ChemistryChemistry Fundamentals

Tanimoto Similarity Explained: Compare Molecules and Find Analogs

Understand Tanimoto similarity for molecular comparison. Fingerprints, coefficient math, score interpretation, and hands-on analog searching.

Ryan Bethencourt
April 8, 2026
9 min read

Why Molecular Similarity Matters

One of the oldest principles in medicinal chemistry is that structurally similar molecules tend to have similar biological activities. This idea, known as the similar property principle, has driven drug discovery for decades. If compound A inhibits a kinase with 10 nM affinity, compounds that look like A are more likely to also inhibit that kinase than random molecules from chemical space.

But "looks like" is subjective. Two chemists can disagree about whether two molecules are similar. What we need is a quantitative, reproducible measure of molecular similarity that can be computed automatically, applied to millions of molecules, and used as a filter in drug discovery pipelines. That measure is the Tanimoto coefficient, calculated over molecular fingerprints.

In this guide, we will explain how molecular fingerprints work, walk through the Tanimoto math with a visual example, interpret similarity scores for real drug pairs, and show you how to compute similarity at scale using the SciRouter API. By the end, you will be able to find analogs, assess chemical diversity, compare your compounds to patent landscapes, and build similarity-based screening filters.

Molecular Fingerprints: Encoding Structure as Bits

Before you can measure similarity, you need a way to represent molecules as comparable data structures. Molecular fingerprints convert a chemical structure into a fixed-length binary vector (a string of 0s and 1s), where each bit indicates the presence or absence of a particular structural feature. Two molecules can then be compared by comparing their bit vectors.

MACCS Keys: Predefined Structural Features

MACCS (Molecular ACCess System) keys are a set of 166 predefined structural patterns. Each key asks a yes/no question about the molecule: Does it contain a ring? Does it have a nitrogen? Does it contain a carbonyl group? Does it have a six-membered aromatic ring with a nitrogen? The result is a 166-bit binary vector.

MACCS keys are simple and interpretable – you can look at which bits differ between two molecules and understand exactly which structural features are present or absent. However, with only 166 bits, they capture coarse structural information and cannot distinguish between molecules with the same functional groups arranged differently.

ECFP4 (Morgan Fingerprint): Circular Substructure Enumeration

Extended Connectivity Fingerprints (ECFP) use a radically different approach. Instead of predefined features, ECFP iteratively examines the local chemical environment around each atom, expanding outward to a specified radius. At radius 0, each atom is described by its element, charge, and number of bonds. At radius 1, the description includes the atom's immediate neighbors. At radius 2 (ECFP4 – the "4" refers to the diameter), the description extends two bonds from each atom.

Each local environment is hashed into a bit position in a fixed-length vector (typically 1024 or 2048 bits). ECFP4 captures far more structural detail than MACCS keys and is better at distinguishing structurally similar molecules. It is the most widely used fingerprint in modern cheminformatics and is the default in SciRouter's similarity endpoint.

Other Fingerprints: Topological, Pharmacophore, and Beyond

Additional fingerprint types exist for specialized applications:

  • RDKit fingerprint – topological path-based fingerprint that enumerates all linear paths up to a given length
  • Avalon fingerprint – combines substructure and path-based features, often used in large-scale virtual screening
  • Pharmacophore fingerprints – encode spatial arrangements of pharmacophoric features (hydrogen bond donors/acceptors, hydrophobic centers, charged groups) rather than exact atom connectivity
  • MAP4 – a MinHashed atom-pair fingerprint designed for molecules of all sizes, from fragments to macrocycles
Note
The choice of fingerprint affects your similarity scores. The same molecule pair might score 0.45 with ECFP4 and 0.72 with MACCS keys. Always specify which fingerprint you used when reporting similarity values, and use the same fingerprint consistently within a project.

Tanimoto Math: Intersection Over Union

The Tanimoto coefficient measures the overlap between two binary fingerprints. The formula is:

Tc(A, B) = c / (a + b - c)

Where:

  • a = number of bits set to 1 in fingerprint A
  • b = number of bits set to 1 in fingerprint B
  • c = number of bits set to 1 in both A and B (intersection)

This is equivalent to the Jaccard index: the size of the intersection divided by the size of the union. If two molecules have identical fingerprints, c = a = b, and Tc = 1.0. If they share no features, c = 0, and Tc = 0.0.

A Visual Example: Aspirin vs Ibuprofen

Let us walk through a simplified example. Imagine a toy fingerprint with 10 bits, where each bit represents a structural feature:

text
Feature:     Ring  COOH  C=O   OH   ArOH  NHR  iPr  Ester  CH3  Cl
Aspirin:     [1]   [1]  [1]  [0]   [0]  [0]  [0]   [1]   [1]  [0]
Ibuprofen:   [1]   [1]  [0]  [0]   [0]  [0]  [1]   [0]   [1]  [0]

Bits in A (aspirin):     a = 5  (Ring, COOH, C=O, Ester, CH3)
Bits in B (ibuprofen):   b = 4  (Ring, COOH, iPr, CH3)
Bits in both (overlap):  c = 3  (Ring, COOH, CH3)

Tanimoto = c / (a + b - c) = 3 / (5 + 4 - 3) = 3/6 = 0.50

A Tanimoto score of 0.50 tells us aspirin and ibuprofen share half their structural features. They are both carboxylic acid-containing, ring-bearing molecules, but they differ in key functional groups (ester vs isopropyl, carbonyl vs no carbonyl). This is consistent with their pharmacological relationship: both are NSAIDs (anti-inflammatory drugs) but with different mechanisms and potencies.

In practice, real fingerprints have 1024 or 2048 bits, and the Tanimoto calculation is done by efficient bitwise operations. The SciRouter API handles this automatically – you provide two SMILES strings and receive a Tanimoto score.

What Tanimoto Scores Mean in Practice

Interpreting Tanimoto scores requires understanding the context. Here are the practical guidelines used by medicinal chemists and cheminformatics researchers, based on ECFP4 fingerprints:

0.85 – 1.0: Very Similar (Close Analogs)

Molecules in this range are typically the same scaffold with minor modifications: a methyl group added, a fluorine replaced by chlorine, or a ring substitution pattern changed. These are the molecules you find when doing a nearest-neighbor search around a lead compound. They are likely to have very similar biological activity and ADMET properties. In patent analysis, compounds above 0.85 similarity are often considered within the scope of Markush claims.

0.7 – 0.85: Related Analogs

Molecules here share the core scaffold but have more significant modifications: ring size changes, functional group replacements, or additional substituents. This is the typical range for SAR (structure-activity relationship) series. Activity is expected to be related but may vary significantly. This range is often the sweet spot for lead optimization – similar enough to retain activity but different enough to improve ADMET properties.

0.4 – 0.7: Moderate Similarity (Scaffold Hops)

This is the range of scaffold hopping: molecules that look different but share pharmacophoric features. A quinazoline-based kinase inhibitor and an imidazole-based inhibitor targeting the same kinase might score in this range. These molecules may have similar biological activity through different structural approaches. This range is valuable for finding backup series in drug discovery and for patent busting.

Below 0.4: Dissimilar

Molecules below 0.4 Tanimoto are generally considered structurally dissimilar. They may still have the same biological target (many different chemotypes can bind the same protein pocket), but their structural relationship is not obvious. Screening libraries are typically designed with diversity thresholds around 0.3–0.4 to ensure maximum structural coverage.

Hands-On: Comparing 5 Real Drug Pairs

Let us compare five pairs of real drugs to see how Tanimoto similarity correlates with pharmacological relationships.

python
import scirouter

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Five drug pairs with known relationships
pairs = [
    {
        "name": "Ibuprofen vs Naproxen (both NSAIDs, different scaffolds)",
        "smiles_a": "CC(C)Cc1ccc(cc1)C(C)C(=O)O",
        "smiles_b": "COc1ccc2cc(ccc2c1)C(C)C(=O)O",
    },
    {
        "name": "Atorvastatin vs Rosuvastatin (both statins, same class)",
        "smiles_a": "CC(C)c1n(CC[C@@H](O)C[C@@H](O)CC(=O)O)c(c2ccc(F)cc2)c(c1c3ccccc3)C(=O)Nc4ccccc4",
        "smiles_b": "CC(C)c1nc(N(C)S(C)(=O)=O)nc(c1c2ccc(F)cc2)/C=C/[C@@H](O)C[C@@H](O)CC(=O)O",
    },
    {
        "name": "Sotorasib vs Adagrasib (both KRAS G12C inhibitors)",
        "smiles_a": "C=CC(=O)N1CCN(c2nc(Nc3ccc(N4CCN(C)CC4)c(C)c3)c3[nH]cnc3n2)CC1",
        "smiles_b": "C=CC(=O)N1CCN(c2nc(Nc3cnc(OC)c(Cl)c3)c3cn(C)nc3n2)CC1",
    },
    {
        "name": "Aspirin vs Metformin (different classes entirely)",
        "smiles_a": "CC(=O)Oc1ccccc1C(=O)O",
        "smiles_b": "CN(C)C(=N)NC(=N)N",
    },
    {
        "name": "Erlotinib vs Gefitinib (both EGFR inhibitors)",
        "smiles_a": "COCCOc1cc2ncnc(Nc3cccc(c3)C#C)c2cc1OCCOC",
        "smiles_b": "COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN4CCOCC4",
    },
]

for pair in pairs:
    result = client.chemistry.similarity(
        smiles_a=pair["smiles_a"],
        smiles_b=pair["smiles_b"],
    )
    print(f"{pair['name']}")
    print(f"  Tanimoto (ECFP4): {result.tanimoto:.3f}")
    print()

Here is what you would see and why the scores make sense:

Ibuprofen vs Naproxen (~0.55): Both are propionic acid NSAIDs with a carboxylic acid and an aryl group, but naproxen has a naphthalene ring system and a methoxy group that ibuprofen lacks. The moderate score correctly reflects that they are in the same pharmacological class but have different scaffolds.

Atorvastatin vs Rosuvastatin (~0.38): Both are statins that inhibit HMG-CoA reductase, sharing the dihydroxyheptanoic acid pharmacophore. But atorvastatin has a pyrrole core while rosuvastatin has a pyrimidine core, and the overall structures are quite different. The low score illustrates that functional analogs can be structurally dissimilar.

Sotorasib vs Adagrasib (~0.62): Both covalent KRAS G12C inhibitors share an acrylamide warhead, a piperazine linker, and a pyrimidine-pyrrole bicyclic core. The moderate-to-high score reflects their common pharmacophoric features with different peripheral groups.

Aspirin vs Metformin (~0.10): These molecules have almost nothing in common structurally. Aspirin is an aromatic ester acid; metformin is a biguanide. The near-zero score correctly identifies them as unrelated.

Erlotinib vs Gefitinib (~0.52): Both are quinazoline-based EGFR inhibitors with an aniline substituent. They share the core scaffold but differ in their solubilizing groups and aniline substitution. The moderate score reflects shared core with meaningful peripheral differences.

Applications of Tanimoto Similarity

Analog Searching: Finding Similar Compounds

The most common use of Tanimoto similarity is finding analogs of a known active compound. Given a hit from a screen or a literature compound, you search a database for molecules with Tanimoto scores above a threshold (typically 0.7). This approach identifies close analogs that might be more potent, more selective, or have better ADMET properties than the original hit.

python
import scirouter

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Reference compound: sotorasib (KRAS G12C inhibitor)
reference = "C=CC(=O)N1CCN(c2nc(Nc3ccc(N4CCN(C)CC4)c(C)c3)c3[nH]cnc3n2)CC1"

# Screen a library for analogs
library = [
    ("compound_A", "C=CC(=O)N1CCN(c2nc(Nc3cnc(OC)c(Cl)c3)c3cn(C)nc3n2)CC1"),
    ("compound_B", "CC(=O)Oc1ccccc1C(=O)O"),
    ("compound_C", "C=CC(=O)N1CCN(c2nc(Nc3ccc(F)cc3)c3[nH]cnc3n2)CC1"),
    ("compound_D", "CCOc1cc2ncc(C#N)c(Nc3ccc(F)c(Cl)c3)c2cc1NC(=O)/C=C/CN(C)C"),
]

print(f"Reference: sotorasib")
print(f"{'Compound':<15} {'Tanimoto':<10} {'Classification'}")
print("-" * 50)

for name, smiles in library:
    result = client.chemistry.similarity(smiles_a=reference, smiles_b=smiles)
    tc = result.tanimoto

    if tc >= 0.85:
        classification = "Very similar (close analog)"
    elif tc >= 0.7:
        classification = "Related analog"
    elif tc >= 0.4:
        classification = "Scaffold hop"
    else:
        classification = "Dissimilar"

    print(f"{name:<15} {tc:<10.3f} {classification}")

Diversity Analysis: Ensuring Library Coverage

When building a screening library, you want maximum structural diversity to cover as much chemical space as possible. Tanimoto similarity enables diversity selection: pick compounds from your collection such that no two compounds have Tanimoto similarity above a threshold (typically 0.7). This ensures that each compound in your library represents a distinct region of chemical space.

Patent Landscape Assessment

Before investing in a drug candidate, check whether structurally similar compounds are patented. Calculate Tanimoto similarity between your candidate and molecules in patent databases (SureChEMBL, Google Patents). Compounds above 0.85 similarity may fall within the scope of broad Markush claims. Compounds between 0.4 and 0.7 represent scaffold hops that are typically distinct enough to avoid infringement, though expert patent analysis is always needed.

Activity Cliff Detection

An activity cliff is a pair of molecules that are very similar structurally (high Tanimoto) but have dramatically different biological activities. Activity cliffs are critically important in SAR analysis because they pinpoint the exact structural feature responsible for activity. If two molecules with Tanimoto 0.90 differ by 100-fold in potency, the single structural difference between them is likely essential for binding.

Beyond Tanimoto: Other Similarity Metrics

While Tanimoto is the standard, other similarity metrics exist for specific use cases:

  • Dice coefficient – Tc = 2c / (a + b). Gives higher scores than Tanimoto for the same molecule pairs. Used when you want to emphasize shared features over differences.
  • Cosine similarity – The cosine of the angle between two fingerprint vectors. More commonly used with continuous (count-based) fingerprints rather than binary.
  • Tversky index – An asymmetric version of Tanimoto that allows weighting: Tv = c / (alpha*a + beta*b + c). Setting alpha=1, beta=0 measures what fraction of A's features are present in B, useful for substructure-like comparisons.
  • Euclidean distance – Measures the geometric distance between fingerprint vectors. Used in clustering algorithms that require a distance metric rather than a similarity metric.

For most drug discovery applications, Tanimoto with ECFP4 is the right default. Switch to Tversky when comparing molecules of very different sizes, Dice when you want a more generous similarity threshold, or cosine when working with count-based fingerprints.

Computing Similarity at Scale

For large-scale similarity searches (comparing one query against thousands of database compounds), the API pattern is straightforward:

python
import scirouter
import pandas as pd

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Query compound
query_smiles = "CC(=O)Oc1ccccc1C(=O)O"  # Aspirin

# Database of compounds to search
database = pd.read_csv("compound_library.csv")  # Columns: id, smiles, name

# Calculate similarity to every compound in the database
results = []
for _, row in database.iterrows():
    sim = client.chemistry.similarity(
        smiles_a=query_smiles,
        smiles_b=row["smiles"],
    )
    results.append({
        "id": row["id"],
        "name": row["name"],
        "smiles": row["smiles"],
        "tanimoto": sim.tanimoto,
    })

# Rank by similarity and get top analogs
df = pd.DataFrame(results).sort_values("tanimoto", ascending=False)
print("Top 10 analogs of aspirin:")
print(df.head(10)[["name", "tanimoto"]].to_string(index=False))
Note
For libraries with more than 1,000 compounds, consider using the Molecule Similarity endpoint in batch mode or computing fingerprints locally with RDKit and performing the Tanimoto calculation in-memory. Local computation is faster for bulk operations; the API is better for on-demand queries integrated into pipelines.

Combining Similarity with Property Prediction

Tanimoto similarity becomes most powerful when combined with property prediction. After finding analogs, screen them for drug-likeness and ADMET properties to identify the best candidates:

python
import scirouter

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Find analogs, then filter by properties
query = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"  # Ibuprofen

candidates = [
    "COc1ccc2cc(ccc2c1)C(C)C(=O)O",       # Naproxen
    "CC(c1ccc(cc1)CC2CCCC2=O)C(=O)O",      # Loxoprofen
    "OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl",      # Diclofenac
    "CC1=CC(=C(C=C1)CC(C)C(=O)O)C",        # Tolmetin analog
]

for smiles in candidates:
    # Get similarity
    sim = client.chemistry.similarity(smiles_a=query, smiles_b=smiles)

    # Get properties
    props = client.chemistry.properties(smiles=smiles)

    print(f"SMILES: {smiles[:50]}...")
    print(f"  Tanimoto: {sim.tanimoto:.3f}")
    print(f"  MW: {props.molecular_weight:.1f}, LogP: {props.logp:.2f}")
    print(f"  Lipinski violations: {props.lipinski_violations}")
    print()

Try It Now: Compare Your Molecules

Compare any two molecules right now using the Tanimoto Similarity free tool – no account or API key required. Paste two SMILES strings and get an instant Tanimoto score with ECFP4 fingerprints.

For programmatic access and batch similarity searches, create a free SciRouter account at scirouter.ai/signup. The free tier includes 5,000 API calls per month. Combine Molecule Similarity with Molecular Properties to build complete analog search and filtering workflows.

Tanimoto similarity transforms the subjective question of "do these molecules look alike?" into a quantitative, reproducible measurement. Whether you are searching for analogs of a lead compound, assessing the diversity of a screening library, checking patent landscapes, or filtering the output of a generative chemistry model, Tanimoto similarity with molecular fingerprints is the foundational tool that makes it all possible.

Frequently Asked Questions

What is the Tanimoto coefficient?

The Tanimoto coefficient (also called the Jaccard index when applied to sets) is a similarity metric that compares two binary fingerprints by calculating the ratio of shared bits to total bits. The formula is Tc = |A intersection B| / |A union B|, which equals |A intersection B| / (|A| + |B| - |A intersection B|). Values range from 0 (no shared features) to 1 (identical fingerprints). It is the most widely used similarity metric in cheminformatics because it handles sparse binary vectors well and correlates with perceived chemical similarity.

What Tanimoto similarity score means two molecules are similar?

There is no universal threshold, but a commonly used guideline is: 0.85 or above indicates very similar molecules (likely the same scaffold with minor modifications), 0.7-0.85 indicates closely related analogs, 0.4-0.7 indicates moderate similarity (possibly scaffold hops or bioisosteric replacements), and below 0.4 indicates dissimilar molecules. However, these thresholds depend on the fingerprint type. ECFP4 at radius 2 tends to give lower scores than MACCS keys for the same molecule pair because ECFP4 captures more fine-grained structural detail.

Which molecular fingerprint should I use with Tanimoto similarity?

For general-purpose molecular similarity, ECFP4 (Extended Connectivity Fingerprint with radius 2, also called Morgan fingerprint) is the standard choice. It captures local chemical environments around each atom and is highly discriminative. MACCS keys (166 predefined structural features) are simpler and give higher similarity scores for the same pairs, making them useful for clustering and diversity analysis. For activity cliff detection, use ECFP4. For pharmacophore-based similarity, use topological pharmacophore fingerprints. SciRouter uses ECFP4 by default.

Can Tanimoto similarity compare molecules of very different sizes?

Yes, but the results may be misleading. If molecule A is a small fragment (MW 150) and molecule B is a large complex molecule (MW 500), even if A is a substructure of B, the Tanimoto score will be low because B has many features that A lacks. This is a known limitation. For comparing molecules of very different sizes, consider using the Tversky index instead, which allows asymmetric weighting. Alternatively, use substructure search to determine whether the smaller molecule is contained within the larger one.

Is Tanimoto similarity the same as Jaccard similarity?

Yes, for binary fingerprints. The Tanimoto coefficient applied to binary bit vectors is mathematically identical to the Jaccard index applied to sets. Some texts distinguish them by saying Jaccard applies to sets while Tanimoto applies to bit vectors, but the formula is the same. For continuous (non-binary) feature vectors, the generalized Tanimoto coefficient uses a different formula involving dot products and norms, and this differs from Jaccard. In cheminformatics, fingerprints are almost always binary, so the two terms are interchangeable.

How do I use Tanimoto similarity for patent landscape analysis?

To assess freedom-to-operate, compare your compound against patented molecules using Tanimoto similarity. Compounds with Tanimoto similarity above 0.85 to a patented structure are likely too close and may infringe, especially under Markush claim scope. Compounds in the 0.5-0.7 range represent scaffold hops that are typically distinct enough to be outside patent claims. Always combine computational similarity with expert patent analysis, as Markush claims can cover broad structural classes that Tanimoto does not capture. SciRouter's similarity endpoint can batch-compare your candidates against a patent library.

Try this yourself

500 free credits. No credit card required.