Performing basic analysis of molecules generated by ML models

Developing an understanding for some of the most important metrics.

May 3, 2024

In the last post, we created simple generative models for molecules. In this one, we will perform very basic analysis of the generated molecules.

When we have a molecule that can generate more SMILES, we want to evaluate performance beyond just measuring the perplexity.

Various metrics have been proposed to evaluate the performance of generative models. Good references to learn more are:

For exploring some of these metrics, we will use a file of 1000 SMILES strings that I generated using a GPT like the one we just implemented in the last post


For many applications, we want to generate molecules that have specific properties. In this case, we can use a conditional model that generates molecules with specific properties. That is, the model is “conditioned” on the properties we want the molecule to have. This conditional generation is one example of what some call “inverse design”.

In the case of conditional generation, we also would need to evaluate how well the model is able to generate molecules with the desired properties. We will not cover this in this post, but it is an important topic to consider when evaluating generative models.

from typing import List, Tuple
with open('../building_an_llm/generations.txt', 'r') as handle: 
    generated = handle.readlines()


The simplest check of the generated SMILES is for syntactic correctness. This is done by using the RDKit to parse the SMILES and check for errors. If the SMILES is syntactically correct, the RDKit will return a molecule object. If the SMILES is not syntactically correct, the RDKit will return None.

Note that if we were to use a representation such as SELFIES, any sequence of SELFIES characters is syntactically correct.

def is_valid_smiles(string: str) -> bool:
    Check if a string is a valid SMILES string.
        string: A string to be checked.
        A boolean value indicating whether the string is a valid SMILES string.
        from rdkit import Chem
        mol = Chem.MolFromSmiles(string)
        if mol is None:
            return False
            return True
        return False
is_valid_generated = [is_valid_smiles(smiles) for smiles in generated]
sum(is_valid_generated) / len(is_valid_generated)

The validity we achieved is not impressive, but it at least is something we can now optimize.

Uniqueness of SMILES

If we sampled a bunch of strings, we would not be happy if all SMILES are the same.

A metric that captures this is the fraction of unique SMILES in all generated SMILES. Of course, it makes little sense to include invalid SMILES in this calculation.

def uniqueness(smiles: List[str]) -> float:
    Calculate the uniqueness of a list of SMILES strings.

        smiles: A list of SMILES strings.

        A float value indicating the uniqueness of the SMILES strings.
    valid = [s for s in smiles if is_valid_smiles(s)]

    num_unique = len(set(valid))

    return num_unique / len(valid)
unique = uniqueness(generated)

And in our generation, the redudancy is quite high. However, we also only started sampling from carbon all the time.


While diversity is a term that is often used, it is very complicated. First, there are different perspectives on what diversity means. Three useful ones that we used in previous work are:

  • disparity: how different are the molecules from each other?
  • coverage: how much of the chemical space is covered?
  • balance: how evenly are the molecules distributed in the chemical space?

On top of that, diversity depends on the context. For some applications, certain characteristics do not matter. Hence, considering those characteristics in the diversity metric might be misleading. In the end, any kind of representation will be biased in one form or another.

Commonly, one uses the average pairwise Tanimoto similarity as a measure of diversity. This comes close to the disparity perspective. However, this moves at least two problems under the carpet:

  • We need to define a fingerprint to calculate the Tanimoto similarity. And the choice of fingerprint will influence the result.
  • The Tanimoto similarity is not necessarily a good measure of chemical similarity.

You can find some more discussion in this paper by Xie and colleagues.

def internal_diversity(smiles: List[str]) -> float:
    Calculate the internal diversity of a list of SMILES strings.

        smiles: A list of SMILES strings.
        A float value indicating the internal diversity of the SMILES strings.

    valid = [s for s in smiles if is_valid_smiles(s)]

    from rdkit import Chem
    from rdkit.Chem import AllChem
    from rdkit import DataStructs

    fps = [AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(s), 2, nBits=2048) for s in valid]

    similarities = []
    for i in range(len(fps)):
        for j in range(i+1, len(fps)):
            similarities.append(DataStructs.TanimotoSimilarity(fps[i], fps[j]))

    return sum(similarities) / len(similarities)
generated_diversity = internal_diversity(generated)   

This number is perhaps not so easy to interpret. However, we can compare to other techniques. For instance Maragakis et al. performed some basic analysis of different generative models and they found average pairwise Tanimoto similarities in the range of 0.216 to 0.477 (the values are not directly comparable as they depend on the task and the way molecules have been sampled, but it at least gives us a ballbark).

Other metrics

Fréchet ChemNet Distance

The Fréchet ChemNet Distance is a metric used to compare the similarity between chemical compounds. It’s based on the Fréchet distance, which measures the similarity between two curves or shapes, and ChemNet, a neural network trained to predict biological activities of chemical compounds. The Frechet ChemNet Distance can hence be used to measure how close the generated molecules are to the training set.

KL Divergence

Quite related is the KL divergence between the distribution of generated molecules and the distribution of training molecules. This can be used to measure how well the model has learned the distribution of the training data. It is often computed based on molecular descriptors.