Why bother?

In [1]:

                
                    Copied!
                    
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

One featurizer to rule them all?¶

Contrary to many other machine learning domains, molecular featurization (i.e. the process of transforming a molecule into a vector) lacks a good default. It remains unclear how we can effectively capture the richness of molecular data in a unified representation and what works best heavily depends on the nature and constraints of the task you are trying to model. It is therefore good practice to try different featurization schemes: From structural fingerprints, to physico-chemical descriptors and pre-trained embeddings.

Don't take our word for it¶

To demonstrate the impact a featurizer can have, we setup two simple benchmarks.

To demonstrate the impact on modeling, we will use two datasets from MoleculeNet [1].
To demonstrate the impact on search, we will use the RDKit Benchmarking Platform [2, 3].

We will compare the performance of three different featurizers:

ECFP6 [4]: Binary, circular fingerprints where each bit indicates the presence of particular substructures of a radius up to 3 bonds away from an atom.
Mordred [5]: Continuous descriptors with more than 1800 2D and 3D descriptors.
ChemBERTa [6]: Learned representations from a pre-trained SMILES transformer model.

Modeling¶

We will compare the performance on two datasets using scikit-learn AutoML [7, 8] models.

In [2]:

                
                    Copied!
                    
                        
                        
                    
                    

            
import os
import tqdm
import fsspec
import pickle
import warnings
import numpy as np
import pandas as pd
import datamol as dm
import matplotlib.pyplot as plt
import autosklearn.classification
import autosklearn.regression
from collections import defaultdict
from rdkit.Chem import SaltRemover

from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import GroupShuffleSplit
from sklearn.neighbors import KNeighborsClassifier

from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer
import os
import tqdm
import fsspec
import pickle
import warnings
import numpy as np
import pandas as pd
import datamol as dm
import matplotlib.pyplot as plt
import autosklearn.classification
import autosklearn.regression
from collections import defaultdict
from rdkit.Chem import SaltRemover

from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import GroupShuffleSplit
from sklearn.neighbors import KNeighborsClassifier

from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer

In [3]:

                
                    Copied!
                    
# Making the output less verbose
warnings.simplefilter("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
dm.disable_rdkit_log()
# Making the output less verbose
warnings.simplefilter("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
dm.disable_rdkit_log()

In [4]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def load_dataset(uri: str, readout_col: str):
    """Loads the MoleculeNet dataset"""
    df = pd.read_csv(uri)
    smiles = df["smiles"].values
    y = df[readout_col].values
    return smiles, y


def preprocess_smiles(smi):
    """Preprocesses the SMILES string"""
    mol = dm.to_mol(smi, ordered=True, sanitize=False)    
    try: 
        mol = dm.sanitize_mol(mol)
    except:
        mol = None
            
    if mol is None: 
        return
        
    mol = dm.standardize_mol(mol, disconnect_metals=True)
    remover = SaltRemover.SaltRemover()
    mol = remover.StripMol(mol, dontRemoveEverything=True)

    return dm.to_smiles(mol)


def scaffold_split(smiles):
    """In line with common practice, we will use the scaffold split to evaluate our models"""
    scaffolds = [dm.to_smiles(dm.to_scaffold_murcko(dm.to_mol(smi))) for smi in smiles]
    splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    return next(splitter.split(smiles, groups=scaffolds))
def load_dataset(uri: str, readout_col: str):
    """Loads the MoleculeNet dataset"""
    df = pd.read_csv(uri)
    smiles = df["smiles"].values
    y = df[readout_col].values
    return smiles, y


def preprocess_smiles(smi):
    """Preprocesses the SMILES string"""
    mol = dm.to_mol(smi, ordered=True, sanitize=False)    
    try: 
        mol = dm.sanitize_mol(mol)
    except:
        mol = None
            
    if mol is None: 
        return
        
    mol = dm.standardize_mol(mol, disconnect_metals=True)
    remover = SaltRemover.SaltRemover()
    mol = remover.StripMol(mol, dontRemoveEverything=True)

    return dm.to_smiles(mol)


def scaffold_split(smiles):
    """In line with common practice, we will use the scaffold split to evaluate our models"""
    scaffolds = [dm.to_smiles(dm.to_scaffold_murcko(dm.to_mol(smi))) for smi in smiles]
    splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    return next(splitter.split(smiles, groups=scaffolds))

In [5]:

                
                    Copied!
                    
# Setup the featurizers
trans_ecfp = FPVecTransformer(kind="ecfp:6", n_jobs=-1)
trans_mordred = FPVecTransformer(kind="mordred", replace_nan=True, n_jobs=-1)
trans_chemberta = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM', notation='smiles')
# Setup the featurizers
trans_ecfp = FPVecTransformer(kind="ecfp:6", n_jobs=-1)
trans_mordred = FPVecTransformer(kind="mordred", replace_nan=True, n_jobs=-1)
trans_chemberta = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM', notation='smiles')

Lipophilicity¶

Lipophilicity is a regression task with 4200 molecules

In [6]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", "exp")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if dm.to_mol(smi) is not None])

feats_ecfp, ind_ecfp = trans_ecfp(smiles, ignore_errors=True)
feats_mordred, ind_mordred = trans_mordred(smiles, ignore_errors=True)
feats_chemberta, ind_chemberta = trans_chemberta(smiles, ignore_errors=True)

X = {
    "ECFP": feats_ecfp[ind_ecfp],
    "Mordred": feats_mordred[ind_mordred],
    "ChemBERTa": feats_chemberta[ind_chemberta],
}
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", "exp")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if dm.to_mol(smi) is not None])

feats_ecfp, ind_ecfp = trans_ecfp(smiles, ignore_errors=True)
feats_mordred, ind_mordred = trans_mordred(smiles, ignore_errors=True)
feats_chemberta, ind_chemberta = trans_chemberta(smiles, ignore_errors=True)

X = {
    "ECFP": feats_ecfp[ind_ecfp],
    "Mordred": feats_mordred[ind_mordred],
    "ChemBERTa": feats_chemberta[ind_chemberta],
}

In [7]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Train a model
train_ind, test_ind = scaffold_split(smiles)

lipo_scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.regression.AutoSklearnRegressor(
        memory_limit=24576, 
        # For practicality’s sake, limit this to 5 minutes! 
        # (x3 = 15 min in total)
        time_left_for_this_task=180,  
        n_jobs=1,
        seed=1,
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict(feats[test_ind])
    
    # Evaluate
    mae = mean_absolute_error(y_true[test_ind], y_hat)
    lipo_scores[name] = mae

lipo_scores
# Train a model
train_ind, test_ind = scaffold_split(smiles)

lipo_scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.regression.AutoSklearnRegressor(
        memory_limit=24576, 
        # For practicality’s sake, limit this to 5 minutes! 
        # (x3 = 15 min in total)
        time_left_for_this_task=180,  
        n_jobs=1,
        seed=1,
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict(feats[test_ind])
    
    # Evaluate
    mae = mean_absolute_error(y_true[test_ind], y_hat)
    lipo_scores[name] = mae

lipo_scores

[WARNING] [2023-03-27 11:14:52,884:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:15:09,140:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:15:10,575:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:17:53,162:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:17:55,897:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:18:08,224:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:20:48,340:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:20:53,717:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-27 11:20:54,100:Client-EnsembleBuilder] No runs were available to build an ensemble from

Out[7]:

{'ECFP': 0.72650517992544,
 'Mordred': 0.5792950477372836,
 'ChemBERTa': 0.7396155151199458}

ClinTox¶

In [8]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Prepare the ClinTox dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/clintox.csv.gz", "CT_TOX")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi is not None])

feats_ecfp, ind_ecfp = trans_ecfp(smiles, ignore_errors=True)
feats_mordred, ind_mordred = trans_mordred(smiles, ignore_errors=True)
feats_chemberta, ind_chemberta = trans_chemberta(smiles, ignore_errors=True)

X = {
    "ECFP": feats_ecfp[ind_ecfp],
    "Mordred": feats_mordred[ind_mordred],
    "ChemBERTa": feats_chemberta[ind_chemberta],
}
# Prepare the ClinTox dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/clintox.csv.gz", "CT_TOX")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi is not None])

feats_ecfp, ind_ecfp = trans_ecfp(smiles, ignore_errors=True)
feats_mordred, ind_mordred = trans_mordred(smiles, ignore_errors=True)
feats_chemberta, ind_chemberta = trans_chemberta(smiles, ignore_errors=True)

X = {
    "ECFP": feats_ecfp[ind_ecfp],
    "Mordred": feats_mordred[ind_mordred],
    "ChemBERTa": feats_chemberta[ind_chemberta],
}

[11:23:45] Unusual charge on atom 0 number of radical electrons set to zero
[11:23:46] Unusual charge on atom 0 number of radical electrons set to zero
[11:23:47] Unusual charge on atom 0 number of radical electrons set to zero
[11:23:47] Unusual charge on atom 0 number of radical electrons set to zero
[11:23:47] Unusual charge on atom 0 number of radical electrons set to zero

In [9]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Train a model
train_ind, test_ind = scaffold_split(smiles)

clintox_scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.classification.AutoSklearnClassifier(
        memory_limit=24576, 
        # For practicality’s sake, limit this to 5 minutes! 
        # (x3 = 15 min in total)
        time_left_for_this_task=180,
        n_jobs=1,
        seed=1,
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict_proba(feats[test_ind])
    y_hat = y_hat[:, 1]
    
    # Evaluate
    auroc = roc_auc_score(y_true[test_ind], y_hat)
    clintox_scores[name] = auroc

clintox_scores
# Train a model
train_ind, test_ind = scaffold_split(smiles)

clintox_scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.classification.AutoSklearnClassifier(
        memory_limit=24576, 
        # For practicality’s sake, limit this to 5 minutes! 
        # (x3 = 15 min in total)
        time_left_for_this_task=180,
        n_jobs=1,
        seed=1,
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict_proba(feats[test_ind])
    y_hat = y_hat[:, 1]
    
    # Evaluate
    auroc = roc_auc_score(y_true[test_ind], y_hat)
    clintox_scores[name] = auroc

clintox_scores

[WARNING] [2023-03-27 11:30:21,940:Client-EnsembleBuilder] No models better than random - using Dummy losses!
	Models besides current dummy model: 0
	Dummy models: 1
[WARNING] [2023-03-27 11:30:38,533:Client-EnsembleBuilder] No models better than random - using Dummy losses!
	Models besides current dummy model: 0
	Dummy models: 1

Out[9]:

{'ECFP': 0.5345833333333333, 'Mordred': 0.5633333333333334, 'ChemBERTa': 0.665}

Conclusion¶

Dataset	Metric	Representation	Score	Rank
Lipophilicity	MAE ↓	ECFP	0.727	1
		Mordred	0.579	0
		ChemBERTa	0.740	2
ClinTox	AUROC ↑	ECFP	0.535	2
		Mordred	0.563	1
		ChemBERTa	0.665	0

We can see that for Lipophilicity, the Mordred featurizer proves most powerful, outperforming the next best featurizer by about 20%. For ClinTox, however, the tables have turned and it is instead ChemBERTa that outperforms the other featurizers by about 18%. This shows the importance of trying different featurizers. Luckily, with Molfeat, this has just become a lot easier to do!

Search¶

We will evaluate the performance on the search task using

In [10]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Specify some meta-data

BASE_CMPD_URI = "https://github.com/rdkit/benchmarking_platform/raw/master/compounds/DUD/cmp_list_DUD"
CMPD_EXT = ".dat.gz"

BASE_SPLIT_URI = "https://github.com/rdkit/benchmarking_platform/raw/master/query_lists/data_sets_I/DUD/training_DUD"
SPLIT_EXT = ".pkl"

# Out of practicality, we only use the first 10 targets
TARGETS = [
    "ace",
    "ache",
    "ar",
    "cdk2",
    "cox2",
    "dhfr",
    "egfr",
    "er_agonist",
    "fgfr1",
    "fxa",
]
# Specify some meta-data

BASE_CMPD_URI = "https://github.com/rdkit/benchmarking_platform/raw/master/compounds/DUD/cmp_list_DUD"
CMPD_EXT = ".dat.gz"

BASE_SPLIT_URI = "https://github.com/rdkit/benchmarking_platform/raw/master/query_lists/data_sets_I/DUD/training_DUD"
SPLIT_EXT = ".pkl"

# Out of practicality, we only use the first 10 targets
TARGETS = [
    "ace",
    "ache",
    "ar",
    "cdk2",
    "cox2",
    "dhfr",
    "egfr",
    "er_agonist",
    "fgfr1",
    "fxa",
]

In [11]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def get_compounds_for_target(target: str):
    """Loads the structural data"""
    df = pd.DataFrame()
    for subset in ["actives", "decoys"]:
        df_ = pd.read_csv(f"{BASE_CMPD_URI}_{target}_{subset}{CMPD_EXT}", sep="\t")
        df_["subset"] = subset
        df_["target"] = 1 if subset == "actives" else 0
        df = pd.concat([df, df_])
    return df


def get_train_decoy_split_for_target(target: str, no_of_actives: int = 20):
    """Loads the proposed split of the benchmark"""
    with fsspec.open(f"{BASE_SPLIT_URI}_{target}_{no_of_actives}{SPLIT_EXT}", "rb") as fd:
        data = pickle.load(fd)
    return data[:no_of_actives], data[no_of_actives:]
def get_compounds_for_target(target: str):
    """Loads the structural data"""
    df = pd.DataFrame()
    for subset in ["actives", "decoys"]:
        df_ = pd.read_csv(f"{BASE_CMPD_URI}_{target}_{subset}{CMPD_EXT}", sep="\t")
        df_["subset"] = subset
        df_["target"] = 1 if subset == "actives" else 0
        df = pd.concat([df, df_])
    return df


def get_train_decoy_split_for_target(target: str, no_of_actives: int = 20):
    """Loads the proposed split of the benchmark"""
    with fsspec.open(f"{BASE_SPLIT_URI}_{target}_{no_of_actives}{SPLIT_EXT}", "rb") as fd:
        data = pickle.load(fd)
    return data[:no_of_actives], data[no_of_actives:]

In [12]:

                
                    Copied!
                    
                        
                        
                    
                    

            
results = defaultdict(dict)

for target in tqdm.tqdm(TARGETS, leave=False):

    # Load the structures (i.e. SMILES)
    df = get_compounds_for_target(target)
    n_actives = len(df[df["subset"] == "actives"])

    smiles = df["SMILES"].values
    smiles = np.array([smi for smi in smiles if dm.to_mol(smi) is not None])

    # Featurize
    feats_ecfp, ind_ecfp = trans_ecfp(smiles, ignore_errors=True)
    feats_mordred, ind_mordred = trans_mordred(smiles, ignore_errors=True)
    feats_chemberta, ind_chemberta = trans_chemberta(smiles, ignore_errors=True)

    X = {
        "ECFP": feats_ecfp[ind_ecfp],
        "Mordred": feats_mordred[ind_mordred],
        "ChemBERTa": feats_chemberta[ind_chemberta],
    }

    # Get the train-test split
    train_active_ind, train_decoy_ind = get_train_decoy_split_for_target(target, no_of_actives=20)
    train_decoy_ind = [i + n_actives for i in train_decoy_ind]

    for feat_name, feats in X.items():

        # Train the model
        knn = KNeighborsClassifier()
        train_ind = np.concatenate([train_active_ind, train_decoy_ind])
        test_ind = np.array([i for i in range(len(df)) if i not in train_ind])
        knn.fit(feats[train_ind], df.iloc[train_ind]["target"].values)

        # Get targets and predictions
        y_true = df.iloc[test_ind]["target"].values
        y_pred = knn.predict_proba(feats[test_ind])[:, 1]

        # Compute the recovery score
        auc = roc_auc_score(y_true, y_pred)
        results[feat_name][target] = auc
results = defaultdict(dict)

for target in tqdm.tqdm(TARGETS, leave=False):

    # Load the structures (i.e. SMILES)
    df = get_compounds_for_target(target)
    n_actives = len(df[df["subset"] == "actives"])

    smiles = df["SMILES"].values
    smiles = np.array([smi for smi in smiles if dm.to_mol(smi) is not None])

    # Featurize
    feats_ecfp, ind_ecfp = trans_ecfp(smiles, ignore_errors=True)
    feats_mordred, ind_mordred = trans_mordred(smiles, ignore_errors=True)
    feats_chemberta, ind_chemberta = trans_chemberta(smiles, ignore_errors=True)

    X = {
        "ECFP": feats_ecfp[ind_ecfp],
        "Mordred": feats_mordred[ind_mordred],
        "ChemBERTa": feats_chemberta[ind_chemberta],
    }

    # Get the train-test split
    train_active_ind, train_decoy_ind = get_train_decoy_split_for_target(target, no_of_actives=20)
    train_decoy_ind = [i + n_actives for i in train_decoy_ind]

    for feat_name, feats in X.items():

        # Train the model
        knn = KNeighborsClassifier()
        train_ind = np.concatenate([train_active_ind, train_decoy_ind])
        test_ind = np.array([i for i in range(len(df)) if i not in train_ind])
        knn.fit(feats[train_ind], df.iloc[train_ind]["target"].values)

        # Get targets and predictions
        y_true = df.iloc[test_ind]["target"].values
        y_pred = knn.predict_proba(feats[test_ind])[:, 1]

        # Compute the recovery score
        auc = roc_auc_score(y_true, y_pred)
        results[feat_name][target] = auc

 30%|██████████████████████████████████████████████████████████▏                                                                                                                                       | 3/10 [02:39<06:23, 54.78s/it][11:35:54] Explicit valence for atom # 1 C greater than permitted
 40%|█████████████████████████████████████████████████████████████████████████████▌                                                                                                                    | 4/10 [03:17<04:49, 48.17s/it][11:36:34] Explicit valence for atom # 6 C greater than permitted
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                   | 9/10 [16:00<01:53, 113.93s/it][11:49:14] Explicit valence for atom # 1 C greater than permitted
[11:49:14] Explicit valence for atom # 1 C greater than permitted
[11:49:14] Explicit valence for atom # 20 C greater than permitted
[11:49:14] Explicit valence for atom # 19 C greater than permitted
[11:49:14] Explicit valence for atom # 1 C greater than permitted
[11:49:14] Explicit valence for atom # 20 C greater than permitted
[11:49:14] Explicit valence for atom # 11 C greater than permitted
[11:49:15] Explicit valence for atom # 20 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 20 C greater than permitted
[11:49:15] Explicit valence for atom # 19 C greater than permitted
[11:49:15] Explicit valence for atom # 21 C greater than permitted
[11:49:15] Explicit valence for atom # 18 C greater than permitted
[11:49:15] Explicit valence for atom # 23 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 20 C greater than permitted
[11:49:15] Explicit valence for atom # 25 C greater than permitted
[11:49:15] Explicit valence for atom # 27 C greater than permitted
[11:49:15] Explicit valence for atom # 24 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 26 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 28 C greater than permitted
[11:49:15] Explicit valence for atom # 27 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 26 C greater than permitted
[11:49:15] Explicit valence for atom # 28 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 18 C greater than permitted
[11:49:15] Explicit valence for atom # 17 C greater than permitted
[11:49:15] Explicit valence for atom # 17 C greater than permitted
[11:49:15] Explicit valence for atom # 29 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] [11:49:15] Explicit valence for atom # 1 C greater than permitted
Explicit valence for atom # 18 C greater than permitted
[11:49:15] Explicit valence for atom # 26 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 25 C greater than permitted
[11:49:15] Explicit valence for atom # 25 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 29 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:15] Explicit valence for atom # 1 C greater than permitted
[11:49:16] Explicit valence for atom # 25 C greater than permitted
[11:49:16] Explicit valence for atom # 25 C greater than permitted
[11:49:16] Explicit valence for atom # 24 C greater than permitted
[11:49:16] Explicit valence for atom # 1 C greater than permitted
[11:49:16] Explicit valence for atom # 1 C greater than permitted
[11:49:16] Explicit valence for atom # 1 C greater than permitted

In [13]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Replicate the figure from https://doi.org/10.1021/ci400466r

xs = list(range(len(TARGETS)))

ecfp_scores = [results["ECFP"][target] for target in TARGETS]
mordred_scores = [results["Mordred"][target] for target in TARGETS]
chemberta_scores = [results["ChemBERTa"][target] for target in TARGETS]

fig, ax = plt.subplots(figsize=(10, 6))
for scores, label in zip([ecfp_scores, mordred_scores, chemberta_scores], ["ECFP", "Mordred", "ChemBERTa"]):
    ax.plot(xs, scores, label=label)

ax.set_xlabel("Target")
ax.set_ylabel("AUC")
ax.set_xticks(xs)
ax.set_xticklabels(TARGETS, rotation=90)
ax.legend()
# Replicate the figure from https://doi.org/10.1021/ci400466r

xs = list(range(len(TARGETS)))

ecfp_scores = [results["ECFP"][target] for target in TARGETS]
mordred_scores = [results["Mordred"][target] for target in TARGETS]
chemberta_scores = [results["ChemBERTa"][target] for target in TARGETS]

fig, ax = plt.subplots(figsize=(10, 6))
for scores, label in zip([ecfp_scores, mordred_scores, chemberta_scores], ["ECFP", "Mordred", "ChemBERTa"]):
    ax.plot(xs, scores, label=label)

ax.set_xlabel("Target")
ax.set_ylabel("AUC")
ax.set_xticks(xs)
ax.set_xticklabels(TARGETS, rotation=90)
ax.legend()

Out[13]:

<matplotlib.legend.Legend at 0x7fc9a7e6f220>

Conclusion¶

For search, we observe yet again another ranking. It now seems that ECFP is the best option across the board, although it's target dependent.

In summary: There's no single best molecular featurizer. With Molfeat it has just become a whole lot easier to make sure you can experiment with a diverse set of popular options to make sure you pick the featurizer that works best for you!

Citations¶

Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., ... & Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical science, 9(2), 513-530.
Riniker, S., Fechner, N., & Landrum, G. A. (2013). Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing. Journal of chemical information and modeling, 53(11), 2829-2836.
Riniker, S., & Landrum, G. A. (2013). Open-source platform to benchmark fingerprints for ligand-based virtual screening. Journal of cheminformatics, 5(1), 26.
Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5), 742-754.
Moriwaki, H., Tian, Y. S., Kawashita, N., & Takagi, T. (2018). Mordred: a molecular descriptor calculator. Journal of cheminformatics, 10(1), 1-14.
Chithrananda, S., Grand, G., & Ramsundar, B. (2020). Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885.
Efficient and Robust Automated Machine Learning Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum and Frank Hutter Advances in Neural Information Processing Systems 28 (2015)
Auto-Sklearn 2.0: The Next Generation Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer and Frank Hutter* arXiv:2007.04074 [cs.LG], 2020