Why bother?

In [1]:

            
                Copied!
                
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

One featurizer to rule them all?¶

Contrary to many other machine learning domains, molecular featurization (i.e. the process of transforming a molecule into a vector) lacks a good default. It remains unclear how we can effectively capture the richness of molecular data in a unified representation and what works best heavily depends on the nature and constraints of the task you are trying to model. It is therefore good practice to try different featurization schemes: From structural fingerprints, to physico-chemical descriptors and pre-trained embeddings.

Don't take our word for it¶

To demonstrate the impact a featurizer can have, we setup two simple benchmarks.

To demonstrate the impact on modeling, we will use two datasets from MoleculeNet.
To demonstrate the impact on search, we will use the RDKit Benchmarking Platform.

We will compare the performance of three different featurizers:

ECFP6 [1]: Binary, circular fingerprints where each bit indicates the presence of particular substructures of a radius up to 3 bonds away from an atom.
Mordred [2]: Continuous descriptors with more than 1800 2D and 3D descriptors.
ChemBERTa [3]: Learned representations from a pre-trained SMILES transformer model.

Modeling¶

We will compare the performance on two datasets using scikit-learn AutoML [4, 5] models.

In [3]:

            
                Copied!
                
                    
                    
                
                

        
import os
import numpy as np
import pandas as pd
import datamol as dm
import autosklearn.classification
import autosklearn.regression
from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import GroupShuffleSplit
from rdkit.Chem import SaltRemover

from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer
import os
import numpy as np
import pandas as pd
import datamol as dm
import autosklearn.classification
import autosklearn.regression
from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import GroupShuffleSplit
from rdkit.Chem import SaltRemover

from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer

In [4]:

            
                Copied!
                
                    
                    
                
                

        
def load_dataset(uri: str, readout_col: str):
    """Loads the MoleculeNet dataset"""
    df = pd.read_csv(uri)
    smiles = df["smiles"].values
    y = df[readout_col].values
    return smiles, y


def preprocess_smiles(smi):
    """Preprocesses the SMILES string"""
    with dm.without_rdkit_log():
        mol = dm.to_mol(smi, ordered=True, sanitize=False)
        mol = dm.sanitize_mol(mol)
        if mol is None: 
            return
        
        mol = dm.standardize_mol(mol, disconnect_metals=True)
        remover = SaltRemover.SaltRemover()
        mol = remover.StripMol(mol, dontRemoveEverything=True)

    return dm.to_smiles(mol)


def scaffold_split(smiles):
    """In line with common practice, we will use the scaffold split to evaluate our models"""
    scaffolds = [dm.to_smiles(dm.to_scaffold_murcko(dm.to_mol(smi))) for smi in smiles]
    splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    return next(splitter.split(smiles, groups=scaffolds))
def load_dataset(uri: str, readout_col: str):
    """Loads the MoleculeNet dataset"""
    df = pd.read_csv(uri)
    smiles = df["smiles"].values
    y = df[readout_col].values
    return smiles, y


def preprocess_smiles(smi):
    """Preprocesses the SMILES string"""
    with dm.without_rdkit_log():
        mol = dm.to_mol(smi, ordered=True, sanitize=False)
        mol = dm.sanitize_mol(mol)
        if mol is None: 
            return
        
        mol = dm.standardize_mol(mol, disconnect_metals=True)
        remover = SaltRemover.SaltRemover()
        mol = remover.StripMol(mol, dontRemoveEverything=True)

    return dm.to_smiles(mol)


def scaffold_split(smiles):
    """In line with common practice, we will use the scaffold split to evaluate our models"""
    scaffolds = [dm.to_smiles(dm.to_scaffold_murcko(dm.to_mol(smi))) for smi in smiles]
    splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    return next(splitter.split(smiles, groups=scaffolds))

In [5]:

            
                Copied!
                
# Setup the featurizers
trans_ecfp = FPVecTransformer(kind="ecfp:6", n_jobs=-1)
trans_mordred = FPVecTransformer(kind="mordred", replace_nan=True, n_jobs=-1)
trans_chemberta = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM', notation='smiles')
# Setup the featurizers
trans_ecfp = FPVecTransformer(kind="ecfp:6", n_jobs=-1)
trans_mordred = FPVecTransformer(kind="mordred", replace_nan=True, n_jobs=-1)
trans_chemberta = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM', notation='smiles')

/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/google/auth/_default.py:78: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. See the following page for troubleshooting: https://cloud.google.com/docs/authentication/adc-troubleshooting/user-creds. 
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)

Lipophilicity¶

Lipophilicity is a regression task with 4200 molecules

In [6]:

            
                Copied!
                
                    
                    
                
                

        
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", "exp")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi != ""])

X = {
    "ECFP": trans_ecfp(smiles),
    "Mordred": trans_mordred(smiles),
    "ChemBERTa": trans_chemberta(smiles),
}
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", "exp")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi != ""])

X = {
    "ECFP": trans_ecfp(smiles),
    "Mordred": trans_mordred(smiles),
    "ChemBERTa": trans_chemberta(smiles),
}

/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

  0%|          | 0/4200 [00:00<?, ?it/s]

  0%|          | 0/4200 [00:00<?, ?it/s]

In [7]:

            
                Copied!
                
                    
                    
                
                

        
# To make the output less verbose: 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Train a model
train_ind, test_ind = scaffold_split(smiles)

scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.regression.AutoSklearnRegressor(
        memory_limit=24576, 
        time_left_for_this_task=360,
        n_jobs=1
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict(feats[test_ind])
    
    # Evaluate
    mae = mean_absolute_error(y_true[test_ind], y_hat)
    scores[name] = mae

scores
# To make the output less verbose: 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Train a model
train_ind, test_ind = scaffold_split(smiles)

scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.regression.AutoSklearnRegressor(
        memory_limit=24576, 
        time_left_for_this_task=360,
        n_jobs=1
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict(feats[test_ind])
    
    # Evaluate
    mae = mean_absolute_error(y_true[test_ind], y_hat)
    scores[name] = mae

scores

[WARNING] [2023-03-21 09:43:37,219:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:43:52,005:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:43:53,508:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:49:31,814:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:49:35,671:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:49:45,916:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:55:25,854:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:55:31,098:Client-EnsembleBuilder] No runs were available to build an ensemble from
[WARNING] [2023-03-21 09:56:08,207:Client-EnsembleBuilder] No runs were available to build an ensemble from

Out[7]:

{'ECFP': 0.6889895591995786,
 'Mordred': 0.5481806419968572,
 'ChemBERTa': 0.7432117051810577}

ClinTox¶

In [12]:

            
                Copied!
                
                    
                    
                
                

        
# Prepare the ClinTox dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/clintox.csv.gz", "CT_TOX")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi is not None])

X = {
    "ECFP": trans_ecfp(smiles),
    "Mordred": trans_mordred(smiles),
    "ChemBERTa": trans_chemberta(smiles),
}
# Prepare the ClinTox dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/clintox.csv.gz", "CT_TOX")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi is not None])

X = {
    "ECFP": trans_ecfp(smiles),
    "Mordred": trans_mordred(smiles),
    "ChemBERTa": trans_chemberta(smiles),
}

--- Logging error ---
Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix
    Chem.SanitizeMol(cp)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 0 N, 5, is greater than permitted

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start
    self.io_loop.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
    await self.process_one()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one
    await dispatch(*args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
    await result
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
    reply_content = await reply_content
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
    result = self._run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
    result = runner(coro)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles
    mol = dm.sanitize_mol(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol
    mol = _sanifix4.sanifix(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix
    logging.debug(e, Chem.MolToSmiles(m))
Message: AtomValenceException('Explicit valence for atom # 0 N, 5, is greater than permitted')
Arguments: ('[NH4][Pt]([NH4])(Cl)Cl',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix
    Chem.SanitizeMol(cp)
rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol.  Unkekulized atoms: 21

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start
    self.io_loop.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
    await self.process_one()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one
    await dispatch(*args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
    await result
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
    reply_content = await reply_content
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
    result = self._run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
    result = runner(coro)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles
    mol = dm.sanitize_mol(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol
    mol = _sanifix4.sanifix(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix
    logging.debug(e, Chem.MolToSmiles(m))
Message: KekulizeException("Can't kekulize mol.  Unkekulized atoms: 21")
Arguments: ('O=c1c(CCS(=O)c2ccccc2)c(=O)n(c2ccccc2)n1c1ccccc1',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix
    Chem.SanitizeMol(cp)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 81 N, 4, is greater than permitted

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start
    self.io_loop.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
    await self.process_one()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one
    await dispatch(*args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
    await result
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
    reply_content = await reply_content
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
    result = self._run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
    result = runner(coro)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles
    mol = dm.sanitize_mol(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol
    mol = _sanifix4.sanifix(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix
    logging.debug(e, Chem.MolToSmiles(m))
Message: AtomValenceException('Explicit valence for atom # 81 N, 4, is greater than permitted')
Arguments: ('CC1=C2N3[C@@H]4[C@H](CC(N)=O)[C@@]2(C)CCC(=O)NC[C@@H](C)OP(=O)([O-])O[C@H]2[C@@H](O)[C@H](O[C@@H]2CO)N2C=N(c5cc(C)c(C)cc52)[Co+]325(O)N3=C1[C@@H](CCC(N)=O)C(C)(C)C3=CC1=N2C(=C(C)C2=N5[C@]4(C)[C@@](C)(CC(N)=O)[C@@H]2CCC(N)=O)[C@@](C)(CC(N)=O)[C@@H]1CCC(N)=O',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix
    Chem.SanitizeMol(cp)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 82 N, 4, is greater than permitted

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start
    self.io_loop.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
    await self.process_one()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one
    await dispatch(*args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
    await result
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
    reply_content = await reply_content
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
    result = self._run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
    result = runner(coro)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles
    mol = dm.sanitize_mol(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol
    mol = _sanifix4.sanifix(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix
    logging.debug(e, Chem.MolToSmiles(m))
Message: AtomValenceException('Explicit valence for atom # 82 N, 4, is greater than permitted')
Arguments: ('CC1=C2N3[C@@H]4[C@H](CC(N)=O)[C@@]2(C)CCC(=O)NC[C@@H](C)OP(=O)(O)O[C@H]2[C@@H](O)[C@H](O[C@@H]2CO)N2C=N(c5cc(C)c(C)cc52)[Co]325(C#N)N3=C1[C@@H](CCC(N)=O)C(C)(C)C3=CC1=N2C(=C(C)C2=N5[C@]4(C)[C@@](C)(CC(N)=O)[C@@H]2CCC(N)=O)[C@@](C)(CC(N)=O)[C@@H]1CCC(N)=O',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix
    Chem.SanitizeMol(cp)
rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol.  Unkekulized atoms: 17

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start
    self.io_loop.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
    await self.process_one()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one
    await dispatch(*args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
    await result
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
    reply_content = await reply_content
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
    result = self._run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
    result = runner(coro)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles
    mol = dm.sanitize_mol(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol
    mol = _sanifix4.sanifix(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix
    logging.debug(e, Chem.MolToSmiles(m))
Message: KekulizeException("Can't kekulize mol.  Unkekulized atoms: 17")
Arguments: ('CCCCc1c(=O)n(c2ccccc2)n(c2ccc(O)cc2)c1=O',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix
    Chem.SanitizeMol(cp)
rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol.  Unkekulized atoms: 16

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start
    self.io_loop.start()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
    await self.process_one()
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one
    await dispatch(*args)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
    await result
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
    reply_content = await reply_content
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
    result = self._run_cell(
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
    result = runner(coro)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp>
    smiles = np.array([preprocess_smiles(smi) for smi in smiles])
  File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles
    mol = dm.sanitize_mol(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol
    mol = _sanifix4.sanifix(mol)
  File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix
    logging.debug(e, Chem.MolToSmiles(m))
Message: KekulizeException("Can't kekulize mol.  Unkekulized atoms: 16")
Arguments: ('CCCCc1c(=O)n(c2ccccc2)n(c2ccccc2)c1=O',)
[10:36:46] Unusual charge on atom 0 number of radical electrons set to zero
[10:36:48] Unusual charge on atom 0 number of radical electrons set to zero
[10:36:48] Unusual charge on atom 0 number of radical electrons set to zero
[10:36:48] Unusual charge on atom 0 number of radical electrons set to zero
[10:36:48] Unusual charge on atom 0 number of radical electrons set to zero
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

  0%|          | 0/1478 [00:00<?, ?it/s]

  0%|          | 0/1478 [00:00<?, ?it/s]

In [13]:

            
                Copied!
                
                    
                    
                
                

        
# To make the output less verbose: 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Train a model
train_ind, test_ind = scaffold_split(smiles)

scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.classification.AutoSklearnClassifier(
        memory_limit=24576, 
        time_left_for_this_task=360,
        n_jobs=1
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict_proba(feats[test_ind])
    y_hat = np.max(y_hat, axis=-1)
    
    # Evaluate
    auroc = roc_auc_score(y_true[test_ind], y_hat)
    scores[name] = auroc

scores
# To make the output less verbose: 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Train a model
train_ind, test_ind = scaffold_split(smiles)

scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.classification.AutoSklearnClassifier(
        memory_limit=24576, 
        time_left_for_this_task=360,
        n_jobs=1
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict_proba(feats[test_ind])
    y_hat = np.max(y_hat, axis=-1)
    
    # Evaluate
    auroc = roc_auc_score(y_true[test_ind], y_hat)
    scores[name] = auroc

scores

[WARNING] [2023-03-21 10:49:24,650:Client-EnsembleBuilder] No models better than random - using Dummy losses!
	Models besides current dummy model: 0
	Dummy models: 1
[WARNING] [2023-03-21 10:49:40,878:Client-EnsembleBuilder] No models better than random - using Dummy losses!
	Models besides current dummy model: 0
	Dummy models: 1

Out[13]:

{'ECFP': 0.47138888888888886,
 'Mordred': 0.4252777777777778,
 'ChemBERTa': 0.3705555555555555}

Conclusion¶

We can see that for Lipophilicity, the Mordred featurizer proves most powerful, outperforming the next best featurizer by about 20%. For ClinTox, however, the tables have turned and it is instead ECFP that outperforms Mordred by about 10%.

This shows the importance of trying different featurizers. Luckily, with Molfeat, this has just become a lot easier to do!

Search¶

We will evaluate the performance on the search task using

In [ ]:

            
                Copied!
                
# TODO
# TODO

Citations¶

Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5), 742-754.
Moriwaki, H., Tian, Y. S., Kawashita, N., & Takagi, T. (2018). Mordred: a molecular descriptor calculator. Journal of cheminformatics, 10(1), 1-14.
Chithrananda, S., Grand, G., & Ramsundar, B. (2020). Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885.
Efficient and Robust Automated Machine Learning Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum and Frank Hutter Advances in Neural Information Processing Systems 28 (2015)
Auto-Sklearn 2.0: The Next Generation Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer and Frank Hutter* arXiv:2007.04074 [cs.LG], 2020