Why bother?
%load_ext autoreload
%autoreload 2
One featurizer to rule them all?¶
Contrary to many other machine learning domains, molecular featurization (i.e. the process of transforming a molecule into a vector) lacks a good default. It remains unclear how we can effectively capture the richness of molecular data in a unified representation and what works best heavily depends on the nature and constraints of the task you are trying to model. It is therefore good practice to try different featurization schemes: From structural fingerprints, to physico-chemical descriptors and pre-trained embeddings.
Don't take our word for it¶
To demonstrate the impact a featurizer can have, we setup two simple benchmarks.
- To demonstrate the impact on modeling, we will use two datasets from MoleculeNet.
- To demonstrate the impact on search, we will use the RDKit Benchmarking Platform.
We will compare the performance of three different featurizers:
- ECFP6 [1]: Binary, circular fingerprints where each bit indicates the presence of particular substructures of a radius up to 3 bonds away from an atom.
- Mordred [2]: Continuous descriptors with more than 1800 2D and 3D descriptors.
- ChemBERTa [3]: Learned representations from a pre-trained SMILES transformer model.
Modeling¶
We will compare the performance on two datasets using scikit-learn AutoML [4, 5] models.
import os
import numpy as np
import pandas as pd
import datamol as dm
import autosklearn.classification
import autosklearn.regression
from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import GroupShuffleSplit
from rdkit.Chem import SaltRemover
from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer
def load_dataset(uri: str, readout_col: str):
"""Loads the MoleculeNet dataset"""
df = pd.read_csv(uri)
smiles = df["smiles"].values
y = df[readout_col].values
return smiles, y
def preprocess_smiles(smi):
"""Preprocesses the SMILES string"""
with dm.without_rdkit_log():
mol = dm.to_mol(smi, ordered=True, sanitize=False)
mol = dm.sanitize_mol(mol)
if mol is None:
return
mol = dm.standardize_mol(mol, disconnect_metals=True)
remover = SaltRemover.SaltRemover()
mol = remover.StripMol(mol, dontRemoveEverything=True)
return dm.to_smiles(mol)
def scaffold_split(smiles):
"""In line with common practice, we will use the scaffold split to evaluate our models"""
scaffolds = [dm.to_smiles(dm.to_scaffold_murcko(dm.to_mol(smi))) for smi in smiles]
splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
return next(splitter.split(smiles, groups=scaffolds))
# Setup the featurizers
trans_ecfp = FPVecTransformer(kind="ecfp:6", n_jobs=-1)
trans_mordred = FPVecTransformer(kind="mordred", replace_nan=True, n_jobs=-1)
trans_chemberta = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM', notation='smiles')
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/google/auth/_default.py:78: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. See the following page for troubleshooting: https://cloud.google.com/docs/authentication/adc-troubleshooting/user-creds. warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
Lipophilicity¶
Lipophilicity is a regression task with 4200 molecules
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", "exp")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi != ""])
X = {
"ECFP": trans_ecfp(smiles),
"Mordred": trans_mordred(smiles),
"ChemBERTa": trans_chemberta(smiles),
}
/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn( /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
0%| | 0/4200 [00:00<?, ?it/s]
0%| | 0/4200 [00:00<?, ?it/s]
# To make the output less verbose:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Train a model
train_ind, test_ind = scaffold_split(smiles)
scores = {}
for name, feats in X.items():
# Train
automl = autosklearn.regression.AutoSklearnRegressor(
memory_limit=24576,
time_left_for_this_task=360,
n_jobs=1
)
automl.fit(feats[train_ind], y_true[train_ind])
# Predict and evaluate
y_hat = automl.predict(feats[test_ind])
# Evaluate
mae = mean_absolute_error(y_true[test_ind], y_hat)
scores[name] = mae
scores
[WARNING] [2023-03-21 09:43:37,219:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:43:52,005:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:43:53,508:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:49:31,814:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:49:35,671:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:49:45,916:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:55:25,854:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:55:31,098:Client-EnsembleBuilder] No runs were available to build an ensemble from [WARNING] [2023-03-21 09:56:08,207:Client-EnsembleBuilder] No runs were available to build an ensemble from
{'ECFP': 0.6889895591995786, 'Mordred': 0.5481806419968572, 'ChemBERTa': 0.7432117051810577}
ClinTox¶
# Prepare the ClinTox dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/clintox.csv.gz", "CT_TOX")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi is not None])
X = {
"ECFP": trans_ecfp(smiles),
"Mordred": trans_mordred(smiles),
"ChemBERTa": trans_chemberta(smiles),
}
--- Logging error --- Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix Chem.SanitizeMol(cp) rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 0 N, 5, is greater than permitted During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit msg = self.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format return fmt.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format record.message = record.getMessage() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module> app.launch_new_instance() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance app.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start self.io_loop.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start self.asyncio_loop.run_forever() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue await self.process_one() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one await dispatch(*args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell await result File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request reply_content = await reply_content File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute res = shell.run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell return super().run_cell(*args, **kwargs) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell result = self._run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell result = runner(coro) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner coro.send(None) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles mol = dm.sanitize_mol(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol mol = _sanifix4.sanifix(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix logging.debug(e, Chem.MolToSmiles(m)) Message: AtomValenceException('Explicit valence for atom # 0 N, 5, is greater than permitted') Arguments: ('[NH4][Pt]([NH4])(Cl)Cl',) --- Logging error --- Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix Chem.SanitizeMol(cp) rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol. Unkekulized atoms: 21 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit msg = self.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format return fmt.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format record.message = record.getMessage() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module> app.launch_new_instance() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance app.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start self.io_loop.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start self.asyncio_loop.run_forever() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue await self.process_one() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one await dispatch(*args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell await result File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request reply_content = await reply_content File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute res = shell.run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell return super().run_cell(*args, **kwargs) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell result = self._run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell result = runner(coro) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner coro.send(None) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles mol = dm.sanitize_mol(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol mol = _sanifix4.sanifix(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix logging.debug(e, Chem.MolToSmiles(m)) Message: KekulizeException("Can't kekulize mol. Unkekulized atoms: 21") Arguments: ('O=c1c(CCS(=O)c2ccccc2)c(=O)n(c2ccccc2)n1c1ccccc1',) --- Logging error --- Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix Chem.SanitizeMol(cp) rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 81 N, 4, is greater than permitted During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit msg = self.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format return fmt.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format record.message = record.getMessage() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module> app.launch_new_instance() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance app.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start self.io_loop.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start self.asyncio_loop.run_forever() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue await self.process_one() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one await dispatch(*args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell await result File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request reply_content = await reply_content File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute res = shell.run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell return super().run_cell(*args, **kwargs) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell result = self._run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell result = runner(coro) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner coro.send(None) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles mol = dm.sanitize_mol(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol mol = _sanifix4.sanifix(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix logging.debug(e, Chem.MolToSmiles(m)) Message: AtomValenceException('Explicit valence for atom # 81 N, 4, is greater than permitted') Arguments: ('CC1=C2N3[C@@H]4[C@H](CC(N)=O)[C@@]2(C)CCC(=O)NC[C@@H](C)OP(=O)([O-])O[C@H]2[C@@H](O)[C@H](O[C@@H]2CO)N2C=N(c5cc(C)c(C)cc52)[Co+]325(O)N3=C1[C@@H](CCC(N)=O)C(C)(C)C3=CC1=N2C(=C(C)C2=N5[C@]4(C)[C@@](C)(CC(N)=O)[C@@H]2CCC(N)=O)[C@@](C)(CC(N)=O)[C@@H]1CCC(N)=O',) --- Logging error --- Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix Chem.SanitizeMol(cp) rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 82 N, 4, is greater than permitted During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit msg = self.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format return fmt.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format record.message = record.getMessage() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module> app.launch_new_instance() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance app.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start self.io_loop.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start self.asyncio_loop.run_forever() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue await self.process_one() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one await dispatch(*args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell await result File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request reply_content = await reply_content File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute res = shell.run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell return super().run_cell(*args, **kwargs) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell result = self._run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell result = runner(coro) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner coro.send(None) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles mol = dm.sanitize_mol(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol mol = _sanifix4.sanifix(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix logging.debug(e, Chem.MolToSmiles(m)) Message: AtomValenceException('Explicit valence for atom # 82 N, 4, is greater than permitted') Arguments: ('CC1=C2N3[C@@H]4[C@H](CC(N)=O)[C@@]2(C)CCC(=O)NC[C@@H](C)OP(=O)(O)O[C@H]2[C@@H](O)[C@H](O[C@@H]2CO)N2C=N(c5cc(C)c(C)cc52)[Co]325(C#N)N3=C1[C@@H](CCC(N)=O)C(C)(C)C3=CC1=N2C(=C(C)C2=N5[C@]4(C)[C@@](C)(CC(N)=O)[C@@H]2CCC(N)=O)[C@@](C)(CC(N)=O)[C@@H]1CCC(N)=O',) --- Logging error --- Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix Chem.SanitizeMol(cp) rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol. Unkekulized atoms: 17 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit msg = self.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format return fmt.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format record.message = record.getMessage() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module> app.launch_new_instance() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance app.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start self.io_loop.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start self.asyncio_loop.run_forever() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue await self.process_one() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one await dispatch(*args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell await result File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request reply_content = await reply_content File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute res = shell.run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell return super().run_cell(*args, **kwargs) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell result = self._run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell result = runner(coro) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner coro.send(None) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles mol = dm.sanitize_mol(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol mol = _sanifix4.sanifix(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix logging.debug(e, Chem.MolToSmiles(m)) Message: KekulizeException("Can't kekulize mol. Unkekulized atoms: 17") Arguments: ('CCCCc1c(=O)n(c2ccccc2)n(c2ccc(O)cc2)c1=O',) --- Logging error --- Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 118, in sanifix Chem.SanitizeMol(cp) rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol. Unkekulized atoms: 16 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 1083, in emit msg = self.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 927, in format return fmt.format(record) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 663, in format record.message = record.getMessage() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/logging/__init__.py", line 367, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module> app.launch_new_instance() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/traitlets/config/application.py", line 1043, in launch_instance app.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 725, in start self.io_loop.start() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 215, in start self.asyncio_loop.run_forever() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue await self.process_one() File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 502, in process_one await dispatch(*args) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell await result File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 729, in execute_request reply_content = await reply_content File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute res = shell.run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 540, in run_cell return super().run_cell(*args, **kwargs) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell result = self._run_cell( File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell result = runner(coro) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner coro.send(None) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <module> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2689847368.py", line 3, in <listcomp> smiles = np.array([preprocess_smiles(smi) for smi in smiles]) File "/tmp/ipykernel_11612/2436713256.py", line 13, in preprocess_smiles mol = dm.sanitize_mol(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/mol.py", line 323, in sanitize_mol mol = _sanifix4.sanifix(mol) File "/home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/datamol/_sanifix4.py", line 121, in sanifix logging.debug(e, Chem.MolToSmiles(m)) Message: KekulizeException("Can't kekulize mol. Unkekulized atoms: 16") Arguments: ('CCCCc1c(=O)n(c2ccccc2)n(c2ccccc2)c1=O',) [10:36:46] Unusual charge on atom 0 number of radical electrons set to zero [10:36:48] Unusual charge on atom 0 number of radical electrons set to zero [10:36:48] Unusual charge on atom 0 number of radical electrons set to zero [10:36:48] Unusual charge on atom 0 number of radical electrons set to zero [10:36:48] Unusual charge on atom 0 number of radical electrons set to zero /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn( /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) /home/cas/local/conda/envs/molfeat-benchmark/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
0%| | 0/1478 [00:00<?, ?it/s]
0%| | 0/1478 [00:00<?, ?it/s]
# To make the output less verbose:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Train a model
train_ind, test_ind = scaffold_split(smiles)
scores = {}
for name, feats in X.items():
# Train
automl = autosklearn.classification.AutoSklearnClassifier(
memory_limit=24576,
time_left_for_this_task=360,
n_jobs=1
)
automl.fit(feats[train_ind], y_true[train_ind])
# Predict and evaluate
y_hat = automl.predict_proba(feats[test_ind])
y_hat = np.max(y_hat, axis=-1)
# Evaluate
auroc = roc_auc_score(y_true[test_ind], y_hat)
scores[name] = auroc
scores
[WARNING] [2023-03-21 10:49:24,650:Client-EnsembleBuilder] No models better than random - using Dummy losses! Models besides current dummy model: 0 Dummy models: 1 [WARNING] [2023-03-21 10:49:40,878:Client-EnsembleBuilder] No models better than random - using Dummy losses! Models besides current dummy model: 0 Dummy models: 1
{'ECFP': 0.47138888888888886, 'Mordred': 0.4252777777777778, 'ChemBERTa': 0.3705555555555555}
Conclusion¶
We can see that for Lipophilicity, the Mordred featurizer proves most powerful, outperforming the next best featurizer by about 20%. For ClinTox, however, the tables have turned and it is instead ECFP that outperforms Mordred by about 10%.
This shows the importance of trying different featurizers. Luckily, with Molfeat, this has just become a lot easier to do!
Search¶
We will evaluate the performance on the search task using
# TODO
Citations¶
- Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5), 742-754.
- Moriwaki, H., Tian, Y. S., Kawashita, N., & Takagi, T. (2018). Mordred: a molecular descriptor calculator. Journal of cheminformatics, 10(1), 1-14.
- Chithrananda, S., Grand, G., & Ramsundar, B. (2020). Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885.
- Efficient and Robust Automated Machine Learning Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum and Frank Hutter Advances in Neural Information Processing Systems 28 (2015)
- Auto-Sklearn 2.0: The Next Generation Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer and Frank Hutter* arXiv:2007.04074 [cs.LG], 2020