Types of featurizers
%load_ext autoreload
%autoreload 2
import torch
import datamol as dm
import numpy as np
All featurizers in Molfeat inherit from at least one of three classes:
molfeat.calc.SerializableCalculator
:A calculator is a Callable that featurizes a single molecule.molfeat.trans.MoleculeTransformer
:A transformer is a class that wraps a calculator in a featurization pipeline.molfeat.trans.pretrained.PretrainedMolTransformer
:A subclass ofMoleculeTransformer
that extends the transformer interface to support the usage of pretrained models.
In this tutorial, we will look at each of these classes in more detail.
Calculators¶
A calculator is a Callable that takes an RDKit Chem.Mol
object or a SMILES string and returns a feature vector.
In the following example, we will use the FPCalculator
.
from molfeat.calc import FPCalculator
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
calc = FPCalculator("maccs")
X = calc(smiles)
X.shape
(167,)
The FPCalculator
implements several popular molecular fingerprints:
from molfeat.calc import FP_FUNCS
FP_FUNCS.keys()
dict_keys(['maccs', 'avalon', 'ecfp', 'fcfp', 'topological', 'atompair', 'rdkit', 'pattern', 'layered', 'map4', 'secfp', 'erg', 'estate', 'avalon-count', 'rdkit-count', 'ecfp-count', 'fcfp-count', 'topological-count', 'atompair-count'])
Switching to any other fingerprint is easy:
calc = FPCalculator("ecfp")
X = calc(smiles)
X.shape
(2048,)
Beyond these fingerprints, Molfeat also provides calculators for other molecular descriptors. The list of available options can be further extended through plugins. All available calculator classes, both built-in and plugin-based, can be found through the molfeat.calc
module:
from molfeat.calc import _CALCULATORS
_CALCULATORS.keys()
dict_keys(['CATS', 'RDKitDescriptors2D', 'MordredDescriptors', 'RDKitDescriptors3D', 'FPCalculator', 'Pharmacophore2D', 'Pharmacophore3D', 'ScaffoldKeyCalculator', 'USRDescriptors', 'ElectroShapeDescriptors'])
Every calculator is serializable, meaning it can be efficiently stored to — and loaded from — disk. To learn more, please see the tutorial on saving and loading featurizers.
Transformers¶
In practice, you won't want to featurize a single molecule, but rather a batch of molecules. This is where transformers come in. A transformer is a class that wraps a calculator in a featurization pipeline. The MoleculeTransformer
class provides a convenient interface for featurizing a batch of molecules. It also provides a number of useful methods to customize the featurization pipeline.
from molfeat.calc import RDKitDescriptors2D
from molfeat.trans import MoleculeTransformer
data = dm.data.freesolv().smiles.values
# Let's try a different calculator!
# This is a descriptor with all 2D, physicochemical descriptors from RDKit
calc = RDKitDescriptors2D(replace_nan=True)
# Wrap the calculator in a transformer instance
featurizer = MoleculeTransformer(calc, dtype=np.float64)
with dm.without_rdkit_log():
feats = featurizer(data)
feats.shape
(642, 214)
The MoleculeTransformer
class provides a number of useful methods to customize the featurization pipeline. For example, you can easily change the dtype of the features or use parallelization.
feats.dtype
dtype('float64')
# To save on memory, we would rather use `float32` than `float64`. Let's change that!
featurizer = MoleculeTransformer(calc, dtype=np.float32)
with dm.without_rdkit_log():
feats = np.stack(featurizer(data))
feats.dtype
dtype('float32')
# Even better, let's directly cast to Torch vectors so we can use them in PyTorch!
featurizer = MoleculeTransformer(calc, dtype=torch.float32)
with dm.without_rdkit_log():
feats = featurizer(data)
feats.dtype
torch.float32
%%timeit
# Let's time our current featurization pipeline
featurizer = MoleculeTransformer(calc, n_jobs=1, dtype=torch.float32)
with dm.without_rdkit_log():
X = featurizer(data)
19.8 s ± 4.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
# With transformer classes, it's really easy to add parallelization! Let's try speed this up.
featurizer = MoleculeTransformer(calc, n_jobs=4, dtype=torch.float32)
with dm.without_rdkit_log():
X = featurizer(data)
5.79 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Even with such a small dataset, we can already see some performance improvements.
Concatenate featurizers¶
Another interesting feature offered in Molfeat is the ability to concatenate multiple featurizers. However, feature concatenation has some limitations. The most significant being the inability to set the parameters of all transformers in a single call unless you are passing a list of strings corresponding to the calculator names at initialization.
It might therefore not be compatible with the Scikit-learn grid search CV API and you will need to handle the update of the parameters of the concatenated featurizer yourself.
from molfeat.trans.fp import FPVecTransformer
# We will use the FPVecTransformer to automatically create a calculator by name
maccs = FPVecTransformer("maccs", dtype=np.float32)
ecfp4 = FPVecTransformer("ecfp:4", dtype=np.float32)
maccs([smiles]).shape, ecfp4([smiles]).shape
((1, 167), (1, 2000))
from molfeat.trans.concat import FeatConcat
featurizer = FeatConcat([maccs, ecfp4], dtype=np.float32)
featurizer([smiles]).shape
(1, 2167)
Alternatively you can use a list of strings corresponding to the FPVecTransformer
name and even define parameters for each featurizer.
from molfeat.trans.concat import FeatConcat
ecfp_params = {'radius':2}
featurizer = FeatConcat(["maccs", "ecfp"], params=dict(ecfp=ecfp_params), dtype=np.float32)
featurizer([smiles]).shape
(1, 2167)
Further reading¶
This has only scratched the surface of what the MoleculeTransformer
class offers. Subsequent tutorials will dive into more detail:
- Easily add your own featurizers: learn how to easily add your own featurizers to Molfeat to take full control.
- Integrations with ML frameworks: learn how to easily integrate Molfeat with PyTorch and Scikit-learn.
Pretrained transformers¶
Finally, the PretrainedMolTransformer
class extends the transformer interface to support the usage of pretrained models. This class is a subclass of MoleculeTransformer
and inherits all its methods. In addition, it adds the _embed()
, and _convert()
.
_embed()
: since pre-trained models benefit from batched featurization, this method is called by the transformer instead of the calculator._convert()
: this method is called by the transformer to convert the input. For example:- For a pre-trained language model, we convert from a SMILES string or Mol object to a SELFIES string.
- For a pre-trained GNN, we convert from a SMILES string or Mol object to a DGL graph.
Furthermore, the PretrainedMolTransformer
supports the use of a caching system. To learn more, see the tutorial on the cache.