Types of featurizers

In [1]:

Copied!





%load_ext autoreload
%autoreload 2

import torch
import datamol as dm
import numpy as np
%load_ext autoreload
%autoreload 2

import torch
import datamol as dm
import numpy as np

All featurizers in Molfeat inherit from at least one of three classes:

molfeat.calc.SerializableCalculator:A calculator is a Callable that featurizes a single molecule.
molfeat.trans.MoleculeTransformer:A transformer is a class that wraps a calculator in a featurization pipeline.
molfeat.trans.pretrained.PretrainedMolTransformer:A subclass of MoleculeTransformer that extends the transformer interface to support the usage of pretrained models.

In this tutorial, we will look at each of these classes in more detail.

Calculators¶

A calculator is a Callable that takes an RDKit Chem.Mol object or a SMILES string and returns a feature vector. In the following example, we will use the FPCalculator.

In [2]:

Copied!





from molfeat.calc import FPCalculator

smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
calc = FPCalculator("maccs")
X = calc(smiles)
X.shape
from molfeat.calc import FPCalculator

smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
calc = FPCalculator("maccs")
X = calc(smiles)
X.shape

Out[2]:

(167,)

The FPCalculator implements several popular molecular fingerprints:

In [3]:

Copied!

from molfeat.calc import FP_FUNCS

FP_FUNCS.keys()
from molfeat.calc import FP_FUNCS

FP_FUNCS.keys()

Out[3]:

dict_keys(['maccs', 'avalon', 'ecfp', 'fcfp', 'topological', 'atompair', 'rdkit', 'pattern', 'layered', 'map4', 'secfp', 'erg', 'estate', 'avalon-count', 'rdkit-count', 'ecfp-count', 'fcfp-count', 'topological-count', 'atompair-count'])

Switching to any other fingerprint is easy:

In [4]:

Copied!

calc = FPCalculator("ecfp")
X = calc(smiles)
X.shape
calc = FPCalculator("ecfp")
X = calc(smiles)
X.shape

Out[4]:

(2048,)

Beyond these fingerprints, Molfeat also provides calculators for other molecular descriptors. The list of available options can be further extended through plugins. All available calculator classes, both built-in and plugin-based, can be found through the molfeat.calc module:

In [5]:

Copied!

from molfeat.calc import _CALCULATORS

_CALCULATORS.keys()
from molfeat.calc import _CALCULATORS

_CALCULATORS.keys()

Out[5]:

dict_keys(['CATS', 'RDKitDescriptors2D', 'MordredDescriptors', 'RDKitDescriptors3D', 'FPCalculator', 'Pharmacophore2D', 'Pharmacophore3D', 'ScaffoldKeyCalculator', 'USRDescriptors', 'ElectroShapeDescriptors'])

Every calculator is serializable, meaning it can be efficiently stored to — and loaded from — disk. To learn more, please see the tutorial on saving and loading featurizers.

Transformers¶

In practice, you won't want to featurize a single molecule, but rather a batch of molecules. This is where transformers come in. A transformer is a class that wraps a calculator in a featurization pipeline. The MoleculeTransformer class provides a convenient interface for featurizing a batch of molecules. It also provides a number of useful methods to customize the featurization pipeline.

In [6]:

Copied!





from molfeat.calc import RDKitDescriptors2D
from molfeat.trans import MoleculeTransformer

data = dm.data.freesolv().smiles.values

# Let's try a different calculator!
# This is a descriptor with all 2D, physicochemical descriptors from RDKit
calc = RDKitDescriptors2D(replace_nan=True)

# Wrap the calculator in a transformer instance
featurizer = MoleculeTransformer(calc, dtype=np.float64)

with dm.without_rdkit_log():
    feats = featurizer(data)

feats.shape
from molfeat.calc import RDKitDescriptors2D
from molfeat.trans import MoleculeTransformer

data = dm.data.freesolv().smiles.values

# Let's try a different calculator!
# This is a descriptor with all 2D, physicochemical descriptors from RDKit
calc = RDKitDescriptors2D(replace_nan=True)

# Wrap the calculator in a transformer instance
featurizer = MoleculeTransformer(calc, dtype=np.float64)

with dm.without_rdkit_log():
    feats = featurizer(data)

feats.shape

Out[6]:

(642, 214)

The MoleculeTransformer class provides a number of useful methods to customize the featurization pipeline. For example, you can easily change the dtype of the features or use parallelization.

In [7]:

Copied!

feats.dtype
feats.dtype

Out[7]:

dtype('float64')

In [8]:

Copied!

# To save on memory, we would rather use `float32` than `float64`. Let's change that!
featurizer = MoleculeTransformer(calc, dtype=np.float32)

with dm.without_rdkit_log():
    feats = np.stack(featurizer(data))

feats.dtype
# To save on memory, we would rather use `float32` than `float64`. Let's change that!
featurizer = MoleculeTransformer(calc, dtype=np.float32)

with dm.without_rdkit_log():
    feats = np.stack(featurizer(data))

feats.dtype

Out[8]:

dtype('float32')

In [9]:

Copied!





# Even better, let's directly cast to Torch vectors so we can use them in PyTorch!
featurizer = MoleculeTransformer(calc, dtype=torch.float32)

with dm.without_rdkit_log():
    feats = featurizer(data)
feats.dtype
# Even better, let's directly cast to Torch vectors so we can use them in PyTorch!
featurizer = MoleculeTransformer(calc, dtype=torch.float32)

with dm.without_rdkit_log():
    feats = featurizer(data)
feats.dtype

Out[9]:

torch.float32

In [10]:

Copied!





%%timeit
# Let's time our current featurization pipeline
featurizer = MoleculeTransformer(calc, n_jobs=1, dtype=torch.float32)
with dm.without_rdkit_log():
    X = featurizer(data)
%%timeit
# Let's time our current featurization pipeline
featurizer = MoleculeTransformer(calc, n_jobs=1, dtype=torch.float32)
with dm.without_rdkit_log():
    X = featurizer(data)

19.8 s ± 4.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]:

Copied!





%%timeit
# With transformer classes, it's really easy to add parallelization! Let's try speed this up.
featurizer = MoleculeTransformer(calc, n_jobs=4, dtype=torch.float32)
with dm.without_rdkit_log():
    X = featurizer(data)
%%timeit
# With transformer classes, it's really easy to add parallelization! Let's try speed this up.
featurizer = MoleculeTransformer(calc, n_jobs=4, dtype=torch.float32)
with dm.without_rdkit_log():
    X = featurizer(data)

5.79 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Even with such a small dataset, we can already see some performance improvements.

Concatenate featurizers¶

Another interesting feature offered in Molfeat is the ability to concatenate multiple featurizers. However, feature concatenation has some limitations. The most significant being the inability to set the parameters of all transformers in a single call unless you are passing a list of strings corresponding to the calculator names at initialization.

It might therefore not be compatible with the Scikit-learn grid search CV API and you will need to handle the update of the parameters of the concatenated featurizer yourself.

In [12]:

Copied!

from molfeat.trans.fp import FPVecTransformer

# We will use the FPVecTransformer to automatically create a calculator by name
maccs = FPVecTransformer("maccs", dtype=np.float32)
ecfp4 = FPVecTransformer("ecfp:4", dtype=np.float32)

maccs([smiles]).shape, ecfp4([smiles]).shape
from molfeat.trans.fp import FPVecTransformer

# We will use the FPVecTransformer to automatically create a calculator by name
maccs = FPVecTransformer("maccs", dtype=np.float32)
ecfp4 = FPVecTransformer("ecfp:4", dtype=np.float32)

maccs([smiles]).shape, ecfp4([smiles]).shape

Out[12]:

((1, 167), (1, 2000))

In [13]:

Copied!

from molfeat.trans.concat import FeatConcat

featurizer = FeatConcat([maccs, ecfp4], dtype=np.float32)
featurizer([smiles]).shape
from molfeat.trans.concat import FeatConcat

featurizer = FeatConcat([maccs, ecfp4], dtype=np.float32)
featurizer([smiles]).shape

Out[13]:

(1, 2167)

Alternatively you can use a list of strings corresponding to the FPVecTransformer name and even define parameters for each featurizer.

In [14]:

Copied!

from molfeat.trans.concat import FeatConcat

ecfp_params = {"radius": 2}
featurizer = FeatConcat(["maccs", "ecfp"], params=dict(ecfp=ecfp_params), dtype=np.float32)
featurizer([smiles]).shape
from molfeat.trans.concat import FeatConcat

ecfp_params = {"radius": 2}
featurizer = FeatConcat(["maccs", "ecfp"], params=dict(ecfp=ecfp_params), dtype=np.float32)
featurizer([smiles]).shape

Out[14]:

(1, 2167)

Pretrained transformers¶

Finally, the PretrainedMolTransformer class extends the transformer interface to support the usage of pretrained models. This class is a subclass of MoleculeTransformer and inherits all its methods. In addition, it adds the _embed(), and _convert().

_embed(): since pre-trained models benefit from batched featurization, this method is called by the transformer instead of the calculator.
_convert(): this method is called by the transformer to convert the input. For example:
- For a pre-trained language model, we convert from a SMILES string or Mol object to a SELFIES string.
- For a pre-trained GNN, we convert from a SMILES string or Mol object to a DGL graph.

Furthermore, the PretrainedMolTransformer supports the use of a caching system. To learn more, see the tutorial on the cache.

Types of featurizers

Calculators¶

Transformers¶

Concatenate featurizers¶

Further reading¶

Pretrained transformers¶