Integration with scikit-learn and PyTorch
%load_ext autoreload %autoreload 2 import torch from torch import nn import datamol as dm import matplotlib.pyplot as plt
All transformers in Molfeat are a subclass of
MoleculeTransformer which in turns implements the
BaseFeaturizer interface ensures that transformers are compatible with both Scikit-Learn and deep learning frameworks, such as PyTorch and DGL.
MoleculeTransformer implements the
TransformerMixin interfaces from Scikit-Learn. This makes it easy to integrate Molfeat featurizers with Scikit-Learn.
In the example below, we create a simple Scikit-learn pipeline to predict the solubility of molecules using a random forest regressor.
from sklearn.ensemble import RandomForestRegressor from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from molfeat.trans import MoleculeTransformer
df = dm.data.freesolv() X, y = df["smiles"], df["expt"] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # The Molfeat transformer seemingly integrates with Scikit-learn Pipeline! transf = MoleculeTransformer("desc2d") pipe = Pipeline([("feat", transf), ("scaler", StandardScaler()), ("rf", RandomForestRegressor())])
with dm.without_rdkit_log(): pipe.fit(X_train, y_train) score = pipe.score(X_test, y_test) y_pred = pipe.predict(X_test) print(score)
fig, ax = plt.subplots() ax.scatter(y_pred, y_test) ax.set_xlabel("Prediction") ax.set_ylabel("Target");
Molfeat transformers are also compatible with Scikit-Learn's
from molfeat.trans.fp import FPVecTransformer from sklearn.model_selection import GridSearchCV # To search over the featurizer, we use a single transformer that combines several calculators. feat = FPVecTransformer(kind="rdkit") param_grid = dict( feat__kind=["fcfp:6", "ecfp:6", "maccs"], feat__length=[512, 1024], rf__n_estimators=[100, 500], ) pipe = Pipeline([("feat", feat), ("scaler", StandardScaler()), ("rf", RandomForestRegressor(n_estimators=100))]) grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1) with dm.without_rdkit_log(): grid_search.fit(X_train, y_train) score = grid_search.score(X_test, y_test) score
MoleculeTransformer also defines some utilities such as the
__len__() method and the
get_collate_fn() method which makes it easy to integrate with PyTorch.
In the example below, we create a simple PyTorch dataset and dataloader using the Molfeat featurizer.
# We can easily get the dimension of the vector! input_size = len(transf) # To for example define the first layer of a Neural Network model = nn.Linear(input_size, 1) # Easily get the associated collate function, # This is for example useful when training a DGL GNN. dataloader = torch.utils.data.DataLoader(X_train, collate_fn=transf.get_collate_fn())
Featurization for training Neural Networks¶
Molfeat also includes featurization schemes to convert molecules into a format suited for training neural networks (e.g. tokenized strings or graphs).
from molfeat.trans.graph import DGLGraphTransformer smi = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" # Get the adjacency matrix transf = DGLGraphTransformer() X = transf([smi]) type(X)
To learn more about the various graph featurization schemes, please see this tutorial.
You can also explore the following two tutorials about integrating Molfeat to train deep neural networks in PyTorch: