The Data Cache

In [17]:

Copied!

%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Caching features¶

molfeat offers a caching system to accelerate molecular featurization. There are two main types of caching systems offered in Molfeat:

DataCache¶

DataCache is the default, mostly in memory caching system of molfeat. The underlying cache system of DataCache is simply a dictionary. To improve efficiency, DataCache also supports shelf for object persistence. See the relevant documentation to learn more about DataCache.

FileCache¶

FileCache takes a file-based serialization approach to establish the underlying caching system. FileCache supports pickle, parquet and csv formats. We recommend the parquet file format for its efficiency.

For both FileCache and DataCache, the key used to save and retrieve a molecular representation is datamol.unique_id. Alternatively, you can use inchikey, which is less robust (e.g. does not differentiate tautomers) or even define your own molecular hashing function that you can pass as input to the cache object.

In [18]:

Copied!

import datamol as dm

from molfeat.trans.base import PrecomputedMolTransformer
from molfeat.utils.cache import DataCache, FileCache
from molfeat.trans.pretrained import FCDTransformer
import datamol as dm

from molfeat.trans.base import PrecomputedMolTransformer
from molfeat.utils.cache import DataCache, FileCache
from molfeat.trans.pretrained import FCDTransformer

In [19]:

Copied!





data = dm.data.freesolv().sample(500)
smiles_col = "smiles"
molecules = data["smiles"].values
targets = data["expt"].values
data = dm.data.freesolv().sample(500)
smiles_col = "smiles"
molecules = data["smiles"].values
targets = data["expt"].values

In [20]:

Copied!





# Define cache and transformer. It can be any types of featurizer

cache = FileCache(
    name="fcd_cache_test",
    cache_file=f"fcd_cache.parquet",
    file_type="parquet",
    mol_hasher="dm.unique_id",
)

transformer = FCDTransformer()
# Define cache and transformer. It can be any types of featurizer

cache = FileCache(
    name="fcd_cache_test",
    cache_file=f"fcd_cache.parquet",
    file_type="parquet",
    mol_hasher="dm.unique_id",
)

transformer = FCDTransformer()

In [21]:

Copied!

# # pregenerate the features and store in cache files
_ = cache(molecules, transformer)
cache.save_to_file(filepath=cache.name)
# # pregenerate the features and store in cache files
_ = cache(molecules, transformer)
cache.save_to_file(filepath=cache.name)

Cache properties¶

You can check whether a cache contains a molecule or not

In [22]:

Copied!

# benzene
benzene = dm.to_mol('c1ccccc1')
benzene in cache
# benzene
benzene = dm.to_mol('c1ccccc1')
benzene in cache

Out[22]:

True

In [23]:

Copied!

# paracetamol
paracetamol = dm.to_mol('CC(=O)Nc1ccc(cc1)O')
paracetamol in cache
# paracetamol
paracetamol = dm.to_mol('CC(=O)Nc1ccc(cc1)O')
paracetamol in cache

Out[23]:

False

You can fetch the information of a molecule from the cache

In [24]:

Copied!

fps = cache.get(benzene)
fps.shape
fps = cache.get(benzene)
fps.shape

Out[24]:

(512,)

You can also serialize a cache by converting it to a state dict

In [25]:

Copied!

cache.to_state_dict()
cache.to_state_dict()

Out[25]:

{'_cache_name': 'FileCache',
 'cache_file': 'fcd_cache.parquet',
 'name': 'fcd_cache_test',
 'n_jobs': None,
 'verbose': False,
 'file_type': 'parquet',
 'clear_on_exit': True,
 'parquet_kwargs': {},
 'mol_hasher': {'hash_name': 'dm.unique_id'}}

You can load a new cache from the serialized state dict or another cache. Or even load a cache from the cache file directly.

In [26]:

Copied!





reload_cache = FileCache.load_from_file("fcd_cache.parquet",
    file_type="parquet",
    mol_hasher="dm.unique_id",
)
len(reload_cache)
reload_cache = FileCache.load_from_file("fcd_cache.parquet",
    file_type="parquet",
    mol_hasher="dm.unique_id",
)
len(reload_cache)

Out[26]:

In [27]:

Copied!

reload_state_dict_cache = FileCache.from_state_dict(cache.to_state_dict())
len(reload_state_dict_cache)
reload_state_dict_cache = FileCache.from_state_dict(cache.to_state_dict())
len(reload_state_dict_cache)

Out[27]:

You can copy the content of a cache file into another cache file. Regardless of the type of cache.

In [28]:

Copied!





# load pregenerated features from files
memorycache = DataCache(
    name="fcd_cache_memory",
    n_jobs=-1,
    mol_hasher=dm.unique_id,
    delete_on_exit=True # we delete anything related to the cache at py exit
)
memorycache.update(cache)
len(memorycache)
# load pregenerated features from files
memorycache = DataCache(
    name="fcd_cache_memory",
    n_jobs=-1,
    mol_hasher=dm.unique_id,
    delete_on_exit=True # we delete anything related to the cache at py exit
)
memorycache.update(cache)
len(memorycache)

Out[28]:

Using a cache with a precomputed transformer¶

Some molecular transformers natively support a precompute_cache attribute that can be used to cache featurization or load cache state into a new featurizer.

molfeat also provides a PrecomputedMolTransformer class that makes the process easier which allows you to quickly build a new transformer from an existing cache. Similar to any MoleculeTransformer, you can serialize the state of a PrecomputedMolTransformer and reload it easily.

In [29]:

Copied!

%%timeit -n 1 -r 3
transformer = PrecomputedMolTransformer(cache=cache, featurizer=FCDTransformer())
transformer(molecules)
%%timeit -n 1 -r 3
transformer = PrecomputedMolTransformer(cache=cache, featurizer=FCDTransformer())
transformer(molecules)

291 ms ± 21.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

In [30]:

Copied!

%%timeit -n 1 -r 3
transformer = FCDTransformer()
transformer(molecules)
%%timeit -n 1 -r 3
transformer = FCDTransformer()
transformer(molecules)

2.17 s ± 120 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

By computing the features once on you dataset, you can gain astonishing speed on featurization later.

Even better, the PrecomputedMolTransformer class provides a batch_transform function that can leverage parallel computing with shared memory for further performance gains. The batch_transform method allows you to both compute features and cache them in a multiprocessing setting for maximum efficiency. This could be relevant for featurizers that accept a batch of molecules, since the normal caching system computes the feature one molecule at a time.

In [31]:

Copied!





from copy import deepcopy
cache_empty = deepcopy(cache)
# clear the empty cache
cache_empty.clear()
len(cache_empty)
from copy import deepcopy
cache_empty = deepcopy(cache)
# clear the empty cache
cache_empty.clear()
len(cache_empty)

Out[31]:

In [32]:

Copied!

%%timeit -n 1 -r 1
transformer = PrecomputedMolTransformer(cache=cache_empty, featurizer=FCDTransformer())
transformer.batch_transform(transformer, molecules, n_jobs=-1, batch_size=50)
%%timeit -n 1 -r 1
transformer = PrecomputedMolTransformer(cache=cache_empty, featurizer=FCDTransformer())
transformer.batch_transform(transformer, molecules, n_jobs=-1, batch_size=50)

Batch compute::   0%|          | 0/10 [00:00<?, ?it/s]

39.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [33]:

Copied!

# now we have all 500 molecules cached 
len(cache_empty)
# now we have all 500 molecules cached 
len(cache_empty)

Out[33]: