The Data Cache
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
Caching features¶
molfeat offers a caching system to accelerate molecular featurization. There are two main types of caching systems offered in Molfeat:
DataCache¶
DataCache
is the default, mostly in memory caching system of molfeat
. The underlying cache system of DataCache
is simply a dictionary. To improve efficiency, DataCache
also supports shelf for object persistence. See the relevant documentation to learn more about DataCache
.
FileCache¶
FileCache
takes a file-based serialization approach to establish the underlying caching system. FileCache
supports pickle
, parquet
and csv
formats. We recommend the parquet
file format for its efficiency.
For both FileCache
and DataCache
, the key used to save and retrieve a molecular representation is datamol.unique_id
. Alternatively, you can use inchikey, which is less robust (e.g. does not differentiate tautomers) or even define your own molecular hashing function that you can pass as input to the cache object.
import datamol as dm
from molfeat.trans.base import PrecomputedMolTransformer
from molfeat.utils.cache import DataCache, FileCache
from molfeat.trans.pretrained import FCDTransformer
data = dm.data.freesolv().sample(500)
smiles_col = "smiles"
molecules = data["smiles"].values
targets = data["expt"].values
# Define cache and transformer. It can be any types of featurizer
cache = FileCache(
name="fcd_cache_test",
cache_file=f"fcd_cache.parquet",
file_type="parquet",
mol_hasher="dm.unique_id",
)
transformer = FCDTransformer()
# # pregenerate the features and store in cache files
_ = cache(molecules, transformer)
cache.save_to_file(filepath=cache.name)
Cache properties¶
You can check whether a cache contains a molecule or not
# benzene
benzene = dm.to_mol('c1ccccc1')
benzene in cache
True
# paracetamol
paracetamol = dm.to_mol('CC(=O)Nc1ccc(cc1)O')
paracetamol in cache
False
You can fetch the information of a molecule from the cache
fps = cache.get(benzene)
fps.shape
(512,)
You can also serialize a cache by converting it to a state dict
cache.to_state_dict()
{'_cache_name': 'FileCache', 'cache_file': 'fcd_cache.parquet', 'name': 'fcd_cache_test', 'n_jobs': None, 'verbose': False, 'file_type': 'parquet', 'clear_on_exit': True, 'parquet_kwargs': {}, 'mol_hasher': {'hash_name': 'dm.unique_id'}}
You can load a new cache from the serialized state dict or another cache. Or even load a cache from the cache file directly.
reload_cache = FileCache.load_from_file("fcd_cache.parquet",
file_type="parquet",
mol_hasher="dm.unique_id",
)
len(reload_cache)
609
reload_state_dict_cache = FileCache.from_state_dict(cache.to_state_dict())
len(reload_state_dict_cache)
609
You can copy the content of a cache file into another cache file. Regardless of the type of cache.
# load pregenerated features from files
memorycache = DataCache(
name="fcd_cache_memory",
n_jobs=-1,
mol_hasher=dm.unique_id,
delete_on_exit=True # we delete anything related to the cache at py exit
)
memorycache.update(cache)
len(memorycache)
609
Using a cache with a precomputed transformer¶
Some molecular transformers natively support a precompute_cache
attribute that can be used to cache featurization or load cache state into a new featurizer.
molfeat also provides a PrecomputedMolTransformer
class that makes the process easier which allows you to quickly build a new transformer from an existing cache. Similar to any MoleculeTransformer
, you can serialize the state of a PrecomputedMolTransformer
and reload it easily.
%%timeit -n 1 -r 3
transformer = PrecomputedMolTransformer(cache=cache, featurizer=FCDTransformer())
transformer(molecules)
291 ms ± 21.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
%%timeit -n 1 -r 3
transformer = FCDTransformer()
transformer(molecules)
2.17 s ± 120 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
By computing the features once on you dataset, you can gain astonishing speed on featurization later.
Even better, the PrecomputedMolTransformer
class provides a batch_transform
function that can leverage parallel computing with shared memory for further performance gains. The batch_transform
method allows you to both compute features and cache them in a multiprocessing setting for maximum efficiency. This could be relevant for featurizers that accept a batch of molecules, since the normal caching system computes the feature one molecule at a time.
from copy import deepcopy
cache_empty = deepcopy(cache)
# clear the empty cache
cache_empty.clear()
len(cache_empty)
0
%%timeit -n 1 -r 1
transformer = PrecomputedMolTransformer(cache=cache_empty, featurizer=FCDTransformer())
transformer.batch_transform(transformer, molecules, n_jobs=-1, batch_size=50)
Batch compute:: 0%| | 0/10 [00:00<?, ?it/s]
39.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# now we have all 500 molecules cached
len(cache_empty)
500