molfeat.trans.struct
ESM¶
ESMProteinFingerprint
¶
Bases: MoleculeTransformer
ESM (Evolutionary Scale Modeling) protein representation embedding. ESM is a transformer protein language model introduced by Facebook FAIR in Rives et al., 2019: 'Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences'
Source code in molfeat/trans/struct/esm.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|
n_layers
property
¶
Number of layers used in the current embeddings
__call__(seqs, names=None, ignore_errors=False, enforce_dtype=True, **kwargs)
¶
Compute molecular representation of a protein sequence. If ignore_error is True, a list of features and valid ids are returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seqs |
List[str]
|
list of protein sequence as amino acids |
required |
names |
Optional[List[str]]
|
protein names |
None
|
enforce_dtype |
bool
|
whether to enforce the instance dtype in the generated fingerprint |
True
|
ignore_errors |
bool
|
Whether to ignore errors during featurization or raise an error. |
False
|
kwargs |
Named parameters for the transform method |
{}
|
Returns:
Name | Type | Description |
---|---|---|
feats |
list of valid embeddings |
|
ids |
all valid positions that did not failed during featurization. Only returned when ignore_errors is True. |
Source code in molfeat/trans/struct/esm.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|
__init__(featurizer='esm1b_t33_650M_UR50S', loader_repo_or_dir='facebookresearch/esm:main', device=None, layers=None, pooling='mean', dtype=None, contact=False, **kwargs)
¶
Constructor for ESM protein representation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer |
str
|
Name of the ESM model to use. Defaults to "esm1b_t33_650M_UR50S". |
'esm1b_t33_650M_UR50S'
|
loader_repo_or_dir |
str
|
Path to local dir containing the model or to a github repo. Default to "facebookresearch/esm:main |
'facebookresearch/esm:main'
|
device |
Optional[str]
|
Torch device to move the model to. Defaults to None. |
None
|
layers |
List[int]
|
Layers to use to extract information. Defaults to None, which is the last layers. |
None
|
pooling |
str
|
Pooling method to use for sequence embedding. Defaults to "mean". If you set pooling to None, token representation will be returned (excluding BOS) |
'mean'
|
dtype |
Callable
|
Representation output datatype. Defaults to None. |
None
|
contact |
bool
|
Whether to return the predictied attention contact instead of the representation. Defaults to False. |
False
|
Source code in molfeat/trans/struct/esm.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
__len__()
¶
Get featurizer length
Source code in molfeat/trans/struct/esm.py
79 80 81 82 83 84 |
|
transform(seqs, names=None, **kwargs)
¶
Transform a list of protein sequence into a feature vector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seqs |
List[str]
|
list of protein sequence as amino acids |
required |
names |
Optional[List[str]]
|
protein names |
None
|
Returns:
Type | Description |
---|---|
Embedding of size (N_SEQS, SEQ_LEN, FEAT_DIM * N_LAYERS) for token embeddings and (N_SEQS, FEAT_DIM * N_LAYERS) for sequence embeddings. Use |
Source code in molfeat/trans/struct/esm.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
Bio Embeddings¶
ProtBioFingerprint
¶
Bases: MoleculeTransformer
Wrapper for general purpose biological sequence representations, as provided by bio_embeddings
For a list of available embeddings, see: https://docs.bioembeddings.com/v0.2.2/api/bio_embeddings.embed.html
!!! note:
The embeddings proposed here are the general purpose embeddings, meaning that task-specific
embeddings offered by bio_embeddings
(e.g PBTucker, DeepBlast) are not included.
According to the bio_embeddings documentation, `prottrans_bert_bfd` and `seqvec` are the best embeddings.
Source code in molfeat/trans/struct/prot1D.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
|
n_layers
property
¶
Get the number of layers used in this embedding
__call__(seqs, ignore_errors=False, enforce_dtype=True, **kwargs)
¶
Compute molecular representation of a protein sequence. If ignore_error is True, a list of features and valid ids are returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seqs |
List[str]
|
list of protein or nucleotide sequence as amino acids |
required |
enforce_dtype |
bool
|
whether to enforce the instance dtype in the generated fingerprint |
True
|
ignore_errors |
bool
|
Whether to ignore errors during featurization or raise an error. |
False
|
kwargs |
Named parameters for the transform method |
{}
|
Returns:
Name | Type | Description |
---|---|---|
feats |
list of valid embeddings |
|
ids |
all valid positions that did not failed during featurization. Only returned when ignore_errors is True. |
Source code in molfeat/trans/struct/prot1D.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
|
__init__(featurizer='seqvec', pooling='mean', dtype=np.float32, device=None, layer_pooling='sum', **kwargs)
¶
Constructor for Deep Learning based Protein representation. SeqVec featurizer will e
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer |
Union[str, Callable]
|
Name of callable of the embedding model |
'seqvec'
|
pooling |
str
|
Pooling method to use for sequence embedding. Defaults to "mean". If you set pooling to None, token representation will be returned |
'mean'
|
dtype |
Callable
|
Representation output datatype. Defaults to None. |
float32
|
device |
Optional[Union[device, str]]
|
Torch device to move the model to. Defaults to None. |
None
|
layer_pooling |
str
|
Layer-wise pooling method to use when > 1 layer exists. Default to 'sum'.
If None, last layers is taken. This is relevant for |
'sum'
|
Source code in molfeat/trans/struct/prot1D.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
__len__()
¶
Get featurizer length
Source code in molfeat/trans/struct/prot1D.py
99 100 101 |
|
transform(seqs, names=None, **kwargs)
¶
Transform a list of protein/nucleotide sequence into a feature vector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seqs |
List[str]
|
list of protein/nucleotide sequence as amino acids |
required |
names |
Optional[List[str]]
|
names of the macromolecules. Will be ignored |
None
|
kwargs |
additional arguments for the featurizer |
{}
|
Returns:
Type | Description |
---|---|
Embedding of size (N_SEQS, FEAT_DIM) for token embeddings and (FEAT_DIM, N_LAYERS) for sequence embeddings |
Source code in molfeat/trans/struct/prot1D.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
|