Usage Overview¶
- Provide a list of molecules as InChi strings.
- Select and configure feature sets using a simple YAML configuration file
- Use ChemFeat to generate a pandas dataframe or CSV file with the calculated features.
Notable Features¶
- Already supports ~70 feature sets thanks to RDKit and PaDEL Descriptor.
- Modular design facilitates addition of new feature sets.
- Calculated features are cached in a database to avoid redundant calculations when rerunning code.
Command-Line Tool¶
Example¶
# features.yaml
# QED feature calculator.
- name: qed
# RDK descriptor feature calculator.
- name: rdkdesc
# inchis.csv
InChi,name
"InChI=1S/C8H9NO2/c1-6(10)9-7-2-4-8(11)5-3-7/h2-5,11H,1H3,(H,9,10)","paracetamol"
"InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)","ibuprofen"
$ chemfeat calculate features.yaml inchis.csv features.csv
# features.csv
InChi,qed__ALERTS,qed__ALOGP,qed__AROM,qed__HBA,qed__HBD,qed__MW,qed__PSA,qed__ROTB,rdkdesc__FpDensityMorgan1,rdkdesc__FpDensityMorgan2,rdkdesc__FpDensityMorgan3,rdkdesc__MaxAbsPartialCharge,rdkdesc__MaxPartialCharge,rdkdesc__MinAbsPartialCharge,rdkdesc__MinPartialCharge,rdkdesc__NumRadicalElectrons,rdkdesc__NumValenceElectrons
"InChI=1S/C8H9NO2/c1-6(10)9-7-2-4-8(11)5-3-7/h2-5,11H,1H3,(H,9,10)",2,2.0000999999999998,1,2,2,151.16500000000002,52.82000000000001,1,1.2727272727272727,1.8181818181818181,2.272727272727273,0.5079642937129114,0.18214293782620056,0.18214293782620056,-0.5079642937129114,0,58
"InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)",0,3.073200000000001,1,2,1,206.28499999999997,37.3,4,1.2,1.7333333333333334,2.1333333333333333,0.4807885019257389,0.3101853515323108,0.3101853515323108,-0.4807885019257389,0,82
Python API¶
from chemfeat.database import FeatureDatabase
from chemfeat.features.manager import FeatureManager
# A list of feature specifications, either loaded from a user-provided
# YAML file or set programmatically.
feat_specs = [
{'name': 'qed'},
{'name': 'rdkdesc'},
{'name': 'rdkfp', 'size': 2048}
]
# An iterable of InChi strings, such as a column from a loaded CSV file.
inchis = [
"InChI=1S/C8H9NO2/c1-6(10)9-7-2-4-8(11)5-3-7/h2-5,11H,1H3,(H,9,10)",
"InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)"
]
# Create the database object. This can point to a common database to centralize
# caching of calculated features.
feat_db = FeatureDatabase('features.sqlite')
# Create the feature manager object.
feat_man = FeatureManager(feat_db, feat_specs)
# Calculate the features and retrieve them as a Pandas dataframe.
feat_dataframe = feat_man.calculate_features(inchis, return_dataframe=True)
ChemFeat Conclusion¶
- Already supports a wide range of feature sets.
- Easy to add new feature sets.
- Simple command-line tool and Python API.
- Available on PyPI.
MolPred Overview¶
Trains ML & DL models using feature vectors generated by ChemFeat.
Highlights¶
- All the benefits of Hydronaut (parameterized via YAML configuration files, automatic hyperparameter sweeping and optimization, systematic result tracking with MLFlow.
- All of the benefits of ChemFeat (large number of features, easily extensible to add more feature sets).
- Automatic visualization of numeric and categoric features. Examples
- Easy to define user models and scorers.
Basic Usage¶
- Define a model as a subclass of ModelBase.
- Modify the configuration file template.
- Provide a CSV file with InChis, prediction targets and optional additional features.
- Run
hydronaut-run.
MolPred Conclusion¶
- Relatively simple to adapt a model to the framework.
- Everything can be configured with YAML (model, feature sets, various metrics).
- Easy to explore & optimize hyperparameters thanks to Hydronaut (test different models, feature sets, etc.).
- Automatic visualization of features.
- All runs are tracked (parameters, metrics, models, artifacts, plots, code version, etc.)
- Retrieve models via MLflow for testing and prediction.