ChemicalDice API & Module Reference¶
This section provides a comprehensive reference for the core modules, CLI commands, and Python API functions available in the ChemicalDice Integrator (CDI) package.
1. Descriptors Module (ChemicalDice.descriptors)¶
This module handles the generation and processing of molecular descriptors from raw SMILES strings.
1.1 calculate_descriptors¶
Calculates diverse molecular descriptors (Quantum, Graph, Image, LLM, Bioactivity, Physicochemical) from an input CSV.
Python API:
from ChemicalDice.descriptors.calculate import calculate_descriptors
calculate_descriptors(input_file="smiles.csv", output_dir="data", descriptors=["mordred", "mopac"])
Arguments:
input_file(str): Path to input CSV file containing aSMILEScolumn.output_dir(str, default:"Chemicaldice_data"): Directory to save output CSVs.descriptors(list, default:["all"]): Descriptors to compute. Options:["mopac", "grover", "imagemol", "chemberta", "signaturizer", "mordred", "all"].
Output: Individual .csv files for each descriptor saved in output_dir.
1.2 convert_to_hdf5¶
Merges multiple descriptor CSVs, filters for common IDs, imputes missing values (KNN), and saves as chunked HDF5 files for fast I/O during training.
Python API:
from ChemicalDice.descriptors.convert import convert_to_hdf5
convert_to_hdf5(input_dir="data", output_dir="data_h5", chunk_size=10000, knn_neighbors=5)
Arguments:
input_dir(str, default:"Chemicaldice_data"): Directory containing descriptor CSVs.output_dir(str, default:"Chemicaldice_data"): Directory to save H5 files.chunk_size(int, default:10000): Chunk size for pandas/HDF5 I/O processing.knn_neighbors(int, default:5): Neighbors used for Mordred missing value imputation.
Output: Individual .h5 files optimized for multi-modal training.
2. Training Module (ChemicalDice.training)¶
This module contains the deep learning architectures for both unsupervised integration (CDI-Basic) and sequence-to-embedding prediction (CDI-Generalised).
2.1 train_basic_cdi (CDI-Basic)¶
Trains the unsupervised, multi-modal autoencoder on precomputed descriptor HDF5 files.
Python API:
from ChemicalDice.training.basic_model import train_basic_cdi
model = train_basic_cdi(h5_file_paths=["data/mordred.h5", "data/mopac.h5"], num_epochs=50)
Arguments:
h5_file_paths(list): List of paths to descriptor.h5files.num_epochs(int, default:50): Total training epochs.batch_size(int, default:32): Batch size for DataLoader.learning_rate(float, default:0.001): Initial learning rate for Adam optimizer.
Output: A trained PyTorch model capable of projecting specific modalities into the unified 8192-D CDI space.
2.2 train_generalised_cdi (CDI-Generalised)¶
Trains the Mamba State-Space Model to predict 8192-D CDI embeddings directly from SMILES sequences.
Python API:
from ChemicalDice.training.gen_model import train_generalised_cdi
model = train_generalised_cdi(csv_path="smiles.csv", target_h5_file="targets.h5")
Arguments:
csv_path(str): Path to CSV containing training SMILES.target_h5_file(str): Path to the HDF5 file containing the ground-truth 8192-D embeddings generated by CDI-Basic.mamba_model_dir(str): Directory containing pre-trained Mamba weights.
Output: A fine-tuned Mamba model capable of generating inference-time CDI embeddings directly from SMILES.
3. Experiments Module (ChemicalDice.experiments)¶
This module houses automated benchmarking pipelines for robust evaluation.
3.1 run_ood_analysis (OOD)¶
Executes Out-of-Distribution analysis using Random vs. Scaffold splitting strategies to test generalization.
Python API:
from ChemicalDice.experiments.ood_analysis import run_ood_analysis
run_ood_analysis(label_csv="PCBA.csv", embedding_files=["PCBA_CDI.parquet"], results_dir="./results")
Arguments:
label_csv(str): Path to target labels CSV.embedding_files(list): Paths to descriptor/embedding Parquet files.results_dir(str): Output directory.resampling_strategy(str, default:"AUTO"): How to handle class imbalance (AUTO,DOWNSAMPLING).
Output: _random_metrics.csv and _scaffold_metrics.csv reports containing ROC-AUC, Balanced Acc, F1, etc.
3.2 run_ldc_analysis (LDC)¶
Executes Low Data Condition analysis across varying fractions of the training set (e.g., 10%, 25%).
Python API:
from ChemicalDice.experiments.ldc_analysis import run_ldc_analysis
run_ldc_analysis(dataset_name="herg_karim", emb_dir="./", label_dir="./", output="results")
Arguments:
dataset_name(str): Base name of the dataset.emb_dir(str, default:"."): Directory containing embedding.parquetfiles.label_dir(str, default:"."): Directory containing target label CSVs.output(str, default:"results_ldc"): Output directory.fractions(list, default:[0.10, 0.25, 0.50, 0.75]): List of data fractions to train on.
Output: _ldc_results.csv detailing the performance of multiple SOTA classifiers at restricted dataset scales.
4. SOTA Pipeline Module (ChemicalDice.sota_pipeline)¶
This module provides a robust benchmarking framework for evaluating molecular embedding performance across multiple SOTA classifiers.
4.1 Evaluate.benchmark¶
Executes a 5-Fold Stratified Cross-Validation suite across 3 seeds, incorporating automated class balancing and metric calculation.
Python API:
from ChemicalDice.sota_pipeline.Evaluate import benchmark, make_default_config
cfg = make_default_config()
cfg["parquet_dir"] = "./embeddings"
cfg["label_dir"] = "./labels"
# Run the benchmark
benchmark(cfg)
CLI Execution:
Arguments (Config/CLI):
stages(list): Pipeline stages to execute (inspect,label,extract,analysis,classify,plots,all).dataset_dir,label_dir,parquet_dir,results_dir(str): Paths for data and output handling.descriptors(list): Descriptors to evaluate.models(list): Classifiers to benchmark (e.g.,XGBoost,RandomForest).
Output: Robust evaluation metrics, checkpoint files, and publication-ready plots.
5. Clustering Module (ChemicalDice.clustering)¶
This module provides advanced algorithms for reducing dataset redundancy and selecting representative subsets while maintaining structural diversity.
5.1 density_aware_sampling¶
Performs memory-efficient density-aware subset sampling using ECFP6 fingerprints, UMAP projection, and HDBSCAN clustering. It extracts a representative core of a large chemical library.
Python API:
# The density aware sampling is typically executed via CLI due to its heavy memory and GPU requirements.
# See CLI usage below.
CLI Execution:
Arguments:
--input(str): Input CSV containing SMILES and Fingerprints.--target(float, default:0.10): Fraction of the dataset to sample (e.g., 0.10 for 10%).--umap_dims(int, default:30): Dimensionality for UMAP reduction.--min_cluster_size(int, default:50): Minimum cluster size for HDBSCAN.--work_dir,--out_dir(str): Directories for checkpoints and final outputs.
Output: A CSV file (representative_XXpct_hdbscan.csv) containing the intelligently sampled molecular subset.
6. Main CLI Hub (ChemicalDice.cli)¶
ChemicalDice exposes all core functionality via a unified cdi command-line interface.
| Command | Description | Example Usage |
|---|---|---|
cdi compute |
Generate raw CSV descriptors from SMILES. | cdi compute --input data.csv --descriptors mordred |
cdi convert |
Fuse CSVs into aligned, imputed HDF5 formats. | cdi convert --input_dir ./csv --output_dir ./h5 |
cdi train-basic |
Train the unsupervised multi-modal autoencoder. | cdi train-basic --data-files a.h5 b.h5 --epochs 50 |
cdi train-gen |
Fine-tune the sequence-to-embedding architecture. | cdi train-gen --smiles-csv smiles.csv --target-h5 out.h5 |
cdi ood |
Run SOTA/Scaffold Out-of-Distribution evaluation. | cdi ood --labels y.csv --embeddings x.parquet |
cdi ldc |
Run fractional Low Data Condition evaluation. | cdi ldc --dataset pcba --fractions 0.1 0.25 |
cdi cluster |
Perform density-aware subset sampling. | cdi cluster --input data.csv --target 0.1 |
cdi benchmark |
Run the SOTA pipeline orchestrator. | cdi benchmark --stages all |
cdi setup |
Automate installation of binary dependencies. | cdi setup |