Model Training Specifics¶
Training the core networks within ChemicalDice Integrator requires profound engineering scale. The standard loop sequences are exposed inherently within ChemicalDice/training/.
1. Unsupervised Multi-Modal Optimization (CDI-Basic)¶
The CDI_Basic Integrator utilizes non-linear, multi-layer symmetrical encoders to project multiple modalities into a dense 8192-dimensional vector. It uses a Leave-One-Out (LOO) learning topology.
How to Train¶
You can trigger the training via the CLI or Python API.
CLI Execution¶
# Execute training with the full 6-modality manifold
cdi train-basic --data-files data_h5/mordred.h5 data_h5/Grover.h5 data_h5/Chemberta.h5 data_h5/Signaturizer.h5 data_h5/mopac.h5 data_h5/ImageMol.h5 --epochs 50
Python API¶
from ChemicalDice.training.basic_model import train_basic_cdi
model = train_basic_cdi(
h5_file_paths=[
"data_h5/mordred.h5", "data_h5/Grover.h5", "data_h5/Chemberta.h5",
"data_h5/Signaturizer.h5", "data_h5/mopac.h5", "data_h5/ImageMol.h5"
],
num_epochs=50,
model_path="cdi_model.pt",
embedding_path="cdi_embeddings.h5",
bottleneck=1024 # Reduce to 1024-D
)
Customizing Embedding Size (Bottleneck)¶
By default, ChemicalDice produces 8192-dimensional embeddings. You can compress these into smaller, more manageable manifolds using the --bottleneck parameter. This is particularly useful for downstream tasks requiring lower memory footprints.
- To 1024-D:
-
To 512-D:
-
Optimizer:
SGD(Initial LR:0.001, Weight Decay:0) - Scheduler:
ReduceLROnPlateau(patience=5) tracking MSE loss.
2. Supervised State-Space Regression (CDI-Generalised)¶
The CDI-Generalised framework leverages SMI-SSED (Structured State-Space Evolution) using Mamba blocks to map raw SMILES strings directly into the 8192-D CDI space.
How to Train¶
This mode requires a mapping between raw SMILES and their corresponding 8192-D embeddings (generated by a trained Basic model).
CLI Execution¶
# Fine-tune Mamba network against target CDI loss maps (custom dim 1024)
cdi train-gen --smiles-csv smiles.csv --target-h5 cdi_embeddings.h5 --target-dim 1024
Python API¶
from ChemicalDice.training.gen_model import train_generalised_cdi
model = train_generalised_cdi(
csv_path="smiles.csv",
target_h5_file="cdi_embeddings.h5"
)
- Base Engine: SMI-SSED (Mamba).
- Loss Functions: Native optimization via
MSELoss.CosineSimilarityLossis evaluated purely for validation and metric tracking.
[!WARNING] Hardware Requirement: CDI-Generalised training is tightly coupled to CUDA kernels. It explicitly requires CUDA 11.8 for optimal tensor memory allocation.
3. Training CDI with Flexible Descriptors¶
CDI supports flexible training configurations, allowing you to select any number of descriptors for feature integration. You are not restricted to a fixed set; you can easily combine any subset of descriptors.
Other Descriptors¶
These descriptors can also be used in place of above 6 descriptors:
- Molformer – Large-scale chemical language model trained on huge number of SMILES strings
- CLAMP – CLAMP (Contrastive Language-Assay Molecule Pre-Training) is trained on molecule-bioassay pairs
- Graphormer – Graph transformer-based structural encoding.
- VideoMol – An Image-enhanced Molecular Graph Representation Learning Framework.
- AlvaDesc – Comprehensive physicochemical topological, geometrical, physicochemical and structural descriptors.