Model Training Specifics¶

Training the core networks within ChemicalDice Integrator requires profound engineering scale. The standard loop sequences are exposed inherently within ChemicalDice/training/.

The CDI_Basic Integrator utilizes non-linear, multi-layer symmetrical encoders to project multiple modalities into a dense 8192-dimensional vector. It uses a Leave-One-Out (LOO) learning topology.

How to Train¶

You can trigger the training via the CLI or Python API.

CLI Execution¶

# Execute training with the full 6-modality manifold
cdi train-basic --data-files data_h5/mordred.h5 data_h5/Grover.h5 data_h5/Chemberta.h5 data_h5/Signaturizer.h5 data_h5/mopac.h5 data_h5/ImageMol.h5 --epochs 50

Python API¶

from ChemicalDice.training.basic_model import train_basic_cdi

model = train_basic_cdi(
    h5_file_paths=[
        "data_h5/mordred.h5", "data_h5/Grover.h5", "data_h5/Chemberta.h5",
        "data_h5/Signaturizer.h5", "data_h5/mopac.h5", "data_h5/ImageMol.h5"
    ],
    num_epochs=50,
    model_path="cdi_model.pt",
    embedding_path="cdi_embeddings.h5",
    bottleneck=1024  # Reduce to 1024-D
)

Customizing Embedding Size (Bottleneck)¶

By default, ChemicalDice produces 8192-dimensional embeddings. You can compress these into smaller, more manageable manifolds using the --bottleneck parameter. This is particularly useful for downstream tasks requiring lower memory footprints.

To 1024-D:

cdi train-basic --data-files data_h5/mordred.h5 data_h5/Grover.h5 data_h5/Chemberta.h5 data_h5/Signaturizer.h5 data_h5/mopac.h5 data_h5/ImageMol.h5 --bottleneck 1024

To 512-D:

cdi train-basic --data-files data_h5/mordred.h5 data_h5/Grover.h5 data_h5/Chemberta.h5 data_h5/Signaturizer.h5 data_h5/mopac.h5 data_h5/ImageMol.h5 --bottleneck 512

Optimizer: SGD (Initial LR: 0.001, Weight Decay: 0)
Scheduler: ReduceLROnPlateau (patience=5) tracking MSE loss.

2. Supervised State-Space Regression (CDI-Generalised)¶

The CDI-Generalised framework leverages SMI-SSED (Structured State-Space Evolution) using Mamba blocks to map raw SMILES strings directly into the 8192-D CDI space.

How to Train¶

This mode requires a mapping between raw SMILES and their corresponding 8192-D embeddings (generated by a trained Basic model).

CLI Execution¶

# Fine-tune Mamba network against target CDI loss maps (custom dim 1024)
cdi train-gen --smiles-csv smiles.csv --target-h5 cdi_embeddings.h5 --target-dim 1024

Python API¶

from ChemicalDice.training.gen_model import train_generalised_cdi

model = train_generalised_cdi(
    csv_path="smiles.csv",
    target_h5_file="cdi_embeddings.h5"
)

Base Engine: SMI-SSED (Mamba).
Loss Functions: Native optimization via MSELoss. CosineSimilarityLoss is evaluated purely for validation and metric tracking.

[!WARNING] Hardware Requirement: CDI-Generalised training is tightly coupled to CUDA kernels. It explicitly requires CUDA 11.8 for optimal tensor memory allocation.

3. Training CDI with Flexible Descriptors¶

CDI supports flexible training configurations, allowing you to select any number of descriptors for feature integration. You are not restricted to a fixed set; you can easily combine any subset of descriptors.

Other Descriptors¶

These descriptors can also be used in place of above 6 descriptors:

Molformer – Large-scale chemical language model trained on huge number of SMILES strings
CLAMP – CLAMP (Contrastive Language-Assay Molecule Pre-Training) is trained on molecule-bioassay pairs
Graphormer – Graph transformer-based structural encoding.
VideoMol – An Image-enhanced Molecular Graph Representation Learning Framework.
AlvaDesc – Comprehensive physicochemical topological, geometrical, physicochemical and structural descriptors.

Model Training Specifics¶

1. Unsupervised Multi-Modal Optimization (CDI-Basic)¶

How to Train¶

CLI Execution¶

Python API¶

Customizing Embedding Size (Bottleneck)¶

2. Supervised State-Space Regression (CDI-Generalised)¶

How to Train¶

CLI Execution¶

Python API¶

3. Training CDI with Flexible Descriptors¶

Other Descriptors¶