Skip to content

Model Training Specifics

Training the core networks within ChemicalDice Integrator requires profound engineering scale. The standard loop sequences are exposed inherently within ChemicalDice/training/.

1. Unsupervised Multi-Modal Optimization (CDI-Basic)

The CDI_Basic Integrator utilizes non-linear, multi-layer symmetrical encoders to project multiple modalities into a dense 8192-dimensional vector. It uses a Leave-One-Out (LOO) learning topology.

How to Train

You can trigger the training via the CLI or Python API.

CLI Execution

# Execute training with the full 6-modality manifold
cdi train-basic --data-files data_h5/mordred.h5 data_h5/Grover.h5 data_h5/Chemberta.h5 data_h5/Signaturizer.h5 data_h5/mopac.h5 data_h5/ImageMol.h5 --epochs 50

Python API

from ChemicalDice.training.basic_model import train_basic_cdi

model = train_basic_cdi(
    h5_file_paths=[
        "data_h5/mordred.h5", "data_h5/Grover.h5", "data_h5/Chemberta.h5",
        "data_h5/Signaturizer.h5", "data_h5/mopac.h5", "data_h5/ImageMol.h5"
    ],
    num_epochs=50,
    model_path="cdi_model.pt",
    embedding_path="cdi_embeddings.h5",
    bottleneck=1024  # Reduce to 1024-D
)

Customizing Embedding Size (Bottleneck)

By default, ChemicalDice produces 8192-dimensional embeddings. You can compress these into smaller, more manageable manifolds using the --bottleneck parameter. This is particularly useful for downstream tasks requiring lower memory footprints.

  • To 1024-D:
    cdi train-basic --data-files data_h5/mordred.h5 data_h5/Grover.h5 data_h5/Chemberta.h5 data_h5/Signaturizer.h5 data_h5/mopac.h5 data_h5/ImageMol.h5 --bottleneck 1024
    
  • To 512-D:

    cdi train-basic --data-files data_h5/mordred.h5 data_h5/Grover.h5 data_h5/Chemberta.h5 data_h5/Signaturizer.h5 data_h5/mopac.h5 data_h5/ImageMol.h5 --bottleneck 512
    

  • Optimizer: SGD (Initial LR: 0.001, Weight Decay: 0)

  • Scheduler: ReduceLROnPlateau (patience=5) tracking MSE loss.

2. Supervised State-Space Regression (CDI-Generalised)

The CDI-Generalised framework leverages SMI-SSED (Structured State-Space Evolution) using Mamba blocks to map raw SMILES strings directly into the 8192-D CDI space.

How to Train

This mode requires a mapping between raw SMILES and their corresponding 8192-D embeddings (generated by a trained Basic model).

CLI Execution

# Fine-tune Mamba network against target CDI loss maps (custom dim 1024)
cdi train-gen --smiles-csv smiles.csv --target-h5 cdi_embeddings.h5 --target-dim 1024

Python API

from ChemicalDice.training.gen_model import train_generalised_cdi

model = train_generalised_cdi(
    csv_path="smiles.csv",
    target_h5_file="cdi_embeddings.h5"
)
  • Base Engine: SMI-SSED (Mamba).
  • Loss Functions: Native optimization via MSELoss. CosineSimilarityLoss is evaluated purely for validation and metric tracking.

[!WARNING] Hardware Requirement: CDI-Generalised training is tightly coupled to CUDA kernels. It explicitly requires CUDA 11.8 for optimal tensor memory allocation.


3. Training CDI with Flexible Descriptors

CDI supports flexible training configurations, allowing you to select any number of descriptors for feature integration. You are not restricted to a fixed set; you can easily combine any subset of descriptors.

Other Descriptors

These descriptors can also be used in place of above 6 descriptors:

  • Molformer – Large-scale chemical language model trained on huge number of SMILES strings
  • CLAMP – CLAMP (Contrastive Language-Assay Molecule Pre-Training) is trained on molecule-bioassay pairs
  • Graphormer – Graph transformer-based structural encoding.
  • VideoMol – An Image-enhanced Molecular Graph Representation Learning Framework.
  • AlvaDesc – Comprehensive physicochemical topological, geometrical, physicochemical and structural descriptors.