Architecture of ChemicalDice Integrator (CDI)¶
The Chemical Dice Integrator (CDI) is a hierarchical, multimodal deep learning framework designed to unify diverse molecular representations into a single, high-information latent space. The architecture is split into two primary operational modes: CDI-Basic for unsupervised integration and CDI-Generalised for scalable, sequence-based inference.
1. CDI-Basic: Multimodal Fusion Engine¶
CDI-Basic is the foundational engine that performs the unsupervised integration of six orthogonal molecular modalities. It utilizes a two-tiered hierarchical autoencoder architecture to learn deep semantic relationships between different views of a molecule.
Tier 1: Semantic Commonality Autoencoders (SCA)¶
In the first tier, CDI employs six dedicated autoencoders. Each SCA is responsible for learning inter-modality dependencies through a Leave-One-Out (LOO) reconstruction objective:
- Input: Concatenated latent information from five modalities.
- Target: Reconstruction of the sixth (omitted) modality.
- Outcome: This process enforces the discovery of shared inter-modality information, yielding six latent subspaces that capture the "semantic overlap" between feature domains.
Tier 2: Super-Embedding Autoencoder (SEA)¶
The outputs from the six SCAs are concatenated and fed into the Super-Embedding Autoencoder.
- Function: Compress the consolidated latent vectors into a single, 8192-dimensional Super Embedding.
- Objective: Minimize global reconstruction error and ensure a cohesive, unified representation that preserves the unique characteristics of each source modality while capturing their synergy.
2. CDI-Generalised: Sequence-to-Embedding Mapping¶
While CDI-Basic provides the "gold standard" integrated embedding, it requires all six source features to be computed beforehand—a significant computational bottleneck. CDI-Generalised solves this by providing a direct pathway from chemical structure to the integrated space.
Mamba State-Space Model (SSM) Core¶
CDI-Generalised leverages a modern Mamba SSM architecture (based on the SMI-SSED framework) to map raw SMILES strings directly into the 8192-D CDI space.
- Efficiency: Bypasses the need to generate descriptor-heavy inputs (like MOPAC or Mordred) during inference.
- Training: The model is trained using a supervised regression objective to ensure both geometric and angular alignment with the pre-trained CDI-Basic targets.
- Scalability: Enables high-throughput molecular screening with the latency of a single model while maintaining the breadth of the original multimodal ensemble.
3. The Six Orthogonal Modalities¶
CDI integrates information from across the chemical intelligence spectrum:
- Quantum-Mechanical (MOPAC): Captures electronic properties and orbital landscapes.
- Topological/Graph-Based (GROVER): Maps atomic connections and structural motifs.
- Linguistic/Language-Based (ChemBERTa): Extracts transformer-based semantics from SMILES syntax.
- Biological/Bioactivity (Signaturizer): Represents pre-trained bioactivity signatures and profiles.
- Visual/Image-Based (ImageMol): Learns 2D topological encodings via visual ResNet features.
- Physicochemical (Mordred): Provides deterministic mathematical descriptors defining geometry and chemistry.
4. Mathematical Formulation & Loss Functions¶
The hierarchical training protocol of ChemicalDice is governed by specific objective functions that ensure high-fidelity representation learning.
CDI-Basic Optimization¶
The total loss for the fusion engine combines the reconstruction and alignment objectives of both tiers:
- \(\mathcal{L}_{RE}\): Tier 1 reconstruction error for the leave-one-out modality.
- \(\mathcal{L}_{MSE}\): Mean Squared Error for semantic alignment between latent spaces.
- \(\mathcal{L}_{SEA}\): Global reconstruction loss for the Tier 2 super-embedding.
CDI-Generalised Optimization¶
The Mamba-based mapping is optimized to align the sequence-derived embeddings with the established CDI manifold:
- \(N\): Batch size.
- \(D\): Embedding dimension (fixed at 8192).
- \(E_{target}\): Gold-standard embeddings from CDI-Basic.
- \(E_{pred}\): Predicted embeddings from the Mamba encoder.
Technical Specifications Summary¶
| Component | Architecture | Dimensionality | Primary Objective |
|---|---|---|---|
| CDI-Basic | Two-Tiered Autoencoder | 8192-D | Multimodal Representation Fusion |
| CDI-Generalised | Mamba State-Space Model | 8192-D | Sequence-to-Latent Mapping |
| Training Scale | ~1.09 Billion Parameters | - | Global Semantic Integration |
| Inference Mode | Sequence-based (SMILES) | - | High-throughput Scalability |