Architecture of ChemicalDice Integrator (CDI)¶

The Chemical Dice Integrator (CDI) is a hierarchical, multimodal deep learning framework designed to unify diverse molecular representations into a single, high-information latent space. The architecture is split into two primary operational modes: CDI-Basic for unsupervised integration and CDI-Generalised for scalable, sequence-based inference.

1. CDI-Basic: Multimodal Fusion Engine¶

CDI-Basic is the foundational engine that performs the unsupervised integration of six orthogonal molecular modalities. It utilizes a two-tiered hierarchical autoencoder architecture to learn deep semantic relationships between different views of a molecule.

Tier 1: Semantic Commonality Autoencoders (SCA)¶

In the first tier, CDI employs six dedicated autoencoders. Each SCA is responsible for learning inter-modality dependencies through a Leave-One-Out (LOO) reconstruction objective:

Input: Concatenated latent information from five modalities.
Target: Reconstruction of the sixth (omitted) modality.
Outcome: This process enforces the discovery of shared inter-modality information, yielding six latent subspaces that capture the "semantic overlap" between feature domains.

Tier 2: Super-Embedding Autoencoder (SEA)¶

The outputs from the six SCAs are concatenated and fed into the Super-Embedding Autoencoder.

Function: Compress the consolidated latent vectors into a single, 8192-dimensional Super Embedding.
Objective: Minimize global reconstruction error and ensure a cohesive, unified representation that preserves the unique characteristics of each source modality while capturing their synergy.

2. CDI-Generalised: Sequence-to-Embedding Mapping¶

While CDI-Basic provides the "gold standard" integrated embedding, it requires all six source features to be computed beforehand—a significant computational bottleneck. CDI-Generalised solves this by providing a direct pathway from chemical structure to the integrated space.

Mamba State-Space Model (SSM) Core¶

CDI-Generalised leverages a modern Mamba SSM architecture (based on the SMI-SSED framework) to map raw SMILES strings directly into the 8192-D CDI space.

Efficiency: Bypasses the need to generate descriptor-heavy inputs (like MOPAC or Mordred) during inference.
Training: The model is trained using a supervised regression objective to ensure both geometric and angular alignment with the pre-trained CDI-Basic targets.
Scalability: Enables high-throughput molecular screening with the latency of a single model while maintaining the breadth of the original multimodal ensemble.

3. The Six Orthogonal Modalities¶

CDI integrates information from across the chemical intelligence spectrum:

Quantum-Mechanical (MOPAC): Captures electronic properties and orbital landscapes.
Topological/Graph-Based (GROVER): Maps atomic connections and structural motifs.
Linguistic/Language-Based (ChemBERTa): Extracts transformer-based semantics from SMILES syntax.
Biological/Bioactivity (Signaturizer): Represents pre-trained bioactivity signatures and profiles.
Visual/Image-Based (ImageMol): Learns 2D topological encodings via visual ResNet features.
Physicochemical (Mordred): Provides deterministic mathematical descriptors defining geometry and chemistry.

4. Mathematical Formulation & Loss Functions¶

The hierarchical training protocol of ChemicalDice is governed by specific objective functions that ensure high-fidelity representation learning.

CDI-Basic Optimization¶

The total loss for the fusion engine combines the reconstruction and alignment objectives of both tiers:

\[ \mathcal{L}_{total} = \frac{1}{6} \sum_{j=1}^{6} \left( \mathcal{L}_{RE}(\mathbf{a}_j^i, \tilde{\mathbf{a}}_j^i) + \mathcal{L}_{MSE}(\mathbf{p}_j^i, \mathbf{d}_j^i) \right) + \mathcal{L}_{SEA} \]

\(\mathcal{L}_{RE}\): Tier 1 reconstruction error for the leave-one-out modality.
\(\mathcal{L}_{MSE}\): Mean Squared Error for semantic alignment between latent spaces.
\(\mathcal{L}_{SEA}\): Global reconstruction loss for the Tier 2 super-embedding.

CDI-Generalised Optimization¶

The Mamba-based mapping is optimized to align the sequence-derived embeddings with the established CDI manifold:

\[ \mathcal{L}_{SSM} = \frac{1}{N \times D} \sum_{i=1}^{N} \sum_{j=1}^{D} (E_{target, ij} - E_{pred, ij})^2 \]

\(N\): Batch size.
\(D\): Embedding dimension (fixed at 8192).
\(E_{target}\): Gold-standard embeddings from CDI-Basic.
\(E_{pred}\): Predicted embeddings from the Mamba encoder.

Technical Specifications Summary¶

Component	Architecture	Dimensionality	Primary Objective
CDI-Basic	Two-Tiered Autoencoder	8192-D	Multimodal Representation Fusion
CDI-Generalised	Mamba State-Space Model	8192-D	Sequence-to-Latent Mapping
Training Scale	~1.09 Billion Parameters	-	Global Semantic Integration
Inference Mode	Sequence-based (SMILES)	-	High-throughput Scalability