CDI Bot¶

Chemical Dice Integrator — Conversational Molecular Embedding Platform

CDI Bot is a fully containerised, LLM-powered web application that gives researchers and chemists a natural-language interface to the Chemical Dice Integrator (CDI).

[!TIP] Watch the CDI Bot in action:

CDI's core embedding pipeline exists as a Python library and a REST API, but lacked any accessible interface for non-programmer users. Researchers needed to write code to generate embeddings, understand API contracts, and manually manage files. The goal was to eliminate that friction entirely:

Allow any researcher to generate molecular embeddings by simply typing in natural language or uploading a spreadsheet
Remove dependency on a locally installed Ollama instance — the LLM runs inside the container, making deployment a single command on any machine
Provide a microservice interface so ML pipelines can call the CDI API programmatically without the chat layer

How We Built It

Architecture — Three services managed by Supervisor inside one Docker image: Ollama (LLM server, port 11434), FastAPI backend (port 8001), and Streamlit frontend (port 8501). Ollama starts first; FastAPI waits for its health check; Streamlit starts after a brief delay.

LLM Pipeline — Every user message triggers two sequential Ollama calls. Step 1 generates a natural-language conversational reply using a CDI-specific system prompt with few-shot examples. Step 2 is a near-deterministic intent classifier (temperature 0.05) that outputs a structured JSON object with fields intent (run_file | run_smiles | chat) and smiles. No regex or keyword lists — the LLM does all parsing.

Model Baking — The chosen LLM (default: llama3.1:8b) is pulled from Ollama's registry during docker build and stored as an image layer. Runtime requires zero internet access and zero model downloads.

CDI Integration — Single-SMILES requests hit the CDI REST API (chemicaldice.ahujalab.iiitd.edu.in). Batch file requests invoke the ChemicalDice Python library directly. Results are checkpointed as CSV and served via a download endpoint.

Features

Feature	Description
💬 Conversational Chat UI	Streamlit-based dark-themed interface with animated message bubbles
🔬 Single SMILES Embedding	Paste any SMILES string; get a CDI vector instantly via the CDI REST API
📂 Batch File Embedding	Upload CSV / Excel / TSV / JSON — auto-detected, converted, and processed
🤖 LLM-Driven Intent Engine	Two-step Ollama pipeline: chat reply + structured JSON intent + SMILES extraction
⚙️ Microservice Tab	Direct API access — no chat needed; ideal for programmatic integration
⬇️ CSV Download	Embeddings exported as a ready-to-use CSV with SMILES + CDI_0…CDI_n columns
🐳 Single Docker Image	One `docker build`, one `docker run` — Ollama + FastAPI + Streamlit bundled

How the Public Gets It

Docker Image	Single self-contained image (~8 GB); run with one command on any Linux host
Default Model	LLM baked in at build time (default: llama3.1:8b); swappable via --build-arg
Ports Exposed	8501 — Streamlit UI
GPU Support	Pass --gpus all for NVIDIA acceleration; falls back to CPU automatically
Model Switching	docker build --build-arg LLM_MODEL=\<tag> — no code changes required

Run using a prebuilt Docker image

docker pull ahujalab/chemicaldice-app:v1

docker run --gpus all --name chemicaldice-app -p 8001:8001 -p 8501:8501 ahujalab/chemicaldice-app:v1

⚙️ Requirements (must have)¶

Docker installed
NVIDIA GPU
NVIDIA Container Toolkit installed (for --gpus all)
~20 GB free disk space
At least 8 GB RAM (16 GB recommended)