Molecular String Renderer: Robust Visualization Tool

Overview

In computational chemistry and AI drug discovery, visualization pipelines are often brittle; breaking on edge cases or failing silently when processing millions of molecules for training data.

I built molecular-string-renderer to treat molecular visualization as a strict software engineering problem. It is a highly configurable, fault-tolerant wrapper around RDKit that standardizes the conversion of text-based chemical representations (SMILES, InChI, SELFIES) into high-fidelity raster and vector graphics.

Features

This library differentiates itself from standard plotting scripts through strict architectural patterns designed for reliability:

1. Strategy Pattern for SVG Generation

RDKit’s vector rendering can sometimes fail on complex molecular topologies. I implemented a Hybrid Strategy that ensures the pipeline never crashes during batch processing:

Vector Strategy: Attempts to generate a true, scalable vector graphic.
Raster Fallback: If the vector engine fails, the system automatically renders a high-res PNG and embeds it transparently into the SVG container.

2. Native Generative AI Support

With the rise of Large Language Models in chemistry, SELFIES (Self-Referencing Embedded Strings) has become a standard output format. This library handles SELFIES natively, managing the decoding and sanitization lifecycle internally so that ML training loops can simply “pass strings and get images.”

3. Strict Configuration Contracts

The library uses Pydantic models (RenderConfig, ParserConfig, OutputConfig) to enforce strict data contracts. This ensures that visualization parameters are validated before any heavy computation begins, preventing runtime errors deep in a batch job.

Usage

The library provides a simple Python API for rendering single molecules or batches of molecules from various string formats.

Results

Type Safety: The codebase runs with strict mypy settings, ensuring type safety across the entire pipeline.
Grid Auto-Fitting: Implemented smart layout algorithms that automatically adjust grid dimensions based on the input batch size.
Format Agnostic: Decouples the parsing logic (SMILES vs. MolBlock vs. SELFIES) from the rendering logic, making it trivial to add support for new proprietary formats.

Why This Matters

For AI research, infrastructure must be bulletproof. When generating 100,000 images for a diffusion model or visualizing the latent space of a chemical LLM, the pipeline must handle edge cases gracefully. This tool bridges the gap between academic scripts and production-grade data pipelines.

Visualizing SMILES and SELFIES Strings: walkthrough of the visualization pipeline this library implements
Isomer Dataset Generation: related project generating molecular datasets using SMILES/SELFIES representations

Overview#

Features#

1. Strategy Pattern for SVG Generation#

2. Native Generative AI Support#

3. Strict Configuration Contracts#

Usage#

Results#

Why This Matters#

Related Work#