Computational Chemistry
ChemVLM architecture showing molecular structure and text inputs flowing through vision encoder and language model into multimodal LLM for chemical reasoning

ChemVLM: A Multimodal Large Language Model for Chemistry

A 2025 AAAI paper introducing ChemVLM, a domain-specific multimodal LLM (26B parameters). It achieves state-of-the-art performance on chemical OCR, reasoning benchmarks, and molecular understanding tasks by combining vision and language models trained on curated chemistry data.

Computational Chemistry
Diagram of the Image2InChI architecture showing a SwinTransformer encoder connected to an attention-based feature fusion decoder for converting molecular images to InChI strings.

Image2InChI: SwinTransformer for Molecular Recognition

Proposes Image2InChI, an OCSR model with improved SwinTransformer encoder and novel feature fusion network with attention mechanisms that achieves 99.8% InChI accuracy on the BMS dataset.

Computational Chemistry
Architecture diagram of the MarkushGrapher dual-encoder system combining VTL and OCSR encoders for Markush structure recognition.

MarkushGrapher: Multi-modal Markush Structure Recognition

This paper introduces a multi-modal approach for extracting chemical Markush structures from patents, combining a Vision-Text-Layout encoder with a specialized chemical vision encoder. It addresses the lack of training data with a synthetic generation pipeline and introduces M2S, a new real-world benchmark.

Computational Chemistry
Three-stage training pipeline for MolSight showing pretraining, multi-granularity fine-tuning, and RL post-training stages

MolSight: OCSR with RL and Multi-Granularity Learning

MolSight introduces a three-stage training paradigm for Optical Chemical Structure Recognition (OCSR), utilizing large-scale pretraining, multi-granularity fine-tuning with auxiliary bond and coordinate prediction tasks, and reinforcement learning (GRPO) to achieve 85.1% stereochemical accuracy on USPTO, recognizing complex stereochemical structures like chiral centers and cis-trans isomers.

Computational Chemistry
ABC-Net detects atom and bond keypoints to reconstruct molecular graphs from images

ABC-Net: Keypoint-Based Molecular Image Recognition

ABC-Net reformulates molecular image recognition as a keypoint detection problem. By predicting atom/bond centers and properties via a single Fully Convolutional Network, it achieves >94% accuracy with high data efficiency.

Computational Chemistry
Overview of the ChemPix CNN-LSTM pipeline converting a hand-drawn hydrocarbon sketch to a SMILES string

ChemPix: Hand-Drawn Hydrocarbon Structure Recognition

Proposes a CNN-LSTM architecture that treats chemical structure recognition as an image captioning task. Introduces a synthetic data generation pipeline with augmentation, degradation, and background addition to train models that generalize to hand-drawn inputs without seeing real data during training.

Computational Chemistry
Architecture diagram showing the DECIMER 1.0 transformer pipeline from chemical image input to SELFIES output

DECIMER 1.0: Transformers for Chemical Image Recognition

DECIMER 1.0 introduces a Transformer-based architecture coupled with EfficientNet-B3 to solve Optical Chemical Structure Recognition. By using the SELFIES representation (which guarantees 100% valid output strings) and scaling training to over 35 million molecules, it achieves 96.47% exact match accuracy on synthetic benchmarks, offering an open-source solution for mining chemical data from legacy literature.

Computational Chemistry
Architecture diagram showing Vision Transformer encoder processing image patches and Transformer decoder generating InChI strings

End-to-End Transformer for Molecular Image Captioning

This paper introduces a convolution-free, end-to-end transformer model for molecular image translation. By replacing CNN encoders with Vision Transformers, it achieves a Levenshtein distance of 6.95 on noisy datasets, compared to 7.49 for ResNet50-LSTM baselines.

Computational Chemistry
Chemical structure diagram representing the ICMDT molecular translation system

ICMDT: Automated Chemical Structure Image Recognition

This paper introduces ICMDT, a Transformer-based architecture for molecular translation (image-to-InChI). By enhancing the TNT block to fuse pixel, small patch, and large patch embeddings, the model achieves superior accuracy on the Bristol-Myers Squibb dataset compared to CNN-RNN and standard Transformer baselines.

Computational Chemistry
Diagram showing a pixelated chemical image passing through a multi-layer encoder to produce a molecular graph with nodes and edges.

Image-to-Graph Transformers for Chemical Structures

This paper proposes an end-to-end deep learning architecture that translates chemical images directly into molecular graphs using a ResNet-Transformer encoder and a graph-aware decoder. It addresses the limitations of SMILES-based approaches by effectively handling non-atomic symbols (abbreviations) and varying drawing styles found in scientific literature.

Computational Chemistry
4-tert-butylphenol molecular structure diagram for Image2SMILES OCSR

Image2SMILES: Transformer OCSR with Synthetic Data Pipeline

A Transformer-based system for optical chemical structure recognition introducing a comprehensive data generation pipeline (FG-SMILES, Markush structures, visual contamination) achieving 79% accuracy on real-world images, outperforming rule-based systems like OSRA.

Computational Chemistry
Bromobenzene molecular structure diagram for MICER OCSR

MICER: Molecular Image Captioning with Transfer Learning

MICER treats optical chemical structure recognition as an image captioning task, using transfer learning with a fine-tuned ResNet encoder and attention-based LSTM decoder to convert molecular images into SMILES strings, reaching 97.54% sequence accuracy on synthetic data and 82.33% on real-world images.