Computer-Vision

ChemVLM architecture showing molecular structure and text inputs flowing through vision encoder and language model into multimodal LLM for chemical reasoning

ChemVLM: A Multimodal Large Language Model for Chemistry

A 2025 AAAI paper introducing ChemVLM, a domain-specific multimodal LLM (26B parameters). It achieves state-of-the-art performance on chemical OCR, reasoning benchmarks, and molecular understanding tasks by combining vision and language models trained on curated chemistry data.

Computational Chemistry

Overview of the DECIMER.ai platform combining segmentation, classification, and image-to-SMILES recognition

DECIMER.ai: Optical Chemical Structure Recognition

DECIMER.ai addresses the lack of open tools for Optical Chemical Structure Recognition (OCSR) by providing a comprehensive, deep-learning-based workflow. It features a novel data generation pipeline (RanDepict), a web application, and models for segmentation and recognition that rival or exceed proprietary solutions.

Computational Chemistry

Architecture diagram of the DGAT model showing dual-path decoder with CGFE and SDGLA modules

Dual-Path Global Awareness Transformer (DGAT) for OCSR

Proposes a new architecture (DGAT) to resolve global context loss in chemical structure recognition. Introduces Cascaded Global Feature Enhancement and Sparse Differential Global-Local Attention, achieving 84.0% BLEU-4 and handling complex chiral structures implicitly.

Computational Chemistry

Diagram showing the DECIMER hand-drawn OCSR pipeline from hand-drawn chemical structure image through EfficientNetV2 encoder and Transformer decoder to predicted SMILES output

Enhanced DECIMER for Hand-Drawn Structure Recognition

This paper presents an enhanced deep learning architecture for Optical Chemical Structure Recognition (OCSR) specifically optimized for hand-drawn inputs. By pairing an EfficientNetV2 encoder with a Transformer decoder and training on over 150 million synthetic images, the model achieves 73.25% exact match accuracy on a real-world hand-drawn benchmark of 5,088 images.

Computational Chemistry

Diagram of the Image2InChI architecture showing a SwinTransformer encoder connected to an attention-based feature fusion decoder for converting molecular images to InChI strings.

Image2InChI: SwinTransformer for Molecular Recognition

Proposes Image2InChI, an OCSR model with improved SwinTransformer encoder and novel feature fusion network with attention mechanisms that achieves 99.8% InChI accuracy on the BMS dataset.

Computational Chemistry

Architecture diagram of the MarkushGrapher dual-encoder system combining VTL and OCSR encoders for Markush structure recognition.

MarkushGrapher: Multi-modal Markush Structure Recognition

This paper introduces a multi-modal approach for extracting chemical Markush structures from patents, combining a Vision-Text-Layout encoder with a specialized chemical vision encoder. It addresses the lack of training data with a synthetic generation pipeline and introduces M2S, a new real-world benchmark.

Computational Chemistry

Diagram of the MMSSC-Net architecture showing the SwinV2 encoder and GPT-2 decoder pipeline for molecular image recognition

MMSSC-Net: Multi-Stage Sequence Cognitive Networks

MMSSC-Net introduces a multi-stage cognitive approach for OCSR, utilizing a SwinV2 encoder and GPT-2 decoder to recognize atomic and bond sequences. It achieves 75-98% accuracy across benchmark datasets by handling varying image resolutions and noise through fine-grained perception of atoms and bonds.

Computational Chemistry

Pipeline diagram showing keypoint detection, supergraph construction, and GNN classification for molecular structure recognition

MolGrapher: Graph-based Chemical Structure Recognition

MolGrapher introduces a three-stage pipeline (keypoint detection, supergraph construction, GNN classification) for recognizing chemical structures from images. It achieves 91.5% accuracy on USPTO by treating molecules as graphs, and introduces the USPTO-30K benchmark.

Computational Chemistry

Overview of the MolMole pipeline showing ViDetect, ViReact, and ViMore processing document pages to extract molecules and reactions.

MolMole: Unified Vision Pipeline for Molecule Mining

MolMole unifies molecule detection, reaction parsing, and structure recognition into a single vision-based pipeline, achieving top performance on a newly introduced 550-page benchmark by processing full documents without external layout parsers.

Computational Chemistry

Overview of the MolScribe encoder-decoder architecture predicting atoms with coordinates and bonds from a molecular image.

MolScribe: Robust Image-to-Graph Molecular Recognition

MolScribe reformulates molecular recognition as an image-to-graph generation task, explicitly predicting atom coordinates and bonds to better handle stereochemistry and abbreviated structures compared to image-to-SMILES baselines.

Computational Chemistry

Three-stage training pipeline for MolSight showing pretraining, multi-granularity fine-tuning, and RL post-training stages

MolSight: OCSR with RL and Multi-Granularity Learning

MolSight introduces a three-stage training paradigm for Optical Chemical Structure Recognition (OCSR), utilizing large-scale pretraining, multi-granularity fine-tuning with auxiliary bond and coordinate prediction tasks, and reinforcement learning (GRPO) to achieve 85.1% stereochemical accuracy on USPTO, recognizing complex stereochemical structures like chiral centers and cis-trans isomers.

Computational Chemistry

ABC-Net detects atom and bond keypoints to reconstruct molecular graphs from images

ABC-Net: Keypoint-Based Molecular Image Recognition

ABC-Net reformulates molecular image recognition as a keypoint detection problem. By predicting atom/bond centers and properties via a single Fully Convolutional Network, it achieves >94% accuracy with high data efficiency.