Computational Chemistry
Bar chart showing vision language model performance across chemistry tasks including equipment identification, molecule matching, spectroscopy, and laboratory safety

MaCBench: Multimodal Chemistry and Materials Benchmark

MaCBench evaluates frontier vision language models across 1,153 chemistry and materials science tasks spanning data extraction, experimental execution, and data interpretation, uncovering fundamental limitations in spatial reasoning and cross-modal integration.

Molecular Representations
Bar chart showing MolBERT ablation: combining MLM, PhysChem, and SMILES equivalence tasks gives best improvement

MolBERT: Auxiliary Tasks for Molecular BERT Models

MolBERT pre-trains a BERT model on SMILES strings using masked language modeling, SMILES equivalence, and physicochemical property prediction as auxiliary tasks, achieving state-of-the-art results on virtual screening and QSAR benchmarks.

Predictive Chemistry
Diagram showing ULMFiT-style three-stage pipeline adapted for molecular property prediction

MolPMoFiT: Inductive Transfer Learning for QSAR

MolPMoFiT applies ULMFiT-style transfer learning to QSAR modeling, pre-training an AWD-LSTM on one million ChEMBL molecules and fine-tuning for property prediction on small datasets.

Molecular Representations
Encoder-decoder architecture diagram for translating chemical names between English and Chinese with performance comparison bar chart

Neural Machine Translation of Chemical Nomenclature

This paper applies character-level CNN and LSTM encoder-decoder networks to translate chemical names between English and Chinese, comparing them against an existing rule-based tool.

Computational Chemistry
Conceptual diagram showing natural language prompts flowing into code generation for chemistry tasks

NLP Models That Automate Programming for Chemistry

Hocky and White argue that NLP models capable of generating code from natural language prompts will fundamentally alter how chemists interact with scientific software, reducing barriers to computational research and reshaping programming pedagogy.

Molecular Generation
Horizontal bar chart showing REINVENT 4 unified framework supporting seven generative model types

REINVENT 4: Open-Source Generative Molecule Design

Overview of REINVENT 4, an open-source generative molecular design framework from AstraZeneca that unifies RNN and transformer generators within reinforcement learning, transfer learning, and curriculum learning optimization algorithms.

Molecular Representations
Diagram showing SMILES string flowing through encoder to fixed-length fingerprint vector and back through decoder

Seq2seq Fingerprint: Unsupervised Molecular Embedding

A GRU-based sequence-to-sequence model that learns fixed-length molecular fingerprints by translating SMILES strings to themselves, enabling unsupervised representation learning for drug discovery tasks.

Molecular Representations
Bar chart comparing binding affinity scores across SMILES, AIS, and SMI+AIS hybrid tokenization strategies

SMI+AIS: Hybridizing SMILES with Environment Tokens

Proposes SMI+AIS, a hybrid molecular representation combining standard SMILES tokens with chemical-environment-aware Atom-In-SMILES tokens, demonstrating improved molecular generation for drug design targets.

Molecular Representations
Diagram showing Transformer encoder-decoder architecture converting SMILES strings into molecular fingerprints

SMILES Transformer: Low-Data Molecular Fingerprints

A Transformer-based encoder-decoder pre-trained on 861K SMILES from ChEMBL24 produces 1024-dimensional molecular fingerprints that outperform ECFP and graph convolutions on 5 of 10 MoleculeNet tasks in low-data settings.

Predictive Chemistry
Bar chart comparing SMILES2Vec and Graph Conv scores across five MoleculeNet tasks

SMILES2Vec: Interpretable Chemical Property Prediction

SMILES2Vec is a deep RNN that learns chemical features directly from SMILES strings using a Bayesian-optimized CNN-GRU architecture. It matches graph convolution baselines on toxicity and activity prediction, and its explanation mask identifies chemically meaningful functional groups with 88% accuracy.

Molecular Representations
Bar chart showing SMILES Pair Encoding reduces mean sequence length from 40 to 6 tokens

SPE: Data-Driven SMILES Substructure Tokenization

Introduces SMILES Pair Encoding (SPE), a data-driven tokenization algorithm that learns high-frequency SMILES substrings from ChEMBL to produce shorter, chemically interpretable token sequences for deep learning.

Computational Chemistry
Bar chart showing scientific LLM taxonomy across five modalities: textual, molecular, protein, genomic, and multimodal

Survey of Scientific LLMs in Bio and Chem Domains

This survey systematically reviews scientific LLMs (Sci-LLMs) across five modalities: textual, molecular, protein, genomic, and multimodal, analyzing architectures, datasets, evaluation methods, and open challenges for AI-driven scientific discovery.