Language Models as 3D Chemical Structure Generators

This is a Method paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.

Beyond Graphs and Strings: The Need for 3D Chemical Generation

Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like SMILES and SELFIES (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.

Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.

Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.

Direct Tokenization of Chemical File Formats

The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (XYZ, CIF, PDB). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.

A molecule with $n$ atoms is represented as:

$$ \mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:

$$ \mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:

$$ \mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n) $$

The language model learns the joint distribution via autoregressive factorization:

$$ p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1) $$

Two tokenization strategies are explored:

  1. Character-level (LM-CH): Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).
  2. Atom+coordinate-level (LM-AC): Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., ‘-1.98’). The vocabulary is larger (~100-10K tokens) but sequences are shorter.

Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.

Experiments Across Molecules, Crystals, and Protein Binding Sites

Molecular Generation (ZINC)

The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit’s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.

For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.

Standard metrics include validity, uniqueness, novelty, and earth mover’s distance (WA) for molecular property distributions (QED, SA score, molecular weight).

Model3DValid (%)Unique (%)Novel (%)WA MWWA SAWA QED
TrainNo100.0100.0100.00.8160.0130.002
SM-LMNo98.35100.0100.03.6400.0490.005
SF-LMNo100.0100.0100.03.7720.0850.006
JTVAENo100.098.56100.022.630.1260.023
ENFYes1.0596.3799.72168.51.8860.160
G-SchNetYes1.2055.9698.33152.71.1260.185
EDMYes77.5196.4095.30101.20.9390.093
LM-CHYes90.13100.0100.03.9122.6080.077
LM-ACYes98.51100.0100.01.8110.0260.004

The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.

Crystal Generation (Perov-5 and MP-20)

Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 perovskite materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).

Evaluation metrics include structural validity (minimum interatomic distance > 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover’s distance for density and number of unique elements.

DataModelStruc. Valid (%)Comp. Valid (%)COV-R (%)COV-P (%)WA densityWA elements
Perov-5CDVAE100.098.5999.4598.460.1260.063
Perov-5LM-CH100.098.5199.6099.420.0710.036
Perov-5LM-AC100.098.7998.7899.360.0890.028
MP-20CDVAE100.086.7099.1599.490.6881.432
MP-20LM-CH84.8183.5599.2597.890.8640.132
MP-20LM-AC95.8188.8799.6098.550.6960.092

On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).

Protein Binding Site Generation (PDB)

The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.

Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.

Competitive 3D Generation Without Geometric Inductive Biases

The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.

Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.

The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.

Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.


Reproducibility Details

Data

PurposeDatasetSizeNotes
Training/EvalZINC250K molecules~23 heavy atoms avg; XYZ files via RDKit conformer generation
Training/EvalPerov-518,928 perovskites5 atoms/unit cell, 56 elements
Training/EvalMP-2045,231 materials1-20 atoms/unit cell, 89 elements
Training/EvalProtein binding sites~180K protein-ligand pairsProcessed to 200-250 atoms per pocket

Algorithms

  • Architecture: GPT-style transformer with ~1M to 100M parameters
  • Layers: 12
  • Embedding size: 128 to 1024
  • Attention heads: 4 to 12
  • Batch size: 4 to 32 structures
  • Learning rate: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$
  • Data augmentation: Random rotation of training structures at each epoch
  • Numerical precision: 2 decimal places (molecules, proteins), 3 decimal places (crystals)

Models

No pre-trained model weights are publicly available. The paper mentions “Example code can be found at” but the URL appears to be missing from the published version.

Evaluation

MetricDomainDescription
ValidityMoleculesxyz2mol produces valid RDKit Mol object
ValidityCrystalsStructural (min distance > 0.5 angstrom) and compositional (charge neutral)
UniquenessAllFraction of distinct generated structures
NoveltyAllFraction not in training set
Earth mover’s distanceAllDistribution match for domain-specific properties
RMSDMoleculesDeviation from RDKit conformer geometries
CoverageCrystalsRecall and precision between generated and test sets

Hardware

Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.

Artifacts

No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.


Paper Information

Citation: Flam-Shepherd, D. & Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. arXiv preprint arXiv:2305.05708.

@article{flamshepherd2023language,
  title={Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files},
  author={Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:2305.05708},
  year={2023}
}