MAT: Graph-Augmented Transformer for Molecules (2020)

A Graph-Augmented Transformer for Molecular Property Prediction

This is a Method paper that proposes the Molecule Attention Transformer (MAT), a Transformer-based architecture adapted for molecular property prediction. The primary contribution is a modified self-attention mechanism that incorporates inter-atomic distances and molecular graph structure alongside the standard query-key attention. Combined with self-supervised pretraining on 2 million molecules from ZINC15, MAT achieves competitive performance across seven diverse molecular property prediction tasks while requiring minimal hyperparameter tuning.

Challenges in Deep Learning for Molecular Properties

Predicting molecular properties is central to drug discovery and materials design, yet deep neural networks have struggled to consistently outperform shallow methods like random forests and SVMs on these tasks. Wu et al. (2018) demonstrated through the MoleculeNet benchmark that graph neural networks do not reliably beat classical models. Two recurring problems compound this:

Underfitting: Graph neural networks tend to underfit training data, with performance failing to scale with model complexity (Ishiguro et al., 2019).
Hyperparameter sensitivity: Deep models for molecule property prediction require extensive hyperparameter search (often 500+ configurations) to achieve competitive results, making them impractical for many practitioners.

Concurrent work explored using vanilla Transformers on SMILES string representations of molecules (Honda et al., 2019; Wang et al., 2019), but these approaches discard the explicit structural information encoded in molecular graphs and 3D conformations. The motivation for MAT is to combine the flexibility of the Transformer architecture with domain-specific inductive biases from molecular structure.

Molecule Self-Attention: Combining Attention, Distance, and Graph Structure

The core innovation is the Molecule Self-Attention layer, which replaces standard Transformer self-attention. In a standard Transformer, head $i$ computes:

$$ \mathcal{A}^{(i)} = \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) \mathbf{V}_{i} $$

MAT augments this with two additional information sources. Let $\mathbf{A} \in {0, 1}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the molecular graph adjacency matrix and $\mathbf{D} \in \mathbb{R}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the inter-atomic distance matrix. The modified attention becomes:

$$ \mathcal{A}^{(i)} = \left(\lambda_{a} \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) + \lambda_{d}, g(\mathbf{D}) + \lambda_{g}, \mathbf{A}\right) \mathbf{V}_{i} $$

where $\lambda_{a}$, $\lambda_{d}$, and $\lambda_{g}$ are scalar hyperparameters weighting each component, and $g$ is either a row-wise softmax or an element-wise exponential decay $g(d) = \exp(-d)$.

Key architectural details:

Atom embedding: Each atom is represented as a 26-dimensional vector encoding atomic identity (one-hot over B, N, C, O, F, P, S, Cl, Br, I, dummy, other), number of heavy neighbors, number of hydrogens, formal charge, ring membership, and aromaticity.
Dummy node: An artificial disconnected node (distance $10^{6}$ from all atoms) is added to each molecule, allowing the model to “skip” attention heads when no relevant pattern exists, similar to how BERT uses the separation token.
3D conformers: Distance matrices are computed from RDKit-generated 3D conformers using the Universal Force Field (UFF).
Pretraining: Node-level masked atom prediction on 2 million ZINC15 molecules (following Hu et al., 2019), where 15% of atom features are masked and the model predicts them.

Benchmark Evaluation and Ablation Studies

Experimental setup

MAT is evaluated on seven molecular property prediction datasets spanning regression and classification:

Dataset	Task	Size	Metric	Split
FreeSolv	Regression (hydration free energy)	642	RMSE	Random
ESOL	Regression (log solubility)	1,128	RMSE	Random
BBBP	Classification (BBB permeability)	2,039	ROC AUC	Scaffold
Estrogen-alpha	Classification (receptor activity)	2,398	ROC AUC	Scaffold
Estrogen-beta	Classification (receptor activity)	1,961	ROC AUC	Scaffold
MetStab-high	Classification (metabolic stability)	2,127	ROC AUC	Random
MetStab-low	Classification (metabolic stability)	2,127	ROC AUC	Random

Baselines include GCN, Weave, EAGCN, Random Forest (RF), and SVM. Each model receives the same hyperparameter search budget (150 or 500 evaluations). Results are averaged over 6 random train/validation/test splits.

Main results

MAT achieves the best average rank across all seven tasks:

Model	Avg. Rank (500 budget)	Avg. Rank (150 budget)
MAT	2.42	2.71
RF	3.14	3.14
SVM	3.57	3.28
GCN	3.57	3.71
Weave	3.71	3.57
EAGCN	4.14	4.14

With self-supervised pretraining, Pretrained MAT achieves an average rank of 1.57, outperforming both Pretrained EAGCN (4.0) and SMILES Transformer (4.29). Pretrained MAT requires tuning only the learning rate (7 values tested), compared to 500 hyperparameter combinations for the non-pretrained models.

Ablation results

Ablation studies on BBBP, ESOL, and FreeSolv reveal:

Variant	BBBP (AUC)	ESOL (RMSE)	FreeSolv (RMSE)
MAT (full)	.723	.286	.250
- Graph	.716	.316	.276
- Distance	.729	.281	.281
- Attention	.692	.306	.329
- Dummy node	.714	.317	.249
+ Edge features	.683	.314	.358

Removing any single component degrades performance on at least one task, supporting the value of combining all three information sources. Adding edge features does not help, suggesting the adjacency and distance matrices already capture sufficient bond-level information.

Interpretability analysis

Individual attention heads in the first layer learn chemically meaningful functions. Six heads were identified that focus on specific chemical patterns: 2-neighbored aromatic carbons, sulfur atoms, non-ring nitrogens, carbonyl oxygens, 3-neighbored aromatic atoms (substitution positions), and aromatic ring nitrogens. Statistical validation using Kruskal-Wallis tests confirmed that atoms matching these SMARTS patterns receive significantly higher attention weights ($p < 0.001$ for all patterns).

Findings, Limitations, and Future Directions

MAT demonstrates that augmenting Transformer self-attention with molecular graph structure and 3D distance information produces a model that performs consistently well across diverse property prediction tasks. The key practical finding is that self-supervised pretraining dramatically reduces the hyperparameter tuning burden: Pretrained MAT matches or exceeds the performance of extensively tuned models while requiring only learning rate selection.

Several limitations are acknowledged:

Fingerprint-based models still win on some tasks: RF and SVM with extended-connectivity fingerprints outperform MAT on metabolic stability and Estrogen-beta tasks, suggesting that incorporating fingerprint representations could improve MAT further.
Single conformer: Only one pre-computed 3D conformer is used per molecule. More sophisticated conformer sampling or ensemble strategies were not explored.
Limited pretraining exploration: Only the masked atom prediction task from Hu et al. (2019) was used. The authors note that exploring additional pretraining objectives is a promising direction.
Scalability: The pretrained model uses 1024-dimensional embeddings with 8 layers and 16 attention heads, fitting the largest model that fits in GPU memory.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ZINC15	2M molecules	Sampled from ZINC database
Evaluation	FreeSolv	642	Hydration free energy regression
Evaluation	ESOL	1,128	Log solubility regression
Evaluation	BBBP	2,039	Blood-brain barrier classification
Evaluation	Estrogen-alpha/beta	2,398 / 1,961	Receptor activity classification
Evaluation	MetStab-high/low	2,127 each	Metabolic stability classification

Algorithms

Optimizer: Adam with Noam learning rate scheduler (warmup then inverse square root decay)
Pretraining: 8 epochs, learning rate 0.001, batch size 256, binary cross-entropy loss
Fine-tuning: 100 epochs, batch size 32, learning rate selected from {1e-3, 5e-4, 1e-4, 5e-5, 1e-5, 5e-6, 1e-6}
Distance kernel: exponential decay $g(d) = \exp(-d)$ for pretrained model
Lambda weights: $\lambda_{a} = \lambda_{d} = 0.33$ for pretrained model

Models

Pretrained MAT: 1024-dim embeddings, 8 layers, 16 attention heads, 1 feed-forward layer per block
Dropout: 0.0, weight decay: 0.0 for pretrained model
Atom featurization: 26-dimensional one-hot encoding (Table 1 in paper)

Evaluation

Regression: RMSE (FreeSolv, ESOL)
Classification: ROC AUC (BBBP, Estrogen-alpha/beta, MetStab-high/low)
All experiments repeated 6 times with different train/validation/test splits
Scaffold split for BBBP, Estrogen, random split for others

Hardware

The paper does not specify exact hardware details. The pretrained model is described as “the largest model that still fits the GPU memory.”

Artifacts

Artifact	Type	License	Notes
gmum/MAT	Code	MIT	Official implementation with pretrained weights

Paper Information

Citation: Maziarka, Ł., Danel, T., Mucha, S., Rataj, K., Tabor, J., & Jastrzębski, S. (2020). Molecule Attention Transformer. arXiv preprint arXiv:2002.08264.

@article{maziarka2020molecule,
  title={Molecule Attention Transformer},
  author={Maziarka, {\L}ukasz and Danel, Tomasz and Mucha, S{\l}awomir and Rataj, Krzysztof and Tabor, Jacek and Jastrz{\k{e}}bski, Stanis{\l}aw},
  journal={arXiv preprint arXiv:2002.08264},
  year={2020}
}

A Graph-Augmented Transformer for Molecular Property Prediction#

Challenges in Deep Learning for Molecular Properties#

Molecule Self-Attention: Combining Attention, Distance, and Graph Structure#

Benchmark Evaluation and Ablation Studies#

Experimental setup#

Main results#

Ablation results#

Interpretability analysis#

Findings, Limitations, and Future Directions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#