Property Prediction

This group covers models specifically designed for predicting molecular or crystal properties from chemical string representations. It includes SMILES-based QSAR architectures, transfer learning approaches, multitask prediction, hybrid prediction-generation models, and text-based crystal property prediction. Also included are studies on using language model perplexity as an intrinsic molecular scoring method and evaluations of how well language models capture complex molecular distributions.

Paper	Year	Approach	Key Idea
SMILES2Vec	2017	CNN-GRU	Interpretable property prediction from raw SMILES embeddings
Transformer-CNN	2020	Transformer + CNN	Transformer SMILES embeddings with CNN for interpretable QSAR
MolPMoFiT	2020	Transfer learning	ULMFiT-style inductive transfer for QSAR on small datasets
Maxsmi	2021	CNN/RNN	SMILES augmentation improves CNN and RNN property prediction
Perplexity Ranking	2022	LM scoring	Perplexity scores rank molecules and detect pretraining bias
LM Distributions	2022	RNN LM	RNN language models capture complex molecular distributions
MTL-BERT	2022	BERT	Multitask pretraining with SMILES enumeration augmentation
Regression Transformer	2023	Transformer	Unifies property prediction and conditional generation in one model
LLM-Prop	2025	T5	Crystal property prediction from text descriptions

All Notes

Computational Chemistry

Bar chart showing RMSE improvement from SMILES augmentation across ESOL, FreeSolv, and lipophilicity datasets

Maxsmi: SMILES Augmentation for Property Prediction

A systematic study of SMILES augmentation strategies for molecular property prediction, showing that augmentation consistently improves CNN and RNN performance and that prediction variance across SMILES correlates with model uncertainty.

Computational Chemistry

Bar chart showing MTL-BERT combining pretraining, multitask learning, and SMILES enumeration for best improvement

MTL-BERT: Multitask BERT for Property Prediction

MTL-BERT pretrains a BERT model on 1.7M unlabeled SMILES, then fine-tunes jointly on 60 ADMET and molecular property tasks using SMILES enumeration as data augmentation in all phases.

Computational Chemistry

Bar chart comparing LLM-Prop band gap MAE against CGCNN, SchNet, MEGNet, and ALIGNN

LLM-Prop: Predicting Crystal Properties from Text

LLM-Prop uses the encoder half of T5, fine-tuned on Robocrystallographer text descriptions, to predict crystal properties. It outperforms GNN baselines like ALIGNN on band gap and volume prediction while using fewer parameters.

Computational Chemistry

Diagram showing ULMFiT-style three-stage pipeline adapted for molecular property prediction

MolPMoFiT: Inductive Transfer Learning for QSAR

MolPMoFiT applies ULMFiT-style transfer learning to QSAR modeling, pre-training an AWD-LSTM on one million ChEMBL molecules and fine-tuning for property prediction on small datasets.

Computational Chemistry

Bar chart comparing SMILES2Vec and Graph Conv scores across five MoleculeNet tasks

SMILES2Vec: Interpretable Chemical Property Prediction

SMILES2Vec is a deep RNN that learns chemical features directly from SMILES strings using a Bayesian-optimized CNN-GRU architecture. It matches graph convolution baselines on toxicity and activity prediction, and its explanation mask identifies chemically meaningful functional groups with 88% accuracy.

Computational Chemistry

Bar chart comparing Transformer-CNN RMSE against RF, SVM, CNN, and CDDD baselines

Transformer-CNN: SMILES Embeddings for QSAR Modeling

Transformer-CNN extracts dynamic SMILES embeddings from a Transformer trained on SMILES canonicalization and feeds them to a TextCNN for QSAR modeling, achieving strong results across 18 benchmarks with built-in LRP interpretability.

Computational Chemistry

Scatter plot showing molecules ranked by perplexity score with color coding for task-relevant (positive delta) versus pretraining-biased (negative delta) generations

Perplexity for Molecule Ranking and CLM Bias Detection

This study applies perplexity, a model-intrinsic metric from NLP, to rank de novo molecular designs generated by SMILES-based chemical language models and introduces a delta score to detect pretraining bias in transfer-learned CLMs.

Computational Chemistry

Three distribution plots showing RNN language models closely matching training distributions across peaked, multi-modal, and large-scale molecular generation tasks while graph models fail

Language Models Learn Complex Molecular Distributions

This study benchmarks RNN-based chemical language models against graph generative models on three challenging tasks: high penalized LogP distributions, multi-modal molecular distributions, and large-molecule generation from PubChem. The LSTM language models consistently outperform JTVAE and CGVAE.

Computational Chemistry

Regression Transformer dual-masking concept showing property prediction (mask numbers) and conditional generation (mask molecules) in a single model

Regression Transformer: Prediction Meets Generation

The Regression Transformer (RT) reformulates regression as conditional sequence modelling, enabling a single XLNet-based model to both predict continuous molecular properties and generate novel molecules conditioned on desired property values.

All Notes#

All Notes