Mol2vec: Unsupervised ML with Chemical Intuition

Word2vec Meets Cheminformatics

Mol2vec is a Method paper that introduces an unsupervised approach for learning dense vector representations of molecular substructures. The core idea is a direct analogy to Word2vec from natural language processing: molecular substructures (derived from the Morgan algorithm) are treated as “words,” and entire molecules are treated as “sentences.” By training on a large unlabeled corpus of 19.9 million compounds, Mol2vec produces embeddings where chemically related substructures occupy nearby regions of vector space. Compound-level vectors are then obtained by summing constituent substructure vectors, and these can serve as features for downstream supervised learning tasks.

Sparse Fingerprints and Their Limitations

Molecular fingerprints, particularly Morgan fingerprints (extended-connectivity fingerprints, ECFP), are among the most widely used molecular representations in cheminformatics. They perform well for similarity searching, virtual screening, and activity prediction. However, they suffer from several practical drawbacks:

High dimensionality and sparsity: Morgan fingerprints are typically hashed to fixed-length binary vectors (e.g., 2048 or 4096 bits), resulting in very sparse representations.
Bit collisions: The hashing step can map distinct substructures to the same bit position, losing structural information.
No learned relationships: Each bit is independent, so the representation does not encode any notion of chemical similarity between substructures.

At the time of this work (2017), NLP techniques had started to appear in cheminformatics. The tf-idf method had been applied to Morgan fingerprints for compound-protein interaction prediction, and Latent Dirichlet Allocation had been used for chemical topic modeling. The Word2vec concept had been adapted for protein sequences (ProtVec) but had not yet been applied to small molecules. Mol2vec fills this gap.

From Substructure Identifiers to Dense Embeddings

The central insight of Mol2vec is that the Morgan algorithm already produces a natural “vocabulary” of molecular substructures, and the order in which these substructures appear in a molecule provides local context, analogous to word order in a sentence.

Corpus Construction

The training corpus was assembled from ZINC v15 and ChEMBL v23, merged and deduplicated, then filtered by molecular weight (12-600), heavy atom count (3-50), clogP (-5 to 7), and allowed elements (H, B, C, N, O, F, P, S, Cl, Br). This yielded 19.9 million compounds.

Sentence Generation

For each molecule, the Morgan algorithm generates atom identifiers at radius 0 and radius 1. Each atom contributes two identifiers (one per radius), ordered according to the atom order in the canonical SMILES. This sequence of identifiers forms a “sentence” for Word2vec training.

Word2vec Training

The model was trained using the gensim implementation of Word2vec. After evaluating both CBOW and Skip-gram architectures with window sizes of 5, 10, and 20, and embedding dimensions of 100 and 300, the best configuration was:

Architecture: Skip-gram
Window size: 10
Embedding dimension: 300

Rare identifiers appearing fewer than 3 times in the corpus were replaced with a special “UNSEEN” token, which learns a near-zero vector. This allows the model to handle novel substructures at inference time.

Compound Vector Generation

The final vector for a molecule is the sum of all its substructure vectors:

$$\mathbf{v}_{\text{mol}} = \sum_{i=1}^{N} \mathbf{v}_{s_i}$$

where $\mathbf{v}_{s_i}$ is the 300-dimensional embedding for the $i$-th substructure identifier in the molecule. This summation implicitly captures substructure counts and importance through vector amplitude.

Benchmarking Across Regression and Classification Tasks

Datasets

The authors evaluated Mol2vec on four datasets:

Dataset	Task	Size	Description
ESOL	Regression	1,144	Aqueous solubility prediction
Ames	Classification	6,511	Mutagenicity (balanced: 3,481 positive, 2,990 negative)
Tox21	Classification	8,192	12 human toxicity targets (imbalanced)
Kinase	Classification	284 kinases	Bioactivity from ChEMBL v23

Machine Learning Methods

Three ML methods were compared using both Mol2vec and Morgan FP features:

Random Forest (RF): scikit-learn, 500 estimators
Gradient Boosting Machine (GBM): XGBoost, 2000 estimators, max depth 3, learning rate 0.1
Deep Neural Network (DNN): Keras/TensorFlow, 4 hidden layers with 2000 neurons each for Mol2vec; 1 hidden layer with 512 neurons for Morgan FP

All models were validated using 20x 5-fold cross-validation with the Wilcoxon signed-rank test for statistical comparison.

ESOL Regression Results

Features	Method	$R^2_{\text{ext}}$	MSE	MAE
Descriptors	MLR	0.81 +/- 0.01	0.82	0.69
Molecular Graph	CNN	0.93	0.31 +/- 0.03	0.40 +/- 0.00
Morgan FP	GBM	0.66 +/- 0.00	1.43 +/- 0.00	0.88 +/- 0.00
Mol2vec	GBM	0.86 +/- 0.00	0.62 +/- 0.00	0.60 +/- 0.00

Mol2vec substantially outperformed Morgan FP ($R^2_{\text{ext}}$ 0.86 vs. 0.66) but did not match the best graph convolution methods ($R^2_{\text{ext}}$ ~0.93).

Classification Results (Ames and Tox21)

On the Ames dataset, Mol2vec and Morgan FP performed comparably (AUC 0.87 vs. 0.88), both matching or exceeding prior SVM and Naive Bayes results. On Tox21, both achieved an average AUC of 0.83, outperforming literature results from graph convolution (0.71) and DNN/SVM approaches (0.71-0.72).

Proteochemometric (PCM) Extension

Mol2vec was combined with ProtVec (protein sequence embeddings using the same Word2vec approach on 3-grams) by concatenating vectors, forming PCM2vec. This was evaluated using a rigorous 4-level cross-validation scheme:

CV1: New compound-target pairs
CV2: New targets
CV3: New compounds
CV4: New compounds and targets

On Tox21, PCM2vec improved predictions for new compound-target pairs (CV1: AUC 0.87 vs. 0.79 for Morgan FP) and new compounds (CV3: AUC 0.85 vs. 0.78). On the kinase dataset, PCM2vec approached the performance of classical PCM (Morgan + z-scales) while being alignment-independent, meaning it can be applied to proteins with low sequence similarity.

Chemical Intuition and Practical Value

Embedding Quality

The learned substructure embeddings capture meaningful chemical relationships. Hierarchical clustering of the 25 most common substructures shows expected groupings: aromatic carbons cluster together, aliphatic ring carbons form a separate group, and carbonyl carbons and oxygens are closely related. Similarly, t-SNE projections of amino acid vectors encoded by Mol2vec reproduce known amino acid relationships (e.g., similar distances between Glu/Gln and Asp/Asn pairs, reflecting the carboxylic acid to amide transition).

Key Findings

Skip-gram with 300-dimensional embeddings provides the best Mol2vec representations, consistent with NLP best practices.
Mol2vec excels at regression tasks, substantially outperforming Morgan FP on ESOL solubility prediction ($R^2_{\text{ext}}$ 0.86 vs. 0.66).
Classification performance is competitive with Morgan FP across Ames and Tox21 datasets.
PCM2vec enables alignment-independent proteochemometrics, extending PCM approaches to diverse protein families with low sequence similarity.
Tree-based methods (RF, GBM) outperformed DNNs on these tasks, though the authors note further DNN tuning could help.

Limitations

The compound vector is a simple sum of substructure vectors, which discards information about substructure arrangement and molecular topology.
Only Morgan identifiers at radii 0 and 1 were used. Larger radii might capture more context but would increase vocabulary size.
DNN architectures were not extensively optimized, leaving open the question of how well Mol2vec pairs with deep learning.
The approach was benchmarked against Morgan FP but not against other learned representations such as graph neural networks in a controlled comparison.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC v15 + ChEMBL v23	19.9M compounds	Filtered by MW, atom count, clogP, element types
Evaluation	ESOL	1,144 compounds	Aqueous solubility regression
Evaluation	Ames	6,511 compounds	Mutagenicity classification
Evaluation	Tox21	8,192 compounds	12 toxicity targets, retrieved via DeepChem
Evaluation	Kinase (ChEMBL v23)	284 kinases	IC50/Kd/Ki binding assays
Protein corpus	UniProt	554,241 sequences	For ProtVec training

Algorithms

Word2vec: Skip-gram, window size 10, 300-dimensional embeddings, min count 3
Morgan algorithm: Radii 0 and 1 (119 and 19,831 unique identifiers respectively)
UNSEEN token: Replaces identifiers occurring fewer than 3 times
Compound vector: Sum of all substructure vectors

Models

RF: scikit-learn, 500 estimators, sqrt features, balanced class weights
GBM: XGBoost, 2000 estimators, max depth 3, learning rate 0.1
DNN: Keras/TensorFlow, 4 layers x 2000 neurons (Mol2vec) or 1 layer x 512 neurons (Morgan FP), ReLU activation, dropout 0.1

Evaluation

Metric	Mol2vec Best	Morgan FP Best	Task
$R^2_{\text{ext}}$	0.86 (GBM)	0.66 (GBM)	ESOL regression
AUC	0.87 (RF)	0.88 (RF)	Ames classification
AUC	0.83 (RF)	0.83 (RF)	Tox21 classification

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
mol2vec	Code	BSD-3-Clause	Python package with pre-trained model

Paper Information

Citation: Jaeger, S., Fulle, S., & Turk, S. (2018). Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. Journal of Chemical Information and Modeling, 58(1), 27-35. https://doi.org/10.1021/acs.jcim.7b00616

@article{jaeger2018mol2vec,
  title={Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
  author={Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
  journal={Journal of Chemical Information and Modeling},
  volume={58},
  number={1},
  pages={27--35},
  year={2018},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.7b00616}
}

Word2vec Meets Cheminformatics#

Sparse Fingerprints and Their Limitations#

From Substructure Identifiers to Dense Embeddings#

Corpus Construction#

Sentence Generation#

Word2vec Training#

Compound Vector Generation#

Benchmarking Across Regression and Classification Tasks#

Datasets#

Machine Learning Methods#

ESOL Regression Results#

Classification Results (Ames and Tox21)#

Proteochemometric (PCM) Extension#

Chemical Intuition and Practical Value#

Embedding Quality#

Key Findings#

Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#