Welcome back to our series of posts on alkane isomer classification using the Coulomb matrix eigenvalues! We’re replicating the methods used in the paper “Can One Hear the Shape of a Molecule (from its Coulomb Matrix Eigenvalues)?”, and in this final part, we’ll be using the Coulomb matrix eigenvalues to classify alkane constitutional isomers using supervised learning.

If you haven’t read the previous posts, I recommend you do so before continuing:

In this post, I thought I’d take a slight deviation from the paper and only replicate some of their supervised learning algorithms. As they’re just using a set of a baseline models, I thought it’d be enough to show how to replicate their setup with two baselines (k-nearest neighbors and logistic regression), allowing the interested reader to expand the analysis with more models by themselves. Specifically, I’m omitting:

  • Tree-based models (random forests, decision trees, gradient boosting trees)
  • Support vector machines

Beyond that, I replicate their approach directly by performing a 5-fold cross-validation with stratified sampling. Let’s get started!

Reload the Data

We’ll reload our data, this time only reloading $n=4$ to $n=11$, omitting alkanes without constitutional isomers.

import re 
from glob import glob 

import numpy as np


spectra = {}
for n in range(4, 12):
    if n not in spectra:
        spectra[n] = {}
        
    for i, f in enumerate(glob(f'spectra/C{n}H{2*n + 2}_*.npy')):
        j = int(re.search(rf'C{n}H{2*n + 2}_(\d+).npy', f).group(1))
        spectra[n][j] = np.load(f)
        
    print(f'Loaded {len(spectra[n])}k spectra for C{n}H{2*n + 2} with {spectra[n][0].shape[1]} dimensions')

Note that you should see as output:

Loaded 2k spectra for C4H10 with 14 dimensions
Loaded 3k spectra for C5H12 with 17 dimensions
Loaded 5k spectra for C6H14 with 20 dimensions
Loaded 9k spectra for C7H16 with 23 dimensions
Loaded 18k spectra for C8H18 with 26 dimensions
Loaded 35k spectra for C9H20 with 29 dimensions
Loaded 75k spectra for C10H22 with 32 dimensions
Loaded 159k spectra for C11H24 with 35 dimensions

where the number of dimensions is the number of eigenvalues we used to represent the Coulomb matrix (equal to the number of atoms in the molecule). We have 1000 examples of each alkane isomer, therefore the integer before the k is the number of constitutional isomers for that alkane, i.e., the number of classes we have to classify.

We’ll also write ourselves a little helper function for extracting the data and labels from the dictionary.

def prep_data(n: int):
    """Prepare data for classification."""
    X = []
    y = []
    for j, s in sorted(spectra[n].items(), key=lambda x: x[0]):
        n, _ = s.shape
        X.append(s)
        y.append(np.full(n, j))
    return np.concatenate(X), np.concatenate(y)

k-Nearest Neighbors (k-NN)

We’ll start with k-nearest neighbors (k-NN) as our first baseline model. It’s a simple algorithm that classifies a sample based on the majority class of its $k$ nearest neighbors. In this way, it’s a non-parametric model, meaning it doesn’t make any assumptions about the underlying data distribution. The flip side is that it can be computationally expensive, especially for large datasets or high-dimensional data.

We will exhibit the code for $k=1$ (1-NN), but will run the model for $k=1, 3, 5$ (as was done in the paper). This can be achieved by varying the n_neighbors parameter in the KNeighborsClassifier class.

Full Dimensionality

To begin, we’ll use the full dimensionality of the Coulomb matrix eigenvalues.

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier

df_1nn = []

for n in range(4, 12):

    # Prepare the data for CnH2n+2
    X, y = prep_data(n=n)

    # Create knn classifier
    knn = KNeighborsClassifier(n_neighbors=1)

    # Set up stratified 5-fold cross-validation
    cv = StratifiedKFold(n_splits=5)

    # Perform cross-validation. Since 'cross_val_score' computes accuracy, we compute misclassification rate by subtracting accuracy from 1.
    acc_scores = cross_val_score(knn, X, y, cv=cv, scoring='accuracy')

    # Convert accuracy scores to misclassification error rates
    misclassification_error_rates = 1 - acc_scores

    # Calculate the average and standard deviation of the misclassification error rates
    avg_misclassification_error = np.mean(misclassification_error_rates)
    std_misclassification_error = np.std(misclassification_error_rates)

    print(f'C{n}H{2*n + 2}: {avg_misclassification_error:.2%} ± {std_misclassification_error:.2%}')
    
    df_1nn.append({
        'molecule': f'C{n}H{2*n + 2}', 
        'avg_misclassification_error': avg_misclassification_error, 
        'std_misclassification_error': std_misclassification_error,
        'n': n,
        'representation': 'full',
        'model': '1nn',
    })

You should see as output something like:

C4H10: 0.00% ± 0.00%
C5H12: 0.00% ± 0.00%
C6H14: 0.00% ± 0.00%
C7H16: 0.07% ± 0.05%
C8H18: 0.11% ± 0.05%
C9H20: 0.51% ± 0.09%
C10H22: 1.31% ± 0.09%
C11H24: 3.24% ± 0.09%

This is the average misclassification error rate and its standard deviation for each alkane isomer. We can see that the error rate increases as the number of atoms in the molecule increases, which is expected from what we saw in the unsupervised learning post. Nevertheless, we see perfect classification for all alkanes smaller than $C_7H_{16}$.

As a brief aside, I experimented with scaling data before fitting the k-NN model, but this had highly detrimental effects (understandably, as the Coulomb matrix eigenvalues have a magnitude that indicates its importance). You can zero-center the data, but this won’t have an effect on the classification performance for the k-NN model. You should not scale the data in this instance (if you don’t believe me, validate my results by scaling the data and running the model again).

Top 10 Eigenvalues

We can restrict to the top 10 eigenvalues by adding a line:

# prune to top-10 features
X = X[:, :10]

and changing our log to:

'representation': 'top',

instead of 'representation': 'full'.

I get the following:

C4H10: 0.00% ± 0.00%
C5H12: 0.00% ± 0.00%
C6H14: 0.00% ± 0.00%
C7H16: 0.07% ± 0.05%
C8H18: 0.12% ± 0.06%
C9H20: 0.52% ± 0.09%
C10H22: 1.34% ± 0.03%
C11H24: 3.58% ± 0.08%

We see that the performance is almost identical to the full representation, which is interesting. If anything, it’s slightly worse for the larger alkanes. We should bear in mind that this is for the 1-NN model, and we might see different results for $k > 1$.

PCA-Reduced

As a further experiment beyond the original paper, I wanted to see what the results would be if we used PCA to first reduce the dimensionality of the Coulomb matrix eigenvalues. To achieve this, be sure to import PCA from sklearn.decomposition and add the following lines before the KNeighborsClassifier instantiation:

# Create pipeline with PCA and knn classifier
knn = Pipeline([
    ('pca', PCA(n_components=10)),
    ('knn', KNeighborsClassifier(n_neighbors=1))
])

and change the log to:

'representation': 'pca',

instead of 'representation': 'full'.

We keep the top 10 dimensions to align the PCA-reduced representation with the top 10 eigenvalues representation. Running it, I get:

C4H10: 0.00% ± 0.00%
C5H12: 0.00% ± 0.00%
C6H14: 0.00% ± 0.00%
C7H16: 0.07% ± 0.05%
C8H18: 0.12% ± 0.06%
C9H20: 0.51% ± 0.10%
C10H22: 1.33% ± 0.08%
C11H24: 3.39% ± 0.06%

Plotting the Different Representations

To more easily compare the different representations, we can plot the results.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6), dpi=150)

# make line plot with error bars
# - x-axis: n
# - y-axis: avg_misclassification_error
# - style: representation
# - y-error: std_misclassification_error
sns.lineplot(data=df_1nn, x='n', y='avg_misclassification_error', hue='representation', style='representation', marker='o')
for i, row in df_1nn.iterrows():
    plt.errorbar(row['n'], row['avg_misclassification_error'], yerr=row['std_misclassification_error'], color='black', alpha=0.5)

plt.xlabel('Number of carbon atoms')
plt.ylabel('Misclassification error')
plt.title('Misclassification error of 1-NN classifier for alkane classification')
plt.savefig('alkane-classification-1nn.webp', bbox_inches='tight')
Misclassification error of 1-NN classifier for alkane classification

Misclassification error of 1-NN classifier for alkane classification

Some observations:

  • For the 1-NN model, the performance is almost identical for the full, top 10, and PCA-reduced representations. Only really for $n=11$ do we see a slight difference. This difference seems to indicate that the full representation is slightly better than the top 10 and PCA-reduced representations. The PCA-reduced seems a better basis for the representation than the top 10 eigenvalues, which means PCA is capturing something meaningful about the data.
  • The performance is perfect for $n=4, 5, 6$, and then starts to degrade as the number of atoms in the molecule increases. This is expected, as the Coulomb matrix eigenvalues are less informative for larger molecules, resulting in more heterogeneity in how the data is clustered. As the k-NN model is a non-parametric model, it’s not an algorithm that “learns” in the traditional sense, and so it’s not able to adapt to the increased heterogeneity in the data. In this way, it’s similar to the unsupervised learning results we saw in the previous post.

k-NN with $k > 1$

Running similar code for $k=3, 5$, I find similar results to the 1-NN model.

Misclassification error of 3-NN classifier for alkane classification

Misclassification error of 3-NN classifier for alkane classification

Misclassification error of 5-NN classifier for alkane classification

Misclassification error of 5-NN classifier for alkane classification

The trends are basically identical, barring some shifts along the y-axis. We can highlight these shifts by plotting the full representation for each $k$.

Misclassification error of k-NN classifier for alkane classification

Misclassification error of k-NN classifier for alkane classification

To me, it’s quite interesting that the error increases as $k$ increases, which is not what I expected. However, upon further reflection it makes sense given what we saw with the unsupervised learning results. I don’t think the eigenvalues give rise to meaningful local structure in the data without further processing, and so the k-NN model is not able to leverage the increased number of neighbors to improve classification performance. Instead, when it looks further afield, it’s more likely to find examples from other classes, which increases the misclassification error.

Logistic Regression

For Logistic Regression, we’ll use the LogisticRegression class from sklearn.linear_model. I want to use the lbfgs solver since it’s a pseudo-Newton method which should be faster than the default liblinear solver. I am using the l2 penalty, which is the default for lbfgs, though I’m not sure that’s the best option for this problem (early experimentation should slightly better performance without penalty, but at the cost of much longer computation time. I went for penalty with a reduced weight to keep computation manageable).

To keep computation even more manageable, I only run this experiment for PCA-reduced representations. This is likely to be a poor choice for this model, as shown in the results from the paper that use only the top 10 eigenvalues for logistic regression. I’m hoping that the results from the k-NN model transfer, allowing the PCA representation to be a better basis for the representation than the top 10 eigenvalues, slightly closing the gap in performance between the full and PCA representations.

from sklearn.linear_model import LogisticRegression


df_lr = []

for n in range(4, 12):

    # Prepare the data for CnH2n+2
    X, y = prep_data(n=n)

    # Create logistic regression classifier with PCA
    lr = Pipeline([
        ('pca', PCA(n_components=10)),
        ('lr', LogisticRegression(
            max_iter=10_000, 
            penalty='l2',
            solver='lbfgs',
            C=10.0,  # > 1.0, less regularization
            random_state=42, 
            verbose=1, 
            n_jobs=-1,
        ))
    ])

    # Set up stratified 5-fold cross-validation
    cv = StratifiedKFold(n_splits=5)

    # Perform cross-validation. Since 'cross_val_score' computes accuracy, we compute misclassification rate by subtracting accuracy from 1.
    acc_scores = cross_val_score(lr, X, y, cv=cv, scoring='accuracy')

    # Convert accuracy scores to misclassification error rates
    misclassification_error_rates = 1 - acc_scores

    # Calculate the average and standard deviation of the misclassification error rates
    avg_misclassification_error = np.mean(misclassification_error_rates)
    std_misclassification_error = np.std(misclassification_error_rates)

    print(f'C{n}H{2*n + 2}: {avg_misclassification_error:.2%} ± {std_misclassification_error:.2%}')
    
    df_lr.append({
        'molecule': f'C{n}H{2*n + 2}', 
        'avg_misclassification_error': avg_misclassification_error, 
        'std_misclassification_error': std_misclassification_error,
        'n': n,
        'representation': 'pca',
        'model': 'lr',
    })
    
df_lr = pd.DataFrame(df_lr)
df_lr

Plotting the results, we get:

Misclassification error of Logistic Regression classifier for alkane classification

Misclassification error of Logistic Regression classifier for alkane classification

We can plot the logistic regression results alongside the 1-NN results for the PCA representation, to get a better sense of scale.

Misclassification error of 1-NN and Logistic Regression classifiers for alkane classification

Misclassification error of 1-NN and Logistic Regression classifiers for alkane classification

The gap between these two models quickly grows. Perhaps better hyperparameters could close the gap. But the training of logistic regression is more costly than k-NN, and so I don’t want to explore it further.

Nevertheless, these trends align with the paper and illustrate how one would apply a subset of baseline models to the problem of alkane isomer classification using the Coulomb matrix eigenvalues.

Conclusion

In this post, we used the Coulomb matrix eigenvalues to classify alkane constitutional isomers using supervised learning. We replicated the methods used in the paper “Can One Hear the Shape of a Molecule (from its Coulomb Matrix Eigenvalues)?” by performing a 5-fold cross-validation with stratified sampling. We used two baseline models: k-nearest neighbors and logistic regression, and found that the performance of the models was almost identical for the full, top 10, and PCA-reduced representations.

Ultimately, we find that the k-NN model is more able to robustly classify the alkane isomers than logistic regression. Both models perform best for the smaller alkanes, and the performance degrades as the number of atoms in the molecule increases.

Hopefully this post gives you a sense of how to apply supervised learning to the problem of alkane isomer classification using the Coulomb matrix eigenvalues, and how to replicate the methods used in the paper. I think it begins to highlight some of the limitations of the Coulomb matrix eigenvalues as a molecular descriptor, and how we might need to use more sophisticated methods to improve classification performance. These things should be kept in mind, especially if one is thinking about using this representation for a more complex problem. For example, if you’re generating molecules and using a representation like the Coulomb matrix eigenvalues to measure similarity, you might find that the representation is not able to capture the nuances of the data, and so you might need to use a more sophisticated method to measure similarity.

I hope you’ve enjoyed this series of posts on alkane isomer classification using the Coulomb matrix eigenvalues! If you have any questions or comments, feel free to reach out.

If you want to check out another post on molecular simulation, you might be interested in this post on mini protein folding simulations.