Introduction

In Parts One and Two, we discovered a sobering truth about Coulomb matrix eigenvalues: while they work perfectly for simple molecules like butane, they struggle to distinguish constitutional isomers as molecular complexity increases.

Unsupervised clustering painted a clear picture—by C₁₁H₂₄, eigenvalue representations become nearly indistinguishable. But this raises an intriguing question: what happens when we provide labels?

Can supervised learning extract hidden patterns that clustering couldn’t find? Maybe the mathematical structure is there, but requires guidance to discover. This is our final test of whether we can truly “hear the shape of a molecule” from eigenvalue signatures.

I’ll focus on two baseline approaches from the original paper: k-nearest neighbors and logistic regression. These represent fundamentally different learning paradigms—one memorizes patterns, the other learns linear boundaries—giving us insight into what types of structure might exist in the eigenvalue space.

Setting Up the Experiments

First, let’s reload our eigenvalue data, focusing only on molecules with constitutional isomers (C₄H₁₀ through C₁₁H₂₄):

import re 
from glob import glob 
import numpy as np

spectra = {}
for n in range(4, 12):
    if n not in spectra:
        spectra[n] = {}
        
    for i, f in enumerate(glob(f'spectra/C{n}H{2*n + 2}_*.npy')):
        j = int(re.search(rf'C{n}H{2*n + 2}_(\d+).npy', f).group(1))
        spectra[n][j] = np.load(f)
        
    print(f'Loaded {len(spectra[n])}k spectra for C{n}H{2*n + 2} with {spectra[n][0].shape[1]} dimensions')

This gives us a clear picture of our classification challenges:

Loaded 2k spectra for C4H10 with 14 dimensions
Loaded 3k spectra for C5H12 with 17 dimensions  
Loaded 5k spectra for C6H14 with 20 dimensions
Loaded 9k spectra for C7H16 with 23 dimensions
Loaded 18k spectra for C8H18 with 26 dimensions
Loaded 35k spectra for C9H20 with 29 dimensions
Loaded 75k spectra for C10H22 with 32 dimensions
Loaded 159k spectra for C11H24 with 35 dimensions

Each molecular size represents a different classification problem—from 2 classes (butane) to 159 classes (undecane). The dimensionality grows with molecular size, but as we learned from PCA analysis in Part One, most information concentrates in the first few eigenvalues.

Our helper function for preparing training data:

def prep_data(n: int):
    """Prepare data for classification."""
    X = []
    y = []
    for j, s in sorted(spectra[n].items(), key=lambda x: x[0]):
        n, _ = s.shape
        X.append(s)
        y.append(np.full(n, j))
    return np.concatenate(X), np.concatenate(y)

k-Nearest Neighbors: Learning by Memory

k-NN represents the simplest form of supervised learning—it memorizes all training examples and classifies new samples based on their closest neighbors. If eigenvalue patterns truly distinguish isomers, nearby points in eigenvalue space should belong to the same class.

This directly tests the local structure that unsupervised clustering couldn’t find. Maybe the global structure is poor, but local neighborhoods preserve meaningful distinctions.

Testing Different Representations

Let’s compare three approaches: full eigenvalue vectors, top 10 eigenvalues only, and PCA-reduced representations.

Full dimensionality (1-NN):

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier

df_1nn = []

for n in range(4, 12):

    # Prepare the data for CnH2n+2
    X, y = prep_data(n=n)

    # Create knn classifier
    knn = KNeighborsClassifier(n_neighbors=1)

    # Set up stratified 5-fold cross-validation
    cv = StratifiedKFold(n_splits=5)

    # Perform cross-validation. Since 'cross_val_score' computes accuracy, we compute misclassification rate by subtracting accuracy from 1.
    acc_scores = cross_val_score(knn, X, y, cv=cv, scoring='accuracy')

    # Convert accuracy scores to misclassification error rates
    misclassification_error_rates = 1 - acc_scores

    # Calculate the average and standard deviation of the misclassification error rates
    avg_misclassification_error = np.mean(misclassification_error_rates)
    std_misclassification_error = np.std(misclassification_error_rates)

    print(f'C{n}H{2*n + 2}: {avg_misclassification_error:.2%} ± {std_misclassification_error:.2%}')
    
    df_1nn.append({
        'molecule': f'C{n}H{2*n + 2}', 
        'avg_misclassification_error': avg_misclassification_error, 
        'std_misclassification_error': std_misclassification_error,
        'n': n,
        'representation': 'full',
        'model': '1nn',
    })

The results are remarkable compared to unsupervised clustering:

C4H10: 0.00% ± 0.00%
C5H12: 0.00% ± 0.00%
C6H14: 0.00% ± 0.00%
C7H16: 0.07% ± 0.05%
C8H18: 0.11% ± 0.05%
C9H20: 0.51% ± 0.09%
C10H22: 1.31% ± 0.09%
C11H24: 3.24% ± 0.09%

Perfect classification for molecules up to C₆H₁₄, and remarkably low error rates even for complex molecules like C₁₁H₂₄. This is a dramatic improvement over clustering, where 97% of C₁₁H₂₄ isomers had misclassified conformations.

Note on data scaling: I tested standardizing features, but this destroyed performance—the magnitude of eigenvalues carries crucial information about molecular structure.

Comparison across representations:

Testing the same molecules with different feature sets reveals interesting patterns:

1-NN performance across different representations

Figure 1: 1-NN classification performance across different eigenvalue representations. All approaches perform similarly, with slight advantages for the full representation on larger molecules.

Key insights:

  • Representation choice matters little for 1-NN—full, top-10, and PCA perform nearly identically
  • PCA slightly outperforms top-10 eigenvalues for larger molecules, suggesting it captures more meaningful structural variance
  • Perfect classification persists through C₆H₁₄ regardless of representation

This suggests that the discriminative information concentrates in the largest eigenvalues, validating our PCA findings from Part One.

The Effect of k: When More Neighbors Hurt

Testing k-NN with different neighbor counts (k=1, 3, 5) reveals a counterintuitive pattern:

k-NN performance for different k values

Figure 2: k-NN classification performance decreases as k increases—more neighbors actually hurt accuracy.

Why does performance degrade with more neighbors? This connects directly to our clustering analysis from Part Two. The eigenvalue space lacks meaningful local structure—when k-NN looks beyond the immediate nearest neighbor, it increasingly finds examples from different classes.

This validates our unsupervised findings: without clear cluster boundaries, examining more neighbors introduces noise rather than signal.

Logistic Regression: Learning Linear Boundaries

Logistic regression represents a fundamentally different approach—instead of memorizing examples, it learns linear decision boundaries in eigenvalue space. If eigenvalues encode structural information linearly, this should work well.

Note: I’ll focus on PCA-reduced representations to keep computation manageable, using the insights from k-NN analysis.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

df_lr = []

for n in range(4, 12):
    # Prepare the data for CnH2n+2
    X, y = prep_data(n=n)

    # Create logistic regression classifier with PCA
    lr = Pipeline([
        ('pca', PCA(n_components=10)),
        ('lr', LogisticRegression(
            max_iter=10_000, 
            penalty='l2',
            solver='lbfgs',
            C=10.0,  # Reduced regularization
            random_state=42, 
            n_jobs=-1,
        ))
    ])

    # 5-fold stratified cross-validation
    cv = StratifiedKFold(n_splits=5)
    acc_scores = cross_val_score(lr, X, y, cv=cv, scoring='accuracy')
    
    # Convert to misclassification rates
    avg_error = np.mean(1 - acc_scores)
    std_error = np.std(1 - acc_scores)
    
    print(f'C{n}H{2*n + 2}: {avg_error:.2%} ± {std_error:.2%}')

Comparing k-NN vs. Logistic Regression:

Comparison of 1-NN and Logistic Regression performance

Figure 3: k-NN significantly outperforms logistic regression, especially for larger molecules. The gap widens as molecular complexity increases.

What this tells us:

  • k-NN dominates across all molecular sizes
  • Linear boundaries fail for larger molecules—suggesting nonlinear eigenvalue relationships
  • Performance gap grows with molecular complexity, indicating that structural patterns become increasingly nonlinear

The poor performance of logistic regression suggests that the discriminative patterns in eigenvalue space are fundamentally nonlinear. Linear models simply can’t capture the complex relationships needed to distinguish constitutional isomers.

The Supervised Learning Verdict

Our supervised learning experiments reveal a nuanced picture of Coulomb matrix eigenvalues as molecular descriptors:

What works:

  • k-NN achieves remarkable performance—even 97% accuracy for C₁₁H₂₄ with 159 classes
  • Local structure exists where global clustering failed
  • Eigenvalue magnitudes matter—scaling destroys performance
  • Dimensionality reduction works—top eigenvalues capture discriminative information

What struggles:

  • Linear models fail badly for complex molecules
  • Performance degrades with molecular size, even for k-NN
  • Nonlinear patterns dominate—requiring memory-based rather than parametric approaches

The key insight: Labels unlock hidden structure that unsupervised methods missed. While eigenvalues can’t form clean global clusters, they preserve enough local structure for nearest-neighbor classification to work remarkably well.

Implications for Molecular Representation

This series reveals important lessons about molecular representations:

  1. Mathematical elegance ≠ practical utility: Eigenvalues are beautiful but have fundamental limitations
  2. Context matters: The same representation can fail unsupervised but succeed supervised
  3. Molecular complexity is hard: Even simple alkanes challenge our best descriptors
  4. Local vs. global structure: Sometimes the neighborhood matters more than the big picture

For practitioners working with molecular representations:

  • Be skeptical of elegant solutions to complex problems
  • Test multiple learning paradigms—supervised and unsupervised often give different insights
  • Consider the curse of dimensionality seriously in molecular design
  • Always validate on the simplest possible test cases first

Looking ahead: Modern deep learning approaches like graph neural networks and transformer-based molecular representations have largely superseded eigenvalue methods. But understanding why simpler approaches fail helps us appreciate what makes the newer methods successful.


This concludes our three-part journey through molecular shape recognition with eigenvalues. For more explorations in computational chemistry and machine learning, check out my work on mini protein folding or the Molecular Representation Methods series. You can also browse all chemistry posts.