Can You Hear the Shape of a Molecule?

Introduction

Can you determine a molecule’s shape from mathematical fingerprints alone? This question drives some of the most fundamental challenges in computational chemistry and machine learning.

I recently encountered a paper with an intriguing title: “Can One Hear the Shape of a Molecule (from its Coulomb Matrix Eigenvalues)?” The title references Mark Kac’s famous mathematical question “Can One Hear the Shape of a Drum?”—which asks whether you can determine a drum’s shape from its sound frequencies.

The molecular version asks: can we determine a molecule’s structure from the eigenvalues of its Coulomb matrix?

This isn’t just academic curiosity. Molecular representations are the foundation of machine learning in chemistry. If eigenvalues can capture structural information, they become powerful features for property prediction. If they can’t distinguish basic structural differences, we need alternative approaches.

The original authors tested this hypothesis using alkane constitutional isomers—molecules with identical chemical formulas but different structural arrangements. I decided to replicate and extend their work to better understand both the methods and their limitations.

This exploration became a three-part series on molecular representation through eigenvalue analysis:

Part 1 (this post): Data generation and computational validation
Part 2: Unsupervised clustering approaches
Part 3: Supervised classification methods

I’ll also explore log-transformed Coulomb matrices, which can reveal structural details that standard matrices miss.

Why Alkanes Make Ideal Test Cases

Alkanes are the simplest organic molecules—carbon and hydrogen connected by single bonds with the general formula $C_{n}H_{2n+2}$.

What makes them perfect for testing molecular representations is their constitutional isomers: molecules with identical chemical formulas but different structural arrangements. For small alkanes (n ≤ 3), atoms can connect in only one way. Starting with butane (n = 4), multiple arrangements become possible:

Butane as a ball-and-stick model. — Butane: a linear chain

Isobutane as a ball-and-stick model. — Isobutane: a branched structure

The number of isomers grows rapidly with molecular size. By undecane (n = 11), there are 159 different structural arrangements:

Alkane	n	Isomers
Butane	4	2
Pentane	5	3
Hexane	6	5
Heptane	7	9
Octane	8	18
Nonane	9	35
Decane	10	75
Undecane	11	159

This creates a natural classification challenge: can Coulomb matrix eigenvalues distinguish these structural differences? If they cannot separate simple alkane isomers, they’re unlikely to succeed with more complex molecules.

Computational Pipeline

The analysis requires three computational steps:

Generate constitutional isomers for each alkane formula
Create multiple 3D conformations for each isomer
Calculate Coulomb matrix eigenvalues for each conformation

Generating Constitutional Isomers

Enumerating all possible carbon skeletons is a combinatorial problem. I used MAYGEN, an open-source Java tool for generating molecular structures from chemical formulas.

For butane (C₄H₁₀):

java -jar MAYGEN-1.8.jar -v -m -f C4H10 -smi -o butane_conformers.smi

This generates:

CCCC
CC(C)C

The first is n-butane (linear), the second is isobutane (branched). We can automate this across all alkanes:

os.makedirs('isomers', exist_ok=True)
for n in range(1, 12):
    cmd = f"java -jar MAYGEN-1.8.jar -f C{n}H{2*n + 2} -smi -o isomers/C{n}H{2*n + 2}.smi"
    os.system(cmd)

Generating 3D Conformations

For machine learning applications, we need multiple 3D structures of each isomer to capture conformational flexibility. I used RDKit’s ETKDG method, which remains competitive with commercial alternatives:

Computing Coulomb Matrix Eigenvalues

The Coulomb matrix encodes 3D structure in a rotation and translation-invariant way. Its eigenvalues should capture structural information while remaining invariant to molecular orientation.

First, I wrote a helper function to convert RDKit molecules into ASE Atoms objects that DScribe can process:

Then I computed Coulomb matrix eigenvalues using DScribe, with optional log transformation:

Combining these functions enables efficient data generation:

# Generate 1000 conformations per isomer for each alkane
os.makedirs('spectra', exist_ok=True)
n_confs = 1000

for n in range(1, 12):
    print(f'Generating spectra for C{n}H{2*n + 2}')

    with open(f'isomers/C{n}H{2*n + 2}.smi') as f:
        smiles_list = [line.strip() for line in f]

    for i, smiles in enumerate(smiles_list):
        spectra = gen_n_spectra(smiles, n_confs, log=False)
        np.save(f'spectra/C{n}H{2*n + 2}_{i}.npy', spectra)

Reproducing the Original Results

To validate our computational pipeline, I replicated key figures from the original paper. This ensures our implementation correctly captures the phenomena they observed.

We generate data using 1000 conformations per isomer:

os.makedirs('spectra', exist_ok=True)

n_confs = 1000
for n in range(1, 12):
    print(f'Generating spectra for C{n}H{2*n + 2}')
    with open(f'isomers/C{n}H{2*n + 2}.smi') as f:
        lines = f.readlines()
        for i, line in enumerate(lines):
            if not line.strip():
                continue

            smiles = line.strip()
            spectra = gen_n_spectra(n_confs, smiles, log=False)
            np.save(f'spectra/C{n}H{2*n + 2}_{i:03d}.npy', spectra)

After generation, we can load the data into a structured format:

import re
from glob import glob

spectra = {}
for n in range(1, 12):
    spectra[n] = {}
    for f in glob(f'spectra/C{n}H{2*n + 2}_*.npy'):
        j = int(re.search(rf'C{n}H{2*n + 2}_(\d+).npy', f).group(1))
        spectra[n][j] = np.load(f)

Largest Eigenvalues Across Alkane Series

The first analysis examines how the largest Coulomb matrix eigenvalues vary across constitutional isomers for each alkane formula. This plot reveals whether single eigenvalues can distinguish between different molecular formulas.

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

for n in range(1, 12):
    eigenvalues = np.array([spectra[n][i][:, 0].mean() for i in spectra[n]])

    # add black dots to boxplot, but not dirextly to the center line
    jitter = np.random.normal(0, 0.1, size=len(eigenvalues))
    ax.scatter(np.full(len(eigenvalues), n) + jitter, eigenvalues, color='black', s=1, alpha=0.3)

    # Plot median
    ax.scatter([n], [np.median(eigenvalues)], color='red', alpha=0.5)

    # Plot range
    ax.plot([n - 0.5, n + 0.5], [np.min(eigenvalues), np.min(eigenvalues)], 'k-', alpha=0.5)  # Draw a line for the range
    ax.plot([n - 0.5, n + 0.5], [np.max(eigenvalues), np.max(eigenvalues)], 'k-', alpha=0.5)  # Draw a line for the range

ax.set_xlabel('Molecular formula')
ax.set_ylabel('Largest eigenvalue')
ax.set_xticks(range(1, 12))
ax.set_xticklabels([f'C{n}H{2*n + 2}' for n in range(1, 12)])
ax.set_title('Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers')
plt.savefig('alkane_coulomb_matrix_largest_eigenvalues.webp', bbox_inches='tight')

Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers. — Largest eigenvalues show sub-linear growth with molecular size and increasing overlap between isomers.

Our results match the original paper. We observe sub-linear growth in the largest eigenvalue with carbon number, and critically, increasing overlap between isomers as molecules grow larger. The largest eigenvalue alone cannot reliably distinguish constitutional isomers for larger alkanes.

Eigenvalue Distributions for Heptane Isomers

Looking deeper at a specific case, I analyzed the probability density functions for heptane ($C_7H_{16}$) isomers. This molecule has nine constitutional isomers, providing a good test of discrimination power.

for n_sel in range(4, 8):
    fig, ax = plt.subplots(1, 1, figsize=(10, 5))

    smiles = []
    with open(f'isomers/C{n_sel}H{2*n_sel + 2}.smi') as f:
        for line in f:
            smiles.append(line.strip())

    for i in range(len(spectra[n_sel])):
        eigenvalues = spectra[n_sel][i][:, 0]

        # kde plot with the following params
        # - Gaussian kernel
        # - bandwidth with Silverman's rule of thumb
        ax = sns.kdeplot(eigenvalues, bw_method='silverman', label=get_iupac_name(smiles[i]))

    ax.set_xlabel('Largest eigenvalue')
    ax.set_ylabel('Density')
    ax.set_title('PDF of the largest eigenvalue for $C_' + str(n_sel) + 'H_{' + str(2*n_sel + 2) + '}$')
    ax.legend()
    plt.savefig(f'pdf_largest_eigenvalue_C{n_sel}H{2*n_sel + 2}.webp', bbox_inches='tight')
    plt.close()

PDFs of the largest eigenvalue for heptane isomers. — Heptane isomers show distinct eigenvalue ranges: n-heptane (smallest), 2,2,3-trimethylbutane (largest), with others overlapping.

The pattern is clear:

n-heptane (linear chain) has the smallest eigenvalues
2,2,3-trimethylbutane (highly branched) has the largest
Seven other isomers fall in between with substantial overlap

This demonstrates the fundamental limitation: while extreme structural differences (linear vs. highly branched) create separable eigenvalue distributions, intermediate structures become indistinguishable.

For smaller alkanes, the separation is more promising:

PDFs for butane isomers. — Butane (n=4): Clean separation between linear and branched structures.

PDFs for pentane isomers. — Pentane (n=5): Good separation between most isomers.

PDFs for hexane isomers. — Hexane (n=6): Some isomers (2-methylpentane, 3-methylpentane) become difficult to distinguish.

The progression is clear: eigenvalue-based discrimination works well for small alkanes but degrades as molecular complexity increases.

Two-Dimensional Eigenvalue Space

Can we improve discrimination by using multiple eigenvalues? For butane, plotting the first two eigenvalues reveals interesting structure:

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

n_sel = 4
smiles = []
with open(f'isomers/C{n_sel}H{2*n_sel + 2}.smi') as f:
    for line in f:
        smiles.append(line.strip())

for i in range(len(spectra[n_sel])):
    eigenvalues = spectra[n_sel][i][:, :2]

    ax.scatter(eigenvalues[:, 0], eigenvalues[:, 1], label=get_iupac_name(smiles[i]))

ax.set_xlabel('Largest eigenvalue')
ax.set_ylabel('Second largest eigenvalue')
ax.set_title('2D plot of the first two eigenvalues for $C_4H_{10}$ conformers')
ax.legend()
plt.savefig(f'2d_largest_eigenvalue_C{n_sel}H{2*n_sel + 2}.webp', bbox_inches='tight')

2D eigenvalue space for butane isomers. — Perfect linear separation of butane isomers using the first two eigenvalues.

The two isomers cluster distinctly, demonstrating that multi-dimensional eigenvalue features can achieve perfect separation for simple cases. The outlier point in the lower right likely results from conformational sampling noise.

Dimensionality and Information Content

How many eigenvalues do we actually need? Principal component analysis reveals the effective dimensionality of the eigenvalue representations:

from sklearn.decomposition import PCA

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

n_components = {}
for n in range(1, 12):
    n_components[n] = []

    for i in range(len(spectra[n])):
        eigenvalues = spectra[n][i]

        # PCA
        pca = PCA(n_components=0.99, svd_solver='full', whiten=False, random_state=42)
        pca.fit(eigenvalues)
        n_components[n].append(pca.n_components_)

    ax.scatter([n] * len(n_components[n]), n_components[n], alpha=0.3)
    ax.plot([n - 0.25, n + 0.25], [np.mean(n_components[n]), np.mean(n_components[n])], 'k-', alpha=0.5)  # Draw a line for the mean

ax.plot([1, 11], [1, 11], 'k--', alpha=0.5, label='y = num carbon')
ax.plot([1, 11], [5, 35], 'r--', alpha=0.5, label='y = num atoms')

ax.set_xlabel('Number of carbon atoms')
ax.set_ylabel('Number of principal components')
ax.set_title('99% variance explained by number of principal components')
plt.legend()
plt.savefig('99_variance_explained.webp', bbox_inches='tight')

Principal components needed for 99% variance. — The eigenvalue space compresses efficiently—far fewer components than molecular degrees of freedom are needed.

Key observations:

High compressibility: Much fewer than the $3n+2$ degrees of freedom are needed
Linear scaling: Principal components grow roughly linearly with carbon number
Efficient representation: The eigenvalue space has lower effective dimensionality than expected

This suggests the representations are highly correlated, enabling significant dimensionality reduction without information loss.

Log-Transformed Coulomb Matrices

As explored in our previous post on Coulomb matrices, log transformation can reveal different structural information. While standard Coulomb matrices emphasize heavy atom interactions, log transformation expands the influence of hydrogen atoms by mapping magnitudes in [0,1] to [-∞,0].

I generated equivalent datasets using log-transformed matrices to test how this affects discriminative power.

os.makedirs('spectra', exist_ok=True)

n_confs = 1000
for n in range(1, 12):
    print(f'Generating spectra for C{n}H{2*n + 2}')
    with open(f'isomers/C{n}H{2*n + 2}.smi') as f:
        lines = f.readlines()
        print(f'\tNumber of SMILES strings: {len(lines)}')
        for i, line in enumerate(tqdm(lines)):
            print(f'\t\t{i + 1}/{len(lines)} - {line.strip()}')

            if not line.strip():
                continue

            if os.path.exists(f'spectra/log-C{n}H{2*n + 2}_{i}.npy'):
                continue

            smiles = line.strip()
            spectra = gen_n_spectra(n=n_confs, smiles_str=smiles, log=True)
            np.save(f'spectra/log-C{n}H{2*n + 2}_{i:03d}.npy', spectra)

Log-Transformed Eigenvalue Distributions

The log transformation dramatically changes the eigenvalue landscape:

Log-transformed eigenvalues across alkane series. — Log transformation emphasizes hydrogen interactions, creating larger eigenvalue ranges and more negative values.

Unlike standard eigenvalues, log-transformed versions:

Span large negative values due to hydrogen atom emphasis
Show increasing variance between isomers as molecular size grows
Demonstrate greater discrimination potential for some isomers

However, this comes with trade-offs. The distributions can become significantly broader, as seen in the heptane analysis:

Log-transformed eigenvalue PDFs for heptane. — Log transformation creates wider, more overlapping distributions that may reduce discrimination power.

The log scale on the y-axis is necessary because unbranched isomers become nearly invisible due to the highly concentrated distributions of branched isomers.

Two-Dimensional Log-Transformed Space

The 2D eigenvalue plot for log-transformed butane shows similar clustering behavior:

2D log-transformed eigenvalue space for butane. — Log transformation brings isomers closer together but maintains separability for simple cases.

While the transformation reduces the separation distance, linear discriminability remains intact for this simple case.

Dimensionality of Log-Transformed Features

Principal component analysis of log-transformed eigenvalues reveals similar compression properties:

Principal components for log-transformed eigenvalues. — Log transformation requires slightly more principal components but maintains efficient compression.

The log-transformed features show comparable dimensionality reduction, though with marginally higher component requirements.

Key Findings

Our computational validation confirms the original paper’s conclusions and reveals important insights about molecular representation:

Single eigenvalues have limited scope: The largest eigenvalue alone can distinguish small alkanes (n ≤ 5) but fails for larger molecules where isomer ranges overlap significantly.

Structural extremes are separable: Linear alkanes (like n-heptane) consistently produce the smallest eigenvalues, while highly branched structures (like 2,2,3-trimethylbutane) generate the largest. However, intermediate branching patterns become increasingly difficult to distinguish.

Eigenvalue space compresses efficiently: PCA analysis reveals that most variance concentrates in far fewer principal components than the molecular degrees of freedom would suggest, indicating highly correlated representations.

Log transformation trades precision for range: While log-transformed eigenvalues expand hydrogen atom influence and increase variance between some isomers, they also create broader, more overlapping distributions that may reduce practical discrimination.

These results establish the fundamental challenge: even mathematically elegant structural descriptors face practical limitations when distinguishing subtle molecular differences.

Looking Ahead

These findings frame the central question for machine learning in computational chemistry: if mathematically sophisticated structural descriptors struggle with basic classification tasks, how can we build better representations?

The eigenvalue limitations we’ve documented aren’t failures—they’re informative constraints that guide us toward more sophisticated approaches. The fact that even rotation-invariant quantum mechanical descriptors hit practical walls highlights why molecular representation learning remains an active frontier.

In the next posts, we’ll test whether machine learning can find patterns that pure mathematical analysis misses. We’ll explore unsupervised clustering methods to discover hidden structure in eigenvalue space, followed by supervised classification to determine the ultimate limits of eigenvalue-based molecular discrimination.

The question “Can you hear the shape of a molecule?” may not have a simple yes or no answer—but exploring it teaches us valuable lessons about the intersection of mathematical elegance and practical utility in computational chemistry.

Introduction#

Why Alkanes Make Ideal Test Cases#

Computational Pipeline#

Generating Constitutional Isomers#

Generating 3D Conformations#

Computing Coulomb Matrix Eigenvalues#

Reproducing the Original Results#

Largest Eigenvalues Across Alkane Series#

Eigenvalue Distributions for Heptane Isomers#

Two-Dimensional Eigenvalue Space#

Dimensionality and Information Content#

Log-Transformed Coulomb Matrices#

Log-Transformed Eigenvalue Distributions#

Two-Dimensional Log-Transformed Space#

Dimensionality of Log-Transformed Features#

Key Findings#

Looking Ahead#