Introduction

I came across a paper with an intriguing title: “Can One Hear the Shape of a Molecule (from its Coulomb Matrix Eigenvalues)?” The title references Mark Kac’s famous mathematical question “Can One Hear the Shape of a Drum?”—which asks whether you can determine a drum’s shape from its sound.

The molecular version asks: can we determine a molecule’s shape from the eigenvalues of its Coulomb matrix?

This is more than an academic curiosity. Molecular representations are fundamental to machine learning in chemistry. If eigenvalues can capture structural information, they could be powerful features for predicting molecular properties. But if they can’t distinguish basic structural differences, we need better approaches.

The original authors tested this using alkane constitutional isomers—molecules with identical formulas but different arrangements. I decided to replicate their work, partly to understand the methods and partly because I wanted to test the limits myself.

This became a three-part series exploring molecular representation through eigenvalue analysis:

  • Part 1 (this post): Data generation and computational validation
  • Part 2: Unsupervised clustering approaches
  • Part 3: Supervised classification methods

For the complete learning path, see the Can You Hear the Shape of a Molecule? series page.

I’ll also explore log-transformed Coulomb matrices, since they can reveal structural details that regular matrices miss.

Why Alkanes Make Perfect Test Cases

Alkanes are the simplest organic molecules—just carbon and hydrogen with single bonds. Their general formula is $C_{n}H_{2n+2}$.

What makes them ideal for testing molecular representations is their constitutional isomers: molecules with identical formulas but different structural arrangements. For small alkanes (n ≤ 3), there’s only one way to connect the atoms. But starting with butane (n = 4), multiple arrangements become possible:

Butane as a ball-and-stick model.

Butane: a linear chain

Isobutane as a ball-and-stick model.

Isobutane: a branched structure

The number of isomers grows rapidly. By undecane (n = 11), there are 159 different ways to arrange the carbon skeleton:

AlkanenIsomers
Butane42
Pentane53
Hexane65
Heptane79
Octane818
Nonane935
Decane1075
Undecane11159

This creates a natural classification challenge: can Coulomb matrix eigenvalues distinguish these structural differences? If they can’t separate simple alkane isomers, they’re unlikely to work for more complex molecules.

Generating the Data

The computational challenge breaks down into three steps:

  1. Generate all constitutional isomers for each alkane
  2. Create multiple 3D conformations for each isomer
  3. Calculate Coulomb matrix eigenvalues for each conformation

Step 1: Generating Constitutional Isomers

Enumerating all possible carbon skeletons is a combinatorial problem. I used MAYGEN, an open-source Java tool that efficiently generates molecular structures from formulas.

For butane (C₄H₁₀):

java -jar MAYGEN-1.8.jar -v -m -f C4H10 -smi -o butane_conformers.smi

This produces:

CCCC
CC(C)C

The first is n-butane (linear), the second is isobutane (branched). We can automate this for all alkanes:

os.makedirs('isomers', exist_ok=True)
for n in range(1, 12):
    cmd = f"java -jar MAYGEN-1.8.jar -f C{n}H{2*n + 2} -smi -o isomers/C{n}H{2*n + 2}.smi"
    os.system(cmd)

Step 2: Generating 3D Conformations

For machine learning, we need multiple 3D structures of each isomer to capture conformational flexibility. I used RDKit’s ETKDG method:

Step 3: Computing Coulomb Matrix Eigenvalues

The Coulomb matrix encodes 3D structure in a rotation and translation-invariant way. Its eigenvalues should capture structural information. I used DScribe for calculations:

With these functions, we can generate training data:

# Generate 1000 conformations per isomer for each alkane
os.makedirs('spectra', exist_ok=True)
n_confs = 1000

for n in range(1, 12):
    print(f'Generating spectra for C{n}H{2*n + 2}')

    with open(f'isomers/C{n}H{2*n + 2}.smi') as f:
        smiles_list = [line.strip() for line in f]

    for i, smiles in enumerate(smiles_list):
        spectra = gen_n_spectra(smiles, n_confs, log=False)
        np.save(f'spectra/C{n}H{2*n + 2}_{i}.npy', spectra)

Validating Our Approach

To ensure our pipeline works correctly, I’ll replicate key figures from the original paper and compare results.

which should, for heptane, print something like:

CCCCCCC heptane
CCC(CC)CC 3-ethylpentane
CC(C)CC(C)C 2,4-dimethylpentane
CC(C)C(C)(C)C 2,2,3-trimethylbutane
CCC(C)(C)CC 3,3-dimethylpentane
CCC(C)C(C)C 2,3-dimethylpentane
CCCC(C)CC 3-methylhexane
CCCC(C)(C)C 2,2-dimethylpentane
CCCCC(C)C 2-methylhexane

Generating Conformers (3D Structures) - A Computational Chemistry Problem

In order to create training data for our machine learning models, we need to generate 3D structures for each isomer. The paper uses RDKit to generate the 3D structures, relying on the ETKDG method for conformer generation. According to past studies on conformer generation, ETKDG is free and still competitive with commercial alternatives.

We write a simple Python function to achieve this like so:

After generating the 3D structures, we’ll use DScribe to calculate the Coulomb matrix, which is invariant to translation and rotation of the molecule. Before doing so, let’s first write a helper function to transform RDKit molecules into ase.Atoms objects, the Atomic Simulation Environment object that DScribe can work with.

With that, we can now compute the Coulomb matrix for each conformer using DScribe and compute the eigenvalues with numpy, allowing for an optional log transformation of the Coloumb matrix prior to eigenvalue calculation.

Using these helper functions, we can sample a batch of conformers, calculate their Coulomb matrices, and calculate the eigenvalues of the Coulomb matrices quite directly:

def gen_n_spectra(
    n: int,
    smiles_str: str,
    log: bool = False
) -> np.ndarray:
    """Generate n Coulomb matrix eigenvalue spectra for a molecule.

    Args:
        n: Number of spectra to generate.
        smiles_str: SMILES string of the molecule.
        log: Whether to log transform the Coulomb matrix prior to calculating the eigenvalues.

    Returns:
        Eigenvalue spectra.
    """
    spectra = []
    for _ in tqdm(range(n)):
        # Convert SMILES string to RDKit mol object
        mol = smiles_str_to_rdkit_mol(smiles_str)

        # Convert RDKit mol object to ASE Atoms object
        atoms = rdkit_mol_to_ase_atoms(mol)

        # Generate Coulomb matrix eigenvalue spectra
        spectra.append(ase_atoms_to_coloumb_matrix_eigenvalues(atoms, log=log))

    # Convert to numpy array
    spectra = np.array(spectra)

    return spectra

Validating the Generation Process

In the remainder of this post, I’d like to go through the Figures of the original paper and compare them to similar figures generated from our own code. This will help us validate our generation process and ensure that we’re on the right track for the next steps of unsupervised and supervised learning.

To start, we’ll generate the data for this section using the following code:

os.makedirs('spectra', exist_ok=True)

n_confs = 1000
for n in range(1, 12):
    print(f'Generating spectra for C{n}H{2*n + 2}')
    with open(f'isomers/C{n}H{2*n + 2}.smi') as f:
        lines = f.readlines()
        print(f'\tNumber of SMILES strings: {len(lines)}')
        for i, line in enumerate(tqdm(lines)):
            print(f'\t\t{i + 1}/{len(lines)} - {line.strip()}')

            if not line.strip():
                continue

            if os.path.exists(f'spectra/C{n}H{2*n + 2}_{i}.npy'):
                continue

            smiles = line.strip()
            spectra = gen_n_spectra(n=n_confs, smiles_str=smiles, log=False)
            np.save(f'spectra/C{n}H{2*n + 2}_{i:03d}.npy', spectra)

As this will take a non-trivial amount of time, be sure to cache it to disk so you don’t have to re-run it!

Then, we can reload the data into a structured format:

import re
from glob import glob

spectra = {}
for n in range(1, 12):
    if n not in spectra:
        spectra[n] = {}

    for i, f in enumerate(glob(f'spectra/C{n}H{2*n + 2}_*.npy')):
        j = int(re.search(rf'C{n}H{2*n + 2}_(\d+).npy', f).group(1))
        spectra[n][j] = np.load(f)

    print(f'Loaded {len(spectra[n])}k spectra for C{n}H{2*n + 2}')

Figure 1 - Largest Eigenvalues of the Coulomb Matrix for Alkane Constitutional Isomers

The first figure of the paper shows the box plots of the largest Eigenvalues of the Coulomb matrix for all alkane constitutional isomers up to $n=11$. Each molecular formula makes up a box plot with individual points for each isomer. This figure is used to argue that the Coulomb matrix eigenvalues are not sufficient for distinguishing between chemical formulas, at least when restricting to the largest eigenvalue.

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

for n in range(1, 12):
    eigenvalues = np.array([spectra[n][i][:, 0].mean() for i in spectra[n]])

    # add black dots to boxplot, but not dirextly to the center line
    jitter = np.random.normal(0, 0.1, size=len(eigenvalues))
    ax.scatter(np.full(len(eigenvalues), n) + jitter, eigenvalues, color='black', s=1, alpha=0.3)

    # Plot median
    ax.scatter([n], [np.median(eigenvalues)], color='red', alpha=0.5)

    # Plot range
    ax.plot([n - 0.5, n + 0.5], [np.min(eigenvalues), np.min(eigenvalues)], 'k-', alpha=0.5)  # Draw a line for the range
    ax.plot([n - 0.5, n + 0.5], [np.max(eigenvalues), np.max(eigenvalues)], 'k-', alpha=0.5)  # Draw a line for the range

ax.set_xlabel('Molecular formula')
ax.set_ylabel('Largest eigenvalue')
ax.set_xticks(range(1, 12))
ax.set_xticklabels([f'C{n}H{2*n + 2}' for n in range(1, 12)])
ax.set_title('Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers')
plt.savefig('alkane_coulomb_matrix_largest_eigenvalues.webp', bbox_inches='tight')
Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers.

Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers.

Barring stylistic differences, our plot looks quite similar to the original paper’s Figure 1. We see a sub-linear growth in the largest eigenvalue with the number of carbon atoms in the molecule, and we see that the largest eigenvalue is not likely to be sufficient for distinguishing between constitutional isomers, especially for larger $n$.

Figure 2 - Maximum Eigenvalue PDFs for $C_7H_{16}$ and Its Constitutional Isomers

The second figure of the paper shows PDFs of the largest eigenvalue for $C_7H_{16}$ (heptane) and its constitutional isomers. The paper highlights how:

  • $n$-heptane conformers have the smallest maximum eigenvalue
  • $2,2,3$-trimethylbutane conformers have the largest maximum eigenvalue
  • all of the other seven conformers have maximum eigenvalues that are intermediate between these two extremes and are not easily distinguishable from one another

In this way, evidence is provided that the Coulomb matrix maximum eigenvalue is not sufficient for distinguishing between constitutional isomers.

for n_sel in range(4, 8):
    fig, ax = plt.subplots(1, 1, figsize=(10, 5))

    smiles = []
    with open(f'isomers/C{n_sel}H{2*n_sel + 2}.smi') as f:
        for line in f:
            smiles.append(line.strip())

    for i in range(len(spectra[n_sel])):
        eigenvalues = spectra[n_sel][i][:, 0]

        # kde plot with the following params
        # - Gaussian kernel
        # - bandwidth with Silverman's rule of thumb
        ax = sns.kdeplot(eigenvalues, bw_method='silverman', label=get_iupac_name(smiles[i]))

    ax.set_xlabel('Largest eigenvalue')
    ax.set_ylabel('Density')
    ax.set_title('PDF of the largest eigenvalue for $C_' + str(n_sel) + 'H_{' + str(2*n_sel + 2) + '}$')
    ax.legend()
    plt.savefig(f'pdf_largest_eigenvalue_C{n_sel}H{2*n_sel + 2}.webp', bbox_inches='tight')
    plt.close()
PDFs of the largest eigenvalue for $C_7H_{16}$ and its constitutional isomers.

PDFs of the largest eigenvalue for $C_7H_{16}$ and its constitutional isomers.

We annotate our plot with the IUPAC names of the isomers, and we see that the PDFs are quite similar to the original paper’s Figure 2.

For the interested, you can also see the plots I generated for $n \in {4, 5, 6}$:

PDFs of the largest eigenvalue for $C_4H_{10}$ and its constitutional isomers.

PDFs of the largest eigenvalue for $C_4H_{10}$ and its constitutional isomers.

For $n=4$, we see that the PDFs are quite distinct from one another, suggesting that the largest eigenvalue is sufficient for distinguishing between butane and isobutane. We’ll see this further confirmed next.

PDFs of the largest eigenvalue for $C_5H_{12}$ and its constitutional isomers.

PDFs of the largest eigenvalue for $C_5H_{12}$ and its constitutional isomers.

Similar to $n=4$, we see fairly decent separation between the PDFs for pentane and its constitutional isomers, suggesting that the largest eigenvalue is sufficient for distinguishing between these isomers.

PDFs of the largest eigenvalue for $C_6H_{14}$ and its constitutional isomers.

PDFs of the largest eigenvalue for $C_6H_{14}$ and its constitutional isomers.

The case of hexane is a bit more ambiguous, but we still see some separation between the PDFs for hexane and its constitutional isomers. However, two isomers (2-mehtylpentane and 3-methylpentane) have PDFs that are quite similar to one another.

Figure 3 - 2D Space of Coulomb Matrix Eigenvalues for $C_4H_{10}$ (Butane and Isobutane)

What if we expand the scope? Look at more than just the largest eigenvalue? Figure 3 of the paper plots the two isomers of $C_4H_{10}$ (butane and isobutane) in 2D space, using the first two eigenvalues of the Coulomb matrix. This demonstrates a clear linear separation between the two isomers, suggesting that the Coulomb matrix eigenvalues are sufficient for distinguishing between these two isomers.

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

n_sel = 4
smiles = []
with open(f'isomers/C{n_sel}H{2*n_sel + 2}.smi') as f:
    for line in f:
        smiles.append(line.strip())

for i in range(len(spectra[n_sel])):
    eigenvalues = spectra[n_sel][i][:, :2]

    ax.scatter(eigenvalues[:, 0], eigenvalues[:, 1], label=get_iupac_name(smiles[i]))

ax.set_xlabel('Largest eigenvalue')
ax.set_ylabel('Second largest eigenvalue')
ax.set_title('2D plot of the first two eigenvalues for $C_4H_{10}$ conformers')
ax.legend()
plt.savefig(f'2d_largest_eigenvalue_C{n_sel}H{2*n_sel + 2}.webp', bbox_inches='tight')
2D space of Coulomb matrix eigenvalues for $C_4H_{10}$ (butane and isobutane).

2D space of Coulomb matrix eigenvalues for $C_4H_{10}$ (butane and isobutane).

For the most part, we observe a similar plot to the original paper’s Figure 3. However, I do think we have one conformer in the lower right that seems noisy/suspicious. This could be a result of the conformer generation process, and it’s something we should consider discarding.

Figure 4 - Number of Principal Components Required for 99% Variance in Coulomb Matrix Eigenvalues

But is this always the case for all constitutional isomers? Definitely not! The author presents Figure 4 as a plot of the number of principal components required to describe at least 99% of the variance in the Coulomb matrix eigenvalues for each alkane constitutional isomer. This figure shows that the number of principal components required grows sub-linearly with the number of carbon atoms in the molecule, suggesting that the Coulomb matrix eigenvalues are not sufficient for distinguishing between constitutional isomers.

from sklearn.decomposition import PCA

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

n_components = {}
for n in range(1, 12):
    n_components[n] = []

    for i in range(len(spectra[n])):
        eigenvalues = spectra[n][i]

        # PCA
        pca = PCA(n_components=0.99, svd_solver='full', whiten=False, random_state=42)
        pca.fit(eigenvalues)
        n_components[n].append(pca.n_components_)

    ax.scatter([n] * len(n_components[n]), n_components[n], alpha=0.3)
    ax.plot([n - 0.25, n + 0.25], [np.mean(n_components[n]), np.mean(n_components[n])], 'k-', alpha=0.5)  # Draw a line for the mean

ax.plot([1, 11], [1, 11], 'k--', alpha=0.5, label='y = num carbon')
ax.plot([1, 11], [5, 35], 'r--', alpha=0.5, label='y = num atoms')

ax.set_xlabel('Number of carbon atoms')
ax.set_ylabel('Number of principal components')
ax.set_title('99% variance explained by number of principal components')
plt.legend()
plt.savefig('99_variance_explained.webp', bbox_inches='tight')
99% variance explained by number of principal components.

99% variance explained by number of principal components.

While I think my plot looks a bit different from the original paper’s Figure 4, I think the general trend is the same in that:

  • For molecules with $N$ degrees of freedom, where $N=3n+2$ for $n$ carbon atoms, we need far less than $N$ principal components to explain 99% of the variance in the Coulomb matrix eigenvalues.
  • The number of principal components required grows seemingly linearly in the number of carbon atoms in the molecule, but it’s not clear if this is a general trend.

This means the representations are highly correlated, and we can likely reduce the dimensionality of the Coulomb matrix eigenvalues without losing much information.

Experimenting with Log-Transformed Coulomb Matrices

I want to repeat the same plot generation process, but this time with log-transformed Coulomb matrices. As we saw in the Coulomb matrix post, the Coulomb matrix has a wider distribution of magnitude across its eigenspectra when we first transform the Coloumb matrix with a log transformation. Instead of down-sampling the hydrogen atoms and up-sampling heavy atoms, taking the log-transformation of the Coulomb matrix has an opposite effect: it expands the range of hydrogen atoms, as magnitudes in $[0, 1]$ are expanded to $[-\infty, 0]$.

What does that do to the eigenvalue spectra?

Let’s find out!

os.makedirs('spectra', exist_ok=True)

n_confs = 1000
for n in range(1, 12):
    print(f'Generating spectra for C{n}H{2*n + 2}')
    with open(f'isomers/C{n}H{2*n + 2}.smi') as f:
        lines = f.readlines()
        print(f'\tNumber of SMILES strings: {len(lines)}')
        for i, line in enumerate(tqdm(lines)):
            print(f'\t\t{i + 1}/{len(lines)} - {line.strip()}')

            if not line.strip():
                continue

            if os.path.exists(f'spectra/log-C{n}H{2*n + 2}_{i}.npy'):
                continue

            smiles = line.strip()
            spectra = gen_n_spectra(n=n_confs, smiles_str=smiles, log=True)
            np.save(f'spectra/log-C{n}H{2*n + 2}_{i:03d}.npy', spectra)

and then:

log_spectra = {}
for n in range(1, 12):
if n not in log_spectra:
    log_spectra[n] = {}

for i, f in enumerate(glob(f'spectra/log-C{n}H{2*n + 2}_*.npy')):
    j = int(re.search(rf'log-C{n}H{2*n + 2}_(\d+).npy', f).group(1))
    log_spectra[n][j] = np.load(f)

print(f'Loaded {len(log_spectra[n])}k spectra for C{n}H{2*n + 2}')

Figure 1 Redux - Largest Eigenvalues of the Log-Transformed Coulomb Matrix for Alkane Constitutional Isomers

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

for n in range(1, 12):
    eigenvalues = np.array([log_spectra[n][i][:, 0].mean() for i in log_spectra[n]])

    # add black dots to boxplot, but not dirextly to the center line
    jitter = np.random.normal(0, 0.1, size=len(eigenvalues))
    ax.scatter(np.full(len(eigenvalues), n) + jitter, eigenvalues, color='black', s=5, alpha=0.3)

    # Plot median
    ax.scatter([n], [np.median(eigenvalues)], color='red')

    # Plot range
    ax.plot([n - 0.5, n + 0.5], [np.min(eigenvalues), np.min(eigenvalues)], 'k-', alpha=0.5)  # Draw a line for the range
    ax.plot([n - 0.5, n + 0.5], [np.max(eigenvalues), np.max(eigenvalues)], 'k-', alpha=0.5)  # Draw a line for the range

ax.set_xlabel('Molecular formula')
ax.set_ylabel('Largest eigenvalue')
ax.set_xticks(range(1, 12))
ax.set_xticklabels([f'C{n}H{2*n + 2}' for n in range(1, 12)])
ax.set_title('Largest eigenvalues of the log-Coulomb matrix for alkane constitutional isomers')
plt.savefig('alkane_log_coulomb_matrix_largest_eigenvalues.webp', bbox_inches='tight')
Largest eigenvalues of the log-Coulomb matrix for alkane constitutional isomers.

Largest eigenvalues of the log-Coulomb matrix for alkane constitutional isomers.

Now, we see that the largest eigenvalues of the log-transformed Coulomb matrix have the potential to become much more dominated to by the hydrogen atoms. This results in a large negative magnitude for the largest eigenvalue.

What I find to be interesting is that the range of the distribution across isomers grows as we add more carbon atoms to the molecule. For some isomers, the carbon atom interactions remain dominant, but for others, the hydrogen atom interactions become dominant. This is in contrast to the original Coulomb matrix eigenvalues, where the range of the distribution across isomers grows sub-linearly with the number of carbon atoms in the molecule and hydrogen atom interactions are virtually irrelevant.

Figure 2 Redux - Maximum Eigenvalue PDFs for $C_7H_{16}$ and Its Constitutional Isomers

for n_sel in range(4, 8):
    fig, ax = plt.subplots(1, 1, figsize=(10, 5))

    smiles = []
    with open(f'isomers/C{n_sel}H{2*n_sel + 2}.smi') as f:
        for line in f:
            smiles.append(line.strip())

    for i in range(len(log_spectra[n_sel])):
        eigenvalues = log_spectra[n_sel][i][:, 0]

        # kde plot with the following params
        # - Gaussian kernel
        # - bandwidth with Silverman's rule of thumb
        # - log scale y-axis
        ax = sns.kdeplot(eigenvalues, bw_method='silverman', label=get_iupac_name(smiles[i]), log_scale=(False, True))

    ax.set_xlabel('Largest eigenvalue')
    ax.set_ylabel('Density')
    ax.set_title('PDF of the largest eigenvalue for $C_' + str(n_sel) + 'H_{' + str(2*n_sel + 2) + '}$')
    ax.legend()
    plt.savefig(f'pdf_log_largest_eigenvalue_C{n_sel}H{2*n_sel + 2}.webp', bbox_inches='tight')
    plt.close()
PDFs of the largest eigenvalue for $C_7H_{16}$ and its constitutional isomers.

PDFs of the largest eigenvalue for $C_7H_{16}$ and its constitutional isomers.

Here, I plot the PDFs of the largest eigenvalue for $C_7H_{16}$ and its constitutional isomers, but this time with a log scale on the y-axis. The reason is otherwise, the less branched isomers become invisible due to the more highly concentrated distributions of the more branched isomers.

In this, I think this exposes an issue of the log-transformation of the Coulomb matrix. The distributions can become a lot more wide and overlapping (as shown here).

Figure 3 Redux - 2D Space of Log-Transformed Coulomb Matrix Eigenvalues for $C_4H_{10}$ (Butane and Isobutane)

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

n_sel = 4
smiles = []
with open(f'isomers/C{n_sel}H{2*n_sel + 2}.smi') as f:
    for line in f:
        smiles.append(line.strip())

for i in range(len(log_spectra[n_sel])):
    eigenvalues = log_spectra[n_sel][i][:, :2]

    ax.scatter(eigenvalues[:, 0], eigenvalues[:, 1], label=get_iupac_name(smiles[i]))

ax.set_xlabel('Largest eigenvalue')
ax.set_ylabel('Second largest eigenvalue')
ax.set_title('2D plot of the first two eigenvalues for $C_4H_{10}$ conformers')
ax.legend()
plt.savefig(f'2d_log_largest_eigenvalue_C{n_sel}H{2*n_sel + 2}.webp', bbox_inches='tight')
2D space of log-transformed Coulomb matrix eigenvalues for $C_4H_{10}$ (butane and isobutane).

2D space of log-transformed Coulomb matrix eigenvalues for $C_4H_{10}$ (butane and isobutane).

While the two isomers have been brought closer together, not much of major substance has changed with this plot. We can still linearly separate the two isomers in 2D space using the first two eigenvalues of the log-transformed Coulomb matrix.

Figure 4 Redux - Number of Principal Components Required for 99% Variance in Log-Transformed Coulomb Matrix Eigenvalues

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

n_components = {}

for n in range(1, 12):
    n_components[n] = []

    for i in range(len(log_spectra[n])):
        eigenvalues = log_spectra[n][i]

        # PCA
        pca = PCA(n_components=0.99, svd_solver='full', whiten=False, random_state=42)
        pca.fit(eigenvalues)
        n_components[n].append(pca.n_components_)

    ax.scatter([n] * len(n_components[n]), n_components[n], alpha=0.3)
    ax.plot([n - 0.25, n + 0.25], [np.mean(n_components[n]), np.mean(n_components[n])], 'k-', alpha=0.5)  # Draw a line for the mean

ax.plot([1, 11], [1, 11], 'k--', alpha=0.5, label='y = num carbon')
ax.plot([1, 11], [5, 35], 'r--', alpha=0.5, label='y = num atoms')

ax.set_xlabel('Number of carbon atoms')
ax.set_ylabel('Number of principal components')
ax.set_title('99% variance explained by number of principal components')
plt.legend()
plt.savefig('99_variance_explained_log.webp', bbox_inches='tight')
99% variance explained by number of principal components for log-transformed Coulomb matrix eigenvalues.

99% variance explained by number of principal components for log-transformed Coulomb matrix eigenvalues.

Key Findings

Our validation reproduces the original paper’s main results:

  1. Single eigenvalues aren’t enough: The largest eigenvalue alone can distinguish small alkanes (n ≤ 5) but fails for larger molecules where isomer ranges overlap significantly.

  2. Dimensionality compression works: PCA analysis shows that most variance concentrates in just a few principal components, suggesting the eigenvalue space has lower effective dimensionality than expected.

  3. Structure matters for separability: Linear alkanes (like n-heptane) consistently have the smallest eigenvalues, while highly branched isomers (like 2,2,3-trimethylbutane) have the largest. But intermediate branching patterns become increasingly difficult to distinguish.

The log-transformed eigenvalues show similar patterns with slightly more principal components needed, but don’t fundamentally change the separability challenge.

What This Means

These results set up the central question: if single eigenvalues and simple projections struggle with structural classification, how will more sophisticated machine learning approaches perform?

The fact that even perfect mathematical structural descriptors face limitations highlights why molecular representation learning remains an active research area. Sometimes the most elegant mathematical approaches hit practical walls that require more complex solutions.

Next, we’ll test whether unsupervised clustering methods can find hidden structure in the eigenvalue space, followed by supervised classification to see how well we can train models to distinguish these molecular shapes.