Recap

In the previous post, we outlined our goals in replicate the paper “Can One Hear the Shape of a Molecule (from its Coulomb Matrix Eigenvalues)?”. We generated data for all constitutional isomers of alkanes with 1 to 11 carbon atoms and computed the Coulomb matrix eigenvalues for each molecule.

We validated our data by creating figures about the dataset that matched up quite well with the paper’s original figures. These figures introduced us to the idea that the Coulomb matrix eigenvalues are highly correlated across dimensions, making it difficult to easily separate constitutional isomers based on the largest eigenvalues alone. Furthermore, PCA-based dimensionality reduction illustrated that one can capture most of the variance in the data with just a few principal components, far fewer than the number of eigenvalues. This doesn’t bode well for separability of the data in the eigenvalue space.

Nevertheless, this post will continue forward with our replication and analysis. By using the full dimensionality of the Coulomb matrix eigenvalues, we will attempt to use unsupervised methods to separate the constitutional isomers. We can use clustering-based measures to inform us about the separability of the data, without needing to rely on a supervised learning model.

If you haven’t read the previous post, I recommend you do so before continuing. Alternatively, if you’re unfamiliar with the Coulomb matrix, I recommend you read my introductory post for the Coulomb matrix. It’s a simple molecular descriptor that attempts to encode the global 3D structure of a molecule into a fixed-size matrix, so it’s quite interesting if you’re into molecular representations.

Unsupervised Learning Methods

This post is all about using clustering metrics to understand the separability of the Coulomb matrix eigenvalues for the constitutional isomers of alkanes. Intuitively, if the eigenvalue space expresses separability of isomers well, then clusters of isomers should be strongly concentrated and non-overlapping. We know this to be false in 1 and 2 dimensions for alkanes with $n \geq 6$, but is this true in higher dimensions?

Note: We’re only comparing isomers of the same number of carbon atoms, so we’re not comparing eigenvalues across different dimensions.

Reloading the Data

We’ll start by reloading the data we generated in the previous post.

import re
from glob import glob

spectra = {}
for n in range(1, 12):
    if n not in spectra:
        spectra[n] = {}
        
    for i, f in enumerate(glob(f'spectra/C{n}H{2*n + 2}_*.npy')):
        j = int(re.search(rf'C{n}H{2*n + 2}_(\d+).npy', f).group(1))
        spectra[n][j] = np.load(f)
        
    print(f'Loaded {len(spectra[n])}k spectra for C{n}H{2*n + 2}')

Method 1 – Dunn Index

The first method we’ll use to measure separability is the Dunn Index. The Dunn Index is a measure of the compactness of clusters and the separation between clusters. It produces a single scalar for the entire dataset, so it’ll give an initial sense of how separable the data is.

The Dunn Index is defined as:

$$ \text{Dunn Index} = \frac{\text{min}_{i \neq j} \left( \text{dist}(C_i, C_j) \right)}{ \text{max}_k \left( \text{diam}(C_k) \right)} $$

Where $C_i$ is the $i$-th cluster, $\text{dist}(C_i, C_j)$ is the distance between clusters $i$ and $j$, and $\text{diam}(C_k)$ is the diameter of cluster $k$.

  • The numerator is also called the inter-cluster separation. We take the minimum distance between any two clusters because we want to find the smallest separation between any two clusters.
  • The denominator is the intra-cluster compactness. We take the maximum diameter of any cluster because we want to find the largest compactness of any cluster (i.e., the most spread out cluster).
  • We’re free to choose any distance metric, but we’ll use Euclidean distance as the paper also considered Manhattan and Chebyshev distances, but opted for Euclidean.
  • It’s lower bounded by 0, where a higher value indicates better separability. As the dunn index approaches 0, the clusters become inseparable.

Let’s use the following snippet:

We’ll run through $4 \leq n \leq 11$ and compute the Dunn Index for each set of eigenvalues

import time 

dunn_scores = {}
for n in range(4, 12):
    tik = time.time()
    dunn_scores[n] = dunn_index([spectra[n][i] for i in spectra[n]])
    tok = time.time()
    dunn_scores[n]['time'] = tok - tik
    print(f'C{n}H{2*n + 2}:', dunn_scores[n])
C4H10: {'diameter': 21.43072917950398, 'distance': 8.316362440688767, 'dunn_index': 0.3880578383978837, 'time': 0.06010293960571289}
C5H12: {'diameter': 23.449286379564892, 'distance': 2.4693042873545856, 'dunn_index': 0.10530402705587172, 'time': 0.10832405090332031}
C6H14: {'diameter': 19.602363375467938, 'distance': 1.4477574259511048, 'dunn_index': 0.07385626917634591, 'time': 0.28030991554260254}
C7H16: {'diameter': 20.065014927470955, 'distance': 0.4050094394280803, 'dunn_index': 0.02018485612355977, 'time': 1.0307331085205078}
C8H18: {'diameter': 24.794154667613665, 'distance': 0.5013450168168625, 'dunn_index': 0.020220290771668196, 'time': 4.199508905410767}
C9H20: {'diameter': 21.811025941686033, 'distance': 0.34381162248560415, 'dunn_index': 0.01576320267578513, 'time': 17.400264978408813}
C10H22: {'diameter': 27.180773716656066, 'distance': 0.4986608768730121, 'dunn_index': 0.0183460883811206, 'time': 86.00787401199341}
C11H24: {'diameter': 25.58731511020692, 'distance': 0.5490373275460223, 'dunn_index': 0.021457402825629343, 'time': 424.4431610107422}

Note the quadratic scaling of time with the number of clusters.

Figure 5 - Dunn Index by Number of Carbon Atoms

fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# in axs[0, 0] - diameter vs number of carbon atoms
axs[0, 0].plot(
    list(range(4, 12)),
    [dunn_scores[n]['diameter'] for n in range(4, 12)],
    marker='o'
)
axs[0, 0].set_xlabel('Number of carbon atoms')
axs[0, 0].set_ylabel('Diameter')
axs[0, 0].set_title('Diameter vs number of carbon atoms')

# in axs[0, 1] - distance vs number of carbon atoms
axs[0, 1].plot(
    list(range(4, 12)),
    [dunn_scores[n]['distance'] for n in range(4, 12)],
    marker='o'
)
axs[0, 1].set_xlabel('Number of carbon atoms')
axs[0, 1].set_ylabel('Distance')
axs[0, 1].set_title('Distance vs number of carbon atoms')

# in axs[1, 0] - dunn index vs number of carbon atoms
axs[1, 0].plot(
    list(range(4, 12)),
    [dunn_scores[n]['dunn_index'] for n in range(4, 12)],
    marker='o'
)
axs[1, 0].set_xlabel('Number of carbon atoms')
axs[1, 0].set_ylabel('Dunn index')
axs[1, 0].set_title('Dunn index vs number of carbon atoms')

# in axs[1, 1] - time vs number of carbon atoms
axs[1, 1].plot(
    list(range(4, 12)),
    [dunn_scores[n]['time'] for n in range(4, 12)],
    marker='o'
)
axs[1, 1].set_xlabel('Number of carbon atoms')
axs[1, 1].set_ylabel('Time (s)')
axs[1, 1].set_title('Time vs number of carbon atoms')

plt.tight_layout()
plt.savefig('dunn_index_vs_num_carbon_atoms.webp', bbox_inches='tight')
Dunn Index, diameter, distance, and time vs number of carbon atoms. The diameter and distance are in arbitrary units, but the Dunn Index is unitless. The time is in seconds.

Dunn Index, diameter, distance, and time vs number of carbon atoms. The diameter and distance are in arbitrary units, but the Dunn Index is unitless. The time is in seconds.

Comparing my Dunn Index results to the paper’s, I find the general trends to be similar:

  • High initial dunn index for $n=4$, though the paper’s is much higher than mine. I suspect this is related to the one conformation I sampled of iso-butane which was very far away from other conformations (see the figure in this section of the previous post).
  • For $n=5, 6$, my dunn index is lower than the paper’s, but the general trend is the same that starting at $n=7$, there’s another significant drop in the dunn index.
  • For the remaining $n$, the dunn index is very low, oscillating around $0.02$.

As all this data is sampled with an imprecise method and we’re taking mins and maxes, it’s highly possible that noise distorts the results here. The fact that the general trends are the same is a good sign, but I wouldn’t put too much stock in the exact values. (This more so speaks to a need of having higher quality data generation pipelines, but that’s a topic for another day.)

Method 2 – Silhouette Score

A silhouette score is another measure of cluster separability, but instead of a single scalar, it produces a score for each sample. The silhouette score is defined as: $$ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} $$ Where $a(i)$ is the average distance from sample $i$ to all other samples in the same cluster, and $b(i)$ is the average distance from sample $i$ to all other samples in the nearest cluster.

The silhouette score is in the range $[-1, 1]$, where a score of $1$ indicates that the sample is far away from the neighboring clusters, and a score of $-1$ indicates that the sample is close to the neighboring clusters.

For each $n$, we want to understand two things:

  • For each isomer, the fraction of samples with a negative score (i.e., samples that are closer to neighboring clusters than their own, leading to unsupervised misclassification).
  • The fraction of isomers with at least one sample with a negative score.

We’ll use the silhouette_samples function from scikit-learn to compute the silhouette score for each sample, and then we’ll aggregate the results.

s_scores = {}
for n in tqdm(range(4, 12)):

    X = []
    y = []
    for i in spectra[n]:
        X.append(spectra[n][i])
        y.extend(np.full(spectra[n][i].shape[0], i))
    X = np.concatenate(X)
    y = np.array(y)

    print(n, X.shape, y.shape)
    s_scores[n] = silhouette_samples(X, y)

Figure 6a - Fraction of Isomers with Negative Silhouette Scores

neg_iso = {}
for n in range(4, 12):
    n_iso = s_scores[n].shape[0] // 1000
    n_has_neg = 0
    for i in range(n_iso):
        chunk = s_scores[n][i * 1000:(i + 1) * 1000]
        if np.any(chunk < 0):
            n_has_neg += 1
    neg_iso[n] = n_has_neg / n_iso
    print(f'C{n}H{2*n + 2}: {n_has_neg} / {n_iso} = {neg_iso[n]:.4f}')
    
fig, ax = plt.subplots(1, 1, figsize=(10, 5))

ax.plot(
    list(range(4, 12)),
    [neg_iso[n] for n in range(4, 12)],
    marker='o'
)
ax.set_xlabel('Number of carbon atoms')
ax.set_ylabel('Fraction of negative silhouette scores')
ax.set_title('Fraction of negative silhouette scores vs number of carbon atoms')

plt.tight_layout()
plt.savefig('fraction_of_negative_silhouette_scores_vs_num_carbon_atoms.webp', bbox_inches='tight')
Fraction of isomers with negative silhouette scores vs number of carbon atoms.

Fraction of isomers with negative silhouette scores vs number of carbon atoms.

With the output

C4H10: 0 / 2 = 0.0000
C5H12: 1 / 3 = 0.3333
C6H14: 4 / 5 = 0.8000
C7H16: 7 / 9 = 0.7778
C8H18: 16 / 18 = 0.8889
C9H20: 32 / 35 = 0.9143
C10H22: 71 / 75 = 0.9467
C11H24: 155 / 159 = 0.9748

we see results that are fairly consistent with the paper’s. I’d say it’s unsurprising when we consider this counts the number of isomers with at least 1 sample with a negative silhouette score. That’s a really stringent penalty.

Figure 6b - Fraction of Samples with Negative Silhouette Scores

neg_confs = {}
for n in range(4, 12):
    n_iso = s_scores[n].shape[0] // 1000
    neg_confs[n] = np.zeros(n_iso)
    for i in range(n_iso):
        isomer_scores = s_scores[n][i * 1000:(i + 1) * 1000]
        neg_confs[n][i] = np.sum(isomer_scores < 0) / isomer_scores.shape[0]
    print(f'C{n}H{2*n + 2}: {neg_confs[n].mean():.4f} +/- {neg_confs[n].std():.4f} ({neg_confs[n].min():.4f}, {neg_confs[n].max():.4f})')
    
fig, ax = plt.subplots(1, 1, figsize=(10, 5))

for n in range(4, 12):
    ax.plot(
        [n] * neg_confs[n].shape[0] + np.random.normal(0, 0.1, size=neg_confs[n].shape[0]),
        neg_confs[n],
        marker='o',
        linestyle='None',
        alpha=0.3
    )
    
    # plot max and min
    ax.plot(
        [n - 0.5, n + 0.5],
        [np.min(neg_confs[n]), np.min(neg_confs[n])],
        'k-',
        alpha=0.5
    )
    ax.plot(
        [n - 0.5, n + 0.5],
        [np.max(neg_confs[n]), np.max(neg_confs[n])],
        'k-',
        alpha=0.5
    )
    
ax.set_xlabel('Number of carbon atoms')
ax.set_ylabel('Fraction of negative silhouette scores')
ax.set_title('Fraction of negative silhouette scores vs number of carbon atoms')

plt.tight_layout()
plt.savefig('fraction_of_negative_silhouette_scores_vs_num_carbon_atoms_individual.webp', bbox_inches='tight')
Fraction of samples with negative silhouette scores vs number of carbon atoms. Each point represents an isomer, and the lines represent the min and max fraction of negative silhouette scores for each isomer.

Fraction of samples with negative silhouette scores vs number of carbon atoms. Each point represents an isomer, and the lines represent the min and max fraction of negative silhouette scores for each isomer.

C4H10: 0.0000 +/- 0.0000 (0.0000, 0.0000)
C5H12: 0.0133 +/- 0.0189 (0.0000, 0.0400)
C6H14: 0.0812 +/- 0.1455 (0.0000, 0.3720)
C7H16: 0.0367 +/- 0.0433 (0.0000, 0.1230)
C8H18: 0.1232 +/- 0.1463 (0.0000, 0.6230)
C9H20: 0.1918 +/- 0.1605 (0.0000, 0.7830)
C10H22: 0.2836 +/- 0.2323 (0.0000, 0.9950)
C11H24: 0.3504 +/- 0.2397 (0.0000, 0.9950)

The only thing I find to be significantly inconsistent from the paper’s results is the non-monotonic behavior of the fraction of isomers with negative silhouette scores. For whatever reason, heptane (C7H16) has a lower fraction of isomers with negative silhouette scores than hexane (C6H14). This should not be so, and might speak to an under-sampling of the conformational space of heptane.

Nevertheless, the general trend of increasing fraction of isomers with negative silhouette scores is consistent with the paper’s results. And we see that by the time we reach $n=11$, the fraction of isomers with negative silhouette scores is quite high. A non-negligible fraction of isomers have a majority of their samples with negative silhouette scores, which means unsupervised methods will incorrectly classify them.

With that confirmed, it’s no wonder the paper was unable to produce clusters of any meaning for $n \geq 10$.

Conclusion

We’ve used two unsupervised learning methods to measure the separability of the Coulomb matrix eigenvalues for the constitutional isomers of alkanes. The verdict is that the Coulomb matrix eigenvalues do not look sufficiently separable to be used for unsupervised learning methods. The Dunn Index and silhouette score both indicate that the data is not well separated, and that unsupervised methods will incorrectly classify a non-negligible fraction of the isomers, especially for $n \geq 10$.

This is a bit of a bummer, but it’s also a good result as it’s what the paper found. In the next post, we’ll do some supervised learning to see how well we can classify the isomers with the Coulomb matrix eigenvalues. Stay tuned!

If you liked this post, consider: