Introduction
In my previous post on SMILES, I built a tool to visualize molecular structures. However, for generative models, SMILES has a weakness: random strings are often invalid. This is where SELFIES (Self-referencIng Embedded Strings) comes in. It is a representation designed to be 100% robust—every SELFIES string corresponds to a valid molecular graph.
Just like with SMILES, I often find myself needing to see what a SELFIES string represents without decoding it mentally.
The solution is similar: a Python script using RDKit and PIL, but with the specific selfies library handling the decoding.
Here is a clean reference implementation.
What Are SELFIES?
SELFIES is a string-based representation for molecules that is robust to mutation and crossover. Unlike SMILES, where a misplaced character can break the syntax (e.g., an unclosed ring or invalid valence), SELFIES is defined such that no invalid strings exist. This makes it ideal for machine learning and evolutionary algorithms (though recent research suggests invalid SMILES might actually be beneficial in some contexts).
For example:
[C][C][O]is Ethanol[C][=C][C][=C][C][=C][Ring1][=Branch1]is Benzene
Building the Solution
I’ll build a tool similar to the SMILES converter, adding the selfies library to the stack.
Core Dependencies
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors
from PIL import Image, ImageDraw, ImageFont
We use selfies to decoder the string into a SMILES string, which RDKit then processes.
The Main Conversion Function
The logic mirrors the SMILES converter, with an initial decoding step:
def selfies_to_png(selfies_input, output_file, size=500):
"""Generates a 2D molecule image with a chemical formula legend from SELFIES."""
# Decode SELFIES to SMILES first
try:
smiles = sf.decoder(selfies_input)
except Exception as e:
raise ValueError(f"Invalid SELFIES string: {selfies_input}") from e
mol = Chem.MolFromSmiles(smiles)
if not mol:
raise ValueError(f"Could not generate molecule from decoded SMILES: {smiles}")
# Generate 2D coordinates and formula
rdDepictor.Compute2DCoords(mol)
formula = rdMolDescriptors.CalcMolFormula(mol)
# Render the molecule
img = Draw.MolToImage(mol, size=(size, size)).convert("RGBA")
# Create a canvas with extra space at the bottom for the legend
legend_height = int(size * 0.1)
canvas = Image.new("RGBA", (size, size + legend_height), "white")
canvas.paste(img, (0, 0))
draw = ImageDraw.Draw(canvas)
# Define dynamic font sizes
font_reg = get_font(int(size * 0.03))
font_sub = get_font(int(size * 0.02))
# Draw the legend
x = int(size * 0.02)
y = size + int(size * 0.02)
# Draw "Formula: " label
draw.text((x, y), "Formula: ", fill="black", font=font_reg)
x += draw.textlength("Formula: ", font=font_reg)
# Draw formula with subscript handling for numbers
for char in formula:
if char.isdigit():
font = font_sub
y_offset = int(size * 0.005)
else:
font = font_reg
y_offset = 0
draw.text((x, y + y_offset), char, fill="black", font=font)
x += draw.textlength(char, font=font)
# Draw original SELFIES string
draw.text((x, y), f" | SELFIES: {selfies_input}", fill="black", font=font_reg)
canvas.save(output_file)
print(f"Saved: {output_file}")
Font Handling
We use the same robust font helper:
def get_font(size, font_name="arial.ttf"):
"""Attempts to load a TTF font, falls back to default if unavailable."""
try:
return ImageFont.truetype(font_name, size)
except IOError:
return ImageFont.load_default()
Examples in Action
Let’s see how SELFIES strings look when converted.
Simple Molecules
CCO.Aromatic Compounds
[Ring1] token indicates a ring closure.Complex Pharmaceuticals
Command-Line Interface
The script allows for quick command-line usage:
# Basic usage
python selfies2png.py "[C][C][O]"
# Specify output filename
python selfies2png.py "[C][C][O]" ethanol.png
Download the Complete Script
You can copy the complete selfies2png.py script below.
Installation and Setup
You will need the selfies library in addition to rdkit and pillow.
pip install rdkit pillow selfies
Complete Script
Click to expand the complete selfies2png.py script
import sys
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors
from PIL import Image, ImageDraw, ImageFont
def get_font(size, font_name="arial.ttf"):
"""Attempts to load a TTF font, falls back to default if unavailable."""
try:
return ImageFont.truetype(font_name, size)
except IOError:
return ImageFont.load_default()
def selfies_to_png(selfies_input, output_file, size=500):
"""Generates a 2D molecule image with a chemical formula legend from SELFIES."""
try:
smiles = sf.decoder(selfies_input)
except Exception as e:
raise ValueError(f"Invalid SELFIES string: {selfies_input}") from e
mol = Chem.MolFromSmiles(smiles)
if not mol:
raise ValueError(f"Could not generate molecule from decoded SMILES: {smiles}")
# Generate 2D coordinates and formula
rdDepictor.Compute2DCoords(mol)
formula = rdMolDescriptors.CalcMolFormula(mol)
# Render the molecule
img = Draw.MolToImage(mol, size=(size, size)).convert("RGBA")
# Create a canvas with extra space at the bottom for the legend
legend_height = int(size * 0.1)
canvas = Image.new("RGBA", (size, size + legend_height), "white")
canvas.paste(img, (0, 0))
draw = ImageDraw.Draw(canvas)
# Define dynamic font sizes
font_reg = get_font(int(size * 0.03))
font_sub = get_font(int(size * 0.02))
# Draw the legend
x = int(size * 0.02)
y = size + int(size * 0.02)
# Draw "Formula: " label
draw.text((x, y), "Formula: ", fill="black", font=font_reg)
x += draw.textlength("Formula: ", font=font_reg)
# Draw formula with subscript handling for numbers
for char in formula:
if char.isdigit():
font = font_sub
y_offset = int(size * 0.005)
else:
font = font_reg
y_offset = 0
draw.text((x, y + y_offset), char, fill="black", font=font)
x += draw.textlength(char, font=font)
# Draw original SELFIES string
draw.text((x, y), f" | SELFIES: {selfies_input}", fill="black", font=font_reg)
canvas.save(output_file)
print(f"Saved: {output_file}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python selfies2png.py <SELFIES_STRING> [OUTPUT_FILENAME]")
sys.exit(1)
selfies_input = sys.argv[1]
filename = sys.argv[2] if len(sys.argv) > 2 else "molecule.png"
selfies_to_png(selfies_input, filename)
Quick Start
- Save the script as
selfies2png.py - Install dependencies:
pip install rdkit pillow selfies - Run:
python selfies2png.py "[C][C][O]" ethanol.png
