Introduction

In my previous post on SMILES, I built a tool to visualize molecular structures. However, for generative models, SMILES has a weakness: random strings are often invalid. This is where SELFIES (Self-referencIng Embedded Strings) comes in. It is a representation designed to be 100% robust—every SELFIES string corresponds to a valid molecular graph.

Just like with SMILES, I often find myself needing to see what a SELFIES string represents without decoding it mentally. The solution is similar: a Python script using RDKit and PIL, but with the specific selfies library handling the decoding. Here is a clean reference implementation.

What Are SELFIES?

SELFIES is a string-based representation for molecules that is robust to mutation and crossover. Unlike SMILES, where a misplaced character can break the syntax (e.g., an unclosed ring or invalid valence), SELFIES is defined such that no invalid strings exist. This makes it ideal for machine learning and evolutionary algorithms (though recent research suggests invalid SMILES might actually be beneficial in some contexts).

For example:

  • [C][C][O] is Ethanol
  • [C][=C][C][=C][C][=C][Ring1][=Branch1] is Benzene

Building the Solution

I’ll build a tool similar to the SMILES converter, adding the selfies library to the stack.

Core Dependencies

import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors
from PIL import Image, ImageDraw, ImageFont

We use selfies to decoder the string into a SMILES string, which RDKit then processes.

The Main Conversion Function

The logic mirrors the SMILES converter, with an initial decoding step:

def selfies_to_png(selfies_input, output_file, size=500):
    """Generates a 2D molecule image with a chemical formula legend from SELFIES."""
    # Decode SELFIES to SMILES first
    try:
        smiles = sf.decoder(selfies_input)
    except Exception as e:
        raise ValueError(f"Invalid SELFIES string: {selfies_input}") from e

    mol = Chem.MolFromSmiles(smiles)
    if not mol:
        raise ValueError(f"Could not generate molecule from decoded SMILES: {smiles}")

    # Generate 2D coordinates and formula
    rdDepictor.Compute2DCoords(mol)
    formula = rdMolDescriptors.CalcMolFormula(mol)
    
    # Render the molecule
    img = Draw.MolToImage(mol, size=(size, size)).convert("RGBA")

    # Create a canvas with extra space at the bottom for the legend
    legend_height = int(size * 0.1)
    canvas = Image.new("RGBA", (size, size + legend_height), "white")
    canvas.paste(img, (0, 0))
    
    draw = ImageDraw.Draw(canvas)
    
    # Define dynamic font sizes
    font_reg = get_font(int(size * 0.03))
    font_sub = get_font(int(size * 0.02))
    
    # Draw the legend
    x = int(size * 0.02)
    y = size + int(size * 0.02)
    
    # Draw "Formula: " label
    draw.text((x, y), "Formula: ", fill="black", font=font_reg)
    x += draw.textlength("Formula: ", font=font_reg)
    
    # Draw formula with subscript handling for numbers
    for char in formula:
        if char.isdigit():
            font = font_sub
            y_offset = int(size * 0.005)
        else:
            font = font_reg
            y_offset = 0
            
        draw.text((x, y + y_offset), char, fill="black", font=font)
        x += draw.textlength(char, font=font)

    # Draw original SELFIES string
    draw.text((x, y), f" | SELFIES: {selfies_input}", fill="black", font=font_reg)

    canvas.save(output_file)
    print(f"Saved: {output_file}")

Font Handling

We use the same robust font helper:

def get_font(size, font_name="arial.ttf"):
    """Attempts to load a TTF font, falls back to default if unavailable."""
    try:
        return ImageFont.truetype(font_name, size)
    except IOError:
        return ImageFont.load_default()

Examples in Action

Let’s see how SELFIES strings look when converted.

Simple Molecules

Ethanol from SELFIES
Ethanol ([C][C][O]): The simplest alcohol. Note how much more verbose SELFIES is compared to the SMILES CCO.

Aromatic Compounds

Benzene from SELFIES
Benzene ([C][=C][C][=C][C][=C][Ring1][=Branch1]): The classic aromatic ring. The [Ring1] token indicates a ring closure.

Complex Pharmaceuticals

Aspirin from SELFIES
Aspirin: A complex molecule showing how SELFIES handles branching and functional groups.

Command-Line Interface

The script allows for quick command-line usage:

# Basic usage
python selfies2png.py "[C][C][O]"

# Specify output filename
python selfies2png.py "[C][C][O]" ethanol.png

Download the Complete Script

You can copy the complete selfies2png.py script below.

Installation and Setup

You will need the selfies library in addition to rdkit and pillow.

pip install rdkit pillow selfies

Complete Script

Click to expand the complete selfies2png.py script
import sys
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors
from PIL import Image, ImageDraw, ImageFont

def get_font(size, font_name="arial.ttf"):
    """Attempts to load a TTF font, falls back to default if unavailable."""
    try:
        return ImageFont.truetype(font_name, size)
    except IOError:
        return ImageFont.load_default()

def selfies_to_png(selfies_input, output_file, size=500):
    """Generates a 2D molecule image with a chemical formula legend from SELFIES."""
    try:
        smiles = sf.decoder(selfies_input)
    except Exception as e:
        raise ValueError(f"Invalid SELFIES string: {selfies_input}") from e

    mol = Chem.MolFromSmiles(smiles)
    if not mol:
        raise ValueError(f"Could not generate molecule from decoded SMILES: {smiles}")

    # Generate 2D coordinates and formula
    rdDepictor.Compute2DCoords(mol)
    formula = rdMolDescriptors.CalcMolFormula(mol)
    
    # Render the molecule
    img = Draw.MolToImage(mol, size=(size, size)).convert("RGBA")

    # Create a canvas with extra space at the bottom for the legend
    legend_height = int(size * 0.1)
    canvas = Image.new("RGBA", (size, size + legend_height), "white")
    canvas.paste(img, (0, 0))
    
    draw = ImageDraw.Draw(canvas)
    
    # Define dynamic font sizes
    font_reg = get_font(int(size * 0.03))
    font_sub = get_font(int(size * 0.02))
    
    # Draw the legend
    x = int(size * 0.02)
    y = size + int(size * 0.02)
    
    # Draw "Formula: " label
    draw.text((x, y), "Formula: ", fill="black", font=font_reg)
    x += draw.textlength("Formula: ", font=font_reg)
    
    # Draw formula with subscript handling for numbers
    for char in formula:
        if char.isdigit():
            font = font_sub
            y_offset = int(size * 0.005)
        else:
            font = font_reg
            y_offset = 0
            
        draw.text((x, y + y_offset), char, fill="black", font=font)
        x += draw.textlength(char, font=font)

    # Draw original SELFIES string
    draw.text((x, y), f" | SELFIES: {selfies_input}", fill="black", font=font_reg)

    canvas.save(output_file)
    print(f"Saved: {output_file}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python selfies2png.py <SELFIES_STRING> [OUTPUT_FILENAME]")
        sys.exit(1)
        
    selfies_input = sys.argv[1]
    filename = sys.argv[2] if len(sys.argv) > 2 else "molecule.png"
    
    selfies_to_png(selfies_input, filename)

Quick Start

  1. Save the script as selfies2png.py
  2. Install dependencies: pip install rdkit pillow selfies
  3. Run: python selfies2png.py "[C][C][O]" ethanol.png