ZINC-22

ZINC-22: Multi-Billion Scale Database
Dataset Details
AuthorsBenjamin I. Tingle, Khanh G. Tang, Mar Castanon, John J. Gutierrez, Munkhzul Khurelbaatar, Chinzorig Dandarchuluun, Yurii S. Moroz, John J. Irwin
Paper TitleZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery
InstitutionsUniversity of California, San Francisco, Taras Shevchenko National University of Kyiv, Chemspace LLC
Published InJournal of Chemical Information and Modeling
CategoryComputational Chemistry
FormatSDF mol2 db2 pdbqt Vendor catalogs
SizeMolecules 2d: 37,200,000,000+
Molecules 3d: 4,500,000,000+
Scaffolds: 96,300,000+ (Bemis-Murcko)
DateSeptember 2025
Links📊 Dataset🔗 DOI📄 Paper
ZINC-22 Tranche Browser showing molecular count distribution
ZINC-22’s Tranche Browser displaying the distribution of 37.2 billion molecules organized by heavy atom count and lipophilicity (LogP)

Key Contribution

ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, intuitive CartBlanche web interface, and cloud distribution systems that enable modern virtual screening at unprecedented scale.

Overview

ZINC-22 is the world’s largest freely available database of commercially available compounds for virtual screening. The database contains over 37 billion make-on-demand molecules with sophisticated search capabilities and cloud-scale infrastructure, representing a massive expansion from its predecessor ZINC-20. It provides ready-to-dock 3D conformations for 4.5 billion molecules, complete with pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.

Strengths

  • Unprecedented scale: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)
  • Federated architecture: Supports asynchronous building and horizontal scaling to trillion-molecule growth
  • User-friendly access: CartBlanche GUI provides a shopping cart metaphor for compound acquisition
  • Privacy protection: Dual public/private server clusters protect patentability of undisclosed catalogs
  • Chemical diversity: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds
  • Ready-to-dock: 3D models include pre-calculated charges, protonation states, and solvation energies
  • Cloud distribution: Available via AWS Open Data, Oracle OCI, and UCSF servers
  • Advanced search: SmallWorld (similarity) and Arthor (substructure) tools for billion-scale searches
  • Organized access: Tranche system enables targeted selection of chemical space
  • Open access: Entire database freely available to academic and commercial users

Limitations

  • Data transfer bottlenecks: 4.5B 3D files require ~1 Petabyte of storage, creating significant download challenges
  • Search limits: Interactive search capped at 20,000 hits for server performance
  • Download workflow: Individual 3D molecule downloads no longer available directly; requires rebuilding via TLDR tool
  • Vendor updates: Difficulty removing discontinued vendor molecules due to federated structure
  • Compute requirements: Cloud-based computation nearly mandatory due to massive file sizes

Technical Notes

Hardware & Software

Compute infrastructure:

  • 1,700 cores across 14 computers for parallel processing
  • 174 independent PostgreSQL 12.0 databases (110 ‘Sn’ for ZINC-ID, 64 ‘Sb’ for Supplier Codes)
  • Distributed across Amazon AWS, Oracle OCI, and UCSF servers

Software stack:

  • PostgreSQL 12.2
  • Python 3.6.8
  • RDKit 2020.03
  • Celery task queue with Redis for background processing
  • All code available on GitHub: docking-org/zinc22-2d, zinc22-3d

Data Organization & Access

Tranche system: Molecules organized into “Tranches” based on 4 dimensions:

  1. Heavy Atom Count
  2. Lipophilicity (LogP)
  3. Charge
  4. File Format

This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.

Search infrastructure:

  • SmallWorld: Ultra-fast similarity searching using Graph Edit Distance
  • Arthor: Substructure and pattern matching
  • CartBlanche: Web interface wrapping search tools with shopping cart functionality

3D Generation Pipeline

The 3D database construction pipeline involves multiple specialized tools:

  1. ChemAxon JChem: Protonation state generation at physiological pH
  2. Corina: Initial 3D structure generation
  3. Omega: Conformation sampling
  4. AMSOL 7.1: Calculation of atomic partial charges and desolvation energies

Chemical Diversity Analysis

Analysis using Bemis-Murcko scaffolds reveals that chemical diversity grows linearly with database size, adding approximately one new scaffold for every 10 molecules added to the database. The analyzed 4.5 billion molecule subset contains 96.3M+ unique scaffolds, demonstrating that even at massive scale, ZINC-22 maintains chemical diversity rather than simply adding redundant structures.

Vendor Integration

ZINC-22 emphasizes commercially available compounds from large make-on-demand catalogs:

  • Enamine REAL: Multi-billion compound virtual library
  • WuXi GalaXi: Extensive make-on-demand catalog
  • Mcule: Diverse purchasable compound collection

This focus on purchasable molecules distinguishes ZINC-22 from theoretical chemical space databases.

Dataset Information

Format

SDF mol2 db2 pdbqt Vendor catalogs

Size

TypeCount
Molecules 2d37,200,000,000+
Molecules 3d4,500,000,000+
Scaffolds96,300,000+ (Bemis-Murcko)

Dataset Examples

ZINC-22's 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties
ZINC-22’s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties

Dataset Subsets

SubsetCountDescription
2D Database37B+Complete 2D chemical structures from make-on-demand catalogs (Enamine, WuXi, Mcule)
3D Database4.5B+Ready-to-dock 3D conformations with pre-calculated charges and solvation energies
Custom TranchesVariableUser-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)
DatasetRelationshipLink
ZINC-20PredecessorN/A
Enamine REALSource CatalogN/A
WuXi GalaXiSource CatalogN/A

Citation

If you use this dataset, please cite:

https://doi.org/10.1021/acs.jcim.2c01253