Key Contribution
ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, intuitive CartBlanche web interface, and cloud distribution systems that enable modern virtual screening at unprecedented scale.
Overview
ZINC-22 is the world’s largest freely available database of commercially available compounds for virtual screening. The database contains over 37 billion make-on-demand molecules with sophisticated search capabilities and cloud-scale infrastructure, representing a massive expansion from its predecessor ZINC-20. It provides ready-to-dock 3D conformations for 4.5 billion molecules, complete with pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.
Strengths
- Unprecedented scale: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)
- Federated architecture: Supports asynchronous building and horizontal scaling to trillion-molecule growth
- User-friendly access: CartBlanche GUI provides a shopping cart metaphor for compound acquisition
- Privacy protection: Dual public/private server clusters protect patentability of undisclosed catalogs
- Chemical diversity: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds
- Ready-to-dock: 3D models include pre-calculated charges, protonation states, and solvation energies
- Cloud distribution: Available via AWS Open Data, Oracle OCI, and UCSF servers
- Advanced search: SmallWorld (similarity) and Arthor (substructure) tools for billion-scale searches
- Organized access: Tranche system enables targeted selection of chemical space
- Open access: Entire database freely available to academic and commercial users
Limitations
- Data transfer bottlenecks: 4.5B 3D files require ~1 Petabyte of storage, creating significant download challenges
- Search limits: Interactive search capped at 20,000 hits for server performance
- Download workflow: Individual 3D molecule downloads no longer available directly; requires rebuilding via TLDR tool
- Vendor updates: Difficulty removing discontinued vendor molecules due to federated structure
- Compute requirements: Cloud-based computation nearly mandatory due to massive file sizes
Technical Notes
Hardware & Software
Compute infrastructure:
- 1,700 cores across 14 computers for parallel processing
- 174 independent PostgreSQL 12.0 databases (110 ‘Sn’ for ZINC-ID, 64 ‘Sb’ for Supplier Codes)
- Distributed across Amazon AWS, Oracle OCI, and UCSF servers
Software stack:
- PostgreSQL 12.2
- Python 3.6.8
- RDKit 2020.03
- Celery task queue with Redis for background processing
- All code available on GitHub: docking-org/zinc22-2d, zinc22-3d
Data Organization & Access
Tranche system: Molecules organized into “Tranches” based on 4 dimensions:
- Heavy Atom Count
- Lipophilicity (LogP)
- Charge
- File Format
This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.
Search infrastructure:
- SmallWorld: Ultra-fast similarity searching using Graph Edit Distance
- Arthor: Substructure and pattern matching
- CartBlanche: Web interface wrapping search tools with shopping cart functionality
3D Generation Pipeline
The 3D database construction pipeline involves multiple specialized tools:
- ChemAxon JChem: Protonation state generation at physiological pH
- Corina: Initial 3D structure generation
- Omega: Conformation sampling
- AMSOL 7.1: Calculation of atomic partial charges and desolvation energies
Chemical Diversity Analysis
Analysis using Bemis-Murcko scaffolds reveals that chemical diversity grows linearly with database size, adding approximately one new scaffold for every 10 molecules added to the database. The analyzed 4.5 billion molecule subset contains 96.3M+ unique scaffolds, demonstrating that even at massive scale, ZINC-22 maintains chemical diversity rather than simply adding redundant structures.
Vendor Integration
ZINC-22 emphasizes commercially available compounds from large make-on-demand catalogs:
- Enamine REAL: Multi-billion compound virtual library
- WuXi GalaXi: Extensive make-on-demand catalog
- Mcule: Diverse purchasable compound collection
This focus on purchasable molecules distinguishes ZINC-22 from theoretical chemical space databases.

