NaViT: Native Resolution Vision Transformer

A Method for Flexible-Resolution Vision Transformers

This is a Method paper that introduces NaViT (Native Resolution ViT), a Vision Transformer trained using sequence packing to handle images of arbitrary resolution and aspect ratio. The core idea, called “Patch n’ Pack,” borrows example packing from NLP and applies it to vision: patches from multiple images of different sizes are concatenated into a single sequence, enabling native-resolution processing without resizing or padding.

Why Fixed-Resolution Pipelines Are Suboptimal

Standard computer vision pipelines resize all images to a fixed square resolution before processing. This practice originates from convolutional neural network constraints, where fixed spatial dimensions were architecturally required. Even with Vision Transformers, which operate on sequences of patches and could in principle handle variable lengths, the convention of fixed-resolution input persists.

This approach has clear drawbacks. Most images are not square: analysis of ImageNet, LVIS, and WebLI shows that over 20% of images deviate significantly from a 1:1 aspect ratio. Resizing distorts content and discards information, while padding wastes computation. Prior work like FlexiViT addressed variable patch sizes and Pix2Struct introduced aspect-ratio-preserving patching, but neither fully solved the problem of training efficiently on images at their original resolution.

Patch n’ Pack: Sequence Packing for Vision

The key insight is that ViT already processes images as sequences of patch tokens, and NLP has long used example packing to handle variable-length sequences efficiently. NaViT applies this directly: patches from multiple images (each at its native resolution and aspect ratio) are packed into a single fixed-length sequence.

Architectural Modifications

Three changes enable Patch n’ Pack:

Masked self-attention and masked pooling: Attention masks prevent patches from different images from attending to each other. Masked pooling extracts a single representation per image from the packed sequence.
Factorized positional embeddings: Standard 1D positional embeddings cannot handle arbitrary resolutions. NaViT decomposes position into separate $x$ and $y$ embeddings $\phi_{x}$ and $\phi_{y}$, which are summed together. Two schemes are considered:
- Absolute embeddings: $\phi(p): [0, \text{maxLen}] \to \mathbb{R}^{D}$, a function of the absolute patch index
- Fractional embeddings: $\phi(r): [0, 1] \to \mathbb{R}^{D}$, where $r = p / \text{side-length}$ is the relative position along the image
Chunked contrastive loss: For contrastive pretraining, the $\mathcal{O}(n^{2})$ loss computation is handled via chunked computation across device subsets to support the high number of examples per sequence.

Training Innovations

Packing enables two techniques that were previously impractical:

Continuous token dropping: Instead of dropping the same proportion of tokens from every image, the drop rate varies per image. Some images keep all tokens while others have aggressive dropping, reducing the train/inference discrepancy. The drop rate can follow a schedule that decreases over training.
Resolution sampling: Each image’s resolution is sampled from a distribution (e.g., $R \sim \mathcal{U}(64, R_{\text{max}})$) while preserving aspect ratio. This mixes the throughput benefits of small images with the detail of large ones.

Computational Overhead

A natural concern is the $\mathcal{O}(n^{2})$ attention cost for longer packed sequences. In practice, as the transformer hidden dimension scales, attention becomes an increasingly small fraction of total compute (the MLP dominates). Packing overhead is typically less than 2% from padding tokens, using a simple greedy bin-packing algorithm.

Pretraining and Downstream Evaluation

NaViT is evaluated in two pretraining setups:

Classification pretraining on JFT-4B with sigmoid cross-entropy loss, evaluated via linear probing (10 examples per class)
Contrastive pretraining on WebLI using image-text contrastive loss, evaluated on zero-shot ImageNet classification and COCO retrieval

Training Efficiency

At fixed compute budget, NaViT consistently outperforms ViT across model scales. The top-performing ViT can be matched by NaViT with 4x less compute. The primary driver is throughput: packing with variable resolution and token dropping enables NaViT-L/16 to process approximately 5x more images during training.

Variable Resolution Results

Models trained with variable resolution ($R \sim \mathcal{U}(64, R_{\text{max}})$) outperform fixed-resolution models even when evaluated at the fixed resolution’s own training resolution. Sampling side lengths from a truncated normal biased toward lower values gives the best cost-performance trade-off.

For fine-tuning on ImageNet-1k, a single NaViT fine-tuned with variable resolutions (64 to 512) matches the performance of models fine-tuned at each specific resolution individually.

Positional Embedding Comparison

Factorized embeddings outperform both standard ViT 1D embeddings (with interpolation) and Pix2Struct’s learned 2D embeddings. The factorized approach generalizes to resolutions outside the training range, while 2D embeddings fail because they require seeing all $(x, y)$ coordinate pairs during training. Additive combination of $\phi_{x}$ and $\phi_{y}$ works best.

Token Dropping Strategies

Variable token dropping with Beta-distributed rates consistently outperforms constant rates. Resolution-dependent dropping (higher rates for higher-resolution images) further improves performance. Scheduling the drop rate to decrease over training provides additional gains.

Downstream Tasks

Task	Setup	Result
Semantic segmentation	ADE20k, L/16, linear decoder	NaViT at $R_{384}$ beats ViT at $R_{512}$ while being 2x faster
Object detection	OWL-ViT-L/14 backbone	NaViT: 28.3% LVIS AP vs. ViT: 23.3%
Video classification	Kinetics-400, tubelet extraction	NaViT-L matches ViViT-L (80.4%) in ~6x fewer epochs
Fairness annotation	FairFace, CelebA linear probes	Statistically significant accuracy improvements ($p = 3 \times 10^{-4}$)

Out-of-Distribution Robustness

NaViT shows strong gains on ImageNet-A (which contains many extreme aspect ratios) when evaluated without center cropping. Performance on ObjectNet is also competitive. The model maintains stable calibration (ECE between 0.045 and 0.047) across a wide range of token counts per image (128 to 1024).

Key Findings and Limitations

NaViT demonstrates that sequence packing, when applied to Vision Transformers, yields substantial improvements in training efficiency, inference flexibility, and downstream performance. The approach processes images at their native resolution without the information loss from resizing or the waste from padding.

Key takeaways:

4x compute reduction to match top ViT performance
A single model works across a continuous range of resolutions at inference time
Variable-resolution training and token dropping provide complementary efficiency gains
Factorized positional embeddings generalize to unseen resolutions
Benefits transfer to detection, segmentation, video, and fairness tasks

Limitations: The paper does not release model weights or code. All experiments use Google-internal datasets (JFT-4B, WebLI) and infrastructure (TPUs, JAX/Scenic), making direct reproduction difficult. The attention masking approach for packing assumes that cross-image attention is undesirable, which may not hold for all tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Classification pretraining	JFT-4B	~4B labeled images	Google-internal, not publicly available
Contrastive pretraining	WebLI	Large-scale web data	Google-internal, not publicly available
Classification fine-tuning	ImageNet-1k	1.28M images	Publicly available
Segmentation	ADE20k	20K images	Publicly available
Detection	LVIS	164K images	Publicly available
Video	Kinetics-400	~240K videos	Publicly available (partial)
Fairness	FairFace, CelebA	108K / 200K images	Publicly available

Algorithms

Greedy bin-packing for sequence construction (less than 2% padding tokens)
Resolution sampling: side length from truncated normal $\mathcal{N}_{t}(-0.5, 1)$ mapped to $[64, R_{\text{max}}]$
Token dropping: Beta-distributed per-image rates, optionally resolution-dependent
Factorized positional embeddings with additive combination

Models

NaViT variants: B/16, L/16, L/14
Based on vanilla ViT with query-key normalization, no biases, attention pooling
Implemented in JAX/FLAX within the Scenic framework
No public model checkpoints available

Evaluation

Metric	NaViT	ViT Baseline	Notes
JFT linear probe (L/16)	Matches top ViT	4x more compute	Compute-matched comparison
ImageNet zero-shot (L/14)	72.9%	68.3%	Contrastive pretraining
LVIS AP (L/14)	28.3%	23.3%	OWL-ViT detection
LVIS AP rare (L/14)	24.3%	17.2%	OWL-ViT detection
ADE20k mIoU (L/16, 384)	Beats ViT@512	At 2x cost	Segmenter linear decoder

Hardware

Training on Cloud TPUs (specific configuration not detailed)
Inference latency measured on Cloud TPUv3

Paper Information

Citation: Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I., Oliver, A., Padlewski, P., Gritsenko, A., Lučić, M., & Houlsby, N. (2023). Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. Advances in Neural Information Processing Systems 36 (NeurIPS 2023).

@misc{dehghani2023patch,
  title={Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
  author={Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lučić, Mario and Houlsby, Neil},
  year={2023},
  eprint={2307.06304},
  archiveprefix={arXiv},
  primaryclass={cs.CV}
}

A Method for Flexible-Resolution Vision Transformers#

Why Fixed-Resolution Pipelines Are Suboptimal#

Patch n’ Pack: Sequence Packing for Vision#

Architectural Modifications#

Training Innovations#

Computational Overhead#

Pretraining and Downstream Evaluation#

Training Efficiency#

Variable Resolution Results#

Positional Embedding Comparison#

Token Dropping Strategies#

Downstream Tasks#

Out-of-Distribution Robustness#

Key Findings and Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Paper Information#