Model Overview

Model Summary

MetaCLIP-2 is a family of state-of-the-art vision-language models developed by Meta AI. It scales the Metadata-Curated (MetaCLIP) approach to significantly larger datasets and more powerful architectures. By utilizing an automated curation pipeline that matches web-scale data to the distribution of high-quality datasets, MetaCLIP-2 achieves industry-leading performance in zero-shot classification and image-text retrieval without the need for proprietary datasets.

Usage

MetaCLIP-2 can be used with Keras Hub to extract embeddings for images and text or to perform zero-shot classification.

Installation

pip install --upgrade keras-hub
pip install --upgrade keras

Presets

The following presets are available in Keras Hub. These presets incorporate the Metadata-Curated training methodology at various scales of the Vision Transformer (ViT) architecture.

Preset Name	Description	Parameters	Input Resolution
`metaclip_2_vit_huge_patch14_224`	ViT-H/14 backbone trained on 224x224 images using the worldwide dataset.	~1.1B	224x224
`metaclip_2_vit_huge_patch14_378`	ViT-H/14 backbone fine-tuned on 378x378 resolution for high-detail tasks.	~1.1B	378x378
`metaclip_2_vit_giant_patch14_224`	ViT-bigG/14 backbone, the largest MetaCLIP-2 variant, trained on 224x224 images.	~2.5B	224x224
`metaclip_2_vit_giant_patch14_378`	ViT-bigG/14 backbone fine-tuned on 378x378 resolution for maximum performance.	~2.5B	378x378

Model Architecture

MetaCLIP-2 utilizes a dual-encoder architecture consisting of a Vision Transformer (ViT) and a Text Transformer.

Vision Encoder: Depending on the preset, it uses either a "Huge" (ViT-H) or "Giant" (ViT-bigG) architecture. These models process images as a sequence of patches (14x14).
Text Encoder: A standard Transformer architecture that tokenizes and embeds textual descriptions into the same latent space as the images.
Training Objective: The model is trained using a contrastive loss (InfoNCE), which maximizes the cosine similarity between matching image-text pairs while minimizing it for non-matching pairs.

The defining characteristic of MetaCLIP-2 is its Data Pipeline. Instead of relying on raw web crawls, Meta uses a metadata-curated approach that filters and balances billions of image-text pairs to ensure the training signal is representative of high-quality visual concepts.

License

MetaCLIP-2 is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. For more details, please refer to the official Meta AI repository.

Downloads last month: 38

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for keras/metaclip_2_vit_huge_patch14_378

MetaCLIP 2: A Worldwide Scaling Recipe

Paper • 2507.22062 • Published Jul 29, 2025 • 38

keras
/

metaclip_2_vit_huge_patch14_378