KerasHub

Model Overview

Model Summary

MetaCLIP-2 is a family of state-of-the-art vision-language models developed by Meta AI. It scales the Metadata-Curated (MetaCLIP) approach to significantly larger datasets and more powerful architectures. By utilizing an automated curation pipeline that matches web-scale data to the distribution of high-quality datasets, MetaCLIP-2 achieves industry-leading performance in zero-shot classification and image-text retrieval without the need for proprietary datasets.

Links

Usage

MetaCLIP-2 can be used with Keras Hub to extract embeddings for images and text or to perform zero-shot classification.

Installation

pip install --upgrade keras-hub
pip install --upgrade keras

Presets

The following presets are available in Keras Hub. These presets incorporate the Metadata-Curated training methodology at various scales of the Vision Transformer (ViT) architecture.

Preset Name Description Parameters Input Resolution
metaclip_2_vit_huge_patch14_224 ViT-H/14 backbone trained on 224x224 images using the worldwide dataset. ~1.1B 224x224
metaclip_2_vit_huge_patch14_378 ViT-H/14 backbone fine-tuned on 378x378 resolution for high-detail tasks. ~1.1B 378x378
metaclip_2_vit_giant_patch14_224 ViT-bigG/14 backbone, the largest MetaCLIP-2 variant, trained on 224x224 images. ~2.5B 224x224
metaclip_2_vit_giant_patch14_378 ViT-bigG/14 backbone fine-tuned on 378x378 resolution for maximum performance. ~2.5B 378x378

Model Architecture

MetaCLIP-2 utilizes a dual-encoder architecture consisting of a Vision Transformer (ViT) and a Text Transformer.

  • Vision Encoder: Depending on the preset, it uses either a "Huge" (ViT-H) or "Giant" (ViT-bigG) architecture. These models process images as a sequence of patches (14x14).
  • Text Encoder: A standard Transformer architecture that tokenizes and embeds textual descriptions into the same latent space as the images.
  • Training Objective: The model is trained using a contrastive loss (InfoNCE), which maximizes the cosine similarity between matching image-text pairs while minimizing it for non-matching pairs.

The defining characteristic of MetaCLIP-2 is its Data Pipeline. Instead of relying on raw web crawls, Meta uses a metadata-curated approach that filters and balances billions of image-text pairs to ensure the training signal is representative of high-quality visual concepts.

License

MetaCLIP-2 is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. For more details, please refer to the official Meta AI repository.

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for keras/metaclip_2_vit_huge_patch14_378