Instructions to use keras/metaclip_2_vit_huge_patch14_378 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- KerasHub
How to use keras/metaclip_2_vit_huge_patch14_378 with KerasHub:
import keras_hub # Create a Backbone model unspecialized for any task backbone = keras_hub.models.Backbone.from_preset("hf://keras/metaclip_2_vit_huge_patch14_378") - Keras
How to use keras/metaclip_2_vit_huge_patch14_378 with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://keras/metaclip_2_vit_huge_patch14_378") - Notebooks
- Google Colab
- Kaggle
Model Overview
Model Summary
MetaCLIP-2 is a family of state-of-the-art vision-language models developed by Meta AI. It scales the Metadata-Curated (MetaCLIP) approach to significantly larger datasets and more powerful architectures. By utilizing an automated curation pipeline that matches web-scale data to the distribution of high-quality datasets, MetaCLIP-2 achieves industry-leading performance in zero-shot classification and image-text retrieval without the need for proprietary datasets.
Links
- MetaCLIP2 Technical Paper
- MetaCLIP2 API Documentation
- MetaCLIP2 Inference Tutorial
- KerasHub Beginner Guide
- KerasHub Model Publishing Guide
Usage
MetaCLIP-2 can be used with Keras Hub to extract embeddings for images and text or to perform zero-shot classification.
Installation
pip install --upgrade keras-hub
pip install --upgrade keras
Presets
The following presets are available in Keras Hub. These presets incorporate the Metadata-Curated training methodology at various scales of the Vision Transformer (ViT) architecture.
| Preset Name | Description | Parameters | Input Resolution |
|---|---|---|---|
metaclip_2_vit_huge_patch14_224 |
ViT-H/14 backbone trained on 224x224 images using the worldwide dataset. | ~1.1B | 224x224 |
metaclip_2_vit_huge_patch14_378 |
ViT-H/14 backbone fine-tuned on 378x378 resolution for high-detail tasks. | ~1.1B | 378x378 |
metaclip_2_vit_giant_patch14_224 |
ViT-bigG/14 backbone, the largest MetaCLIP-2 variant, trained on 224x224 images. | ~2.5B | 224x224 |
metaclip_2_vit_giant_patch14_378 |
ViT-bigG/14 backbone fine-tuned on 378x378 resolution for maximum performance. | ~2.5B | 378x378 |
Model Architecture
MetaCLIP-2 utilizes a dual-encoder architecture consisting of a Vision Transformer (ViT) and a Text Transformer.
- Vision Encoder: Depending on the preset, it uses either a "Huge" (ViT-H) or "Giant" (ViT-bigG) architecture. These models process images as a sequence of patches (14x14).
- Text Encoder: A standard Transformer architecture that tokenizes and embeds textual descriptions into the same latent space as the images.
- Training Objective: The model is trained using a contrastive loss (InfoNCE), which maximizes the cosine similarity between matching image-text pairs while minimizing it for non-matching pairs.
The defining characteristic of MetaCLIP-2 is its Data Pipeline. Instead of relying on raw web crawls, Meta uses a metadata-curated approach that filters and balances billions of image-text pairs to ensure the training signal is representative of high-quality visual concepts.
License
MetaCLIP-2 is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. For more details, please refer to the official Meta AI repository.
- Downloads last month
- 38