Instructions to use KRAFTON/Raon-VisionEncoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use KRAFTON/Raon-VisionEncoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="KRAFTON/Raon-VisionEncoder", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - vision | |
| - image-text | |
| - clip | |
| - zero-shot | |
| <div align="center"> | |
| <img class="block dark:hidden" src="assets/Raon-VisionEncoder-Gradient-Black.png" alt="Raon VisionEncoder" width="600"> | |
| <img class="hidden dark:block" src="assets/Raon-VisionEncoder-Gradient-White.png" alt="Raon VisionEncoder" width="600"> | |
| </div> | |
| <p align="center"> | |
| <a href="https://www.krafton.ai/ko/"><img src="https://img.shields.io/badge/Homepage-KRAFTON%20AI-blue?style=flat&logo=google-chrome&logoColor=white" alt="Homepage"></a> | |
| <br> | |
| <a href="https://huggingface.co/KRAFTON"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KRAFTON-yellow?style=flat" alt="Hugging Face"></a> | |
| <a href="https://x.com/Krafton_AI"><img src="https://img.shields.io/badge/X-KRAFTON%20AI-white?style=flat&logo=x&logoColor=black" alt="X"></a> | |
| <br> | |
| <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey?style=flat" alt="License"></a> | |
| </p> | |
| **Raon-VisionEncoder** is a 1.14B-parameter vision-language foundation model by [KRAFTON](https://www.krafton.com) for image and text feature extraction. | |
| It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex. | |
| Built on [OpenCLIP](https://github.com/mlfoundations/open_clip) with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder. | |
| ## Pretrained Models | |
| | Model | Params (Inference) | Vision | Text | Patch Size | NaFlex Default Patches | | |
| |-------|--------------------|--------|------|------------|------------------------| | |
| | LocCa ViT-SO400M-16-SigLIP2 | 1.14B | 0.43B | 0.71B | 16x16 | 256 | | |
| ## Requirements | |
| ```bash | |
| pip install torch torchvision timm transformers huggingface-hub safetensors ftfy | |
| ``` | |
| ## Quick Start | |
| ```python | |
| import torch | |
| from transformers import AutoModel | |
| from PIL import Image | |
| # Load model + processor | |
| model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True) | |
| model = model.to(dtype=torch.bfloat16).eval() | |
| processor = model.get_processor("KRAFTON/Raon-VisionEncoder") | |
| # Encode image and text | |
| img_inputs = processor(images=Image.open("assets/photo.jpg")) | |
| txt_inputs = processor(text=["a cat", "a dog"]) | |
| with torch.no_grad(): | |
| img_feat = model.encode_image(**img_inputs) | |
| txt_feat = model.encode_text(**txt_inputs) | |
| # Compute similarity with learned scale and bias | |
| logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias | |
| probs = logits.softmax(dim=-1) | |
| print(probs) | |
| ``` | |
| ## API Reference | |
| | Method | Input | Output | | |
| |--------|-------|--------| | |
| | `model.encode_image(**inputs)` | Processor output (image) | `[B, 1152]` normalized image features | | |
| | `model.encode_text(**inputs)` | Processor output (text) | `[B, 1152]` normalized text features | | |
| | `model.logit_scale` | - | Learned temperature parameter | | |
| | `model.logit_bias` | - | Learned bias parameter | | |
| | `model.get_processor(repo_id)` | HuggingFace repo ID | Processor instance | | |
| | `processor(images=img)` | PIL Image | Preprocessed image dict | | |
| | `processor(text=["a cat"])` | list of strings | Tokenized text dict | | |
| ## License | |
| This repository is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). | |
| Third-party notices in [NOTICE](NOTICE). | |
| © 2026 KRAFTON | |