Rax 3.5 Chat

Rax 3.5 Chat is a compact 2B parameter multimodal model for vision-language understanding and conversational AI. It supports text and image inputs with extended context up to 262K tokens.

Model Details

Parameters: ~2B
Context Length: 262,144 tokens
Input Modalities: Text + Images
Attention: Hybrid linear + full attention (24 layers)
Vision Encoder: 24-layer transformer with 1024 hidden size
Text Hidden Size: 2048
Precision: BFloat16

Key Features

Multimodal Understanding: Processes text and images in unified reasoning
Long Context: Supports up to 262K tokens for extended conversations
Efficient Architecture: Hybrid attention mechanism for optimal performance
Production Ready: Compatible with vLLM, SGLang, and Transformers

Usage

With Transformers

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

model = AutoModelForVision2Seq.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True)

# Text-only conversation
messages = [{"role": "user", "content": "What is the capital of France?"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

# With image
image = Image.open("image.jpg")
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe this image."}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

With vLLM

vllm serve raxcore/Rax-3.5-Chat --port 8000 --max-model-len 8192

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="raxcore/Rax-3.5-Chat",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)

Architecture Highlights

Hybrid Attention: Alternates between linear attention and full attention layers for efficiency
Vision Encoder: 24-layer transformer with patch size 16 and spatial merge 2x2
Efficient KV Cache: 2 key-value heads for reduced memory footprint
Multi-resolution Position Embeddings: Optimized for long-context understanding

Best Practices

Use temperature 0.6–0.8 for factual tasks, 0.8–1.0 for creative tasks
For long context (>32K tokens), ensure sufficient GPU memory
Enable trust_remote_code when loading the model

Limitations

2B parameters may limit complex reasoning compared to larger models
Vision understanding optimized for natural images
Long context requires significant memory resources

License

Apache 2.0

Citation

@misc{rax3.5chat,
  title={Rax 3.5 Chat: Efficient Multimodal Assistant Model},
  author={Raxcore},
  year={2026}
}

Downloads last month: 1,277

Safetensors

Model size

2B params

Tensor type

F32

BF16