Qwen2.5-VL-7B Arabic VQA
Fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct on Arabic Visual Question Answering data as part of the ArabicVL-R project.
Model Description
This model is fine-tuned to answer visual questions in Arabic, supporting reasoning over images with Arabic text responses.
Training Data
- Dataset: Arabic LLaVA — available at ArabicVL-R
- Total samples: 5,000
- Split: 80% train / 10% validation / 10% test
- Train: 4,000 samples
- Validation: 500 samples
- Test: 500 samples
Evaluation Results
Evaluated on the test split (503 samples) using Exact Match:
| Metric | Score |
|---|---|
| Exact Match | 0.4334 |
| Accuracy | 43.34% |
| Correct / Total | 218 / 503 |
Usage
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from PIL import Image
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Manar01/qwen-vl-arabic",
torch_dtype=torch.float16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Manar01/qwen-vl-arabic")
image = Image.open("image.jpg").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "your question in Arabic"},
],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
generated = output[0][inputs["input_ids"].shape[1]:]
print(processor.decode(generated, skip_special_tokens=True))
Authors
- Sarah Aldumiji
- Manar Alrabie
- Hadeel Alseni
- Ragheed Samkari
- Mourad Mars
Citation
@misc{arabicvlr2026,
title = {ArabicVL-R: Arabic Vision Language Model Reasoning},
author = {Sarah and Manar and Hadeel and Ragheed and Mourad},
year = {2025},
url = {https://github.com/hadeelalseni/ArabicVL-R}
}
License
Apache 2.0
Model tree for ManarAlrabie/qwen-vl-arabic
Base model
Qwen/Qwen2.5-VL-7B-Instruct