Gemma 4 E4B — Opus Reasoning + Claude Code | GGUF
GGUF version of our Opus 4.6 reasoning model. Ollama ✅ LM Studio ✅ llama.cpp ✅ Reasoning baked in — no adapter needed.
Built by RavenX AI · GGUF converted from MLX source
What is this?
This is the GGUF version of gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit — Gemma 4 E4B with Opus 4.6 reasoning and Claude Code LoRA fused directly into the weights.
No adapter needed, no extra config — just load and run with Claude-style <think> reasoning baked in.
Looking for the Apple Silicon MLX version? → MLX 4-bit model (optimized for Metal GPU)
Available Quantizations
| Quantization | Size | Use case |
|---|---|---|
Q4_K_M |
2.7 GB | Recommended — best balance of quality and speed |
Q5_K_M |
3.1 GB | Higher quality, slightly more RAM |
Q8_0 |
4.5 GB | Near-lossless, needs more RAM |
F16 |
8.3 GB | Full precision GGUF |
Sizes will be updated once conversion is complete.
🦙 Ollama — One Command
ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF
With a custom system prompt
Create a Modelfile:
FROM hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF
SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step using <think> tags."
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4
OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF",
"messages": [{"role": "user", "content": "Explain why RSA encryption is hard to break."}]
}'
💻 LM Studio
- Open LM Studio
- Search for
deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF - Download any quantization (Q4_K_M recommended)
- Load and chat — reasoning is baked in
🔧 llama.cpp
CLI
llama-cli \
-hf deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF \
-p "Explain why RSA encryption is hard to break." \
-n 1024
Server (OpenAI-compatible)
llama-server \
-hf deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF \
--port 8080
# Use with any OpenAI client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF",
"messages": [{"role": "user", "content": "Hello!"}]
}'
What's different from the base Gemma 4
| Base Gemma 4 E4B | This model | |
|---|---|---|
<think> tag reasoning |
❌ | ✅ baked in |
| Claude-style structured answers | ❌ | ✅ |
| Tool-use patterns | ❌ | ✅ |
| Requires adapter | — | ❌ no adapter needed |
| Vision support | ✅ | ✅ |
| Ollama / LM Studio / llama.cpp | ✅ | ✅ |
🧪 Live Demos — Try It Now
| Space | What to try |
|---|---|
| 🔥 Agentic Tool Calling Demo | Live agentic loop — tool calling, <think> reasoning, calculator, web search |
| 🐳 OpenClaw Sandbox Demo | OpenClaw-style orchestration, Docker runtime, sandbox/approval modes |
🧩 Agent Stack Compatibility
This model is built to sit inside a real agent stack, not just a chat box.
| Layer | Role |
|---|---|
| Gemma 4 E4B Opus Reasoning + Claude Code | Reasoning + tool-use baked into weights |
| Gemini CLI | Coding agent + tool orchestration |
| OpenHarness | Harness runtime, tool loop, swarm, hooks, memory |
| OpenClaw | Orchestration, sessions, skills, messaging |
| Hermes skill | Agent behavior for concise, terminal-first execution |
→ Gemini CLI fork · TurboQuant-MLX · RavenX Inference Harness
How it was made
Training data
| Source | Examples |
|---|---|
| Crownelius/Opus-4.6-Reasoning-2100x-formatted | 2,054 |
| Claude Code tool-use patterns | 140 files |
| Total | 2,163 |
Training
Base: deadbydawn101/gemma-4-E4B-mlx-4bit
Method: SFT completions-only (mlx_vlm.lora)
Rank: 8 · Alpha: 16 · LR: 1e-5 · Iters: 1,000
Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB
Final loss: ~3.5e-7
Fusion + GGUF Conversion
- All 378 LoRA pairs merged via weight arithmetic into base weights
- De-quantized from MLX 4-bit to FP16
- Converted to GGUF using
llama.cpp/convert_hf_to_gguf.py - Quantized to multiple GGUF formats using
llama-quantize
Related Models
| Model | Format | Size | Notes |
|---|---|---|---|
| MLX 4-bit (source) | MLX | ~10.5 GB | Apple Silicon optimized, Metal GPU |
| This model (GGUF) | GGUF | varies | Ollama, LM Studio, llama.cpp |
| Base model (4-bit) | MLX | 4.86 GB | Base model (use with adapter) |
| LoRA adapter only | Safetensors | 658 MB | Adapter-only |
| 2B abliterated | MLX | 3.34 GB | 2B abliterated |
| 21B MoE REAP | MLX | 12 GB | 21B MoE REAP |
License
- Downloads last month
- 1,175
Hardware compatibility
Log In to add your hardware
4-bit
5-bit
8-bit
16-bit
Model tree for deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF
Base model
google/gemma-4-E4B-it Quantized
deadbydawn101/gemma-4-E4B-mlx-4bit