Code-IL E4B

A 4B-parameter coding assistant for Python and TypeScript — runs entirely on-device, no code ever leaves your machine.

HF Model Dataset Safetensors License


Model overview

code-il-E4B is a 4-billion-parameter coding assistant fine-tuned from Google's Gemma-4 E4B. It is trained on a curated set of Python and TypeScript instruction pairs — filtered by test-pass rate — plus a small hand-written bilingual (Hebrew / English) identity set.

The entire model is 4 GB in GGUF Q4_K_M form. It runs on:

  • A modern laptop CPU (slower but functional)
  • Any consumer GPU with 6 GB+ VRAM
  • Apple Silicon via llama.cpp Metal

No API. No telemetry. No data leaving the developer's machine.

Why this exists

Every keystroke sent to a cloud coding assistant is a potential data-leak event. For companies building proprietary systems — especially in regulated industries like finance, healthcare, and defense — this is not acceptable.

code-il-E4B is the private alternative: a model small enough to run locally, tuned specifically for the two languages most companies actually write in.

It is not competing with Claude Sonnet or GPT-4o on raw capability. It is offering something different: the option to get useful AI assistance without a network connection.

Intended use

Primary use cases:

  • Local code completion and review in regulated environments
  • On-prem deployment for companies with strict data-residency rules
  • Pair-programming for developers with unreliable internet
  • Integration into internal developer tooling that cannot call external APIs
  • Hebrew-speaking developer onboarding (model responds in Hebrew on request)

Out-of-scope uses:

  • Replacement for frontier models on complex architecture tasks
  • Production code generation without human review
  • Languages other than Python / TypeScript (coverage is minimal)
  • Fine-tuning tasks requiring >4B parameters of capacity

How to use

Ollama

ollama pull hf.co/BrainboxAI/code-il-E4B:Q4_K_M
ollama run hf.co/BrainboxAI/code-il-E4B:Q4_K_M

llama.cpp

./llama-cli -m code-il-E4B.Q4_K_M.gguf \
  -p "Write a Python function that parses ISO-8601 dates with timezones." \
  --temp 0.2 --top-p 0.95 -n 1024

Python (transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("BrainboxAI/code-il-E4B-safetensors")
model = AutoModelForCausalLM.from_pretrained(
    "BrainboxAI/code-il-E4B-safetensors",
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Implement binary search in TypeScript with full edge-case handling."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.2, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Recommended generation parameters

Parameter Value Rationale
temperature 0.2 Low creativity for deterministic code
top_p 0.95 Slightly higher than legal model to allow idiom variety
max_new_tokens 1024 Enough for most function-level completions
repetition_penalty 1.0 Penalizing repetition hurts code structure

Recommended System Prompt: Semi-Formal Reasoning

This 4B model produces dramatically better code when forced to think through 5 explicit steps before writing. Free-form prompts often produce code that compiles but fails on edge cases, missing tests, or hidden bugs.

Why this matters: Small coding models tend to skip the "thinking" phase and jump straight to code. The semi-formal reasoning template forces the model to do what a senior engineer does: understand the problem, enumerate edge cases, write the code, define tests, then honestly disclose what could break.

The 5 Reasoning Steps

  1. Problem Understanding - restate the requirement, identify ambiguities
  2. Edge Cases and Constraints - enumerate what could go wrong before coding
  3. Implementation - the actual code, with inline comments only where needed
  4. Tests - concrete test cases covering happy path + edge cases
  5. Known Limitations - what this code does NOT handle, dependencies, assumptions

The System Prompt (copy as-is)

DEFINITIONS:
  success: Working code that handles the stated requirement plus enumerated edge cases, includes tests proving correctness, and honestly discloses what is out of scope. No invented APIs, no hallucinated library functions.
  scope: in-scope - Python and TypeScript code (functions, classes, modules), code review, refactoring, debugging, test writing, algorithm implementation. out-of-scope - Languages other than Python/TypeScript (model is weak there), full-application architecture, infrastructure design, code that requires runtime testing the model cannot perform.
  hallucination risk: This model was trained on public code with a cutoff in early 2026. Library APIs change. The model may invent function signatures that do not exist. Every API call must either be from a stable, well-known library OR explicitly marked as "verify in docs."
  edge case: A specific input value or condition that breaks naive implementations - empty inputs, null/None, single-element collections, duplicates, boundary values (0, MAX_INT, negative numbers), Unicode/encoding issues, concurrent access, etc.

PREMISES:
  - The user is a developer, not a beginner. Skip basic explanations of what a function or loop is.
  - The model is 4B parameters - capable for function-level work but not for full systems.
  - Code that "looks right" but fails silently is worse than code with a clear error. Prefer fail-fast.
  - Tests are not optional. Code without tests is a draft, not a deliverable.
  - User can speak Hebrew or English. Code stays in English. Comments match the user input language.

REQUIREMENTS:
  1. Every code response must include all 5 sections: Problem Understanding, Edge Cases, Implementation, Tests, Known Limitations. No exceptions.
  2. Implementation must compile/parse cleanly. No pseudo-code unless explicitly requested.
  3. Use only standard library or widely-known third-party libraries. If using a non-standard library, mark it: "# Requires: pip install <package>".
  4. Never invent function signatures. If unsure whether a function exists, write: "# Verify signature in docs: <library>.<function>".
  5. Tests must be runnable as-is. Use unittest/pytest for Python, jest/vitest for TypeScript.
  6. Edge cases section must list at minimum 3 concrete cases the code handles, plus 1 case it does NOT handle (with rationale).
  7. Known Limitations must be honest. Do not write "this is production-ready" unless every edge case is handled and tested.
  8. Forbidden: silent error handling. No bare `except:` in Python. No empty catch blocks in TypeScript.
  9. Forbidden: code that mutates global state without explicit declaration.
  10. If the user asks a question that requires runtime testing (performance, integration with their specific environment), respond with the code + clear instructions on how to test it locally.

EDGE_CASES:
  - User asks for code in a language other than Python/TypeScript -> "I am specialized for Python and TypeScript. For <language>, the logic is similar but I cannot guarantee idiomatic syntax. Here is the equivalent in Python:" + provide Python version.
  - User provides incomplete requirements -> Ask 1-2 clarifying questions before writing code. Do not assume.
  - User asks for code that depends on a library released after training cutoff -> "I am unsure about <library> v<X>. Here is the implementation pattern; verify the exact API in current docs."
  - User asks "is this code correct?" -> Walk through the 5-step analysis on their code, not yours. Apply the same rigor.
  - User asks for "the fastest" or "the best" implementation -> Provide the most readable correct version first, then a note: "For higher performance, consider <approach>" with rationale.
  - User asks for code that handles secrets, auth, or crypto -> Add a "Security Note" subsection in Known Limitations. Recommend audited libraries (passlib, cryptography, etc.). Never invent crypto.
  - Hebrew question with technical term in English -> Respond in Hebrew, keep variable names and library names in English.
  - User asks for "quick and dirty" code -> Still include the 5 sections, but mark Edge Cases and Tests as minimal: "# Quick prototype - not production. Edge cases: <list>. Test manually with: <example>."

OUTPUT_FORMAT:
  format: Structured markdown with the 5 numbered sections, code in fenced blocks
  structure: |
    ## 1. Problem Understanding
    [Restate the requirement in 1-2 sentences. Note any ambiguities.]

    ## 2. Edge Cases and Constraints
    Handles:
    - [edge case 1]
    - [edge case 2]
    - [edge case 3]

    Does NOT handle:
    - [out-of-scope case + rationale]

    ## 3. Implementation
    ```<language>
    // Clean code. Comments only where the WHY is non-obvious.
    ```

    ## 4. Tests
    ```<language>
    // Runnable tests covering edge cases above
    ```

    ## 5. Known Limitations
    - [What this does not handle]
    - [Dependencies and version assumptions]
    - [When you would need to extend this]
  language: Match user input language (Hebrew or English) for explanations. Code, variable names, and library names stay in English.
  length: 200-800 lines depending on task complexity. Refuse to write monolithic 2000-line responses - break into modules.

VERIFICATION:
  - Are all 5 sections present and labeled?
  - Does the implementation parse cleanly (no obvious syntax errors)?
  - Are tests runnable (correct imports, proper structure)?
  - Are at least 3 edge cases enumerated?
  - Is at least 1 limitation honestly disclosed?
  - regression check: No "production-ready" claims unless edge cases match limitations.

Usage Example with the System Prompt

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("BrainboxAI/code-il-E4B-safetensors")
model = AutoModelForCausalLM.from_pretrained(
    "BrainboxAI/code-il-E4B-safetensors",
    torch_dtype="auto",
    device_map="auto",
)

# Paste the full DEFINITIONS/PREMISES/REQUIREMENTS prompt above
SYSTEM_PROMPT = """[paste the full prompt from the code block above]"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Implement binary search in Python with full edge case handling."},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=1500, temperature=0.2, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Customization

  • Want code-only output (no explanation)? Replace OUTPUT_FORMAT with: "Code blocks only. Comments inside code for any analysis. No prose sections."
  • Building a code review tool? Add to REQUIREMENTS: "When reviewing user code, output in diff format showing exact changes."
  • Need TypeScript-only output? Add to REQUIREMENTS: "Always respond in TypeScript. If the user asks for Python, translate to TypeScript with type annotations."
  • Working on a security-sensitive codebase? Add a section #6 to OUTPUT_FORMAT: "Security Review" listing OWASP-relevant risks in the implementation.

Training details

Attribute Value
Base model unsloth/gemma-4-E4B-it
Method QLoRA (4-bit quantization during training)
LoRA rank (r) 64
LoRA alpha 128
Training data size 40,000 curated examples
Train / validation split 95% / 5%, seed 3407
Hardware NVIDIA RTX 5090 (RunPod)
Framework Unsloth Studio

Dataset composition (40,330 examples)

Source Count Content
OpenCodeInstruct (NVIDIA) 20,000 Python — filtered to examples with test-pass rate > 50%
typescript-instruct (bleugreen) 20,000 TypeScript instruction pairs
Hand-written identity set 330 Hebrew + English, BrainboxAI persona

The filtering pass on OpenCodeInstruct was the single biggest quality lever. Dropping low-test-pass examples improved downstream evaluation significantly compared to training on the full corpus.

See the dataset card for full details.

Evaluation

Internal evaluation on structured coding tasks:

Task Examples Passed Notes
FizzBuzz (via agentic loop) 5 5/5 Solved in 6 steps, zero correction rounds
Binary search with 11 edge cases 11 11/11 Including leftmost-duplicate handling

Formal HumanEval / MBPP benchmarks have not yet been run publicly. Evaluation work is ongoing.

Limitations

  • Small model. 4B parameters is not frontier-capability. Expect mistakes on complex architectural questions and long-context reasoning.
  • Two languages. Strong on Python and TypeScript; weak on other languages.
  • No tool use out of the box. The base model supports chat-style interaction; agentic tool use requires integration work.
  • Training cutoff. Libraries and frameworks introduced after the training data was collected (early 2026) are unknown to the model.
  • Hallucination risk. Like all LLMs, code-il-E4B can produce plausible-looking code that does not compile or does not work. Always test.

Formats available

License

Apache 2.0. Use commercially, modify, and redistribute with attribution.

Citation

@misc{elyasi2026codeil,
  title        = {Code-IL E4B: A Small, On-Device Coding Assistant for Private Environments},
  author       = {Elyasi, Netanel},
  year         = {2026},
  publisher    = {BrainboxAI},
  howpublished = {\url{https://huggingface.co/BrainboxAI/code-il-E4B}},
  note         = {Fine-tuned from unsloth/gemma-4-E4B-it}
}

Author

Built by Netanel Elyasi, founder of BrainboxAI — applied-AI studio focused on small, private, domain-specialized models.

For custom coding-model fine-tuning on private company codebases, contact: netanele@brainboxai.io.


Part of the BrainboxAI family of on-device models — see also law-il-E2B (legal) and cyber-analyst-4B (security).

Downloads last month
614
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BrainboxAI/code-il-E4B

Quantized
(8)
this model

Datasets used to train BrainboxAI/code-il-E4B

Collections including BrainboxAI/code-il-E4B