Instructions to use shibatch/tinygemma4text3m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinygemma4text3m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinygemma4text3m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Tiny Gemma4 Text 3M
This repository contains a tiny Gemma4 text-only causal language model for validation and debugging.
The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises the Gemma4 text stack in Hugging Face Transformers, including sliding attention, full attention, grouped-query attention, and per-layer input embeddings.
Model purpose
This model is designed for:
- testing
Gemma4ForCausalLM - validating
Gemma4TextConfig - checking model load/save behavior
- testing tokenizer load/save behavior
- exercising both sliding and full attention layers
- exercising grouped-query attention
- exercising Gemma4 per-layer input embedding paths
- providing a small Gemma4 checkpoint for inference-engine validation
It is not designed for:
- high-quality story generation
- benchmark comparison against production language models
- instruction following
- general OCR
- multimodal inference
- chat use
Model architecture
The model uses Gemma4ForCausalLM with a small Gemma4TextConfig.
model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024
hidden_size: 160
hidden_size_per_layer_input: 24
intermediate_size: 640
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32
sliding_window: 128
max_position_embeddings: 1024
layer_types:
- sliding_attention
- sliding_attention
- full_attention
- sliding_attention
- sliding_attention
- full_attention
hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
rms_norm_eps: 1e-06
enable_moe_block: false
use_double_wide_mlp: false
pad_token_id: 2
bos_token_id: 0
eos_token_id: 1
The attention pattern is:
ssFssF
where s means sliding_attention and F means full_attention.
This pattern was chosen for validation coverage. A full-attention-only model would be easier to train, but it would not exercise the sliding attention path. This model intentionally includes both attention types.
Parameter count
total parameters: 2,597,624
trainable parameters: 2,597,624
Top-level breakdown:
model: 2,597,624
lm_head: 163,840
Prefix breakdown:
model.embed_tokens: 163,840
model.embed_tokens_per_layer: 147,456
model.layers.0: 377,184
model.layers.1: 377,184
model.layers.2: 377,184
model.layers.3: 377,184
model.layers.4: 377,184
model.layers.5: 377,184
model.norm: 160
model.per_layer_model_projection: 23,040
model.per_layer_projection_norm: 24
Training data
The model was trained on TinyStories-style English story text.
The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. This small vocabulary is intentional: it keeps the checkpoint compact and reduces embedding size, but it also limits text generation quality.
Training setup
The model was trained as a compact text-only Gemma4 validation model.
Representative training settings:
num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256
vocab_size: 1024
hidden_size: 160
intermediate_size: 640
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 24
layer_pattern: ssFssF
sliding_window: 128
The final training loss in the reference run was approximately:
Final loss: 3.1163
This value should not be interpreted as a quality benchmark. The model is very small and includes Gemma4-specific architectural paths primarily for validation coverage.
Example generation
Example output from the reference checkpoint:
Prompt: Once upon
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "We can play with the toys, but you can't play with it. You can play with it."
The model can generate TinyStories-like text fragments, but repetitions and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.
Usage
import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM
repo = "shibatch/tinygemma4text3m"
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
repo,
subfolder="hf",
torch_dtype=torch.float32,
)
model.eval()
prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=80,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Loading with Transformers
This checkpoint requires a Transformers version that supports Gemma4.
from transformers import Gemma4ForCausalLM, Gemma4TextConfig
If this import fails, update Transformers to a version with Gemma4 support.
Intended validation coverage
This model is useful for checking that an implementation supports:
Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
generate()
save_pretrained()
from_pretrained()
Limitations
This is a tiny debug model. It should not be used as a general-purpose language model.
Known limitations:
- frequent phrase repetition
- weak long-form coherence
- frequent TinyStories template collapse
- small vocabulary
- weak semantic consistency
- no instruction tuning
- no chat formatting
- no multimodal capability
- no OCR capability
The checkpoint is primarily intended to make Gemma4 text-model code paths easy to test without downloading a large model.
Why not full attention only?
A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior. Since this checkpoint is intended for implementation validation, it uses a mixed attention pattern:
sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention
This provides better code-path coverage than FFFFFF.
Notes on OCR and multimodal use
This repository is text-only. It does not include a vision tower, image projector, image token alignment, or OCR training.
A Gemma4 OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.
Citation
This is a synthetic tiny validation checkpoint derived from Gemma4-compatible architecture settings. It is intended for debugging and implementation testing.