Tiny Gemma4 Text 3M

This repository contains a tiny Gemma4 text-only causal language model for validation and debugging.

The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises the Gemma4 text stack in Hugging Face Transformers, including sliding attention, full attention, grouped-query attention, and per-layer input embeddings.

Model purpose

This model is designed for:

  • testing Gemma4ForCausalLM
  • validating Gemma4TextConfig
  • checking model load/save behavior
  • testing tokenizer load/save behavior
  • exercising both sliding and full attention layers
  • exercising grouped-query attention
  • exercising Gemma4 per-layer input embedding paths
  • providing a small Gemma4 checkpoint for inference-engine validation

It is not designed for:

  • high-quality story generation
  • benchmark comparison against production language models
  • instruction following
  • general OCR
  • multimodal inference
  • chat use

Model architecture

The model uses Gemma4ForCausalLM with a small Gemma4TextConfig.

model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024
hidden_size: 160
hidden_size_per_layer_input: 24
intermediate_size: 640
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32
sliding_window: 128
max_position_embeddings: 1024
layer_types:
  - sliding_attention
  - sliding_attention
  - full_attention
  - sliding_attention
  - sliding_attention
  - full_attention
hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
rms_norm_eps: 1e-06
enable_moe_block: false
use_double_wide_mlp: false
pad_token_id: 2
bos_token_id: 0
eos_token_id: 1

The attention pattern is:

ssFssF

where s means sliding_attention and F means full_attention.

This pattern was chosen for validation coverage. A full-attention-only model would be easier to train, but it would not exercise the sliding attention path. This model intentionally includes both attention types.

Parameter count

total parameters:     2,597,624
trainable parameters: 2,597,624

Top-level breakdown:

model:   2,597,624
lm_head:   163,840

Prefix breakdown:

model.embed_tokens:                163,840
model.embed_tokens_per_layer:      147,456
model.layers.0:                    377,184
model.layers.1:                    377,184
model.layers.2:                    377,184
model.layers.3:                    377,184
model.layers.4:                    377,184
model.layers.5:                    377,184
model.norm:                            160
model.per_layer_model_projection:   23,040
model.per_layer_projection_norm:        24

Training data

The model was trained on TinyStories-style English story text.

The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. This small vocabulary is intentional: it keeps the checkpoint compact and reduces embedding size, but it also limits text generation quality.

Training setup

The model was trained as a compact text-only Gemma4 validation model.

Representative training settings:

num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256
vocab_size: 1024
hidden_size: 160
intermediate_size: 640
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 24
layer_pattern: ssFssF
sliding_window: 128

The final training loss in the reference run was approximately:

Final loss: 3.1163

This value should not be interpreted as a quality benchmark. The model is very small and includes Gemma4-specific architectural paths primarily for validation coverage.

Example generation

Example output from the reference checkpoint:

Prompt: Once upon

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "We can play with the toys, but you can't play with it. You can play with it."

The model can generate TinyStories-like text fragments, but repetitions and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.

Usage

import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM

repo = "shibatch/tinygemma4text3m"

tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
    repo,
    subfolder="hf",
    torch_dtype=torch.float32,
)
model.eval()

prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Loading with Transformers

This checkpoint requires a Transformers version that supports Gemma4.

from transformers import Gemma4ForCausalLM, Gemma4TextConfig

If this import fails, update Transformers to a version with Gemma4 support.

Intended validation coverage

This model is useful for checking that an implementation supports:

Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
generate()
save_pretrained()
from_pretrained()

Limitations

This is a tiny debug model. It should not be used as a general-purpose language model.

Known limitations:

  • frequent phrase repetition
  • weak long-form coherence
  • frequent TinyStories template collapse
  • small vocabulary
  • weak semantic consistency
  • no instruction tuning
  • no chat formatting
  • no multimodal capability
  • no OCR capability

The checkpoint is primarily intended to make Gemma4 text-model code paths easy to test without downloading a large model.

Why not full attention only?

A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior. Since this checkpoint is intended for implementation validation, it uses a mixed attention pattern:

sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention

This provides better code-path coverage than FFFFFF.

Notes on OCR and multimodal use

This repository is text-only. It does not include a vision tower, image projector, image token alignment, or OCR training.

A Gemma4 OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

Citation

This is a synthetic tiny validation checkpoint derived from Gemma4-compatible architecture settings. It is intended for debugging and implementation testing.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support