Instructions to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RetentionLabs/TTTPilot-Q-5B-Thinking-MAC", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("RetentionLabs/TTTPilot-Q-5B-Thinking-MAC", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RetentionLabs/TTTPilot-Q-5B-Thinking-MAC

SGLang

How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with Docker Model Runner:
```
docker model run hf.co/RetentionLabs/TTTPilot-Q-5B-Thinking-MAC
```

TTTPilot-Q-5B-Thinking-MAC

A Pilot Implementation of Titans MAC Architecture

Combining TTT-Linear-1.3B-Base-Pile-8k and Qwen3-4B-Thinking-2507 using the Memory as Context (MAC) architecture pattern.

🎯 Pilot Experiment Overview

This is an experimental pilot exploring how test-time training (TTT) memory layers can be combined with standard transformer cores in a modular architecture. The MAC pattern separates:

Memory Layers: TTT-Linear's self-adaptation mechanism for dynamic context learning
Core Layers: Qwen3's transformer decoder for reasoning and generation

Key Idea: MAC Processing Flow

Input Sequence (Processed in Segments)
    ↓
┌─────────────────────────────────────┐
│  1. Read-Only Retrieval (R-Mode)   │  ← Memory layers (Q-only projection)
│     Generate memory queries          │    Enables parallel computation
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  2. Core Processing                 │  ← Qwen3 transformer layers
│     [Fixed Memory + Query + Input]  │    Standard attention + MLP
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  3. Memory Update (W-Mode)          │  ← Memory layers (full QKV)
│     Update context representations  │    Test-time adaptation
└─────────────────────────────────────┘
    ↓
Final Output (Join updated context with core output)

Architecture Benefits

Parallel Memory Retrieval: Q-only projection in R-mode enables efficient segment processing
Weight Tying: retriever.q shares weights with memory.q for efficiency
Modular Design: Memory and core can be independently scaled/fine-tuned
Hybrid Capabilities: Combines TTT's adaptive learning with transformer's proven performance

📊 Model Statistics

Component	Source Model	Layers	Parameters	Intermediate Size
Embedding & LM Head	Qwen3-4B-Thinking	-	~310M (vocab: 151,936)	-
Memory Layers	TTT-Linear-1.3B	24	~1.3B	5,504
Core Layers	Qwen3-4B-Thinking	36	~4B	9,728
Total	Combined MAC	60 layers	~5.6B	Mixed

Hidden Size: 2048 (from TTT-Linear)
Attention Heads: 32 (memory), 32/8 GQA (core)
Context Length: Up to 262K tokens (Qwen3's max)
Precision: BFloat16

🏗️ Architecture Details

Memory Module (TTT-Linear)

Purpose: Dynamic context adaptation through test-time training
Key Components:
- Self-adaptation layers with learnable neural memory
- Momentum-based learning rate gates
- Q/K/V projections with RoPE
- Mini-batch processing (chunk_size=16)
Special Features:
- Weight tying between retriever and full memory
- Shared Q/K projections with separate conv layers
- Learnable token-wise learning rates

Core Module (Qwen3)

Purpose: High-capacity reasoning and generation
Key Components:
- Multi-head attention with Grouped Query Attention (GQA)
- SwiGLU MLP activations
- RMSNorm for layer normalization
- RoPE with theta=5,000,000 for long context
Special Features:
- 8 KV heads for efficient inference
- No attention bias
- Sliding window attention support

Fixed Persistent Memory

Size: 64 tokens × 2048 dimensions
Purpose: Store global context/knowledge across segments
Initialization: Zeros (trainable parameter)

🚀 Usage

Installation

# Install dependencies
pip install torch transformers safetensors accelerate

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "./TTTPilot-Q-5B-Thinking-MAC",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("./TTTPilot-Q-5B-Thinking-MAC")

# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced: Segment Processing

# MAC processes inputs in segments for memory efficiency
# Segment size controlled by mini_batch_size (default: 16)

# For long sequences, the model automatically:
# 1. Chunks input into mini-batches
# 2. Processes each chunk through R-mode → Core → W-mode
# 3. Accumulates context updates across chunks

long_text = "..." * 1000  # Very long input
inputs = tokenizer(long_text, return_tensors="pt", truncation=False)
outputs = model.generate(**inputs, max_new_tokens=200)

🔧 Weight Conversion

To recreate this model from source checkpoints:

python convert_weights.py

The script:

Loads TTT-Linear-1.3B and Qwen3-4B-Thinking
Maps TTT weights → memory layers (preserving exact key names)
Maps Qwen3 weights → core layers (preserving exact key names)
Uses Qwen3's embedding & lm_head (for vocab compatibility)
Copies tokenizer files from Qwen3
Saves combined model in HuggingFace format

⚙️ Configuration

Key hyperparameters in config.json:

{
  "model_type": "tttpilot_mac",
  "vocab_size": 151936,          // Qwen3
  "hidden_size": 2048,            // TTT-Linear
  "num_memory_layers": 24,        // TTT-Linear
  "num_core_layers": 36,          // Qwen3
  "memory_intermediate_size": 5504,   // TTT MLP
  "core_intermediate_size": 9728,     // Qwen3 MLP
  "num_attention_heads": 32,
  "num_key_value_heads": 8,       // GQA in cores
  "mini_batch_size": 16,          // TTT chunk size
  "ttt_base_lr": 1.0,             // TTT learning rate
  "fixed_memory_size": 64,        // Persistent memory tokens
  "rope_theta": 5000000,          // Long context RoPE
  "max_position_embeddings": 262144   // Max sequence length
}

🧪 Pilot Experiment Status

What Works

✅ Model architecture defined
✅ Weight conversion pipeline
✅ Configuration files generated
✅ Tokenizer compatibility (Qwen3)
✅ Basic forward pass structure

What's Experimental

⚠️ MAC segment processing: Simplified in pilot, needs full TTT integration
⚠️ Retriever implementation: Placeholder, requires Q-only inference mode
⚠️ Weight tying: Defined but not verified in practice
⚠️ Memory update logic: Full TTT adaptation step needs integration

Known Limitations

Memory layers use placeholder identity functions (need full TTT code)
No actual segment-based MAC flow yet (processes like standard transformer)
TTT cache and Qwen3 KV cache not properly integrated
No training/fine-tuning tested
Generation quality not benchmarked

🎓 Research Context

This pilot implements concepts from:

Test-Time Training (TTT): Self-supervised adaptation during inference
- Paper: Learning to (Learn at Test Time)
- Code: TTT-Linear
Titans Architecture: Modular memory-augmented patterns
- Inspiration: Memory-as-X design patterns (MAC, MAE, MAL, etc.)
- Idea: Separate stateful memory from stateless reasoning
Grouped Query Attention: Efficient multi-head attention
- From: Qwen3 and modern LLMs
- Benefit: Faster inference with minimal quality loss

📝 Citation

If you use this pilot or build upon it:

@misc{tttpilot-mac-2026,
  title={TTTPilot-MAC: A Pilot Implementation of Memory-Augmented-Core Architecture},
  author={Your Name},
  year={2026},
  note={Pilot experiment combining TTT-Linear and Qwen3},
  howpublished={\url{https://github.com/...}}
}

Source Models: