Instructions to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RetentionLabs/TTTPilot-Q-5B-Thinking-MAC", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("RetentionLabs/TTTPilot-Q-5B-Thinking-MAC", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/RetentionLabs/TTTPilot-Q-5B-Thinking-MAC
- SGLang
How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RetentionLabs/TTTPilot-Q-5B-Thinking-MAC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use RetentionLabs/TTTPilot-Q-5B-Thinking-MAC with Docker Model Runner:
docker model run hf.co/RetentionLabs/TTTPilot-Q-5B-Thinking-MAC
TTTPilot-Q-5B-Thinking-MAC
A Pilot Implementation of Titans MAC Architecture
Combining TTT-Linear-1.3B-Base-Pile-8k and Qwen3-4B-Thinking-2507 using the Memory as Context (MAC) architecture pattern.
🎯 Pilot Experiment Overview
This is an experimental pilot exploring how test-time training (TTT) memory layers can be combined with standard transformer cores in a modular architecture. The MAC pattern separates:
- Memory Layers: TTT-Linear's self-adaptation mechanism for dynamic context learning
- Core Layers: Qwen3's transformer decoder for reasoning and generation
Key Idea: MAC Processing Flow
Input Sequence (Processed in Segments)
↓
┌─────────────────────────────────────┐
│ 1. Read-Only Retrieval (R-Mode) │ ← Memory layers (Q-only projection)
│ Generate memory queries │ Enables parallel computation
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 2. Core Processing │ ← Qwen3 transformer layers
│ [Fixed Memory + Query + Input] │ Standard attention + MLP
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 3. Memory Update (W-Mode) │ ← Memory layers (full QKV)
│ Update context representations │ Test-time adaptation
└─────────────────────────────────────┘
↓
Final Output (Join updated context with core output)
Architecture Benefits
- Parallel Memory Retrieval: Q-only projection in R-mode enables efficient segment processing
- Weight Tying:
retriever.qshares weights withmemory.qfor efficiency - Modular Design: Memory and core can be independently scaled/fine-tuned
- Hybrid Capabilities: Combines TTT's adaptive learning with transformer's proven performance
📊 Model Statistics
| Component | Source Model | Layers | Parameters | Intermediate Size |
|---|---|---|---|---|
| Embedding & LM Head | Qwen3-4B-Thinking | - | ~310M (vocab: 151,936) | - |
| Memory Layers | TTT-Linear-1.3B | 24 | ~1.3B | 5,504 |
| Core Layers | Qwen3-4B-Thinking | 36 | ~4B | 9,728 |
| Total | Combined MAC | 60 layers | ~5.6B | Mixed |
Hidden Size: 2048 (from TTT-Linear)
Attention Heads: 32 (memory), 32/8 GQA (core)
Context Length: Up to 262K tokens (Qwen3's max)
Precision: BFloat16
🏗️ Architecture Details
Memory Module (TTT-Linear)
- Purpose: Dynamic context adaptation through test-time training
- Key Components:
- Self-adaptation layers with learnable neural memory
- Momentum-based learning rate gates
- Q/K/V projections with RoPE
- Mini-batch processing (chunk_size=16)
- Special Features:
- Weight tying between retriever and full memory
- Shared Q/K projections with separate conv layers
- Learnable token-wise learning rates
Core Module (Qwen3)
- Purpose: High-capacity reasoning and generation
- Key Components:
- Multi-head attention with Grouped Query Attention (GQA)
- SwiGLU MLP activations
- RMSNorm for layer normalization
- RoPE with theta=5,000,000 for long context
- Special Features:
- 8 KV heads for efficient inference
- No attention bias
- Sliding window attention support
Fixed Persistent Memory
- Size: 64 tokens × 2048 dimensions
- Purpose: Store global context/knowledge across segments
- Initialization: Zeros (trainable parameter)
🚀 Usage
Installation
# Install dependencies
pip install torch transformers safetensors accelerate
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"./TTTPilot-Q-5B-Thinking-MAC",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./TTTPilot-Q-5B-Thinking-MAC")
# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advanced: Segment Processing
# MAC processes inputs in segments for memory efficiency
# Segment size controlled by mini_batch_size (default: 16)
# For long sequences, the model automatically:
# 1. Chunks input into mini-batches
# 2. Processes each chunk through R-mode → Core → W-mode
# 3. Accumulates context updates across chunks
long_text = "..." * 1000 # Very long input
inputs = tokenizer(long_text, return_tensors="pt", truncation=False)
outputs = model.generate(**inputs, max_new_tokens=200)
🔧 Weight Conversion
To recreate this model from source checkpoints:
python convert_weights.py
The script:
- Loads TTT-Linear-1.3B and Qwen3-4B-Thinking
- Maps TTT weights → memory layers (preserving exact key names)
- Maps Qwen3 weights → core layers (preserving exact key names)
- Uses Qwen3's embedding & lm_head (for vocab compatibility)
- Copies tokenizer files from Qwen3
- Saves combined model in HuggingFace format
⚙️ Configuration
Key hyperparameters in config.json:
{
"model_type": "tttpilot_mac",
"vocab_size": 151936, // Qwen3
"hidden_size": 2048, // TTT-Linear
"num_memory_layers": 24, // TTT-Linear
"num_core_layers": 36, // Qwen3
"memory_intermediate_size": 5504, // TTT MLP
"core_intermediate_size": 9728, // Qwen3 MLP
"num_attention_heads": 32,
"num_key_value_heads": 8, // GQA in cores
"mini_batch_size": 16, // TTT chunk size
"ttt_base_lr": 1.0, // TTT learning rate
"fixed_memory_size": 64, // Persistent memory tokens
"rope_theta": 5000000, // Long context RoPE
"max_position_embeddings": 262144 // Max sequence length
}
🧪 Pilot Experiment Status
What Works
✅ Model architecture defined
✅ Weight conversion pipeline
✅ Configuration files generated
✅ Tokenizer compatibility (Qwen3)
✅ Basic forward pass structure
What's Experimental
⚠️ MAC segment processing: Simplified in pilot, needs full TTT integration
⚠️ Retriever implementation: Placeholder, requires Q-only inference mode
⚠️ Weight tying: Defined but not verified in practice
⚠️ Memory update logic: Full TTT adaptation step needs integration
Known Limitations
- Memory layers use placeholder identity functions (need full TTT code)
- No actual segment-based MAC flow yet (processes like standard transformer)
- TTT cache and Qwen3 KV cache not properly integrated
- No training/fine-tuning tested
- Generation quality not benchmarked
🎓 Research Context
This pilot implements concepts from:
Test-Time Training (TTT): Self-supervised adaptation during inference
- Paper: Learning to (Learn at Test Time)
- Code: TTT-Linear
Titans Architecture: Modular memory-augmented patterns
- Inspiration: Memory-as-X design patterns (MAC, MAE, MAL, etc.)
- Idea: Separate stateful memory from stateless reasoning
Grouped Query Attention: Efficient multi-head attention
- From: Qwen3 and modern LLMs
- Benefit: Faster inference with minimal quality loss
📝 Citation
If you use this pilot or build upon it:
@misc{tttpilot-mac-2026,
title={TTTPilot-MAC: A Pilot Implementation of Memory-Augmented-Core Architecture},
author={Your Name},
year={2026},
note={Pilot experiment combining TTT-Linear and Qwen3},
howpublished={\url{https://github.com/...}}
}
Source Models:
- TTT-Linear: test-time-training/TTT-Linear-1.3B-Base-Pile-8k
- Qwen3: Qwen/Qwen3-4B-Thinking-2507
📄 License
This project combines:
- TTT-Linear (MIT License)
- Qwen3 (Apache 2.0 License)
Final license: Apache 2.0 (compatible with both)
See LICENSE for details.
🤝 Contributing
This is a pilot experiment for research exploration. Contributions welcome:
- Full MAC segment processing implementation
- TTT-Linear integration (replace placeholders)
- Retriever Q-only mode
- Training scripts
- Benchmark evaluations
- Documentation improvements
🐛 Issues & Feedback
Found a bug or have suggestions? Open an issue!
Important Notes:
- This is NOT a production-ready model
- Use for research/experimentation only
- Performance not guaranteed
- May require significant compute resources (5.6B parameters)
🌟 Acknowledgments
- TTT Team for the test-time training paradigm
- Qwen Team for Qwen3-Thinking model
- Titans Architecture inspiration from modular design patterns
Status: 🚧 Experimental Pilot - Use with caution!
- Downloads last month
- 2