Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token
This is an INT8 W8A8 quantized version of meta-llama/Llama-3.1-8B-Instruct created using llm-compressor.
Note: This model quantizes Weights and Activations only. KV cache is NOT quantized.
Quantization Details
- Quantization Method: INT8 W8A8 (Weight and Activation only)
- Weight Precision: INT8 (8-bit integer), static per-channel quantization
- Scale shape:
(N, 1)— one scale per output channel - Observer: MinMax
- Scale shape:
- Activation Precision: INT8 (8-bit integer), dynamic per-token quantization
- Scale computed at runtime:
absmax / 127.0per token (row) - No activation scales stored in checkpoint
- Scale computed at runtime:
- KV Cache: Not quantized (remains in original precision)
- Quantization Format:
compressed-tensors(int-quantized) - Ignored Layers:
lm_headonly - Calibration Dataset: CNN/DailyMail
- Calibration Samples: 512
vLLM CUTLASS W8A8 Kernel
This model is optimized for the vLLM CUTLASS W8A8 INT8 kernel, which fuses dequantization into the GEMM epilogue:
D[m,n] = a_scale[m] * b_scale[n] * int32_accum[m,n]
a_scale[m]: per-token activation scale (computed dynamically at runtime)b_scale[n]: per-channel weight scale (stored in checkpoint)int32_accum[m,n]: INT8 x INT8 accumulated in INT32
Model Size
- Original Model: ~16GB (FP16)
- Quantized Model: ~8.5GB (INT8 W8A8)
- Compression Ratio: ~1.9x
Usage
Installation
pip install vllm>=0.6.0
With vLLM
from vllm import LLM, SamplingParams
# Load the INT8 W8A8 quantized model
llm = LLM(
model="JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token",
)
# Generate text
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
With Transformers (for inspection)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token")
model = AutoModelForCausalLM.from_pretrained(
"JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token",
device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Performance
INT8 W8A8 Dynamic Per-Token quantization provides:
- ~2x memory reduction compared to FP16
- Faster inference with INT8 GEMM kernels on modern GPUs
- Better accuracy than per-tensor due to fine-grained per-token activation scaling
- Per-channel weight quantization preserves weight distribution per output channel
Quantization Recipe
The quantization recipe used for this model is included in the repository as recipe.yaml.
Key configuration:
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: int
strategy: channel # Per-channel (one scale per output channel)
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: int
strategy: token # Per-token (one scale per row)
dynamic: true # Scales computed at runtime
symmetric: true
targets: ["Linear"]
Hardware Requirements
- GPU: NVIDIA GPU with INT8 Tensor Core support (Turing and later)
- Examples: RTX 2080 Ti, RTX 3090, RTX 4090, A100, H100
- VRAM: Minimum 10GB for inference
Citation
If you use this model, please cite:
@software{llm-compressor,
title = {LLM Compressor},
author = {vLLM Team},
url = {https://github.com/vllm-project/llm-compressor},
year = {2024}
}
@article{llama3,
title={Llama 3 Model Card},
author={AI@Meta},
year={2024},
url={https://github.com/meta-llama/llama3}
}
License
This model inherits the license from the original Llama 3.1 model.
Acknowledgments
- Original model: meta-llama/Llama-3.1-8B-Instruct
- Quantization tool: llm-compressor by vLLM team
- Quantization guide: vLLM INT8 W8A8 Documentation
- Downloads last month
- 4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct