flute_kernels

CUDA matmul kernels for LUT-quantized LLMs, packaged for the kernels library.

Upstream: hanguo97/flute (Han Guo et al., Apache-2.0).

Use

import torch
from kernels import get_kernel

flute = get_kernel("galqiwi/flute_kernels", version=1)

# qgemm: y = x · dequant(Q, table)·s
y = flute.qgemm(x, Q, scales, table, table2, workspace,
                num_bits, group_size, template_id, num_sms)

# fused HadaCore rotation + qgemm (HIGGS path)
y = flute.qgemm_hadamard(x, Q, scales, table, table2, workspace,
                         num_bits, group_size, hadamard_size,
                         template_id, num_sms)

# stand-alone Hadamard transform (HadaCore, fp16/bf16, pow-2 dim ≤ 32768)
y = flute.hadamard_transform(x, inplace=False)

Load-time helpers

flute.utils.pack(W, num_bits, template_ids, num_sms)
flute.utils.make_qmap2_from_qmap(qmap)
flute.utils.get_workspace_streamk(device)
flute.utils.get_template_config(num_bits, template_id, num_sms)
flute.utils.get_template_ids(num_bits)
flute.utils.is_template_supported(M, N, K, num_bits, template_id, num_sms)
flute.utils.get_device_num_sms(device)
flute.TEMPLATE_CONFIGS    # the pre-tuned config dict

Attribution

CUDA code is adapted from hanguo97/flute (Apache-2.0). HadaCore kernel borrowed from pytorch-labs/applied-ai. Built against NVIDIA CUTLASS v3.5 (BSD-3-Clause); upstream FLUTE pins v3.4.1 but CuTe API is stable across 3.x.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support