KingNish (Nishith Jain)

posted an update about 23 hours ago

Post

459

We trained an open-source Mythos like cybersecurity LLM for the Build Small Hackathon meet OpenMythos

Trained in two stages: SFT on ~1.84K filtered ArXiv cs.CR papers + real CVE data, then RLVR using paired with past vulnerabilities GitHub repos with a verifier model checking outputs against ground truth.

Trained on: H100s from Modal

The RLVR stage made the biggest difference responses got more precise and less prone to confusing similar vulnerability classes.

Everything is open:
🤖 Demo → build-small-hackathon/OpenMythos
🧠 Model → build-small-hackathon/OpenMythos
📦 CVE Dataset → build-small-hackathon/CVE_Vulnerailities_Detailed
📄 ArXiv Dataset → himanshu17HF/ArvixImport-Filtered-Final

Try it out and let us know where it breaks 🙏

reacted to alvarobartt's post with 🔥 5 months ago

Post

3279

💥 hf-mem v0.4.1 now also estimates KV cache memory requirements for any context length and batch size with the --experimental flag!

uvx hf-mem --model-id ... --experimental will automatically pull the required information from the Hugging Face Hub to include the KV cache estimation, when applicable.

💡 Alternatively, you can also set the --max-model-len, --batch-size and --kv-cache-dtype arguments (à la vLLM) manually if preferred.

1 reply

·

reacted to vincentg64's post with 👀 5 months ago

Post

2814

New Book: No-Blackbox, Secure, Efficient AI and LLM Solutions https://mltblog.com/4aRwvM5

Large language models and modern AI is often presented as technology that needs deep neural networks (DNNs) with billions of Blackbox parameters, expensive and time consuming training, along with GPU farms, yet prone to hallucinations. This book presents alternatives that rely on explainable AI, featuring new algorithms based on radically different technology with trustworthy, auditable, fast, accurate, secure, replicable Enterprise AI. Most of the material is proprietary and made from scratch, showcasing the culmination of decades of research away from standard models to establish a new framework in machine learning and AI technology.

I discuss an efficient DNN architecture based on a new type of universal functions in chapter 4, with DNN distillation and protection via watermarking in chapter 5. Then, in chapter 6, I discuss non-DNN alternatives that yield exact interpolation on the training set yet benefit from benign overfitting in any dimension. Accurate predictions are obtained with a simple closed-form expression, without gradient descent or other iterative optimization technique, essentially without training.

Case studies include 96% correct predictions for the next token on a Nvidia PDF repository, automated heart beat clustering and unusually high data compression rates (big data), anomaly detection and fraud litigation linked to large-scale cybsersecurity breach (large Excel repository, automated SQL, time series and geospatial data) as well as predicting next sequence on real-world genome data with home-made LLM technology. Some datasets with 1000 dimensions are generated with the best and fastest tabular data synthesizer on the market, described in details in chapter 2 along with the best model evaluation metric. These cases correspond to different agents linked to the xLLM technology (extreme LLM) developed by the author.

reacted to ovi054's post with 🚀🔥 6 months ago

Post

2725

Z-Image Turbo + LoRA ⚡

ovi054/Z-Image-LORA

Z-Image Turbo is the No. 1 trending Text-to-Image model right now. You can add a custom LoRA and generate images with this Space.

👉 Try it now: ovi054/Z-Image-LORA

3 replies

·

reacted to their post with 🔥 6 months ago

Post

3811

Muon vs MuonClip vs Muon+Adamw

Muon has gone from an experiment to a mainstream optimizer, but does it hold up for fine‑tuning? We ran head‑to‑head tests on Qwen3‑4B (10k+ high‑quality instruction rows) to find out.

Short story: Pure Muon converged fastest at the start, but its gradient‑norm spikes made training unstable. MuonClip (Kimi K2’s clipping) stabilizes long pretraining runs, yet in our small‑scale fine‑tune it underperformed, lower token accuracy and slower convergence. The winner was the hybrid: Muon for 2D layers + AdamW for 1D layers. It delivered the best balance of stability and final performance and even beat vanilla AdamW.

Takeaway: for small-scale fine-tuning, hybrid = practical and reliable.

Next Step: scale to larger models/datasets to see if Muon’s spikes become catastrophic or if clipping wins out.

Full Blog Link: https://huggingface.co/blog/KingNish/optimizer-part1

posted an update 6 months ago

Post

3811

Muon vs MuonClip vs Muon+Adamw

Muon has gone from an experiment to a mainstream optimizer, but does it hold up for fine‑tuning? We ran head‑to‑head tests on Qwen3‑4B (10k+ high‑quality instruction rows) to find out.

Short story: Pure Muon converged fastest at the start, but its gradient‑norm spikes made training unstable. MuonClip (Kimi K2’s clipping) stabilizes long pretraining runs, yet in our small‑scale fine‑tune it underperformed, lower token accuracy and slower convergence. The winner was the hybrid: Muon for 2D layers + AdamW for 1D layers. It delivered the best balance of stability and final performance and even beat vanilla AdamW.

Takeaway: for small-scale fine-tuning, hybrid = practical and reliable.

Next Step: scale to larger models/datasets to see if Muon’s spikes become catastrophic or if clipping wins out.

Full Blog Link: https://huggingface.co/blog/KingNish/optimizer-part1

reacted to their post with 🔥 6 months ago

Post

2840

I tested Muon vs MuonClip vs Muon+AdamW for fine-tuning LLMs
Just published a blog on that, Read here 👉 https://huggingface.co/blog/KingNish/optimizer-part1

1 reply

·

posted an update 6 months ago

Post

2840

I tested Muon vs MuonClip vs Muon+AdamW for fine-tuning LLMs
Just published a blog on that, Read here 👉 https://huggingface.co/blog/KingNish/optimizer-part1

1 reply

·

reacted to piercus's post with 👍 8 months ago

Post

4064

Starts erasing! 🎉 🎉 🎉
This is made with a one-step SD1.5 LBM [1] eraser !

Data is open. Data pipeline is open. Training code is open.
On our LBM fork : https://github.com/finegrain-ai/LBM

[1] LBM: Latent Bridge Matching for Fast Image-to-Image Translation (2503.07535)

1 reply

·

reacted to sergiopaniego's post with 🔥 9 months ago

Post

4372

gpt-oss was possible thanks to new engineering efforts in 🤗 transformers. We just dropped a blog covering them:

- Kernels from the Hub
- MXFP4 Quantization
- Tensor & Expert Parallelism
- Dynamic Sliding Window & Cache
- Continuous Batching & Paged Attention

Grab a coffee & dive in! ☕️

https://huggingface.co/blog/faster-transformers

posted an update 11 months ago

Post

2242

Wan 2.2 fast upto 10x faster than original wan 2.2

Model: FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers

Space: KingNish/wan2-2-fast

reacted to nicolay-r's post with ❤️ 11 months ago

Post

2995

📢 For those who planning to start a PhD or research in the UK 🇬🇧 (including AI field in particular) but facing ATAS (Academic Technology Approval Scheme) issues.
Excited to share the ultimate guide for dealing with ATAS refusals and how to write effective rebuttal letters.

🎬 https://youtu.be/bfknM3n-SHs

🔍 From the video you will find:
1. Why appealing an ATAS decision matters even if your visa is approved
2. Which docments to use in understanding the principles behind sponsorship decisions
3. Key tips for proper rebuttal letter structuring

reacted to fdaudens's post with 🚀 11 months ago

Post

2317

AudioRAG is becoming real! Just built a demo with ColQwen-Omni that does semantic search on raw audio, no transcription needed.

Drop in a podcast, ask your question, and it finds the exact chunks where it happens. You can also get a written answer.

What’s exciting: it skips transcription, making it faster and better at capturing emotion, ambient sound, and tone, surfacing results text search would miss.

- Demo: fdaudens/colqwen-omni-demo
- Blog post from ColQwen team: https://huggingface.co/blog/manu/colqwen-omni-omnimodal-retrieval

1 reply

·

reacted to Tonic's post with 👍 11 months ago

Post

3428

🙋🏻‍♂️ Normalize adding compute & runtime traces to your model cards

2 replies

·

reacted to AdinaY's post with 🔥 11 months ago

Post

3485

Kimi-K2 is now available on the hub🔥🚀
This is a trillion-parameter MoE model focused on long context, code, reasoning, and agentic behavior.

moonshotai/kimi-k2-6871243b990f2af5ba60617d

✨ Base & Instruct
✨ 1T total / 32B active - Modified MIT License
✨ 128K context length
✨ Muon optimizer for stable trillion-scale training

1 reply

·

reacted to mlabonne's post with 🔥 11 months ago

Post

5781

LiquidAI open-sources a new generation of edge LLMs! 🥳

Based on a new hybrid architecture, these 350M, 700M, and 1.2B models are both fast and performant, ideal for on-device deployment.

I recommend fine-tuning them to power your next edge application. We already provide Colab notebooks to guide you. More to come soon!

📝 Blog post: https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models
🤗 Models: LiquidAI/lfm2-686d721927015b2ad73eaa38

1 reply

·

reacted to a-r-r-o-w's post with 🧠🔥 11 months ago

Post

3543

Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache.

The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.

Diffusers usage with CogView4:

import torch
from diffusers import CogView4Pipeline
from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.to("cuda")

apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))

prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
image.save("output.png")

Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.

Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180

References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache

1 reply

·

reacted to merve's post with 🔥 11 months ago

Post

3536

ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯

Nishith Jain

AI & ML interests

Recent Activity

Organizations

Nishith Jain

AI & ML interests

Recent Activity

Organizations

KingNish's activity