Instructions to use pruna-test/test-save-tiny-random-llama4-smashed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pruna-test/test-save-tiny-random-llama4-smashed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pruna-test/test-save-tiny-random-llama4-smashed")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("pruna-test/test-save-tiny-random-llama4-smashed") model = AutoModelForCausalLM.from_pretrained("pruna-test/test-save-tiny-random-llama4-smashed") - Pruna AI
How to use pruna-test/test-save-tiny-random-llama4-smashed with Pruna AI:
# Use a pipeline as a high-level helper from pruna import PrunaModel pipe = PrunaModel.from_pretrained("pruna-test/test-save-tiny-random-llama4-smashed")from pruna import PrunaModel # Load model directly from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("pruna-test/test-save-tiny-random-llama4-smashed") model = PrunaModel.from_pretrained("pruna-test/test-save-tiny-random-llama4-smashed") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use pruna-test/test-save-tiny-random-llama4-smashed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pruna-test/test-save-tiny-random-llama4-smashed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pruna-test/test-save-tiny-random-llama4-smashed", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/pruna-test/test-save-tiny-random-llama4-smashed
- SGLang
How to use pruna-test/test-save-tiny-random-llama4-smashed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pruna-test/test-save-tiny-random-llama4-smashed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pruna-test/test-save-tiny-random-llama4-smashed", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pruna-test/test-save-tiny-random-llama4-smashed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pruna-test/test-save-tiny-random-llama4-smashed", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use pruna-test/test-save-tiny-random-llama4-smashed with Docker Model Runner:
docker model run hf.co/pruna-test/test-save-tiny-random-llama4-smashed
| library_name: transformers | |
| tags: | |
| - safetensors | |
| - pruna-ai | |
| # Model Card for pruna-test/test-save-tiny-random-llama4-smashed | |
| This model was created using the [pruna](https://github.com/PrunaAI/pruna) library. Pruna is a model optimization framework built for developers, enabling you to deliver more efficient models with minimal implementation overhead. | |
| ## Usage | |
| First things first, you need to install the pruna library: | |
| ```bash | |
| pip install pruna | |
| ``` | |
| You can [use the transformers library to load the model](https://huggingface.co/pruna-test/test-save-tiny-random-llama4-smashed?library=transformers) but this might not include all optimizations by default. | |
| To ensure that all optimizations are applied, use the pruna library to load the model using the following code: | |
| ```python | |
| from pruna import PrunaModel | |
| loaded_model = PrunaModel.from_pretrained( | |
| "pruna-test/test-save-tiny-random-llama4-smashed" | |
| ) | |
| # we can then run inference using the methods supported by the base model | |
| ``` | |
| For inference, you can use the inference methods of the original model like shown in [the original model card](https://huggingface.co/hf-internal-testing/tiny-random-llama4?library=transformers). | |
| Alternatively, you can visit [the Pruna documentation](https://docs.pruna.ai/en/stable/) for more information. | |
| ## Smash Configuration | |
| The compression configuration of the model is stored in the `smash_config.json` file, which describes the optimization methods that were applied to the model. | |
| ```bash | |
| { | |
| "awq": false, | |
| "c_generate": false, | |
| "c_translate": false, | |
| "c_whisper": false, | |
| "deepcache": false, | |
| "diffusers_int8": false, | |
| "fastercache": false, | |
| "flash_attn3": false, | |
| "fora": false, | |
| "gptq": false, | |
| "half": false, | |
| "hqq": false, | |
| "hqq_diffusers": false, | |
| "hyper": false, | |
| "ifw": false, | |
| "img2img_denoise": false, | |
| "kvpress": false, | |
| "llm_int8": false, | |
| "moe_kernel_tuner": false, | |
| "pab": false, | |
| "padding_pruning": false, | |
| "qkv_diffusers": false, | |
| "quanto": false, | |
| "realesrgan_upscale": false, | |
| "reduce_noe": false, | |
| "ring_attn": false, | |
| "sage_attn": false, | |
| "stable_fast": false, | |
| "text_to_image_distillation_inplace_perp": false, | |
| "text_to_image_distillation_lora": false, | |
| "text_to_image_distillation_perp": false, | |
| "text_to_image_inplace_perp": false, | |
| "text_to_image_lora": false, | |
| "text_to_image_perp": false, | |
| "text_to_text_inplace_perp": false, | |
| "text_to_text_lora": false, | |
| "text_to_text_perp": false, | |
| "token_merging": false, | |
| "torch_compile": false, | |
| "torch_dynamic": false, | |
| "torch_structured": false, | |
| "torch_unstructured": false, | |
| "torchao": false, | |
| "whisper_s2t": false, | |
| "x_fast": false, | |
| "zipar": false, | |
| "batch_size": 1, | |
| "device": "cpu", | |
| "device_map": null, | |
| "save_fns": [], | |
| "save_artifacts_fns": [], | |
| "load_fns": [ | |
| "transformers" | |
| ], | |
| "load_artifacts_fns": [], | |
| "reapply_after_load": {} | |
| } | |
| ``` | |
| ## 🌍 Join the Pruna AI community! | |
| [](https://twitter.com/PrunaAI) | |
| [](https://github.com/PrunaAI) | |
| [](https://www.linkedin.com/company/93832878/admin/feed/posts/?feedType=following) | |
| [](https://discord.gg/JFQmtFKCjd) | |
| [](https://www.reddit.com/r/PrunaAI/) |