DynMoE Family
Collection
DynMoE model checkpoints and paper on huggingface β’ 4 items β’ Updated β’ 4
How to use LINs-lab/DynMoE-Phi-2-2.7B with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="LINs-lab/DynMoE-Phi-2-2.7B", trust_remote_code=True) # Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("LINs-lab/DynMoE-Phi-2-2.7B", trust_remote_code=True, dtype="auto")How to use LINs-lab/DynMoE-Phi-2-2.7B with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LINs-lab/DynMoE-Phi-2-2.7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "LINs-lab/DynMoE-Phi-2-2.7B",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/LINs-lab/DynMoE-Phi-2-2.7B
How to use LINs-lab/DynMoE-Phi-2-2.7B with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "LINs-lab/DynMoE-Phi-2-2.7B" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "LINs-lab/DynMoE-Phi-2-2.7B",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "LINs-lab/DynMoE-Phi-2-2.7B" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "LINs-lab/DynMoE-Phi-2-2.7B",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use LINs-lab/DynMoE-Phi-2-2.7B with Docker Model Runner:
docker model run hf.co/LINs-lab/DynMoE-Phi-2-2.7B
Dynamic Mixture of Experts (DynMoE) incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training.
We are grateful for the following awesome projects:
This project is released under the Apache-2.0 license as found in the LICENSE file.
@misc{guo2024dynamic,
title={Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models},
author={Yongxin Guo and Zhenglin Cheng and Xiaoying Tang and Tao Lin},
year={2024},
eprint={2405.14297},
archivePrefix={arXiv},
primaryClass={cs.LG}
}