diff --git a/.gitattributes b/.gitattributes index a6344aac8c09253b3b630fb776ae94478aa0275b..53c101720834ca44149a76b8e32f111ecbe682c7 100644 --- a/.gitattributes +++ b/.gitattributes @@ -33,3 +33,20 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text *.zip filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text +assets/audio/0.wav filter=lfs diff=lfs merge=lfs -text +assets/audio/1.wav filter=lfs diff=lfs merge=lfs -text +assets/demo.png filter=lfs diff=lfs merge=lfs -text +assets/depth/0.png filter=lfs diff=lfs merge=lfs -text +assets/depth/1.png filter=lfs diff=lfs merge=lfs -text +assets/emergency.jpg filter=lfs diff=lfs merge=lfs -text +assets/iclr_dataset_sample.jpg filter=lfs diff=lfs merge=lfs -text +assets/languagebind.jpg filter=lfs diff=lfs merge=lfs -text +assets/languagebind_frame.jpg filter=lfs diff=lfs merge=lfs -text +assets/languagebind_result.jpg filter=lfs diff=lfs merge=lfs -text +assets/languge_result.jpg filter=lfs diff=lfs merge=lfs -text +assets/logo.jpg filter=lfs diff=lfs merge=lfs -text +assets/logo_languagebind.png filter=lfs diff=lfs merge=lfs -text +assets/result1.jpg filter=lfs diff=lfs merge=lfs -text +assets/sota.jpg filter=lfs diff=lfs merge=lfs -text +assets/video/0.mp4 filter=lfs diff=lfs merge=lfs -text +assets/video/1.mp4 filter=lfs diff=lfs merge=lfs -text diff --git a/1/1 b/1/1 new file mode 100644 index 0000000000000000000000000000000000000000..d00491fd7e5bb6fa28c517a0bb32b8b506539d4d --- /dev/null +++ b/1/1 @@ -0,0 +1 @@ +1 diff --git a/DATASETS.md b/DATASETS.md new file mode 100644 index 0000000000000000000000000000000000000000..4c74c19fb68ac16eac91b0fc07c01762ddecf6f4 --- /dev/null +++ b/DATASETS.md @@ -0,0 +1,66 @@ +## Sample data +We are releasing sample data here so that individuals who are interested can further modify the code to train it on their own data, which includes videos, text from various sources, depth, and infrared. + +
| Baidu Yun | Google Cloud | Peking University Yun | +|
|---|---|---|---|
| DATA | Link | Link | Link | +
| ANNOTATION | Link | Link | Link | +
+ 
+
+
+
+
+> [**Video-LLaVA: Learning United Visual Representation by Alignment Before Projection**](https://arxiv.org/abs/2311.10122)
+> Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan
+[](https://github.com/PKU-YuanGroup/Video-LLaVA) [](https://github.com/PKU-YuanGroup/Video-LLaVA) [](https://arxiv.org/abs/2311.10122)
+
+> [**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**](https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/MoE-LLaVA.pdf)
+> Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan
+[](https://github.com/PKU-YuanGroup/MoE-LLaVA) [](https://github.com/PKU-YuanGroup/MoE-LLaVA) [](https://arxiv.org/abs/2401.15947)
+
+> [**Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models**](https://arxiv.org/abs/2311.08046)
+> Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan
+[](https://github.com/PKU-YuanGroup/Video-Bench) [](https://github.com/PKU-YuanGroup/Video-Bench) [](https://arxiv.org/abs/2311.16103)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
| Modality | LoRA tuning | Fine-tuning | +
|---|---|---|
| Video | LanguageBind_Video | LanguageBind_Video_FT | +
| Audio | LanguageBind_Audio | LanguageBind_Audio_FT | +
| Depth | LanguageBind_Depth | - | +
| Thermal | LanguageBind_Thermal | - | +
| Version | Tuning | Model size | Num_frames | HF Link | MSR-VTT | DiDeMo | ActivityNet | MSVD | +
|---|---|---|---|---|---|---|---|---|
| LanguageBind_Video | LoRA | Large | 8 | Link | 42.6 | 37.8 | 35.1 | 52.2 | +
| LanguageBind_Video_FT | Full-tuning | Large | 8 | Link | 42.7 | 38.1 | 36.9 | 53.5 | +
| LanguageBind_Video_V1.5_FT | Full-tuning | Large | 8 | Link | 42.8 | 39.7 | 38.4 | 54.1 | +
| LanguageBind_Video_V1.5_FT | Full-tuning | Large | 12 | Coming soon | +||||
| LanguageBind_Video_Huge_V1.5_FT | Full-tuning | Huge | 8 | Link | 44.8 | 39.9 | 41.0 | 53.7 | +
| LanguageBind_Video_Huge_V1.5_FT | Full-tuning | Huge | 12 | Coming soon | +
| Cache of pretrained weight | Baidu Yun | Google Cloud | Peking University Yun | +
|---|---|---|---|
| Large | Link | Link | Link | +
| Huge | Link | - | Link | +
| Datasets | Baidu Yun | Google Cloud | Peking University Yun | +
|---|---|---|---|
| LLVIP | Link | Link | Link | +
| FLIR V1 | Link | Link | Link | +
| FLIR V2 | Link | Link | Link | +