Skip to content

An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Notifications You must be signed in to change notification settings

ShareGPT4Omni/ShareGPT4Video

Repository files navigation

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

⭐️ Our series works: [MMStar] [ShareGPT4V] [ShareGPT4Omni]


🚀🚀🚀 Official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.

Here is a video for introducing ShareGPT4Video clearly:

demo_clip_v2.mp4

💡 Highlights

  • 🔥 A large-scale highly descriptive video-text dataset, 40K GPT4-Vision-generated video captions, around 400K implicit video split captions.
  • 🔥 A general video captioner for various video durations, resolutions, and aspect ratios, approaching GPT4-Vision's caption capability, featuring two inference modes targeted for quality and efficiency, separately.
  • 🔥 A superior large video-language model ShareGPT4Video-8B, lasting 5 hours on 8xA100 GPUs of training respectively.
  • 🔥 Improving Text-to-Video performance with high-quality video captions generated by our ShareCaptioner-Video. Thanks to Open-Sora-Plan.

📜 News

[2024/7/1] The code about batch-inference of ShareCaptioner-Video is available now!

[2024/6/11] The web demo and local demo of ShareCaptioner-Video are available now!

[2024/6/11] The web demo and local demo of ShareGPT4Video-8B are available now!

[2024/6/7] Our paper has been featured as HuggingFace Daily Papers and ranked 1st in 6.7.

[2024/5/27] The ShareGPT4Video-8B model is released!

[2024/5/26] The ShareGPT4Video dataset and project page are released!

👨‍💻 Todo

  • Training code for ShareGPT4Video-8B
  • Batch inference code for ShareCaptioner-Video
  • Web demo and local demo of ShareCaptioner-Video
  • Web demo and local demo of ShareGPT4Video-8B
  • Checkpoints of ShareGPT4Video-8B

Quick Usage

You can directly use our ShareGPT4Video model for conversation with your own video by the following command:

python run.py --model-path Lin-Chen/sharegpt4video-8b --video examples/yoga.mp4 --query Describe this video in detail.

Or you can build your local demo to enjoy our ShareGPT4Video-8B with the following command:

python app.py

You can build your local demo for enjoying our ShareCaptioner-Video with the following command:

cd captioner

python app.py

Install

git clone https://github.com/ShareGPT4Omni/ShareGPT4Video
conda create -n share4video python=3.10 -y
conda activate share4video

cd ShareGPT4Video
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Train

To validate the effectiveness of high-quality video captions for helping to improve the LVLMs' comprehension capabilities. We choose the VideoLLaVA and LLaMA-VID models as our baselines. The SFT data used for both models is LLaVA-mix665K image data plus VideoChatGPT-100K video data. We replace 28K caption data in VideoChatGPT-100K with 28K high quality caption data from ShareGPT4Video. Next, we take VideoLLaVA as the example.

You need to follow the instructions in VideoLLaVA to prepare the images and videos first, then download the 28K videos used in ShareGPT4Video from HuggingFace (only involves bdd100k, ego4d, and panda).

Finally, you can specify the llava_v1_5_mix665k_with_video_chatgpt72k_share4video28k.json file in the finetune.sh to perform the SFT to reproduce the results in the paper.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@article{chen2024sharegpt4video,
  title={ShareGPT4Video: Improving Video Understanding and Generation with Better Captions},
  author={Chen, Lin and Wei, Xilin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Lin, Bin and Tang, Zhenyu and others},
  journal={arXiv preprint arXiv:2406.04325},
  year={2024}
}

@article{chen2023sharegpt4v,
  title={ShareGPT4V: Improving Large Multi-Modal Models with Better Captions},
  author={Chen, Lin and Li, Jisong and Dong, Xiaoyi and Zhang, Pan and He, Conghui and Wang, Jiaqi and Zhao, Feng and Lin, Dahua},
  journal={arXiv preprint arXiv:2311.12793},
  year={2023}
}

@article{chen2024we,
  title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
  author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
  journal={arXiv preprint arXiv:2403.20330},
  year={2024}
}

❤️ Acknowledgments

  • LLaVA: the codebase we built upon. Thanks for their wonderful work.
  • Open-Sora-Plan: an excellent open-source codebase for Sora-like text-to-video implementation. Thanks for their wonderful work.
  • Open-LLaVA-NeXT: an open-source codebase for re-producing the training procedure of LLaVA-NeXT series.