-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Video MiniGPT4 #16
Comments
Yes, it's a good question! |
Yes, it seems that we need to train on video to achieve better results. In short, if LLM can understand the behavior of a time series, I think it would be very interesting. |
Yes! We are trying to train a video model for better temporal reasoning! |
any timeline proposed yet for the new module? rumor says that the gpt5 might incorporate video understanding, I think your work can be easily extended for better performance like what was suggested in this thread? or maybe using videomae as the video encoder? from my understanding, the temporal information would require extra training which involves the allignment of spacial information accross different frames, that's why the static information obtained from static images won't be too helpful for video understanding |
Thank you very much for your work, it's been a really novel and interesting experience! As for this question, I just happen to have thought of our newly proposed model mPLUG-Owl. I have tested it for a long time and found that its visual ability is quite strong. Meanwhile, our visual system is designed to support interleaved text and image as well as multiple images. Would it be possible to implement this idea on our model? I'm really looking forward to hearing from your team! 😊 |
Hi, all, motivated by the awesome MiniGPT4, we (DAMO NLP SG team) are excited to present Video-LLaMA (https://github.com/DAMO-NLP-SG/Video-LLaMA), a modular video-language pre-training framework that empowers instruction-following large language models with video understanding capability. Video-LLaMA utilizes a video Q-Former (inspired by BLIP-2) and a frame embedding layer to make the image encoder capable to process video input. These components serve as the "adapter" between image encoder and large language model Vicuna-13B. Although Video-LLaMA is still in its initial version, we are already seeing some promising results in its ability to capture dynamic events from videos and follow instructions accurately. Here is a single demo. We will continue upgrading Video-LLaMA continuously! Stay tuned for more updates on our progress! Lastly, thanks to awesome project MiniGPT4 and Ask-Anything! |
@YiyangZhou @hangzhang-nlp Thanks for your great jobs! These days we are busy with other things. |
@Andy1621 Just go through your technical report and project page, it's definitely a great work in terms of technical contributions and resource contributions👍👍👍 We really appreciate your kindness in releasing the video instruction-tuning data so promptly (and also letting us know :) Will try your instruction tuning data soon and for sure mention your work in our repo. |
Firstly, thanks for your interesting work.
For minigpt4, can it be realized directly using video embedding?
Just like,
As for the
self.perceive
, Maybe a simple attention will do?Just like flamingo
I don't have enough GPUs to verify this idea. Maybe it is very naive. I just put it here and hope to inspire some interested friends.
The text was updated successfully, but these errors were encountered: