Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video MiniGPT4 #16

Open
pixeli99 opened this issue Apr 26, 2023 · 8 comments
Open

Video MiniGPT4 #16

pixeli99 opened this issue Apr 26, 2023 · 8 comments

Comments

@pixeli99
Copy link

pixeli99 commented Apr 26, 2023

Firstly, thanks for your interesting work.

For minigpt4, can it be realized directly using video embedding?
Just like,

query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
            query_output = self.Qformer.bert(
                query_embeds=query_tokens,
                encoder_hidden_states=image_embeds,
                encoder_attention_mask=image_atts,
                return_dict=True,
)
# [bs, num_frames, 32, 768] -> [bs, num_frames, 32, 768] -> [bs, num_frames * 32, 768]
video_out = self.perceive(query_output.last_hidden_state.view(b, t, query_tokens.shape[-2], query_tokens.shape[-1])).flatten(1, 2)
inputs_llama = self.llama_proj(video_out)

As for the self.perceive, Maybe a simple attention will do?
Just like flamingo

class PerceiverResampler(nn.Module):
    def __init__(
        self,
        *,
        dim,
        depth,
        dim_head = 64,
        heads = 8,
        num_latents = 64,
        num_media_embeds = 4,
        ff_mult = 4
    ):
        super().__init__()
        self.latents = nn.Parameter(torch.randn(num_latents, dim))
        self.media_pos_emb = nn.Parameter(torch.randn(num_media_embeds, 1, dim))

        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PerceiverAttention(dim = dim, dim_head = dim_head, heads = heads),
                FeedForward(dim = dim, mult = ff_mult)
            ]))

        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        if x.ndim == 3:
            x = rearrange(x, 'b n d -> b 1 n d')

        times = x.shape[1]
        x = x + self.media_pos_emb[:times]

        latents = repeat(self.latents, 'n d -> b m n d', b = x.shape[0], m = x.shape[1])

        for attn, ff in self.layers:
            latents = attn(x, latents) + latents
            latents = ff(latents) + latents

        return self.norm(latents)

I don't have enough GPUs to verify this idea. Maybe it is very naive. I just put it here and hope to inspire some interested friends.

@Andy1621
Copy link
Collaborator

Yes, it's a good question!
Actually, we have tried different methods, like randomly selecting one frame, averaging the frames, and merging frames via QFormer. Those methods work but are not so sensitive to temporal information without video fine-tuning. We just introduce a simple temporal template since it seems more sensitive to temporal information.

@pixeli99
Copy link
Author

Yes, it seems that we need to train on video to achieve better results. In short, if LLM can understand the behavior of a time series, I think it would be very interesting.
Currently, Ask-Anything is not as good as it is (or there may be issues with my usage), such as turning a car, and so on.

@Andy1621
Copy link
Collaborator

Yes! We are trying to train a video model for better temporal reasoning!

@Xinxinatg
Copy link

Yes! We are trying to train a video model for better temporal reasoning!

any timeline proposed yet for the new module? rumor says that the gpt5 might incorporate video understanding, I think your work can be easily extended for better performance like what was suggested in this thread? or maybe using videomae as the video encoder? from my understanding, the temporal information would require extra training which involves the allignment of spacial information accross different frames, that's why the static information obtained from static images won't be too helpful for video understanding

@YiyangZhou
Copy link

Thank you very much for your work, it's been a really novel and interesting experience! As for this question, I just happen to have thought of our newly proposed model mPLUG-Owl. I have tested it for a long time and found that its visual ability is quite strong. Meanwhile, our visual system is designed to support interleaved text and image as well as multiple images. Would it be possible to implement this idea on our model? I'm really looking forward to hearing from your team! 😊
mPLUG-Owl:https://github.com/X-PLUG/mPLUG-Owl

@hangzhang-nlp
Copy link

@pixeli99 @Xinxinatg

Yes! We are trying to train a video model for better temporal reasoning!

any timeline proposed yet for the new module? rumor says that the gpt5 might incorporate video understanding, I think your work can be easily extended for better performance like what was suggested in this thread? or maybe using videomae as the video encoder? from my understanding, the temporal information would require extra training which involves the allignment of spacial information accross different frames, that's why the static information obtained from static images won't be too helpful for video understanding

Yes, it seems that we need to train on video to achieve better results. In short, if LLM can understand the behavior of a time series, I think it would be very interesting. Currently, Ask-Anything is not as good as it is (or there may be issues with my usage), such as turning a car, and so on.

Hi, all, motivated by the awesome MiniGPT4, we (DAMO NLP SG team) are excited to present Video-LLaMA (https://github.com/DAMO-NLP-SG/Video-LLaMA), a modular video-language pre-training framework that empowers instruction-following large language models with video understanding capability.

Video-LLaMA utilizes a video Q-Former (inspired by BLIP-2) and a frame embedding layer to make the image encoder capable to process video input. These components serve as the "adapter" between image encoder and large language model Vicuna-13B.

Although Video-LLaMA is still in its initial version, we are already seeing some promising results in its ability to capture dynamic events from videos and follow instructions accurately.

Here is a single demo.

birthday_demo

We will continue upgrading Video-LLaMA continuously! Stay tuned for more updates on our progress!

Lastly, thanks to awesome project MiniGPT4 and Ask-Anything!

@Andy1621
Copy link
Collaborator

@YiyangZhou @hangzhang-nlp Thanks for your great jobs! These days we are busy with other things.
I have seen that mPLUG-owl supports video now, but it seems to simply concat the image embedding. And Video-LLaMA only uses WebVid in the first stage.
Now we have released the video data for instruction tuning. Maybe your teams can use it for better results! If you find them work or do not work, don't hesitate to tell me 😁!

@lixin4ever
Copy link

lixin4ever commented May 11, 2023

@Andy1621 Just go through your technical report and project page, it's definitely a great work in terms of technical contributions and resource contributions👍👍👍 We really appreciate your kindness in releasing the video instruction-tuning data so promptly (and also letting us know :) Will try your instruction tuning data soon and for sure mention your work in our repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants