Video MiniGPT4 #16

pixeli99 · 2023-04-26T06:01:54Z

Firstly, thanks for your interesting work.

For minigpt4, can it be realized directly using video embedding?
Just like,

query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
            query_output = self.Qformer.bert(
                query_embeds=query_tokens,
                encoder_hidden_states=image_embeds,
                encoder_attention_mask=image_atts,
                return_dict=True,
)
# [bs, num_frames, 32, 768] -> [bs, num_frames, 32, 768] -> [bs, num_frames * 32, 768]
video_out = self.perceive(query_output.last_hidden_state.view(b, t, query_tokens.shape[-2], query_tokens.shape[-1])).flatten(1, 2)
inputs_llama = self.llama_proj(video_out)

As for the self.perceive, Maybe a simple attention will do？
Just like flamingo

class PerceiverResampler(nn.Module):
    def __init__(
        self,
        *,
        dim,
        depth,
        dim_head = 64,
        heads = 8,
        num_latents = 64,
        num_media_embeds = 4,
        ff_mult = 4
    ):
        super().__init__()
        self.latents = nn.Parameter(torch.randn(num_latents, dim))
        self.media_pos_emb = nn.Parameter(torch.randn(num_media_embeds, 1, dim))

        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PerceiverAttention(dim = dim, dim_head = dim_head, heads = heads),
                FeedForward(dim = dim, mult = ff_mult)
            ]))

        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        if x.ndim == 3:
            x = rearrange(x, 'b n d -> b 1 n d')

        times = x.shape[1]
        x = x + self.media_pos_emb[:times]

        latents = repeat(self.latents, 'n d -> b m n d', b = x.shape[0], m = x.shape[1])

        for attn, ff in self.layers:
            latents = attn(x, latents) + latents
            latents = ff(latents) + latents

        return self.norm(latents)

I don't have enough GPUs to verify this idea. Maybe it is very naive. I just put it here and hope to inspire some interested friends.

The text was updated successfully, but these errors were encountered:

Andy1621 · 2023-04-26T06:49:11Z

Yes, it's a good question!
Actually, we have tried different methods, like randomly selecting one frame, averaging the frames, and merging frames via QFormer. Those methods work but are not so sensitive to temporal information without video fine-tuning. We just introduce a simple temporal template since it seems more sensitive to temporal information.

pixeli99 · 2023-04-26T06:57:59Z

Yes, it seems that we need to train on video to achieve better results. In short, if LLM can understand the behavior of a time series, I think it would be very interesting.
Currently, Ask-Anything is not as good as it is (or there may be issues with my usage), such as turning a car, and so on.

Andy1621 · 2023-04-26T15:56:02Z

Yes! We are trying to train a video model for better temporal reasoning!

Xinxinatg · 2023-04-27T04:34:38Z

Yes! We are trying to train a video model for better temporal reasoning!

any timeline proposed yet for the new module? rumor says that the gpt5 might incorporate video understanding, I think your work can be easily extended for better performance like what was suggested in this thread? or maybe using videomae as the video encoder? from my understanding, the temporal information would require extra training which involves the allignment of spacial information accross different frames, that's why the static information obtained from static images won't be too helpful for video understanding

YiyangZhou · 2023-04-28T08:09:04Z

Thank you very much for your work, it's been a really novel and interesting experience! As for this question, I just happen to have thought of our newly proposed model mPLUG-Owl. I have tested it for a long time and found that its visual ability is quite strong. Meanwhile, our visual system is designed to support interleaved text and image as well as multiple images. Would it be possible to implement this idea on our model? I'm really looking forward to hearing from your team! 😊
mPLUG-Owl：https://github.com/X-PLUG/mPLUG-Owl

hangzhang-nlp · 2023-05-10T09:31:08Z

@pixeli99 @Xinxinatg

Yes! We are trying to train a video model for better temporal reasoning!

any timeline proposed yet for the new module? rumor says that the gpt5 might incorporate video understanding, I think your work can be easily extended for better performance like what was suggested in this thread? or maybe using videomae as the video encoder? from my understanding, the temporal information would require extra training which involves the allignment of spacial information accross different frames, that's why the static information obtained from static images won't be too helpful for video understanding

Yes, it seems that we need to train on video to achieve better results. In short, if LLM can understand the behavior of a time series, I think it would be very interesting. Currently, Ask-Anything is not as good as it is (or there may be issues with my usage), such as turning a car, and so on.

Hi, all, motivated by the awesome MiniGPT4, we (DAMO NLP SG team) are excited to present Video-LLaMA (https://github.com/DAMO-NLP-SG/Video-LLaMA), a modular video-language pre-training framework that empowers instruction-following large language models with video understanding capability.

Video-LLaMA utilizes a video Q-Former (inspired by BLIP-2) and a frame embedding layer to make the image encoder capable to process video input. These components serve as the "adapter" between image encoder and large language model Vicuna-13B.

Although Video-LLaMA is still in its initial version, we are already seeing some promising results in its ability to capture dynamic events from videos and follow instructions accurately.

Here is a single demo.

We will continue upgrading Video-LLaMA continuously! Stay tuned for more updates on our progress!

Lastly, thanks to awesome project MiniGPT4 and Ask-Anything！

Andy1621 · 2023-05-11T06:30:26Z

@YiyangZhou @hangzhang-nlp Thanks for your great jobs! These days we are busy with other things.
I have seen that mPLUG-owl supports video now, but it seems to simply concat the image embedding. And Video-LLaMA only uses WebVid in the first stage.
Now we have released the video data for instruction tuning. Maybe your teams can use it for better results! If you find them work or do not work, don't hesitate to tell me 😁!

lixin4ever · 2023-05-11T09:07:04Z

@Andy1621 Just go through your technical report and project page, it's definitely a great work in terms of technical contributions and resource contributions👍👍👍 We really appreciate your kindness in releasing the video instruction-tuning data so promptly (and also letting us know :) Will try your instruction tuning data soon and for sure mention your work in our repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video MiniGPT4 #16

Video MiniGPT4 #16

pixeli99 commented Apr 26, 2023 •

edited

Loading

Andy1621 commented Apr 26, 2023

pixeli99 commented Apr 26, 2023

Andy1621 commented Apr 26, 2023

Xinxinatg commented Apr 27, 2023

YiyangZhou commented Apr 28, 2023

hangzhang-nlp commented May 10, 2023

Andy1621 commented May 11, 2023

lixin4ever commented May 11, 2023 •

edited

Loading

Video MiniGPT4 #16

Video MiniGPT4 #16

Comments

pixeli99 commented Apr 26, 2023 • edited Loading

Andy1621 commented Apr 26, 2023

pixeli99 commented Apr 26, 2023

Andy1621 commented Apr 26, 2023

Xinxinatg commented Apr 27, 2023

YiyangZhou commented Apr 28, 2023

hangzhang-nlp commented May 10, 2023

Andy1621 commented May 11, 2023

lixin4ever commented May 11, 2023 • edited Loading

pixeli99 commented Apr 26, 2023 •

edited

Loading

lixin4ever commented May 11, 2023 •

edited

Loading