Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

Open
cleong110 opened this issue Jul 3, 2024 · 4 comments
Labels

Comments

@cleong110
Copy link

cleong110 commented Jul 3, 2024

What I need help with / What I was wondering

I want to load a dataset containing these
image
without this happening (Colab notebook for replicating)
image

...How can I edit my dataset loader to use less memory when encoding videos?

Background:
I am trying to load a custom dataset with a Video feature.
When I try to tfds.load() it, or even just download_and_prepare, RAM usage goes up very high and then the process gets killed.
For example this notebook will crash if allowed to run, though with a High-RAM instance it may not.
It seems it is using over 30GB of memory to encode one or two 10 MB videos.
I would like to know how to edit/update this custom dataset so that it will not use so much memory.

What I've tried so far
image

I did a bunch of debugging and tracing of the problem with memray, etc. See this notebook and this issue for detailed analysis including a copy of the memray report.

Tried various different ideas in the notebook, including loading just a slice, editing buffer size, and switching from .load() to download_and_prepare()

Finally I traced the problem to serializing and encoding steps under the
See this comment, which was allocating many GiB of memory to encode even one 10MB video.

I discovered that even one 10MB video was extracted to over 13k video frames, taking up nearly 5GiB of space. And then the
serializing would take up 14-15 GiB, and the encoding would take another 14-15, and so the process would be killed.

Relevant items:

It would be nice if...

  • ...there were more examples of how to efficiently load video datasets, and explanations of why they are more efficient.
  • ...there were a way to do this in some sort of streaming fashion that used less memory, e.g. loading in a batch of frames, using a sliding window, etc.
  • ...there were some way to set a memory limit, and just have it process more slowly within that limit.
  • ...there were a way to separate the download and prepare processes. A download_only option, like --download_only in the CLI
  • ...there were a warning that the dataset was using a lot of memory in processing, before the OS kills the process.
  • ...for saving disk space, a way to encode and serialize videos without extracting thousands of individual frames, ballooning the size from 10MB to multiple GiB. Maybe there is and I just don't know.
  • ...it was possible to download only part of a dataset. It's possible to load a slice, but only after download_and_prepare does its whole thing.
  • ...more explanation of what serialization and encoding are for, maybe? What are they?

Environment information
I've tested it on Colab and a few other Ubuntu workstations. High-Ram Colab Instances seem to have enough memory to get past this.

@cleong110 cleong110 added the help label Jul 3, 2024
@cleong110 cleong110 changed the title Custom video dataset encoding step uses all memory, process killed. How to fix? Jul 3, 2024
@tomvdw
Copy link
Collaborator

tomvdw commented Jul 4, 2024

Hey,

Thanks for your question. Those are some cool datasets! I'm very sorry to hear that you're running into these problems.

We brainstormed a bit and came up with a couple of ideas:

  1. 14-15GB for 13k frames means that each frame takes up ~1MB. IIUC ffmpeg extracts frames as PNG files. Switching to JPG could maybe bring ~5x savings. However, you'd still end up with ~3GB for a 10MB video. Not great.
  2. Store the encoded video in the dataset. This means that the video will stay 10MB, but that the decoding needs to happen when you use the data. I'm not sure if using ffmpeg to decode when training would be a good solution (i.e. running a separate tool that writes 14-15 GB to disk, then read those 14-15 GB from disk). Alternatively, there seem to be Python libraries that can read videos, e.g. OpenCV.

Even if we make storing encoded videos work, I'm worried that the problem would just be moved to when the dataset is used. Namely, reading a single example would still require 14-15 GB of memory.

After the dataset has been prepared, how are you expecting that it will be used? Would it make sense to lower the FPS (it's 50 now right)? Will users only use chunks of the video? If so, perhaps you can store the chunks instead of the entire video.

Kind regards,
Tom

@cleong110
Copy link
Author

Tom,

Thank you very much for your reply, and those ideas!

How will they be used:

I'm just getting into Sign Language Processing research, so I'm still not quite sure how I want to use these, but potentially for training translation models for signed language videos to spoken-language text, or for pretraining a vision transformer, or a bunch of other things A few use-cases follow:

test out models on real data

I figured I'd start learning by at least running some inference pipelines with already-trained models, and got stuck on this step. I expected running a model to take significant memory, but didn't expect that loading the video would be the issue. I guess I'm successfully learning things! Specifically I'd like to load in some videos and run this demo of segmentation+recognition pipeline.

replicate other research on github

I went looking for examples of people using these, and it seems that not many use the video option, perhaps for this very reason, that loading them is too cumbersome.

replicate WMT results, or at least re-run their models

One thing I wanted to do was replicate results for the WMT Sign Language Translation contests, which provides data in a number of formats including video, and a number of the submissions do use video as inputs instead of poses.

  • WMT 22 data
  • WMT 23 data
    According to the "Findings" papers that came from these, a good number of the submissions to these did take videos as inputs instead of poses, I'd like to be able to tinker with those pipelines.

At least load the videos and then run pose estimation on them

Another thing I wanted to do was to be able to load the videos, run a pose estimator on them, and then use that, in order to potentially improve that part of the pipeline. A number of sign language translation models take pose keypoints as inputs, and I'd like to try those out.

At the very least I'd like to be able to do this! And then the pose methods may take less compute from there.

@cleong110
Copy link
Author

cleong110 commented Jul 5, 2024

Regarding the suggestions:

  1. seems pretty easy to test, worth a shot!
  2. I admit I'm pretty ignorant about this, what is the encoding/decoding even doing exactly? What would it mean to store the encoded video, decode later, etc.? I read about it a bit, and I think I understand that encoding is to compress the frames to a video format, and decode is to expand out to the frames...? If so, then is there a way to load in only some limited number of the frames at a time? And why does the dataset need to encode when it's already encoded as a .mp4?

I guess I'd like to be able to, and I don't know if any of this is feasible, but:

  • If I have plenty of time but not memory or hard drive space, have a way to just slowly decode as needed.
  • If I have plenty of time AND hard drive space, expand it out to frames on the hard drive, but then only load into memory what I need when I need it.
  • If I have memory enough to load half the video, only load half. Stream the rest in in like a buffer
  • and so forth, but basically have it do its best with the available resources but not crash.

Did some further Googling, and I found a few things:

@cleong110
Copy link
Author

FPS lowering: that's another good idea, I think there might be a method in there to set that already. Maybe tweaking that would reduce memory usage, I can try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 participants