A brief history of gapless audio (and what you can do about it)

Where would video be without synchronized audio? Revisit the evolution of the audio track.

Thomas Daede
Vimeo Engineering Blog
7 min readMar 7, 2023

--

At Vimeo, delivering high-quality video is only half of the story. We also deliver high-quality, multichannel audio, perfectly synchronized to the video. Vimeo’s audio tracks are also gapless, meaning that the audio track we deliver is the exact same length as the user’s upload, without any added silence at the beginning or the end. This might seem like a simple task, but having gapless, A/V-synced video has been a bane for digital multimedia for many years. In this post we explore the history of how this came to be, and the solutions created to solve it.

In the beginning: MP3

While hardly the first compressed audio format, the first compressed audio format to be widely adopted among consumers of digital media was MPEG-I Layer III audio, or MP3 for short. MP3 files are very simple: they are a sequence of packets, of either fixed size (for CBR, or constant bit rate files) or variable size (for VBR, or variable bit rate files), as shown in Figure 1. There is no header or any other sort of metadata.

Figure 1. A raw MP3 sequence. Each packet has a duration but no timestamp. There is no header in the file, either.

This works, but it comes with a number of drawbacks. One is that the audio must be a fixed number of packets long. There is no way to encode an arbitrary-length uncompressed audio file into an MP3 without putting some silence at the end. For example, a MP3 version of a concert recording from a CD might not exactly preserve track lengths, leading to gaps between tracks.

However, there is an even bigger problem. MP3 uses several features, such as an overlapped transform, that result in dependencies on previous blocks to decode correctly. This presents a problem at the beginning of the file: how do you start decoding without any previous block?

The solution is priming, by which the MP3 encoder inserts some extra silence at the beginning of the track. This doesn’t sound objectionable when decoded or dropped. Unfortunately, there’s no rule of how much silence to insert. This means that most MP3 tracks end up with extra silence at the beginning when played back, too, which only exacerbates the gapless problem and makes it harder to synchronize the track to video.

Some MP3 players implement a form of gapless playback, but without any knowledge of the real length, they have to implement a kludge: searching the beginning and end of the track for silence and then saving those positions for future playback trimming. While this process gets the job done, it’s inherently inaccurate, and it requires scanning the track ahead of playback.

With two magic words: Ogg Vorbis

MP3 has other problems, too. The quality isn’t great because of the legacy inherited from MPEG’s earlier, less successful audio formats. It also has licensing fees, which can be a challenge in the case of software distributed for free. In response, the Xiph.Org Foundation developed the Vorbis audio codec.

Unlike MP3, Vorbis isn’t just a bare sequence of blocks. Rather, its frames are always encapsulated inside a purpose-built container called Ogg. (The name, which derives from gaming slang, essentially means forcing something to work.) This not only makes seeking much easier, but it also allows interleaving audio, video, and subtitles in the same file, and it supports multiple tracks and offers error detection.

Plus, an Ogg Vorbis file has both a real header and timestamps. These timestamps are included with each Ogg page as shown in Figure 2 to indicate the end point of the page, which means that a sequence of Vorbis frames also has an explicit duration.

Figure 2. Vorbis as packaged into an Ogg container. Ogg granule positions appear along the top.

Ogg Vorbis avoids the problem that MP3 has with regard to durations that are not a multiple of the frame size. The last timestamp can be shorter than the frame size, enabling the trimming of the end to an exact length.

In addition, Vorbis always uses exactly one frame for priming. This frame is placed within the first audio-containing Ogg page. The timestamp is set to represent the original audio duration before priming, which also allows for extra trimming if needed; for example, if the audio needs to be trimmed by an editor without recompression.

This solution made Ogg Vorbis especially popular for video games, where perfectly looping music tracks and delay-free sound effects are essential. The library even includes dedicated functions for looping and joining, to make sure the waveforms line up without any audible pop. The free license didn’t hurt, either.

MPEG strikes back: AAC

The quality problems of MP3 were apparent to everyone early on. MPEG’s own successor format arrived in the form of Advanced Audio Coding (AAC). AAC files can exist as a bare sequence of blocks, known as an ADTS file, but are more commonly put inside of an ISOBMFF (MP4) container, as an M4A file. Just like Ogg, the ISOBMFF container adds provisions for syncing, combining audio and video in the same file, and including a per-track header (the sample entry).

AAC is extremely widespread thanks to its early adoption by Apple in the iPod. Unfortunately, unlike in Ogg Vorbis, there’s no way to signal the number of priming samples. This means that, just like in MP3 before it, AAC has the same gapless and A/V sync issues. Whoops!

One solution is simply to assume a common number of priming samples; 2112 is the most common. This is pretty suboptimal, though, as just like with MP3 there are plenty of encoders that produce different numbers of priming samples. FFmpeg’s internal AAC encoder produces a delay of 1024 samples, and the popular FDK-AAC encoder usually produces 2048 samples of priming delay, but even that can change based on encoder settings!

A better solution, now used by Vimeo, involves the use of an existing ISOBMFF feature: edit lists, shown in Figure 3. These were intended to be generic tools for applying presentation-level edits to videos but found an application specifically to audio in trimming off the beginning and end of the file. In fact, the browser Media Source API allows exactly one edit list in ISOBMFF: for trimming AAC audio.

Figure 3. AAC as packaged into ISOBMFF for trimming audio.

Because the trim information isn’t stored in ADTS, extreme care must be taken to preserve the information if using a standalone mixer like GPAC or L-SMASH. At Vimeo, we encode all of our AAC with the exact same number of priming samples and pass that on to L-SMASH. However, we are switching to an integrated, streaming muxer, which enables the encoder to pass the values directly to the muxer. Expect a blog post on that in the future!

One especially tricky bit is that there are multiple variants of AAC. One of them, HE-AAC, can be primed in an integer number of output samples, but HE-AAC in ISOBMFF can often use a coarser timebase due to an AAC feature called SBR. The time base of the edit list entry must be set especially carefully in this case.

Magnum: Opus

Opus is the newest format that Vimeo and many others use for audio. It’s a huge improvement over both AAC and Vorbis in quality, and it can also be configured for very low delay if needed.

Like Vorbis, Opus never stands on its own. It’s stored in the same Ogg container, in fact, and its end trimming is handled in the same way, which means that there are no issues with gaps. Opus does add one small improvement compared to Vorbis — Opus supports very short delays, and can prime with even less than a single frame. Rather than make the single frame assumption as in Vorbis, the amount to trim is stored in a pre-skip field in the Opus header as shown in Figure 4.

Figure 4. Opus in Ogg. Note that differently to Ogg Vorbis, the addition of a pre-skip value to the header signals the priming to trim at the beginning of the stream. This can potentially be less than a single Opus frame, or multiple Opus frames.

Opus can also be put into the ISOBMFF container, which is what we do at Vimeo. It was very important that Opus be at least as good as Ogg Vorbis and AAC with edit lists, and therefore the pre-skip and durations from Ogg Opus are always mapped to the edit list mechanism that AAC adopted. Plus, Opus’s single sample rate per time base of 48 kHz avoids the difficulties that plague HE-AAC. This arrangement is shown in Figure 5.

Figure 5. Opus in ISOBMFF. Note the similarity to AAC in ISOBMFF when an edit list is used.

Because Opus is normally provided to a standalone muxer as Ogg Opus, an ISOBMFF muxer can read the pre-skip value from the input, meaning it doesn’t have to be passed externally and everything will “just work.”

And there you have it

The easiest takeaway here is to always use Opus, since it’s the easiest audio format to get right. However, sheer adoption rates suggest that we’ll be stuck with AAC for years to come, so it’s very important to make sure your tooling produces edit lists for AAC, too, with the correct values!

It’s also likely that there will be new audio formats in the future. You’ll want to package those in ISOBMFF right away to avoid repeating the mistakes of AAC.

--

--