Introductory concepts
To begin, I should explain some introductory concepts related to H.264 video.
What is H.264?
H.264 is a video compression standard known as MPEG-4 Part 10, or MPEG-4 AVC (for "advanced video coding"). It's a joint standard promulgated by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).
H.264's audio sidekick is AAC (advanced audio coding), which is designated MPEG-4 Part 3. Both H.264 and AAC are technically MPEG-4 codecs—though it's more accurate to call them by their specific names—and compatible bitstreams should conform to the requirements of Part 14 of the MPEG-4 spec.
According to Part 14, MPEG-4 files containing both audio and video, including those with H.264/AAC, should use the .mp4 extension, while audio-only files should use .m4a and video-only files should use .m4v. Different vendors have adopted a range of extensions that are recognized by their proprietary players, such as Apple with .m4p for files using FairPlay Digital Rights Management and .m4r for iPhone ringtones. (Mobile phones use the .3gp and .3g2 extensions, though I don't discuss producing for mobile phones in this article.)
Like MPEG-2, H.264 uses three types of frames, meaning that each group of pictures (GOP) is comprised of I-, B-, and P-frames, with I-frames like the DCT-based compression used in DV and B- and P-frames referencing redundancies in other frames to increase compression. I'll cover much more on this later in this article.
Like most video coding standards, H.264 actually standardizes only the "central decoder...such that every decoder conforming to the standard will produce similar output when given an encoded bitstream that conforms to the constraints of the standard," according to Overview of the H.264/AVC Video Coding Standard published in IEEE Transactions on Circuits and Systems for Video Technology (ITCSVT). Basically, this means that there's no standardized H.264 encoder. In fact, H.264 encoding vendors can utilize a range of different techniques to optimize video quality, so long as the bitstream plays on the target player. This is one of the key reasons that H.264 encoding interfaces vary so significantly among the various tools.
Will there be royalties?
If you stream H.264 encoded video after December 31, 2010, there may be an associated royalty obligation. As yet, however, it's undefined and uncertain. Here's an overview of what's known about royalties to date.
Briefly, H.264 was developed by a group of patent holders now represented by the MPEG Licensing Suthoring, or MPEG-LA for short. According to the Summary of AVC/H.264 License Terms (PDF, 34K) you can download from the MPEG-LA site, there are three classes of video producers subject to a potential royalty obligation.
If you're in the first two classes, and are either distributing via pay-per-view or subscription, you may already owe MPEG-LA royalties. The third group, which is clearly the largest, is for free Internet broadcast. Here, there will be no royalties until December 31, 2010 (source: AVC/H.264 License Agreement). After that, "the royalty shall be no more than the economic equivalent of royalties payable during the same time for free television."
According to their website, MPEG-LA must disclose licensing terms at least one year before they become due, or no later than December 31, 2009. Until then, we're unfortunately in the dark as to which uses of H.264 video will incur royalties, and the extent of these charges. For more information on H.264-related royalties, check out my article, The Future's So Bright: H.264 Year in Review, at StreamingMedia.com.
H.264 and Flash Player
As I mentioned, Adobe added H.264 playback support to Adobe Flash Player 9 Update 3 back in 2007. The apparent goal was to support the widest possible variation of files containing H.264 encoded video, and Flash Player should play.mp4, .m4v, .m4a, .mov, and .3gp files, H.264 files using the .flv extension, as well as files using the newer extensions that were released along with Flash Player 9 (see Table 1).
File Extension | FTYP | MIME Type | Description |
---|---|---|---|
.f4v | 'F4V ' | video/mp4 | Video for Flash Player |
.f4p | 'F4P ' | video/mp4 | Protected media for Flash Player |
.f4a | 'F4A ' | audio/mp4 | Audio for Flash Player |
.f4b | 'F4B ' | audio/mp4 | Audio book for Flash Player |
I'll describe profiles and levels in the next section. For now, understand that Flash Player supports the Baseline, Main, High, and High 10 H.264 profiles with no levels excluded. Accordingly, when you're producing H.264 video for Flash Player, you're free to choose the most advanced profile supported by the encoding tool, which is typically the High profile. On the audio side, Flash Player can play AAC Main, AAC Low Complexity, and AAC SBR (spectral band replication), which is otherwise known as High-Efficiency-AAC, or HE-AAC.
Producing H.264 video
You have seen that you have nearly complete flexibility regarding profiles and extensions; what else do you need to know before you dig into the details? A couple of things.
First, unlike VP6, which is available only from On2, there are multiple suppliers of H.264 codecs, including MainConcept, whose codec Adobe uses in Adobe Media Encoder and Adobe Flash Media Encoding Server. I've compared the quality of H.264 files produced with H.264 codecs from other vendors, and MainConcept has proven to be the best.
In general, while the overall quality of other codecs has improved, there are some tools to avoid out there. If you're producing with a different tool and not achieving the quality you were hoping for, try encoding with one of the Adobe tools.
Second, some older encoding tools do not offer output directly into F4V format. If F4V format is not offered in your encoding tool, the best alternative is to produce an MPEG-4 compatible streaming media file using the .mp4 extension.
With this as background, I'll describe the most common H.264 encoding parameters.
Though H.264 codecs come from different vendors, they use the same general encoding techniques and typically present similar encoding options. Here I review the most common H.264 encoding options.
Understanding profiles and levels
According to the aforementioned article, Overview of the H.264/AVC Video Coding Standard, a profile "defines a set of coding tools or algorithms that can be used in generating a conforming bitstream, whereas a level "places constraints on certain key parameters of the bitstream." In other words, a profile defines specific encoding techniques that you can or can't utilize when encoding the files (such as B-frames), while the level defines details such as the maximum resolutions and data rates.
Take a look at Figure 1, which is a filtered screen capture of a features table from Wikipedia's description of H.264. On top are H.264 profiles, including the Baseline, Main, High, and High 10 profiles that Flash Player supports. On the left are the different encoding techniques available, with the table detailing those supported by the respective profiles.
As you would guess, the higher-level profiles use more advanced encoding algorithms and produce better quality (see Figure 2). To produce this comparison, I encoded the same source file to the same encoding parameters. The file on the left uses the Main Profile; the files on the right uses the Baseline. A quick check of the chart in Figure 1 reveals that the Main Profile enables B slices (also called B-frames) and the higher-quality CABAC encoding, which I define later in this article. As you can see, these do help the Main Profile deliver higher-quality video than the Baseline.
So, the Main and High profiles deliver better quality than the Baseline Profile; what's the catch? The catch is, as you use more advanced encoding techniques, the file becomes more difficult to decompress, and may not play smoothly on older, slower computers.
This observation illustrates one of the two trade-offs typically presented by H.264 encoding parameters. One trade-off is better quality for a file that is harder to decompress. The other trade-off is a parameter that delivers better quality at the expense of encoding time. In some rare instances, as with the decision to include B-frames in the stream, you trigger both trade-offs, increasing both decoding complexity and encoding time.
To return to profiles: At a high level, think about profiles as a convenient point of agreement for device manufacturers and video producers. Mobile phone vendor A wants to build a phone that can play H.264 video but needs to keep the cost, heat, and size requirements down. So the crafty chief of engineering searches and finds the optimal processor that's powerful enough to play H.264 files produced to the Baseline Profile. If you're a video producer seeking to create video for that device, you know that if you encode using the Baseline profile, the video will play.
Accordingly, when producing H.264 video, the general rule is to use the maximum profile supported by the target playback platform, since that delivers the best quality at any given data rate. If producing for mobile devices, this typically means the Baseline Profile, but check the documentation for that device to be sure. If producing for Flash Player consumption on Windows or Macintosh computers, this means the High Profile.
This sounds nice and tidy, but understand this: While encoding using the Baseline Profile ensures smooth playback on your target mobile device, using the High Profile for files bound for computer playback doesn't provide the same assurance. That's because the High Profile supports H.264 video produced at a maximum resolution of 4096 × 2048 pixels and a data rate of 720 Mbps. Few desktop computers could display a complete frame, much less play back that stream at 30 frames per second.
Accordingly, while producing for devices is all about profile, producing for computers is all about your video configuration. Here, the general rule is that decoding H.264 video is about as computationally intense as VP6—or Windows Media, for that matter. So long as you produce your H.264 video at a similar resolution and data rate as the other two codecs, it should play fine on the same class of computer. (For comparative playback statistics for H.264, VP6 and VC-1, check out my StreamingMedia.com article, Decoding the Truth About Hi-Def Video Production.)
In general, this means that as long as you're producing SD video at 640 × 480 resolution and lower, it should play fine on most post–2003 computers. If you're producing at 720p or higher, these streams won't play smoothly on many of these computers. You should consider offering an alternative SD stream for these viewers.
What about H.264 levels? If producing for mobile devices with limited screen resolution and bandwidth, you also have to choose the correct level, which again should be specified by the device manufacturer. However, since Flash Player can handle any level supported by any of the supported profiles, you don't have to worry about levels when producing for Flash Player playback on a personal computer.
Entropy coding
When you select the Main or High Profiles, some encoding tools will give you two options for entropy coding mode (see Figure 3):
- CAVLC: Context-based adaptive variable-length coding
- CABAC: Context-based adaptive binary arithmetic coding
Of the two, CAVLC is the lower-quality, easier-to-decode option, while CABAC is the higher-quality, harder-to-decode option.
Though results are source-dependent, CABAC is generally regarded as being between 5–15% more efficient than CAVLC. This means that CABAC should deliver equivalent quality at a 5–15% lower data rate, or better quality at the same data rate. In my own tests, CABAC produced noticeably better quality, though only in HD test clips encoding to very low data rates. This is shown in Figure 4, from a 720p file produced with CABAC on the left and CAVLC on the right, both to the same 800 kbps video data rate. Figure 4 shows a portion of a frame cut from a 16:9 720p video. Now 800 kbps is very low for 720p footage; by way of comparison, YouTube encodes H.264 720p footage at 2 Mbps, over 2.5 times the data rate.
Though neither image would win an award for clarity, the ballerina's face and other details are clearly more visible on the left. The bottom line is that CABAC should deliver better quality, however modest the difference. Now the question becomes, How much harder is the file to decompress and play?
Not that much, it turns out. I tested this on two of the less-powerful multiple-core computers in my office, one a Hewlett-Packard notebook with a Core 2 Duo processor, and the other a Power PC-based Apple PowerMac. As you can see in Table 2, the CABAC file increased the CPU load by less than 1% on the HP notebook, and less than 2% on the Mac. Based on the improved quality and minimal difference in the required playback CPU, I recommend choosing CABAC whenever the option is available.
Computer | CABAC | CAVLC | Difference |
---|---|---|---|
HP Compaq 8710w Mobile Workstation – | Core 2 duo31.1% | 30.5% | 0.6% |
Apple PowerMac – Dual 2.7 GHz PPC G5 | 35.5 | 33.7 | 1.8% |
I, P, and B-frames
It's common knowledge that talking-head footage, where very little changes from frame to frame, encodes at higher quality than dynamic, motion-filled video. That's because H.264, like all high-quality motion codecs, is designed to take advantage of redundancies between video frames. The more redundancy, the higher the quality at any given bit rate.
To leverage this redundancy, H.264 streams include three types of frames (see Figure 5):
- I-frames: Also known as key frames, I-frames are completely self-referential and don't use information from any other frames. These are the largest frames of the three, and the highest-quality, but the least efficient from a compression perspective.
- P-frames: P-frames are "predicted" frames. When producing a P-frame, the encoder can look backwards to previous I or P-frames for redundant picture information. P-frames are more efficient than I-frames, but less efficient than B-frames.
- B-frames: B-frames are bi-directional predicted frames. As you can see in Figure 5, this means that when producing B-frames, the encoder can look both forwards and backwards for redundant picture information. This makes B-frames the most efficient frame of the three. Note that B-frames are not available when producing using H.264's Baseline Profile.
Now that you know the function of each frame type, I'll show you how to optimize their usage.
Working with I-frames
Though I-frames are the least efficient from a compression perspective, they do perform two invaluable functions. First, all playback of an H.264 video file has to start at an I-frame because it's the only frame type that doesn't refer to any other frames during encoding.
Since almost all streaming video may be played interactively, with the viewer dragging a slider around to different sections, you should include regular I-frames to ensure responsive playback. This is true when playing a video streamed from Flash Media Server, or one distributed via progressive download. While there is no magic number, I typically use an I-frame interval of 10 seconds, which means one I-frame every 300 frames when producing at 30 frames per second (and 240 and 150 for 24 fps and 15 fps video, respectively).
The other function of an I-frame is to help reset quality at a scene change. Imagine a sharp cut from one scene to another. If the first frame of the new scene is an I-frame, it's the best possible frame, which is a better starting point for all subsequent P and B-frames looking for redundant information. For this reason, most encoding tools offer a feature called "scene change detection," or "natural key frames," which you should always enable.
Figure 6 shows the I-frame related controls from Flash Media Encoding Server. You can see that Enable Scene Change detection is enabled, and that the size of the Coded Video Sequence is 300, as in 300 frames. This would be simpler to understand if it simply said "I-frame interval," but it's easy enough to figure out.
Specifically, the Coded Video Sequence refers to a "Group of Pictures" or GOP, which is the building block of the H.264 stream—that is, each H.264 stream is composed of multiple GOPs. Each GOP starts with an I-frame and includes all frames up to, but not including, the next I-frame. By choosing a Coded Video Sequence size of 300, you're telling Flash Media Encoding Server to create a GOP of 300 frames, or basically the same as an I-frame interval of 300.
IDR frames
I'll describe the Number of B-Pictures setting further on, and I've addressed Entropy Coding Mode already; but I wanted to explain the Minimum IDR interval and IDR frequency. I'll start by defining an IDR frame.
Briefly, the H.264 specification enables two types of I-frames: normal I-frames and IDR frames. With IDR frames, no frameafter the IDR frame can refer back to any frame before the IDR frame. In contrast, with regular I-frames, B and P-frames located after the I-frame can refer back to reference frames located before the I-frame.
In terms of random access within the video stream, playback can always start on an IDR frame because no frame refers to any frames behind it. However, playback cannot always start on a non-IDR I-frame because subsequent frames may reference previous frames.
Since one of the key reasons to insert I-frames into your video is to enable interactivity, I use the default setting of 1, which makes every I-frame an IDR frame. If you use a setting of 0, only the first I-frame in the video file will be an IDR frame, which could make the file sluggish during random access. A setting of 2 makes every second I-frame an IDR frame, while a setting of 3 makes every third I-frame an IDR frame, and so on. Again, I just use the default setting of 1.
Minimum IDR interval defines the minimum number of frames in a group of pictures. Though you've set the Size of Codec Video Sequence at 300, you also enabled Scene Change Detection, which allows the encoder to insert an I-frame at scene changes. In a very dynamic MTV-like sequence, this could result in very frequent I-frames, which could degrade overall video quality. For these types of videos, you could experiment with extending the minimum IDR interval to 30–60 frames, to see if this improved quality. For most videos, however, the default interval of 1 provides the encoder with the necessary flexibility to insert frequent I-Frames in short, highly dynamic periods, like an opening or closing logo. For this reason, I also use the default option of 1 for this control.
Working with B-frames
B-frames are the most efficient frames because they can search both ways for redundancies. Though controls and control nomenclature varies from encoder to encoder, the most common B-frame related control is simply the number of B-frames, or "B-Pictures" as shown in Figure 6. Note that the number in Figure 6 actually refers to the number of B-frames between consecutive I-frames or P-frames.
Using the value of 2 found in Figure 6, you would create a GOP that looks like this:
IBBPBBPBBPBB......all the way to frame 300. If the number of B-Pictures was 3, the encoder would insert three B-frames between each I-frame and/or P-frame. While there is no magic number, I typically use two sequential B-frames.
How much can B-frames improve the quality of your video? Figure 7 tells the tale. By way of background, this is a frame at the end of a very-high-motion skateboard sequence, and also has significant detail, particularly in the fencing behind the skater. This combination of high motion and high detail is unusual, and makes this frame very hard to encode. As you can see in the figure, the video file encoded using B-frames retains noticeably more detail than the file produced without B-frames. In short, B-frames do improve quality.
What's the performance penalty on the decode side? I ran a battery of cross-platform tests, primarily on older, lower-power computers, measuring the CPU load required to play back a file produced with the Baseline Profile (no B-frames), and a file produced using the High Profile with B-frames. The maximum differential that I saw was 10 percent, which isn't enough to affect my recommendation to always use the High Profile except when producing for devices that support only the Baseline Profile.
Advanced B-frame options
Adobe Flash Media Encoding Server also includes the B and P-frame related controls shown in Figure 8. Adaptive B-frame placement allows the encoder to override the Number of B-Pictures value when it will enhance the quality of the encoded stream; for instance, when it detects a scene change and substitutes an I-frame for the B. I always enable this setting.
Reference B-Pictures lets the encoder to use B-frames as a reference frame for P frames, while Allow pyramid B-frame coding lets the encoder use B-frames as references for other B-frames. I typically don't enable these options because the quality difference is negligible, and I've noticed that these options can cause playback to become unstable in some environments.
Reference frames is the number of frames that the encoder can search for redundancies while encoding, which can impact both encoding time and decoding complexity; that is, when producing a B-frame or P-frame, if you used a setting of 10, the encoder would search until it found up to 10 frames with redundant information, increasing the search time. Moreover, if the encoder found redundancies in 10 frames, each of those frames would have to be decoded and in memory during playback, which increases decode complexity.
Intuitively, for most videos, the vast majority of redundancies are located in the frames most proximate to the frame being encoded. This means that values in excess of 4 or 5 increase encoding time while providing little value. I typically use a value of 4.
Finally, though it's not technically related to B-frames, consider the number of Slices per picture, which can be 1, 2, or 4. At a value of 4, the encoder divides each frame into four regions and searches for redundancies in other frames only within the respective region. This can accelerate encoding on multicore computers because the encoder can assign the regions to different cores. However, since redundant information may have moved to a different region between frames—say in a panning or tilting motion—encoding with multiple slices may miss some redundancies, decreasing the overall quality of the video.
In contrast, at the default value of 1, the encoder treats each frame as a whole, and searches for redundancies in the entire frame of potential reference frames. Since it's harder to split this task among multiple cores, this setting is slower, but also maximizes quality. Unless you're in a real hurry, I recommend the default value of 1.
Other encoding parameters
Once you get beyond I and B-frame related controls, H.264 enables a range of additional encoding parameters, as I will soon describe. To put these in perspective, I would estimate that all the options described up to this point account for 90-95% of the quality available in H.264. The settings discussed in this section can deliver only the remaining 5%, which means that most users can accept the defaults without noticing the difference. Still, if you want to try to eke out the ultimate in H.264 quality,you can use the functions that the settings shown in Figure 9 control.
First is search shape, which can be either 16 × 16 or 8 × 8. The latter (8 × 8) is the higher-quality option, with the trade-off being longer encoding time. The next three "fast" options allow you to speed encoding time at the possible cost of quality. I typically disable these options.
Adaptive Quantization Mode and Quantization Strength are advanced settings that reallocate bits of data within a frame using one of the three selected criteria: brightness, contrast, or complexity. I would only experiment with these settings when areas in the video frame are noticeably blocky. Unfortunately, operation is extremely content-specific, which makes it impossible to offer general advice regarding which techniques and values to use.
Both the rate distortion optimization and Hadamard transformation settings can improve quality but lengthen encoding time; I usually enable both. Finally, the Motion estimation subpixel mode defines the granularity of the search for redundancies: Quarter pixel represents the highest-quality option, though the slowest to encode, and Full pixel represents the fastest but lowest quality. In my low-volume environment, I always use the Quarter pixel option.