Thoughts on Netflix’s Move Toward Perceptual Video Quality Metrics
Netflix recently published a blog post titled “Toward a Practical Perceptual Video Quality Metric” that announced their development of a new video quality metric, termed Video Multimethod Assessment Fusion (VMAF), that they claim better reflects human perception of video quality than other widely used objective metrics.
At EuclidIQ, where we have been working on methods of applying perceptual quality optimization (PQO), integrating consideration of the human visual system (HVS) into the encoding process to produce perceptually better H.264 and HEVC encoding, we welcome such efforts and the general attention toward perceptual (instead of mathematical) measures of video quality. There are a few points to be made about Netflix’s efforts, however.
Netflix’s VMAF metric is a supervised learning method that requires existing subjective test data to train its model parameters. From the blog: “The machine-learning model is trained and tested using the opinion scores obtained through a subjective experiment.” This means that the accuracy of the VMAF metric is limited to the videos in the subjective test upon which it is trained. As is the case for most objective metrics, the accuracy of the VMAF metric would likely be much lower for videos not contained in the subjective test training data. The accuracy of the VMAF metric also depends on viewing videos under the same viewing conditions as used for the subjective test training data. As noted in the Netflix blog, “a Netflix member watching a 720p movie encoded at 1 Mbps on a 4K 60-inch TV may have a very different perception of the quality of that same stream if it were instead viewed on a 5-inch smartphone.” To its credit, Netflix is trying to generalize the VMAF metric to include more videos and more viewing conditions. The well-known tradeoff in the generalization process involves sacrificing accuracy in specific cases in favor of better accuracy in the general case.
Netflix’s VMAF metric has some of the same limitations as most objective metrics. First, most objective metrics detect spatial artifacts well but not temporal artifacts. This is likely true for the current form of the VMAF metric, which combines two “elementary” metrics based on spatial information (the so-called visual information fidelity and detail loss metrics) and one elementary metric based on temporal frame differences. As with most objective metrics, the only temporal component to VMAF is derived from temporal differencing, which is a simplistic and often inaccurate measure of the temporal characteristics of video that doesn’t account for the complex motion of objects within the video frame. Accurately capturing the true temporal characteristics of video (using, for example, the motion vectors computed during encoding) is computationally expensive, which is why most objective metrics don’t attempt to do so. Second, VMAF aggregates measurement over an entire video by computing a simple frame average. To its credit, Netflix acknowledges this as a weakness of the current version of VMAF and seeks to apply a better “temporal pooling” mechanism to future versions. We pointed out both of these limitations of objective metrics in our recent talk at Streaming Media East on using subjective testing for perceptual quality optimization (see Slides 4 and 5 in the linked presentation).
Subjective testing is still important. While effective objective metrics can reduce the frequency of subjective testing, they won’t eliminate the need for subjective testing entirely. As noted above, supervised learning methods such as VMAF require accurate subjective tests to generate training data. And final evaluation of encoding techniques often requires at least some subjective testing. EuclidIQ has developed a practical subjective test methodology that generates meaningful results without being prohibitive in terms of time or cost. We have written about the methodology in a series of blog posts and have released a white paper you can access here.
Developing better perceptual quality metrics is good, but applying perceptual quality optimization to improve encoding is better. If Netflix is successful in developing a perceptual quality objective metric that is widely accepted in the industry, we will certainly applaud the effort. However, the development of a better objective metric will only make evaluation of different encoding techniques easier – it will not make the encoding itself perceptually better. Following the same line of thinking as Netflix but applying it to the encoding process, EuclidIQ has been working to integrate perceptual quality optimization (PQO) directly into the encoder. Our IQ264 technology generates a Unified Importance Map that uses HVS considerations to identify areas of the video frame that are most noticeable to human observers and adjusts encoding quality accordingly. Where our method differs from other PQO efforts is that IQ264 operates during encoding, not as a pre-processor or as post-processor. Because of this, IQ264 can both improve encoding quality for a given bitrate and reduce bitrate for a given quality.
To find out more about IQ264, contact us at firstname.lastname@example.org.