How Well Does Netflix’s VMAF Metric Capture Video Quality?

 In Blog

Last year, Netflix published a blog post titled “Toward a Practical Perceptual Video Quality Metric” announcing their development of a new video quality metric, termed Video Multimethod Assessment Fusion (VMAF), that they claim better reflects human perception of video quality than other widely used objective metrics.  At the time, we made some initial observations about VMAF in our own blog post.

In the past year, after Netflix open-sourced the code for VMAF, industry thought leaders have begun to use it in compression evaluation studies as a replacement for the long-time standard video quality metric, PSNR.  At EuclidIQ, we have worked hard to develop a subjective testing methodology that produces mean opinion score (MOS) data quickly and accurately, and we base all evaluation of our video encoding technology on MOS data.  So, a natural question arises: how well does VMAF approximate MOS data?  Read on to find out!

Subjective Test Data

The data presented in this blog comes from two double-blind subjective tests conducted using our subjective testing methodology.  One test included eight expert viewers, while the other test included nine experts.  Each test contained twenty different 1080p video clips with varying characteristics, for a total of forty different video clips in all.  The video clips ranged between ten and thirty seconds in length.  Each video clip was encoded with two different H.264 encoders, a “reference” Encoder A and an “enhanced” Encoder B, and at three different bitrates chosen to be below, near, and above the video quality breakdown bitrate for the video clip.  Thus, for each video clip, six different encodings were presented in randomized order to the viewers for scoring, for a total of 240 encodings (forty video clips at six encodings each) across the two tests.

Viewers scored each clip using a 9-grade impairment scale, with scores ranging from 1 to 5 (5 being the best) in increments of 0.5. Mean opinion score (MOS) data for the two tests was derived by averaging viewer scores across viewers.   For each of the 240 encodings across the two tests, both PSNR and VMAF values were computed for each frame and then frame-averaged to obtain the overall metric value for the entire encoding.  VMAF was computed using the “default” model.

VMAF Performance: Correlation with MOS Values

The most basic measure of how well an objective metric correlates with subjective MOS data is to plot one versus the other and observe how “tightly” the resulting scatter plot fits to a line.  These scatter plots are shown in Figure 1 for both PSNR vs. MOS and VMAF vs. MOS, for all 240 encodings (independent of encoder).  Similar to what Netflix found in their original VMAF blog post (linked above), Figure 1 shows that VMAF was more tightly correlated with MOS than PSNR was, though the VMAF correlation was not as tight as in the Netflix results.


Figure 1.  Overall correlation of PSNR vs. MOS and VMAF vs. MOS.

VMAF Performance: Correlation with MOS Ordering

When using objective metrics for encoder evaluations, however, general correlation of metric values to MOS values is not the whole story.  What is more important is whether the objective metric values preserve the relative ordering of encoder performance in A vs. B comparisons.  In other words, for a given encoding, if Encoder A has a higher MOS value than Encoder B, does Encoder A also have a higher VMAF value than Encoder B?

Table 1 shows some example results for two video clips in the subjective test data, AirChicago at 7700 kbps and AlpineCow at 3650 kbps.  For AirChicago, Encoder A had a higher VMAF value than Encoder B, matching the ordering of MOS values, while PSNR was mismatched, showing a higher value for Encoder B than Encoder A.  For AlpineCow, the result was the opposite, where PSNR matched the MOS order (B better than A) while VMAF was mismatched.




Video A B B-A A B B-A A B B-A
AirChicago 90.89 90.19 -0.71 4.13 4.00 -0.13 35.29 35.62 0.33
AlpineCow 88.70 88.32 -0.38 3.56 3.88 0.32 35.47 35.71 0.24

Table 1.  Selected results showing metric values attained by Encoder A and Encoder B for two video clips, AirChicago at 7700 kbps and AlpineCow at 3650 kbps.

The full set of subjective test data contained 120 different comparisons of Encoder A vs. Encoder B (40 different video clips at 3 bitrates each).  Table 2 shows how well-correlated VMAF and PSNR were with MOS in terms of this “order preservation.” Surprisingly, according to this method of evaluation, PSNR outperformed VMAF significantly, matching the MOS ordering 89% of the time while VMAF matched the MOS ordering only 70% of the time.  This is surprising because VMAF is designed to capture various aspects of human perception and thus should theoretically match the conclusions of MOS data more often than PSNR.

MOS 1.00 0.70 0.89
VMAF 0.70 1.00 0.79
PSNR 0.89 0.79 1.00

Table 2.  Correlation of VMAF and PSNR to MOS, measured in terms of how often each metric preserved the MOS ordering of which encoder was better.

VMAF Performance: Correlation with MOS Gains

Complete encoder evaluations from subjective test data do not simply compare MOS values at different bitrates, since the output bitrates for different encoders may vary slightly for the same encoding bitrate and the overall MOS vs. bitrate behavior for different encoders may vary over multiple bitrates.  One way to capture the compression “gain” of one encoder over another is to compute the BD-Rate metric from the respective rate-quality plots of the encoders.  Full details of how we compute the BD-Rate metric are beyond the scope of this blog but are contained in our whitepaper on subjective testing methodology.  For the purposes of this blog, it is sufficient to describe BD-Rate as a measure of the average bandwidth reduction provided by one encoder over another, averaged over a range of qualities represented in a rate-quality plot.



Figure 2.  Rate-quality plots for the AirChicago video, both MOS vs. bitrate (left) and VMAF vs. bitrate (right).

Figure 2 shows an example on the AirChicago video of different BD-Rate gains indicated by different metrics.  The rate-quality plot on the left measures quality using subjective MOS and gives a very different picture of the gain of Encoder B vs. Encoder A than does the rate-quality plot on the right, where quality is measured by VMAF.  The BD-Rate of Encoder B relative to Encoder A calculated from the MOS-based rate-quality plot on the left is +24.5%; i.e., Encoder B achieves a 24.5% gain relative to Encoder A.  However, the same BD-Rate calculated from the VMAF-based rate-quality plot on the right is -2.2%; i.e., Encoder B achieves a 2.2% loss relative to Encoder A.

The full set of subjective test data contained 40 different BD-Rate calculations for the gain of Encoder B relative to Encoder A, corresponding to 40 different video clips encoded at three bitrates each for both Encoder A and Encoder B. The BD-Rate values can be sorted into three categories:

  1. Encoder A significantly better than Encoder B [BD-Rate less than -5%]
  2. Both encoders approximately the same [BD-Rate between -5% and +5%]
  3. Encoder B significantly better than Encoder A [BD-Rate greater than +5%]

Table 3 shows how often VMAF and PSNR agreed with MOS in terms of BD-Rate gain categories, i.e., how often the respective BD-Rate gains fell into the same category.

MOS 1.00 0.43 0.78
VMAF 0.43 1.00 0.55
PSNR 0.78 0.55 1.00

Table 3.  Correlation of VMAF and PSNR to MOS, measured in terms of how often each metric matched the MOS BD-Rate gain category.

Again, surprisingly, PSNR was better correlated with MOS than VMAF (78% vs. 43%) when evaluated using BD-Rate gains categories.  Simply averaging the BD-Rate gains of Encoder B vs. Encoder A over the 40 video clips resulted in +20.4% gain using MOS, +2.7% using VMAF, and +13.5% using PSNR.  While it is well-known that subjective testing often results in greater distinctions between encoders than objective metric measurements, it is clear from this data set that VMAF pointed to a very different conclusion (both Encoder A and Encoder B performed about the same) than either MOS or PSNR (Encoder B was significantly better than Encoder A).


To summarize, we presented an analysis of how well VMAF correlates with subjective MOS data derived from two subjective tests we conducted, comparing VMAF correlation with that of PSNR, which served as a reference metric.  While VMAF was better-correlated with MOS than PSNR when metric values were considered independent of encoder, VMAF was less well-correlated with MOS than PSNR when performing encoder comparisons, both when measuring encoder ordering (which of two encoders is “better”) and when measuring encoder gain (how much compression improvement one encoder achieves versus the other).  This was a surprising result, because VMAF is designed to reflect human perception, unlike PSNR, and so should better reflect subjective test scores from human subjects.  This is especially true because Netflix used subjective MOS data to train their model for VMAF.

However, it is possible that the MOS data Netflix used to train their VMAF model differs fundamentally from our MOS data, causing the poor correlation of VMAF with our MOS data.  For example, there are some differences in the way Netflix conducted its subjective testing, as compared to our subjective testing methodology.  Also, Netflix seemed to use videos with multiple frame resolutions in its training set, possibly diluting the accuracy of the VMAF model for any single frame size, since the correlation of VMAF to MOS may vary with frame size.  Finally, VMAF might do better with encoder comparison metrics if separate models were computed for each encoder, as the correlation of VMAF to MOS may vary with encoder type.

At EuclidIQ, we have been developing technologies that are directly relevant to many of the issues mentioned in this blog.  Our enhanced encoder uses perceptual quality optimization (PQOTM) that identifies areas of each video frame most noticeable to human observers to provide a better allocation of bits for each frame and across frames in a video.  And our content-adaptive modeling algorithm called SABRE (signal-adaptive bitrate estimation) measures video quality using a “MOS predictor” (MOSp) objective metric that correlates well with actual MOS data.

In a future blog, we will explain how MOSp is calculated and how well it performs according to the evaluation metrics described in this blog.  If you’re interested getting updates on this work, sign up for our blog mailing list. You can also find more information about how we bring the human back into digital video encoding on our website.



Recent Posts

Leave a Comment

Start typing and press Enter to search