In a previous blog post, we examined the performance of Netflix’s new video quality metric, Video Multimethod Assessment Fusion (VMAF), for its ability – in comparison with the long-time standard video quality metric, PSNR – to predict subjective test results, in terms of both correlation to raw mean opinion score (MOS) data and correlation with encoder comparisons derived from MOS data .
This comparison was of particular interest because Netflix claims VMAF better reflects human perception of video quality than other widely used objective metrics (such as PSNR), so VMAF should theoretically have bested PSNR in correlating with subjective test results. What we found was somewhat surprising: while VMAF was more well-correlated with MOS data than PSNR when metric values were considered independent of encoder, VMAF was less well-correlated with MOS data than PSNR when performing encoder comparisons, both when measuring encoder ordering (which of two encoders was “better”) and when measuring encoder gain (how much compression improvement one encoder achieved versus the other). In this follow-up blog post, we examine the performance, within the same comparison environment, of a perceptually-based video quality metric called MOS predictor (MOSp) that we at EuclidIQ have recently developed.
Our MOS predictor (MOSp) metric is based on edge-based statistics and their variants. For each video encoding, we compute edge-based statistics on a frame-by-frame basis and then calculate video average values of the edge-based statistics by averaging over frames. We then fit a model to predict subjective MOS data using the edge-based statistics.
We experimented with several methods of modeling, including support vector machines (SVMs), but we found that linear modeling – the simplest and fastest modeling method – was sufficient, as no other modeling method performed appreciably better than linear modeling. The resulting model accepts edge-based statistics as inputs and outputs predicted MOS on the same scale as typical subjective scoring (1 to 5, with 5 being the best) – hence the term MOS predictor for the video quality metric. One important difference in our modeling, as compared with that of VMAF, which also fits edge-based statistics to subjective MOS data, is that we compute different models for different encoders. In other words, for example, the relative weighting of edge-based statistics to produce MOS predictions is different for the x264 encoder than for the JM encoder.
Subjective Test Data
To review, the data presented in this blog comes from two double-blind subjective tests conducted using our subjective testing methodology. One test included eight expert viewers, while the other test included nine experts. Each test contained twenty different 1080p video clips with varying characteristics, for a total of forty different video clips in all. The video clips ranged between ten and thirty seconds in length. Each video clip was encoded with two different H.264 encoders, a “reference” Encoder A and an “enhanced” Encoder B, and at three different bitrates chosen to be below, near, and above the video quality breakdown bitrate for the video clip. Thus, for each video clip, six different encodings were presented in randomized order to human subjects for scoring, for a total of 240 encodings (forty video clips at six encodings each) across the two tests.
Viewers scored each clip using a 9-grade impairment scale, with scores ranging from 1 to 5 (5 being the best), in increments of 0.5. Mean opinion score (MOS) data for the two tests was derived by averaging viewer scores across viewers. For each of the 240 encodings across the two tests, both objective metric values (including PSNR, VMAF, and MOS) were computed for each frame and then frame-averaged to obtain the overall metric value for the entire encoding.
MOSp Performance: Correlation with MOS Values
The most basic measure of how well an objective metric correlates with subjective MOS data is to plot one versus the other and observe how “tightly” the resulting scatter plot fits to a line. In our previous blog post, we displayed the scatter plots for all 240 encodings (independent of encoder) and showed that the scatter plot for VMAF vs. MOS was tighter than that for PSNR vs. MOS. Here, we compare in Figure 1 the scatter plots for VMAF vs. MOS against that for MOSp vs. MOS, this time color-coded by the two encoders, the “reference” Encoder A (in red) and the “enhanced” Encoder B (in blue).
Figure 1. Correlation of VMAF vs. MOS (left) and MOSp vs. MOS (right), color-coded by Encoder (Encoder A in red and Encoder B in blue).
It is clear from inspection of the two figures that the correlation is tighter for MOSp than for VMAF. In addition, the MOSp vs. MOS data points are well-distributed near the MOSp = MOS (“y=x”) line, as they should be as the MOS predictor metric is trying to predict MOS. The better correlation of MOSp to MOS is corroborated in Table 1, which listed the Pearson correlation coefficients for PSNR, VMAF, and MOSp vs. MOS (the higher the value, the more correlated, with 1 representing perfect positive correlation and 0 no correlation). While VMAF had higher correlations than PSNR, the correlations for MOSp were still higher.
|Encoder A||Encoder B||Overall|
Table 1. Pearson correlation coefficients for PSNR, VMAF, and MOSp vs. MOS.
MOSp Performance: Correlation with MOS Ordering
When using objective metrics for encoder evaluations, however, general correlation of metric values to MOS values is not the whole story. What is more important is whether the objective metric values preserve the relative ordering of encoder performance in A vs. B comparisons. In other words, for a given encoding, if Encoder A has a higher MOS value than Encoder B, does Encoder A also have a higher metric value than Encoder B?
Table 2 shows some example results for two video clips in the subjective test data, AirChicago at 7700 kbps and AlpineCow at 3650 kbps. For AirChicago, Encoder B had a lower VMAF than Encoder A, matching the ordering of the MOS values from the subjective test, whereas MOSp was mismatched, with Encoder B having a higher MOSp value than Encoder A. Additionally, the MOSp values were significantly underpredicted for this video (by 0.4 to 0.8 on the 5-point scale). For AlpineCow, however, VMAF was mismatched relative the ordering of MOS, with Encoder A having a higher VMAF value than Encoder B, while MOSp maintained the MOS ordering, with Encoder B having a higher MOSp value than Encoder A. Additionally, the MOSp values were highly accurate for this video, with predictions with 0.1 of the MOS values. As can be seen from these sample results, interpretation of MOSp values is easier, as MOSp takes on values according to the same 1-to-5 scale that MOS does.
Table 2. Selected results showing metric values attained by Encoder A and Encoder B for two video clips, AirChicago at 7700 kbps and AlpineCow at 3650 kbps.
The full set of subjective test data contained 120 different comparisons of Encoder A vs. Encoder B (40 different video clips at 3 bitrates each). It should be noted, however, that MOSp was trained on 20 videos encompassing the first 60 encodings. To avoid testing on the same videos as the MOSp models were trained on, the metrics were evaluated on the other 20 videos in the subjective tests encompassing the other 60 encodings. Table 3 shows, for these other 60 encodings, how well-correlated VMAF, PSNR, and MOSp were with MOS in terms of “order preservation.” According to this metric, PSNR and MOSp outperformed VMAF significantly, matching the MOS ordering 90% and 92% of the time, respectively, while VMAF matched the MOS ordering only 67% of the time.
Table 3. Correlation of VMAF, PSNR, and MOSp to MOS (first row and first column), measured in terms of how often each metric preserved the MOS ordering of which encoder was better.
MOSp Performance: Correlation with MOS Gains
As noted in our previous blog, complete encoder evaluations from subjective test data do not simply compare MOS values at different bitrates, since the output bitrates for different encoders may vary slightly for the same encoding bitrate and the overall MOS vs. bitrate behavior for different encoders may vary over multiple bitrates. One way to capture the compression “gain” of one encoder over another is to compute the BD-Rate metric from the respective rate-quality plots of the encoders. Full details of how we compute the BD-Rate metric are beyond the scope of this blog but are contained in our whitepaper on subjective testing methodology. For the purposes of this blog, it is sufficient to describe BD-Rate as a measure of the average bandwidth reduction provided by one encoder over another, averaged over a range of qualities represented in a rate-quality plot.
The full set of subjective test data contained 40 different BD-Rate calculations for the gain of Encoder B relative to Encoder A, corresponding to 40 different video clips encoded at three bitrates each for both Encoder A and Encoder B. It should be noted again, however, that MOSp was trained on 20 of the 40 videos. Again, to avoid testing on the same videos as the MOSp models were trained on, the metrics were evaluated on the other 20 videos in the subjective tests. The BD-Rate values can be sorted into three categories: (1) Encoder A significantly better than Encoder B [BD-Rate less than -5%]; (2) both encoders approximately the same [BD-Rate between -5% and +5%]; and (3) Encoder B significantly better than Encoder A [BD-Rate greater than +5%]. Table 4 shows how often VMAF, PSNR, and MOSp agreed with MOS in terms of BD-Rate gain categories, i.e., how often the respective BD-Rate gains fell into the same category.
Table 4. Correlation of VMAF, PSNR, and MOSp to MOS (first row and first column), measured in terms of how often each metric matched the MOS BD-Rate gain category.
Again, the ordering of metric performance was preserved: PSNR was better than VMAF, and MOSp was better than PSNR. Thus, when matched against the MOS data, VMAF was not able to identify well which encoder was better (Table 3), nor was it able to identify well by how much one encoder was better than another (Table 4). PSNR performed better than VMAF by both criteria, and MOSp performed best of all three metrics.
To summarize, we presented a new video quality metric called MOS predictor (MOSp) that fits a model of edge-based statistics to mean opinion score (MOS) data in order to predict MOS. MOSp has intuitive meaning, as it is scaled to the same 1-to-5 scale at which MOS is measured. For the subjective test data that we evaluated, we found that MOSp performed better than either PSNR or VMAF, both in terms of encoder-independent correlation with the MOS data and in terms of encoder ordering (which encoder was better) and encoder gain (by how much one encoder was better than the other).
As we noted in the previous blog, it was surprising that VMAF performed worse than PSNR in terms of its ability to match MOS data from actual subjective tests. VMAF is designed to reflect human perception, unlike PSNR, and so should better reflect subjective test scores from human subjects. This is especially true because Netflix used subjective MOS data to train their model for VMAF. Our MOSp metric, also trained on MOS data, did perform better than PSNR in all evaluation metrics.
Why does MOSp perform so much better than VMAF? We noted in the previous blog that it is possible that the MOS data Netflix used to train their VMAF model differs fundamentally from our MOS data, causing the poor correlation of VMAF with our MOS data. Perhaps most importantly, Netflix used non-experts as their subjects while we used (a likely smaller number of) expert observers. We have found in past subjective testing that non-experts give scores that are both more variable (both within subject and across subjects) and more unreliable – for example, by inverting their scores for two encodings of the same video with the same encoder but different bitrates. In our testing, such score inversions occur at least three times as often for non-experts as experts, which would in turn lead to more inaccurate modeling. The process of modeling video quality by fitting statistics to subjective test score requires the ability to generate sufficient MOS data quickly and accurately, which is why we worked hard to develop our practical subjective testing methodology (again, see our whitepaper on this topic).
We also noted in the previous blog that Netflix seemed to use videos with multiple frame resolutions in its training set, possibly diluting the accuracy of the VMAF model for any single frame size, since the correlation of VMAF to MOS may vary with frame size. By comparison, MOSp is trained on videos at a single frame size, 1080p. Finally, VMAF might do better with encoder comparison metrics if separate models were computed for each encoder, as was done for MOSp in the results presented here. However, the scatter plot on the left-hand side of Figure 1 indicates that there would only be marginal improvement from doing that.
We are continually looking at ways to improve the accuracy of MOSp, including better methods of spatial and temporal pooling of statistics (combining block values into frame-average and video-average values) and better representation of artifacts not well-captured by edge-based statistics.
At EuclidIQ, we have been developing a content-adaptive modeling algorithm called SABRE (signal-adaptive bitrate estimation) that uses the MOSp metric to measure video quality and determine the bitrate needed to achieve the user’s desired MOS quality level. In turn, SABRE is part of EuclidIQ’s new content-adaptive cloud transcoder called Rithm. You can read more about this technology and see a demonstration of it at www.euclidiq.com/rithm. You can also sign up for the upcoming webinar — Rithm: The Content-Adaptive Transcoder With a Human Touch, Wednesday, March 14th from 1:00 – 1:30pm ET.