Subjective Testing Part 4: PostTest Analysis
Introduction
EuclidIQ’s blog series on subjective testing has presented a “practical” subjective test methodology that we’ve designed to produce meaningful results while not being prohibitive in terms of time or cost. In Part 1, we detailed the general test setup, including the physical design and the scoring scheme (scoring scale and reference scheme). In Part 2, we described how we select representative video data for testing, first using motioncomplexity sectors to categorize content (Part 2a) and then determining the suitability of content for subjective testing by considering the watchability of the content (Part 2b). In Part 3, we explained how we determine appropriate performance points for subjective testing, by determining the video qualitybreakdown bitrate specific to each video clip in the test.
This fourth and final part of the subjective testing blog series describes the posttest analysis that we perform to quantify the results of our subjective tests. First, we apply a set of subject reliability metrics to identify and remove those test subjects whose scores are likely unreliable. Then, we aggregate the scores of the remaining (reliable) test subjects into mean opinion scores and calculate compression gain (of one encoding type versus another, reference encoding type) as average bandwidth savings over a common quality interval.
Subject Reliability Metrics
To identify which test subjects have produced scores that are likely unreliable, we apply a set of three reliability metrics. Recall that for each video clip in our subjective tests, we encode using two encoding types (referred to here without loss of generality as the “test” and “reference” encoding types) at three bitrates each, with the bitrates based on the video quality breakdown bitrate, as detailed in Part 3. This results in six total performance points per clip.
The first metric is what we call switch percentage. For given video clip and encoding type, with all other encoding settings being the same, subjects should give a higher score to a higherbitrate encoding; if this isn’t the case, a switch has occurred. For example, for the reference encodings at 1 Mbits/s, 2 Mbits/s, and 3 Mbits/s for a given video clip, subjects should assign the highest score to the 3 Mbits/s encoding and the lowest score to the 1 Mbits/s encoding. For any set of three encodings (corresponding to three bitrates in a given video for a given encoding type), there are three possible switches: high bitrate with medium bitrate, medium with low, and high with low. A highlow switch may be considered more serious than the other two switch types. The switch percentage is then calculated as the number of switches divided by the number of possible switches. For our external tests conducted with nonexpert subjects, we allow a maximum switch percentage of 20%; subjects with higher switch percentages have their scores discarded. For our internal tests, our development team (most of which may be classified as expert viewers) consistently achieves switch percentages around 3 to 5%.
The second metric is termed (intrasubject) variance percentage and applies to tests where all encodings are scored twice, in two separate “runs.” In this case, subjects should score the same encoding (a given video clip encoded using a given encoding type at a given bitrate) consistently from run to run. If the scores for the same encoding differ by more than 1 from run to run, a scoring variance has occurred. The variance percentage is then calculated as the number of variances divided by the number of possible variances. For our external tests, we again allow a maximum variance percentage of 20%; subjects with higher variance percentages have their scores discarded. For our internal tests, our development team consistently achieves variance percentages around 3 to 5%.
The third metric is termed (intersubject) difference percentage and measures, for given encoding (a given video clip, encoding type, and bitrate), the difference between an individual subject’s score and the mean opinion score (MOS) for that encoding. If that difference is greater than 1, a scoring difference has occurred. The difference percentage is then calculated as the number of differences divided by the number of possible differences. We generally use the difference percentage, for both internal and external tests, to monitor whether a given subject is stricter or more lenient than other subjects in the test. A high difference percentage in itself will not disqualify a subject from the subjective test results, as long as the subject’s scores are otherwise consistent (i.e., low switch and variance percentages).
It should be noted that the above three metrics are variations of subject reliability metrics proposed in the ITUR BT.500 recommendation[1], where the metrics are termed local inversions (a variation of switch percentage) and systematic shifts (a variation of difference percentage). Our metrics are simpler than those found in the standards and tailored to our specific test methodology.
Calculating Compression Gain
Once subject reliability metrics have been applied and unreliable subjects and their scores have been discarded, the remaining scores are averaged to obtain mean opinion scores (MOS). The MOS values are paired with the corresponding encoding bitrates to obtain ratequality plots, a variation of the more wellknown ratedistortion plots. For the bitrates in the ratequality plots, instead of using the target (input) bitrate, we measure the actual output bitrate of each encoding from the size in bits of the bitstream and the elapsed time of the video clip (determined from the video clip’s frame rate and number of frames). With two encoding types (a “Reference” encoding and a “Test” encoding) and three bitrates each, the subjective test results for a given video contains six total performance points, as illustrated in Table 1 and Figure 1.
Reference Bitrate (kbits/s)  Reference MOS  Test Bitrate (kbits/s)  Test Mos 
987

1.82

995

2.32

1489

2.55

1481

3.36

1997

3.32

2055

3.64

Figure 1: Example Plot of Subjective Test Results
The next step is to approximate the ratequality curves underlying the six data points from the subjective test (three data points per encoding type). This is done by interpolation, under the (relatively safe) assumption that quality (MOS) monotonically increases with bitrate for a given encoding type, all other settings being equal. We find that polynomial interpolation often results in unrealistic ratequality curves, as seen in Figure 2, where polynomial interpolation of the data from Table 1 produces a “Test” encoding curve contains an unrealistic “kink” in the curve. Thus, we use piecewise spline interpolation, which usually results in the more realistic ratequality curves seen in Figure 3.
Given the interpolated curves from Figure 3, we can then calculate the compression gain of the “Test” encoding relative to the “Reference” encoding in terms of average bitrate savings over a common quality interval. This metric is termed Bjøntegaard delta bitrate, or BDRate for short, after Bjøntegaard, who originally[2] applied it to PSNRbased ratequality curves. More recently, Hanhart and Ebrahimi extended[3] the BDRate calculation to MOSbased ratequality curves, as is done here.
In the example from Figure 3, the quality (MOS) interval common to both the “Reference” and “Test” curves is bounded by the lowest MOS value of the “Test” curve (2.32) and the highest MOS value of the “Reference” curve (3.32), so the common quality interval is [2.32, 3.32]. To obtain the average bandwidth required by each encoding, we calculate the area to the left of the ratequality curve, within the common quality interval, as illustrated in Figure 4 for the “Test” encoding curve (note that the horizontal bitrate axis has been extended all the way back to 0 in Figure 4, to illustrate the full area being calculated). This area can be calculated using simple integration techniques, such as the trapezoidal rule.
If the area to the left of the “Reference” encoding curve is given by AR and the area to the left of the “Test” encoding curve is given by AT, then the BDRate bandwidth savings is calculated as (AR – AT) / AR. In the example of Figs. 14, the BDRate is calculated as 0.29, or 29%, meaning that the “Test” encoding produces an average of 29% bandwidth savings over the “Reference encoding, over the MOS quality interval [2.32, 3.32].
We have used the subjective test methodology presented in the four parts of this blog series to measure the compression gains of our IQ264 technology (the “test” encoding) relative to x264 (the “reference” encoding). Because IQ264 uses perceptual quality optimization that focuses on human perceptual considerations to improve H.264 encoding, its gains are best measured via subjective testing. In the subjective tests referenced in Part 3, where 14 video clips were scored by between 25 to 30 subjects each under formal test conditions, the average MOSbased BDRate gain of IQ264 over x264 was measured to be 22.6%. This means that IQ264 was seen to provide 22.6% bandwidth savings relative to x264 for equivalent MOS quality.
Summary
Because EuclidIQ’s IQ264 technology uses perceptual quality optimization (PQO) that focuses on human perceptual considerations to improve H.264 encoding, its gains are best measured via subjective testing. This fourpart blog series has presented the subjective test methodology that we designed so that we could quantify the compression gains from IQ264 in a way that is both practical and meaningful. The subjective test methodology includes several components: the design of the physical setup, selection of the test type and scoring scheme, selection of representative data, determination of challenging performance points (encoding bitrates), and application of posttest metrics.
We welcome your comments, questions, and suggestions regarding our subjective test methodology. If you would like to evaluate our IQ264 technology and see how PQO improves H.264 encoding, please contact us at sales@euclidiq.com for more details.
[1] (ITUR BT.50013, 2012, pp. 26,3538)
[2] (Bjontegaard, 2001)
[3] (Hanhart & Ebrahimi, 2014)
Bjontegaard, G. (2001). Calculation of average PSNR differences between RDcurves. VCEGM33. Austin, TX: ITUT.
Hanhart, P., & Ebrahimi, T. (2014). Calculation of average coding efficiency based on subjective quality scores. J. of Visual Communication and Image Representation, 25(3), 555564.
ITUR BT.50013. (2012). Methodology for the subjective assessment of the quality of television pictures. International Telecommunication Union. Retrieved from http://www.itu.int/rec/RRECBT.500/en