Why EuclidIQ Uses Subjective Testing (Part1)
“We need to do subjective testing.”
This simple yet daunting directive was given by our company CEO, Richard Wingard, in a September 2014 meeting. The engineers in the room recognized the significant challenges involved in subjective testing, but we all agreed it was necessary.
EuclidIQ had been working on cutting edge video compression technology for several years, and part of that work required making our test results credible to potential customers, making sure to test over a wide variety of videos and at realistic settings.
Following industry-standard practice, we had measured compression gains from rate-distortion curves that used an objective quality measure, peak signal-to-noise ratio (PSNR). But one problem kept coming up: PSNR does not always reflect human perception.
Because compression gain is measured in terms of bandwidth reduction for equivalent quality, the gain numbers can be inaccurate if quality is measured using PNSR.
For example, we might objectively measure, say, a 20% gain (i.e., 20% bandwidth reduction at the same PSNR), but the “same-PSNR” streams might not look the same subjectively to our customers’ video experts, meaning that the subjective gain is something less than 20%.
There are two possible solutions to this problem: (1) measure quality using a different objective metric that better reflects human perception, or (2) move to subjective testing.
The first solution is an active area of study in the video compression industry (see, for example, the survey paper by Chikkerur, et al. 2011), but no objective metric to date has captured human perception well enough to be universally agreed upon. Additionally, we ourselves were beginning to work on perceptual quality optimization techniques for improving video compression, which is now our IQ264 product line, and there was no guarantee that any objective metric (all of which are built on some model or understanding of how people perceive videos) would reflect the perceptual qualities our algorithms were trying to optimize.
So the better option was to move to subjective testing, which in any case is the gold standard of video quality analysis. This technical note, then, is the first in a multipart series that explains how our company made the conversion to subjective testing.
Requirements for Subjective Testing
Once we decided to convert to subjective testing, the next step was to develop a subjective testing methodology that was credible, feasible, and useful.
To achieve credible subjective testing results, results that would be accepted as valid and believable in the video compression community, we needed our methodology to adhere closely to accepted standards in all essential areas. Accepted standards documents include the ITU-R BT.500 recommendation[i] for subjective video quality analysis (VQA) of television video, the ITU-T P.910 recommendation[ii] for subjective VQA in multimedia applications, and the ITU-T P.913 recommendation[iii] for subjective VQA of both Internet and distributed television video “in any environment.”
Unlike the BT.500 and P.910 recommendations, which describe VQA for controlled environments (broadcast or pay TV signals over reliable networks transmitted to an immobile screen in a “quiet and non-distracting” environment), the P.913 recommendation recognizes a “new paradigm of video watching” that includes on-demand video over unreliable networks, transmitted to a variety of devices, many of which are mobile, in often-distracting environments.
Thus, P.913 allows some flexibility in the test setup (including the physical environment, test type, and scoring method), depending on the purposes of the test. However, P.913 recommends at least 35 subjects for “public environment” tests, compared to 24 subjects for “controlled environment” tests.
Constructing a feasible subjective testing methodolog,y given time and resource constraints, required a balance between stricter adherence to standards for controlled environment testing and more subjects for public environment testing.
One additional requirement specific to our situation was that our subjective testing methodology needed to be useful, by allowing for subjective testing to occur frequently and over a broad set of test videos and conditions. This would enable our R&D team to analyze subjective test results and refine our still-under-development algorithms.
We quickly determined that the credibility and usefulness requirements conflicted, because setting up a credible subjective test—whether under a controlled environment or a public environment—required too much time and resulted in too few data points to provide the frequent feedback needed for algorithm development.
We then concluded that we would conduct two types of subjective tests with separate subjective testing methodologies: (1) external or “formal” subjective tests conducted in a controlled environment with close adherence to accepted standards, designed to produce results that could be credibly reported publicly; and (2) internal or “informal” subjective tests constructed to approximate accepted standards, designed to produce results with sufficient frequency and breadth to aid algorithm development.
With the two different types of subjective testing, we could then make further choices in line with their respective purposes: the external tests would be designed to reflect public opinion and scoring by non-expert viewers when watching the video clips, while the internal tests would be designed to reflect the opinion and scoring of expert viewers.
For the purposes of the external tests, we converted an interior room in the company offices in Concord, MA into a video viewing room (VVR) that met various requirements in the BT.500 and P.910 standards[i].
To help meet the various luminance requirements in the standards, we blocked off the VVR’s single glass window, painted the walls gray, and purchased two torch lights for the room. The torch lights were positioned to avoid any direct glare on the monitor screens. We then calibrated the monitor displays using a ColorMunki Display spectrophotometer[ii] to adhere to the ITU-R BT.709 recommendation[iii].
The ratio of ambient light and light behind the monitors to peak screen luminance was set to be less than 0.15 as suggested in the BT.500 general viewing conditions in a laboratory environment (ITU-R BT.500-13 2012, 3).
We used Apple Thunderbolt displays with 27-inch diagonal screens and screen resolutions of 2560´1440 as our reference monitors. We situated each chair at 30 inches away from the screen, representing 1.67 picture heights, and ensured that the viewing angle from monitor to chair was no greater than 30°.
For the purposes of the internal, informal tests, we decided that we would make use of our R&D team of developers, who are geographically dispersed throughout the United States. Instead of calibrating each individual environment, we gave the developers general guidelines for setting up their viewing areas: the room should be fairly dark, with little outside light coming in; the monitor should have 1080p resolution or better; and the viewing distance should reflect each developer’s typical viewing distance when watching videos on their computer.
The most interesting decision in determining the subjective testing methodology was how to design the subjective test itself. There are three main testing methods, as summarized in the P.910 recommendation (ITU-T P.910 2008, 6-9):
- Absolute Category Rating (ACR), where each video clip is judged according to its quality, independently of other clips;
- Degradation Category Rating (DCR), also known as Double Stimulus Impairment Scale (DSIS), where each video clip is compared against a reference clip and judged according to how much impairment the viewer notices in comparison to the reference;
- Pairwise Comparison (PC), where video clips from different processes (systems or algorithms) are presented in pairs and judged relative to the other video in the pair.
The ACR method usually uses a five-level quality scale:
5 = Excellent
4 = Good
3 = Fair
2 = Poor
1 = Bad.
The DCR method uses a five-level impairment scale:
5 = Imperceptible
4 = Perceptible but not annoying
3 = Slightly annoying
2 = Annoying
1 = Very annoying.
And the PC method uses a seven-level relative impairment scale, comparing the second clip in a pair to the first:
-3 = Much Worse
-2 = Worse
-1 = Slightly worse
0 = Same
1 = Slightly better
2 = Better
3 = Much Better.
Because our intention was to test an encoder enhanced with our perceptual quality optimization (PQO) algorithms against a reference encoder and measure compression gains against the reference, the PC method seemed on the surface to be the most natural fit for our purposes.
However, we also wanted to measure performance at multiple operating points (e.g., multiple bitrates for target bitrate tests or multiple QP values for QP-mode tests) to form rate-distortion (R-D) curves and measure gains from the R-D curves, and the PC method is not well-suited for that purpose.
The pairwise structure of the PC method enables relative evaluation of a video clip against its pairwise counterpart but not against multiple operating points.
The “double stimulus” or “full reference” nature of the DCR method, where every video clip is compared against an unimpaired reference stream, makes it well-suited for the generation of calibrated R-D curves. However, the process of viewing a reference stream together with every video clip doubles the testing time relative to a single stimulus method such as ACR.
To create valid R-D curves, we needed a minimum of three operating points and two processing streams (our PQO-enhanced encoding and the reference encoding), or six clips, for each video. And we wanted to test a wide range of videos in each subjective test.
Given the above considerations, we decided to use the ACR method for internal tests and the DCR method for external tests. For the internal tests, where it is most important to obtain subjective testing results useful for developing algorithms, ACR is the method that enables evaluation of the most videos in a given amount of testing time. However, to allow for some calibration to reference streams, we departed from the standards recommendations by playing a high-quality reference stream before each set of six clips (three operating points for two processing streams) for each video.
We felt that this “partial reference” setup struck a good balance between the full-reference calibration of the DCR method and the no-reference speed of the (original) ACR method. For the external tests, we felt that the full-reference DCR method provided the most consistent and most frequent calibration for untrained viewers and also the most credibility for a formal test, even though the testing time for each subject would be longer.
Once we identified the physical environment and the testing method, we needed a way to actually run the test. To do this, we created an application to “conduct” the subjective test as double blind, meaning that neither the subject nor the test presenter knows what is being presented and when.
As noted above, video clips are grouped into sets of six associated with each video. In the partial-reference internal tests, for each set of six clips, the application plays a high-quality reference stream first, followed by a randomized presentation ordering of the six clips. In the full-reference external tests, the presentation ordering of each six-clip set is again randomized, but the high-quality reference stream is played before each clip.
After each clip, the subject is asked to score the clip according to nine-level scale (1 to 5, but with additional half-point gradations in scoring, e.g., 1.5, 2.5, 3.5, and 4.5). Subjects are asked to score according to the usual DCR impairment scale descriptions (e.g., 5 = Imperceptible, 4 = Perceptible but not annoying, etc.), in terms of noticeability of artifacts relative to the high-quality reference. Even though the internal tests are set up more in the style of ACR, we have found that the more experienced developers on our R&D team can score accurately according to the DCR impairment scale descriptions, by judging the amount of impairment they notice relative to the high-quality reference viewed once per set of six clips.
In the external tests, subjects view the video clips at full speed, without the ability to pause playback, and each clip (and its high-quality reference) is played back twice before scoring.
In the internal tests, subjects are told to view the video clips at full speed but are given the ability to pause playback and restart the video, and clips can be played back a number of times at the discretion of the subject (but at an average of around two playbacks per clip). For internal tests, our video clips have durations ranging from 10 to 30 seconds; for external tests, all clips have duration 10 seconds, adhering to the guidelines from the P.913 recommendation (ITU-T P.913 2014, 5).
A future technical note will address our criteria for selecting videos with both diverse semantic content and diverse spatiotemporal characteristics, as well how we determine the target bitrates at which to test each video.
For both the internal and external tests, in order to avoid visual fatigue, we ask subjects to take a five-minute break after 20 minutes of video viewing. For internal tests, the 20 minute viewing time generally allows for one full run of five videos with six clips each, with each clip repeated twice, plus high-quality playbacks before each six-clip set and a 3-second pause after every clip. For external tests, the 20-minute viewing time generally allows for a full-reference run of three videos with six clips each, with each clip and its corresponding high-quality reference played back twice, plus a 3-second pause after every clip.
For external tests, we conduct pre-test screening of subjects by giving them a vision test and a color blindness test, following two recommendations: ITU-R BT.500-13 2012, 8 and ITU-T P.910 2008, 12. For internal tests, we initially trusted that our developers possessed the requisite vision and color vision, and we indirectly verified this through extensive post-test screening of their test scores.
We initially designed both the internal and external tests with replication, such that each subject scores the entire set of video clips in the test twice, with a different, randomized presentation ordering in each run. Replication reduces the effects of presentation bias, also known as the “order effect,” where scores for a given clip may be affected by the score for the previous clip. Replication also aids in post-test screening of subjects, as we can determine the variance of each subject’s scores from run to run for identical clips.
Within-subject variance is one method of determining which subjects’ scores should be discarded as unreliable. (A future technical note will describe in greater detail our entire process for analyzing subjective test scores.) For the internal tests, after extensive verification of all developers’ scores as reliable, we removed the replication of runs in favor of testing more videos in each test. For the external tests, we removed replication in favor of full-reference testing, because having both would make the testing time for each subject too long.
This technical note describes the beginning steps in our company’s conversion to subjective testing: specification of the physical environment, the design of the test itself, and the screening of subjects. An upcoming technical note will describe how we select videos for our subjective tests and determine the appropriate operating points for those videos. A final technical note will describe how we conduct post-test analysis of the raw subjective test scores.
This blog post, the first in a series on subjective testing, was co-authored by Nigel Lee, Chief Science Officer of EuclidIQ and Katie Cornog, Senior Video Codec Analyst.
For the reader’s benefit, we provide the following bibliography:
Chikkerur, S., V. Sundaram, M. Reisslein, and L. Karam. 2011. “Objective video quality assessment methods: a classification, review, and performance comparison.” IEEE Trans. on Broadcasting 57 (2): 165-182.
ITU-R BT.500-13. 2012. “Methodology for the subjective assessment of the quality of television pictures.” International Telecommunication Union. http://www.itu.int/rec/R-REC-BT.500/en.
ITU-R BT.709-5. 2009. “Parameter values for the HDTV standards for production and international programme exchange.” International Telecommunication Union. https://www.itu.int/rec/R-REC-BT.709-5-200204-I/en.
ITU-T P.910. 2008. “Subjective video quality assessment methods for multimedia applications.” International Telecommunication Union. https://www.itu.int/rec/T-REC-P.910/en.
ITU-T P.913. 2014. “Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment.” International Telecommunication Union. http://www.itu.int/rec/T-REC-P.913/en.
X-Rite Color Services. 2013. Profiling with ColorMunki Display. http://xritephoto.com/documents/literature/en/MonitorSetColorMunkiCalNTK_EN.pdf.