In the first part of our series on Subjective Testing, we detailed the transition in our company from objective testing to subjective testing, including the development of a subjective testing methodology that achieved a balance between being credible (adhering reasonably well to accepted standards), feasible (able to be executed by a limited number of test subjects), and useful (able to be repeated frequently and over a broad set of videos and test conditions).
While Part 1 described the proper set-up of the physical environment for subjective testing—as well as our design of a “partial-reference” test that combines elements of both the absolute category rating (ACR) and degradation category rating (DCR) test types—the second part of the series describes how we select the video clips that we include in our subjective tests.
For the results of any encoder comparison – whether subjective or objective – to be meaningful, the test materials (video clips in the test) must be representative. However, there are multiple criteria for defining “representative” test materials, including semantic categories, motion and complexity characteristics, and encoding difficulty.
Additionally, the chosen video clips must be suitable for subjective evaluation, which brings in further considerations such as clip length, “watchability,” and noticeability of artifacts.
We perform subjective testing for two different purposes: first, to demonstrate the benefits of our perceptual optimization (PQO) technology to current and potential customers; second, to measure our overall performance as we develop and productize our technology.
For technology development and productization, we need a set of test videos that accurately represents the broad range of video content consumed by viewers today. Not only should the videos cover the full content space, in terms of genres and modern editing styles, but they must also be applicable to subjective quality testing in a way that leads to meaningful test results.
In this technical note, we focus on this case and describe how we select content for general viewing.
Representative Content: Classifying Based On Semantic Categories
In typical video quality analysis studies, “representative” test materials are classified based on the diversity of semantic categories, defined by the meaning and purpose of the video clips.
For example, the Video Quality Experts Group (VQEG) stated in a recent test plan, “The test material will be representative of a range of content and applications,” selecting a set of eight semantic categories:
- Movies and movie trailers
- Music videos
- Broadcast news
- Home videos
Semantic category lists, while they often cover a wide range of video types, are problematic because some of the categories are exceedingly broad and may involve full-length videos that consist of many individual clips edited together. Movies, for example, contain many different scene types with varying characteristics, including long establishing shots of city skylines or panoramic natural vistas, juxtaposed against action scenes with fast motion and quick scene cuts, or even dialogue scenes with intense facial close-ups.
Sports is itself a wide catogery: Tennis, basketball, baseball, ice hockey, football, golf, and boxing are all considered sports, but the respective broadcasts of those sports contain very different types of scenes with very different types of characteristics (e.g., athlete motion, ball/puck motion, the size of the athletes relative to the overall playing field, and the rate of switching camera angles to properly follow the action).
In addition, some of the categories often involve video clips that are typically have one of two issues: they are either too difficult to watch and score or too simple to distinguish quality. Including these types of clips leads to inconclusive subjective test results.
In videos with multiple, rapid scene cuts, it is difficult to focus on any particular area or subject of the video, making the overall video difficult to watch. A good example is movie trailers, especially fast paced CGI based trailers, which are meant to be flashy and awe-inspiring. This results in videos with fast motion and multiple, rapid scene cuts. Music videos, though not involving as much motion as movie trailers, also typically contain multiple scene cuts.
Conversely, videoconferencing and broadcast news videos often contain a minimal amount of motion set against a stationary background, with most of the video comprised of a person talking to the camera. Such videos are easy to watch but difficult to use in subjective testing to distinguish encoding quality at typical bitrates, as most encoders will produce “equally good” quality for them.
The bottom line is that, while some videos in these semantic categories may be appropriate for subjective testing, the majority are not, so requiring a full representation of videos in each semantic category is problematic. In addition, semantic lists provide no guidance to help select individual segments for subjective viewing once a particular long-form video is selected to represent a category.
For example, if the subject “football” is selected to represent the sports category, which shots of a football game should be used in the test? Should it be the crowd scenes, the kickoff, the slow-motion replay, or the benched player looking morose? Each choice contains characteristics that will garner both different encoding performance and subjective viewing response.
Representative Content: Classifying Based On Motion and Complexity Sectors
Test sets based solely on semantic category lists often are not “representative” data, so we believe it is better to classify videos according to the characteristics of their content rather than the (semantic) categories of their content. In particular, the motion and complexity characteristics of videos often directly correlate with their encoding difficulty.
For the purposes of this technical note, we consider general, conceptual definitions for motion and complexity. Motion is defined as the temporal displacement in the video from frame to frame, while complexity is defined as the amount of high spatial frequency content in the scene.
Videos containing soft content, with no strong edges or texture, have low complexity – visualize an empty black-board or a uniformly gray sky, while high complexity videos contain many regions with significant edge and texture information – imagine a scene with highly detailed imagery of a spring forest.
In a previous whitepaper, we detailed the technical challenges of motion and complexity for video encoders, as well as some of the possible solutions. Since our purpose here is to construct representative data sets for subjective testing, we can classify video data according to their motion (high, medium, or low) and complexity (high, medium, or low) characteristics. This results in a total of nine motion/complexity sectors, as illustrated in Figure 1.
Generally, it is difficult to distinguish encoder quality for videos in either the high-motion, high-complexity sector or the low-motion, low-complexity sector. Videos are more difficult to encode with perceptually good quality the closer they are to the upper right-hand corner, with high-motion, high-complexity videos the most difficult (all encoders will likely perform relatively poorly) but videos are easier to encode with good quality the closer they are to the lower left-hand corner, with low-motion, low-complexity videos the easiest (all encoders will likely perform relative well).
We believe that truly “representative” data sets should include a good distribution of video clips from all nine motion/complexity sectors.
Quantifying Motion and Complexity Sectors
One might ask how motion and complexity can be quantified so that videos can be placed in their proper motion/complexity sector. The ITU-T P.910 recommendation for subjective video quality analysis (VQA) in multimedia applications suggests encoding-independent measures[i] to compute the temporal and spatial characteristics of videos. While these measures have the advantage of not requiring an encoding to compute them, in our experience they are not well correlated with encoding quality.
A more suitable way to quantify the amount of motion in a video is to perform a sample encoding and gather statistics for the magnitudes of the motion vectors in the encoding. One can calculate the average motion vector magnitude across the entire video or the median of the average motion vector magnitudes from individual frames. To measure complexity, the video is encoded with intra-frame (I-Frame) encoding and then PSNR or bits-per-pixel (bpp) statistics are gathered from the sample I-frame encoding, where a low overall PSNR (or high bpp) corresponds to high complexity and a high overall PSNR (or low bpp) corresponds to low complexity.
This part of our series on subjective testing describes the process by which our company selects video data sets used in our subjective testing.
In summary, we select videos that cover a good distribution of motion/complexity sectors and are generally “watchable.” We also select videos with good cinematography (good camera focus and tracking, good lighting), a variety of camera angles and camera motion, and a variety of acquisition formats. We avoid videos that are difficult to watch, heavily edited, or have poor quality in the original.
In the case of demonstrations for customers, we first request their representative content. Since they best know the types of content most appropriate for their application, we first conduct tests on video clips that they supply, and sometimes augment these tests with additional content of our own.
For example, if the encoder is to be applied to Blu-ray creation for motion picture distribution, then subjective tests should be run on video content that will match motion picture video in terms of source video acquisition, shot composition, editing style and effects, CGI content, titles, and quality requirements.
We welcome your comments, questions, and even your sample content and settings, should you want to evaluate our PQO technologies. Please contact us at firstname.lastname@example.org for more details.
[Ed note: Both Dane Kottke and Nigel Lee contributed to this blog post. Dane Kottke is the Director of Software Development at EuclidIQ. Nigel Lee is the Chief Science Officer at EuclidIQ.]