Suitable Content for Subjective Testing: Why EuclidIQ Uses Subjective Testing (Part 2b)
If you’ve read either the first or second parts of our series on Subjective Testing, you’ll recall that the first part detailed both the transition of our company’s philosophy away from objective testing and toward subjective testing—including a balance between being credible, feasible, and useful—as well as a description of the proper set-up of the physical environment for subjective testing.
The second part of the series highlighted multiple criteria for defining representative test materials, including semantic categories, motion and complexity characteristics, and encoding difficulty for the various clips we choose to be representative of the motion-complexity spectrum.
As an example of how EuclidIQ chooses suitable content for its subjective tests—where we’re now seeing a consistent 25% gain in bandwidth savings, at equivalent quality, when compared to the reference x264 encoder—this short blog post provides a few practical suggestions:
Finding videos that reside in each of the nine motion and complexity sectors ensures that the encoder is stressed for all combinations of motion and complexity. We further refine the set of videos to ensure that they are suitable for subjective viewing, both in terms of their “watchability” and the consistency of the resulting subjective scoring.
Watchable videos are interesting and enjoyable to view during the test, and they are neither too long, nor too short. However, not all watchable videos produce consistent scoring within and across viewers. Videos can be difficult to score because they are chaotic and visually complex (e.g., a school of fish swimming, confetti falling from the sky, swirling water, etc.). Because these videos have no clear subject, subjective scoring of such videos will have greater variability and thus be less reliable, making it difficult to draw quantitative conclusions about encoder performance.
We use the following factors in the refinement process: the video clip length, the video watchability (including the technical attributes of the cinematography), and the consistency and reliability of subjective scoring. This section will briefly describe each of these. Other than the video clip length, these considerations are not easy to quantify and as such we rely on staff members that have experience in video production to help identify appropriate clips.
Because all video clips are subjectively scored by human subjects after real-time playback, the P.910 recommendation, as well as the BT.500 recommendation for subjective VQA of television video, specify that test clips should be limited to 10 seconds in length. Clips shorter than 10 seconds often do not have enough content for humans to easily distinguish encoder performance. Clips longer than 25 seconds are problematic for two reasons. First, they often have multiple scenes with different performance characteristics, making overall evaluation of the clip more difficult. Second, it is often difficult to remember the relative degradation of an artifact that occurs near the start of the clip when the clip is long, and this makes quality scoring less accurate. We feel expert viewers can accurately assess video clips anywhere between 10 to 25 seconds in length.
Videos that are interesting and enjoyable to watch have high “watchability” and make good test material. For example, videos with human faces are highly watchable and important for subjective tests because of the typical viewer’s sensitivity to distortions in faces. On the other hand, some video clips are not watchable because they are physically taxing to watch (e.g., videos taken by shaky or hand-held cameras, videos that capture motion that would normally cause motion-sickness like a point-of-view video from a rollercoaster, or videos with strobing and continually flashing lights). Additionally, sometimes short video segments do not make sense when taken out of the context of a longer video. These types of videos are confusing for viewers because they don’t provide a good reference with which to make an assessment of video quality. Finally, some videos are not suitable because of inappropriate content (e.g., politically-charged or violent videos) that viewers would find objectionable.
Another factor that affects the watchability of video content for subjective testing is the quality of the cinematography. Cinematography can be defined as “the science or art of motion-picture photography.” For the purposes of subjective test content selection, we skip the artistic notions of “narrative and themes” and concentrate on the technical aspects of cinematography. We select videos that show good camera focus and tracking and good scene lighting. Video clips with poor camera work or bad lighting are not easy to watch and can confuse subjective test results because viewers mistakenly interpret poor original source media as compression artifacts. Additionally, we strive in our content selection to include a variety of camera angles and camera motion, as these can alter encoding performance due to variations in motion and spatial frequency characteristics.
The purpose of subjective visual testing is to derive numerical measures of video quality based on human scoring. As such, it is vital to test with videos that produce consistent and meaningful mean opinion scores. Generally, watchable videos with easily identifiable subjects and background are easy for viewers to score consistently. These videos make up the majority of video content in motion pictures, television, advertisements, news, and most sports productions, and they are an important part of our testing corpus. In our experience, there is also a class of video clips that are watchable but difficult to score. These clips might have rapid scene changes or contain visually dense content or show large variations in lighting and contrast. They are sometimes seen in movies, concert footage, and action dramas on TV. These clips are also important because they stress the limits of encoders and help to differentiate encoder performance. We believe that such difficult-to-score videos should be included as part of an overall encoder evaluation process, but should not be a large factor in subjective testing. We consider difficult-to-score videos to be “corner cases,” which we evaluate outside of our subjective testing by using expert viewers.
Other important considerations for content selection include the nature of the source acquisition and post-production work such as color correction and video editing. The acquisition format (camera type, frame size, and frame rate) can have a significant effect on the video’s look. For example, film-based content that has been digitized to video can often have high amounts of film grain. Similarly, the frame rate is important since lower frame rate video will contain higher amounts of motion blur during high-motion sequences. We include a variety of acquisition formats in our content selection. Post-production editing and color correction can drive encoder performance but are often confusing to viewers who watch short clips and do not have the full context of the longer video to give clues about the intent of the effects. We push these types of heavily-edited videos into corner case evaluation.
Finally, it is important when selecting content to be aware of the compression format and compression ratio of the original source video. Evaluating encoder performance when the original video is highly compressed and shows noticeable artifacts is difficult because viewers cannot easily distinguish between artifacts present in the original video and additional artifacts introduced by the encoder. The effects of this are minimized to some extent by using a high-quality reference stream of the video during subjective testing, but we prefer to use the original source video with the highest quality possible and will throw out content that is overly soft or blocky.
We hope you’ve gained insight in to how we choose our test clips, so that they’re representative and cover a wide range of the motion-complexity spectrum. We welcome your comments, questions, and even your sample content and settings, should you want to evaluate our PQO technologies. Please contact us at firstname.lastname@example.org for more details.
[Ed note: Both Dane Kottke and Nigel Lee contributed to this blog post. Dane Kottke is the Director of Software Development at EuclidIQ. Nigel Lee is the Chief Science Officer at EuclidIQ. Part 3 of this series will describe how to determine the appropriate operating points (bitrates) for the videos we have selected, while Part 4 will describe how we conduct post-test analysis of the raw subjective test scores.]