In a previous article, we discussed how techniques in the field of computer vision (CV) can be used to improve video compression, the process by which video content can be represented more compactly than in its original form. Video compression is what enables high-quality video data to be stored and transmitted over limited-capacity networks. By emulating the human visual system in viewing and understanding a video, CV builds models that identify (detect) and follow (track) important content across a series of video frames. CV-based modeling enables video processing systems to better exploit temporal redundancy in the video, the cornerstone of video compression. To understand why, it is useful to contract CV-based compression with conventional video compression.

Conventional video compression systems employ a “bottom-up” approach to modeling: divide each video frame into blocks of data; then, for each block, perform brute force searches for “good matches” in previously-coded frames. While this approach is computationally fast, it has difficulty achieving good compression for videos with “complex” motion (e.g., fast-moving objects, multiple moving objects, or objects that become obscured by other objects). In these situations, the simple and limited searching of conventional compression cannot always find a “good match” for a data block in a previous frame, resulting in a relatively expensive encoding for that block.

By contrast, CV-based video compression takes a “top-down” approach to modeling: detect interesting objects (or features) in previously-coded frames, track the motion of those objects across a series of previously-coded frames, and then predict where those objects should be in the current frame. This represents a higher level of modeling than conventional compression in two ways: it does not constrain the size of objects of interest to pre-set block sizes, and it follows object motion across multiple frames. In so doing, CV-based modeling can identify temporal redundancies in the video where conventional methods cannot, thereby increasing the video compression that results.

However, the higher-level modeling provided by CV comes at a cost: greater computational complexity. By detecting objects across a greater portion of the video and by tracking those objects across multiple frames, CV modeling both increases the amount of data that must be stored in memory and increases the computation time needed to perform the modeling. Thus, there is a compression vs. computation tradeoff between CV-based video compression and conventional video compression.

Keeping this tradeoff in mind, an “optimum” video compression system could employ a hybrid approach, “turning on” CV-based modeling where the data warrants it and reverting back to conventional modeling everywhere else. CV-based modeling would be enabled for portions of data where conventional compression performs poorly, so that the compression gain would justify the extra computational expense. Where conventional compression performs “well” (and it is important to define properly what “well” is) CV-based modeling is unlikely to provide enough compression gain to justify the extra computations. At EuclidIQ, researchers are working to develop a hybrid video compression system that integrates the power of CV-based modeling with the simplicity of conventional compression, providing a technological solution to the next generation of video processing challenges.

This article first appeared on BostInno.

Comments are closed.