Visual Search and its Application to Video Compression

 In Blog

The secret to any useful visual search is the ability to distinguish between what is important–the target–and what is not–the distracters. While the subject has been studied for over forty years, new technology, like smartphones and mobile apps, requires transforming and adapting how we achieve that.

Visual searches have traditionally relied on human vision–you can easily find a red square in a sea of green circles or (eventually) find Waldo. Similar to the subjective measurements of video quality, searches for specific items are time-consuming and inefficient. Thus, the need for automated search algorithms arose.

Merchandisers have been using a combination of consumer behavior psychology and visual search paradigms for over a decade to help increase profits and sales by maximizing the feedback gained from eye tracking techniques and saccades. The London start-up Snap Fashion, for instance, uses its proprietary search engine in its mobile app to allow customers to look for specific items of clothing in their online collection using images instead of words or metadata.

Amazon’s Android-based smartphone–Fire phone–allows developers to experiment with Firefly software development kits (SDKs) to build plug-ins that recognize real world objects. Combining Amazon’s existing database of physical and digital content, millions of items, including TV episodes, QR bars, songs, and products, are available to match with search requests that use image, text, and audio recognition technologies.

These examples focus on searches based on static information. But, how can visual searches be applied to a series of dynamic images, such as a video? What are the additional challenges and complications brought about by a series of moving images?

Research at the University of Oxford (England) has created Video Google for searching within a video. The aim is to retrieve objects with the “ease, speed, and accuracy with which Google retrieves text documents (web pages) containing particular words.” The researchers describe “an approach to object retrieval, which searches for and localizes all the occurrences of an object in a video, given a query image of the object….This requires a visual analogy of a word, which is provided…by vector quantizing the region descriptors….The result is that retrieval is immediate, returning a ranked list of shots in the manner of Google.“

Searching for static images or using words to define the search criteria seems inherently easier than looking for images within an element that change constantly or nearly constantly. When the database now consists of a series of videos instead of a series of still shots, the challenge is multiplied. Nonetheless, this variation on visual searches has broad application for indexing video databases based on visual content.

The main challenge in visual search is to locate and identify different, significant occurrences of the same visual elements throughout a video or across multiple videos. This is accomplished by finding good visual descriptors that are compact but accurate representations of each element. Descriptors must be compact because the cost of storing and searching through raw video data is usually prohibitive. Descriptors must be accurate representations of the corresponding video element or the search process will either miss occurrences of the element or associate different elements incorrectly. There are many types of visual element descriptors that may seek to represent, for example, an element’s color, texture, shape, motion, or location; others may seek to identify specific types of elements, such as faces or vehicles. Examples of specific statistics that are used to generate descriptors include color histograms or wavelet (Fourier transform) coefficients that capture frequency content and image gradients that identify edges in the video frame. Because visual elements may change slightly over the course of a video and, especially, across different videos, their descriptors must be robust enough to associate multiple, slightly different occurrences of the same element without mistakenly associating occurrences of other elements.

The following graphic provides examples of typical queries against a ground truth database.

Queries_and_Database

As noted above, successful matching of a query element with the appropriate element from the database requires the use of descriptors that are compact, accurate, and robust.

The utility of visual search in video processing is not just limited to the search problem itself, as visual search can actually aid in video compression. “Significant improvements in video coding efficiency have been achieved with modern hybrid video coding methods such as [the video compression format] H.264…in the last two decades. Spatial and temporal redundancy in video sequences has been dramatically decreased by introducing intensive spatial‑temporal prediction, transform coding, and entropy coding. However, to achieve better compression performance, reducing such kind of so‑called objective redundancy is limited and highly complex in computation.” An effective video search algorithm, executed prior to or in conjunction with a traditional video encoder, can identify different occurrences of the same element and both simplify and target the encoder’s basic motion estimation process. Because video compression fundamentally seeks to exploit temporal redundancies in the video, visual search, which identifies just such redundancies, provides a “top-down approach” to supplement or improve the “bottom-up” search for temporal redundancies during basic motion estimation.

The compression and search problems have always been closely linked through mature database and search applications; they have recently become more closely linked within video compression itself. Many believe that the next breakthrough in video compression will come through the successful marriage of the efficient, bottom-up motion estimation and compensation algorithms already found in state-of-the-art video encoders and the computer-vision-based, top-down approaches of visual search that identify and associate visual elements in the videos.

 

To learn more about H.264 and how improved encoding efficiencies provide a competitive advantage on H.264 platforms, download our white paper by clicking here.

 {{cta(‘ee5f7dad-ece4-4af0-91c3-22419954e4d1′,’justifycenter’)}}

Recommended Posts

Leave a Comment

Start typing and press Enter to search