Monday, December 21, 2009

Why is visual search so hard to do?

Superfish is centered on visual search—the use of images, rather than text, as information retrieval queries. Our vision is, no pun intended, to connect any real-world object, through its captured images, to all relevant information.

Robust visual search—retrieving relevant information on any image—is really hard to do. Many researchers, in academia and industry, are working to crack the visual search nut. Google has recently released Goggles in their labs, and Microsoft’s Bing has included a mode of visual search for some time. Dozens of other research teams and start-ups are busy working on this problem. Yet, the state of the art is not quite there. Why is it so hard? Why can’t computers perform a task so simple for us humans?

Human ability to sense and recognize images is the result of hundreds of millions of years of natural selection. Computer-based machine vision and visual search have been around for merely tens of years. In their quest for robust visual search, researchers are faced with several hurdles.

The first of these is that, contrary to the title of a recently popular book, the world is not really flat. Visual search for flat objects already exists, and it is not bad at all. For example, there are pretty good optical character recognition (OCR) and bar-code readers in use today. But we live in a three-dimensional world where objects take on dissimilar visual forms when viewed from different viewing angles. The same shoe looks completely different from the front, back, side, top and bottom. While even a young child can abstract a real-world object from its myriad appearances, computers can only compare images by their apparent features. Superfish employs algorithms that handle complex geometries to recognize an object regardless of the angle the image was captured.


The first image is the query image of a 3D object. The next two are results from our algorithm.
Notice how we are able to find similar even with the object rotated.




Secondly, unless professionally shot, most pictures have a lot of “noise.” The resolution may be poor, the lighting dim, the main subject might be partially covered, or there may be dominant background elements. We describe such images as having a low signal-to-noise ratio. It would be nice if every object was photographed with perfect lighting on a white background, but the great majority of images captured by camera or cell phone have a lot of confusing “noise.”

Some visual search algorithms employ rules of thumb, or heuristics, that try to ferret out visual features that are unique to specific classes of images, but such heuristics significantly limit the scope and scale of visual search. Because our goal is to enable visual search from common images to very large indexes of common images, we had no choice but to develop algorithms that rely on mathematics and statistics, rather than heuristics, to extract the signal from its noisy environment.

The query image (left) and the result from our algorithms (right).
Notice how both images have background and lighting "noise."




Third, image search features offered on Google and Microsoft’s Bing are heavily dependent on keyword tags and other text-based image-related descriptors. Such tags are useful in classifying images into categories such as people, places and objects, and subsequently can augment ranking and filtering algorithms. But sadly, most images captured from the real world are not tagged, thus rendering them invisible to search engines that rely on tags.

More importantly, many objects we search for simply cannot be described with words. A specific pattern in a curtain fabric, for example, may be indescribable with words, even for fabric experts. While in our minds we may not have words for the great dictionary of visual features, such a vocabulary can in fact be extracted and used for visual search. While we certainly appreciate and leverage the role of augmenting textual, temporal and locational tags, we seek to maximize the use of visual vocabularies for maximal relevance.

Lastly, size matters. A robust visual search solution must index and process billions of images quickly and accurately, and must handle image queries in real time. All this processing is very expensive –both in server costs as well as time. Engineering a solution that operates on arbitrary and “noisy” images to deliver high relevance at low latency on an unbounded index is a significant challenge. While our first application, Window Shopper, demonstrates visual search on a small subset of all the images one can think of, our algorithmic work has been based, from day one, on scalability and high performance.

As we move forward with our visual search technology, these are a few of the key challenges we tackle. While we’re not there yet, we believe our foundation is sound and deep. Our goal is a world where anyone can connect any image to relevant information through visual search.

1 comment: