Is a picture worth a thousand search words?

Written by Catherine Bolgar

Selecting the right Internet search words can be frustrating. But thanks to broader bandwidth and better picture-recognition technology, future searches may be image- or video-driven

“There’s a long history of search engines that have tried to use images,, says Greg Sterling, vice president of strategy for the Local Search Association, an industry association of media companies, agencies and technology providers. “Visual search was seen as more directly delivering information than text. Maybe it was a technology thing or timing thing, but they didn’t quite find the right model.”

As smart phones began reshaping the Internet landscape—some 340 million were shipped in the second quarter of 2015 alone—pre-existing visual search engines such as Grokker, Viewzi and SearchMe floundered. Yet the proliferation of smart phones and tablets may have increased demand because their small screens are more suited to pictures than text.

Visual is definitely one path forward for search,” Mr. Sterling says. At the moment, when searching for a particular product, “unless you have a specific brand name, it’s hard and frustrating clicking back and forth to different sites.”

An image search “will confirm quickly if it’s what you’re looking for, plus provide customer reviews and other product information,” Mr. Sterling says.

 

However, image search is not so straightforward. You take a photograph and use it to search related information, but success depends on the angle, light and focus of the photo.

“In the future, maybe it will be the case where you snap a picture of a landmark and get all the information about it,” he says. “What’s open for improvement is using a camera to get information. Inputting a 16-digit credit card number into a small screen on a phone is problematic. You mistype. Today, you can take a picture of the credit card and certain apps will recognize it and process it into the form.”

Images by themselves probably aren’t the future. “Look for a mix of images and structured data, finding what images are, finding other related things and organizing that information with tags and other data,” Mr. Sterling says. “There’s more and more sophistication in how you identify and index, with machine learning and other technology that exists behind the scenes that could apply to a pure text or image model.”

Researchers are working to improve the technological foundations for image searches. A group of universities is developing ImageNet, a database of 14 million images that attaches images to nouns.

Meanwhile, Lorenzo Torresani, associate professor of computer science at Dartmouth College in New Hampshire, has helped create a machine-learning algorithm that uses images to find documents. However, only a few users annotate their uploaded pictures and videos, and not necessarily accurately. “The repository is expanding at an astonishing rate, but we can’t retrieve content efficiently,” Dr. Torresani says.

Software can check whether the searched-for objects are in a picture, and if so automatically tags them. “It works, but has limitations,” Dr. Torresani says. “It’s difficult to expose all the content in the picture with predefined classes. And if you use predefined classes, then the search is only accessible through those keywords.”

Another way is to extract some visual features, like a visual signature, that allows users to search by example. Alternatively, software could translate key words into the visual signature, because users are accustomed to searching via text. This would work like language translation software, but translating from text to image instead.

“It could be used to find images or videos that are similar in context or appearance, and link them somehow,” Dr. Torresani says. “It could make the repositories browsable.”

Video is the bigger challenge. “One second of video has 30 images,” he says. “The amount of data we need to analyze a one-minute video is huge. Storage is a problem. Retrieval is a problem. Processing is a problem.”

Yet “even if the recognition process fails on one or two images, we have so many of them and the view maybe changes and the object that was ambiguous becomes clearer later in the video,” Dr. Torresani says. “From that point of view, video is easier than a still image.”

 

Catherine Bolgar is a former managing editor of The Wall Street Journal Europe. For more from Catherine Bolgar, contributors from the Economist Intelligence Unit along with industry experts, join the Future Realities discussion.

Photos courtesy of iStock

Catherine

Catherine

Catherine Bolgar is a former managing editor of The Wall Street Journal Europe, now working as a freelance writer and editor with WSJ. Custom Studios in EMEA. For more from Catherine Bolgar, along with other industry experts, join the Future Realities discussion on LinkedIn.