‘Deep Web’ dive and ‘social media’ search

With an overcast sky, the prospect of an outdoor interview looks bleak. And to add to my qualms, daylight filtering in through the trees in the Nageswara Rao park, at 7.30 am, is apparently not enough for the camera to fire up a few stills without the flash. As if in bold contrast, though, my guest Ravi N. Raj, VP-Products & GM, Kosmix Sites, US (www.kosmix.com) is all fired up, about search and content, about the consumers and their interests.

Kosmix is a guide to the Web, Ravi begins. Offering ‘a 360-degree view for a topic by aggregating facts, images, videos, discussions, social networks, blogs and much more about the topic,’ the company currently receives over 15 million visits on its two sites, Kosmix.com and RightHealth.com, he adds. Try ‘mamallapuram’ or ‘marina_beach,’ ‘kamal hasan’ or ‘carnatic music,’ invites Ravi.

Founded by Venky Harinarayan and Anand Rajaraman (whose earlier venture Junglee was sold to Amazon in 1998), and Cambrian Ventures based in Silicon Valley, Kosmix has Time Warner, Accel Partners and Lightspeed providing the funding.

Over the next about ninety minutes, at good speed, Ravi covers topics ranging from rich content to Deep Web, from tagging to texting, from social media to search research. And our conversation continues over the email…

Excerpts from the interview.

How good is Web search as a lead indicator?

Web search queries are an excellent lead indicator of future trends. Examples of top queries conducted by users on search engines this decade include “Google”, “ipod”, “Harry Potter”, “MySpace”, “Facebook” and “Twitter.” Coincidentally, all of these have been phenomena, in terms of the impact on consumer behaviour.

Many of these queries topped the search terms list long before these companies and products became household names. The list of top search terms is a great indicator of future success of products and services because consumers tend to search for them in large numbers before investors and the stock market get wind of them. Looking at the top search terms over the past 3-6 months can, therefore, provide the ability to predict consumer trends and behaviour.

On the other hand, real-time search terms (i.e. top search terms in the last few minutes or hours) give you an indicator of breaking news stories and events that are happening around the world. For example, when the Mumbai terrorist attacks happened last year, one of the first places where the story broke was on Twitter, where it spiked as the #1 search term within minutes of the event happening.

Good resources for tracking top search terms include Google Trends, Yahoo Buzz Index, Twitter and Hitwise. Incidentally, the most searched for term this year is “Facebook”, according to Hitwise.

What have been the trends in search over the years?

The first few search engines on the Web in the mid-1990s included WebCrawler, Lycos, Excite, Inktomi and AltaVista. These search engines crawled pages on the Web and built a queryable index. These search engines largely used keyword-based methods for ranking search results.

In other words, they would rank pages by how often the search terms occurred in the page, or how strongly associated the search terms were within each resulting page. Web publishers were able to easily game the system by including multiple instances of popular search terms on their pages.

Then, in 1999, Google changed the game by introducing a popularity algorithm (called PageRank) for ranking search results. This relied on the uniquely democratic nature of the Web by using its massive link structure as an indicator of a page’s usefulness.

Google interpreted a link from one page to another as a vote by the first page for the second page. A vote cast page by a popular page (i.e. one that has a large number of links pointing to it) carried more weight than a vote cast by a page with fewer links pointing to it. Google’s relevance algorithms have evolved over the years to include many more “signals” to rank pages, but the underlying principle has remained the same.

The challenges going forward for top-tier search engines like Google, Yahoo! and Bing are going to be two-fold: crawling, indexing and surfacing content from the Deep Web, which is estimated to be about 500 times larger than the surface Web, and dealing with social media on the Web, which includes sites like Facebook, MySpace, Twitter, and so on, which is where most of the content on the Web is currently being generated.

Why is Deep Web growing in significance?

The Deep Web, as you may know, is simply the Web behind HTML forms. A search engine won’t be able to see the inner pages, however, because it was created just for you from a series of databases. The page gets lost in the Deep Web.

Local information, research documents, mountains of medical data, and otherwise useful information lie buried beyond the reach of GoogleBot and other crawlers. The world anxiously awaits a way to access this information.

According to one study, the Deep Web is estimated to be 500 times larger than the surface Web. As the number of dynamic websites and applications increase, this number will only go up. Imagine…all that data is not available to search engines!

It is difficult to automatically determine if a Web resource is a member of the surface Web or the Deep Web. If a search engine provides a backlink for a resource, one may assume that the resource is in the surface Web.

Unfortunately, search engines do not always provide all backlinks to resources. Even if a backlink does exist, there is no way to determine if the resource providing the link is itself in the surface Web without crawling all of the Web.

Furthermore, a resource may reside in the surface Web, but it may not have yet been found by a search engine. Therefore, if we have an arbitrary resource, we cannot know for sure if the resource resides in the surface Web or Deep Web without a complete crawl of the Web.

The format in which search results are to be presented varies widely by the particular topic of the search and the type of content being exposed. The challenge is to find and map similar data elements from multiple disparate sources so that search results may be exposed in a unified format on the search report irrespective of their source.

The lines between search engine content and the Deep Web have begun to blur, as search services start to provide access to part or all of once-restricted content. An increasing amount of Deep Web content is opening up to free search as publishers and libraries make agreements with large search engines.

In the future, Deep Web content may be defined less by opportunity for search than by access fees or other types of authentication.

Isn’t content from social media, too, getting into search space?

Over the last couple of years, there has been an explosion of social media content on the Web. This includes blog posts, postings on forums and message boards, Facebook status updates, tweets, and so on. The challenge for search engines is going to be to crawl and index this content, and surface the results in a relevant manner in real-time to users.

Social media content is not only hard to discover, crawl and index, but it also changes at an extremely rapid pace, which is something that search engines currently have not been designed to deal with.

The other trend on the Web, that has largely been driven by social media sites, is the diversification of information types. The Web, which predominantly consisted of HTML documents a few years ago, has evolved into a rich and interactive medium consisting of user-generated videos, widgets and gizmos, playable games, and so on. This poses a challenge for search engines because many of their relevance algorithms are still text-centric, and don’t leverage metadata like user ratings and reputations to rank results.

More importantly, the paradigm of presenting ten blue links for a query may not work in this new world. Showing a link to a YouTube video, for example, in the form of a title and two lines of snippet, is not the best way to present a video result to a user. A much better way would be to show a thumbnail image of the video along with play time and date of creation, and have the video play inline at the click of a button.

The next game-changer in search is going to come from a site that offers the ability to surface social media results in a manner which is timely, relevant and comprehensive. Whether this is going to be offered by traditional search engines like Google, Yahoo! or Bing, or by social media sites like Facebook or Twitter, or by a brand new startup is the billion-dollar question.

You had mentioned about dynamic and persistent intent…

Users have two types of intent, dynamic intent and persistent intent. Dynamic intent, which typically lasts a short time, involves users looking for answers to questions or researching a topic. Examples include searching for the nearest pizza place or researching a travel destination for an upcoming vacation. This intent typically does not last for more than a few minutes or days.

Persistent intent, on the other hand, lasts for a very long time. It includes users tracking their interests, which could be a hobby, a sports team, a music artist or an industry sector.

Users turn to search engines when their intent is dynamic. They turn to social media sites to track their persistent interests.

What has been the impact of slowdown/ recession on search?

Year 2009 was an extremely challenging year for most companies, small and large, given the worldwide economic slowdown. Bellwether technology companies like Microsoft, Google and Cisco showed a drop or flattening of topline revenue for the first time in history.

In the search industry, the harsh economic conditions led to a few search startups like SearchMe shutting down. It also contributed to a consolidation, with the #2 and #3 players in the space (Yahoo and Microsoft), deciding to join forces to take on Google.

Your outlook for the near term.

Several indicators are now pointing to a global economic recovery sometime in 2010. The stock market in several parts of the world have made a huge comeback this year, with tech stocks, in particular, returning to the levels they were in 2008. The big event in the search industry in 2010-11 will be the launch of the partnership between Yahoo! and Microsoft, assuming the deal gets regulatory approval in the US and Europe.

Do you see a technology gap between India and the rest of the world, especially in the educational institutions?

Thanks to the Internet, technology professionals and educators in India are now much more in touch with technology trends than they were before. About 20 years ago, the technology infrastructure in even the best educational institutions in India used to be several years behind the Western world. Also, what was being taught in colleges in India was a few years behind what was being taught in educational institutions in the US. That has changed thanks to the global economy we now live in.

Would you like to talk about your company, how it works, and what your revenue models are?

Kosmix was launched as a health search engine in 2006 on RightHealth.com. RightHealth quickly grew to be a top health site, and is currently the #2 health site on the Web after WebMD, according to Hitwise. In an effort to extend its success in health to other verticals, Kosmix launched its horizontal product on Kosmix.com in December 2008.

Several verticals on Kosmix, including Autos, Travel, Business, Finance, Jobs, Shopping, Technology, are now getting a lot of traffic. In addition, Kosmix launched MeeHive.com earlier this year, which allows users to enter their interests and create a personalised newspaper, which is accessible both on the Web and on smartphones.

Samachar India News, an iPhone app from Kosmix, pulls the top 20 news sources from India and delivers them directly to your iPhone. This free service helps enthusiasts receive international, national and local coverage from India.

How Kosmix works is through a taxonomy for the web, consisting of several million categories, using a combination of humans and algorithms. Editors discover, integrate, and tag Web services to the taxonomy, algorithms route the user’s query through the set of taxonomy nodes, which enable the engine to decide which Web service to call.

All Kosmix services, which include its websites as well as its iPhone apps (MeeHive and Samachar) are completely free for consumers to use. Kosmix makes money through online advertising.

InterviewsInsights.blogspot.com

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.

&lsquo;Deep Web&rsquo; dive and &lsquo;social media&rsquo; search

Related stories

Related Topics

Top News Today

Comments

‘Deep Web’ dive and ‘social media’ search