The Tech Stuff: Biasing Web results for topic familiarity

This post is based on a research paper by Yahoo! Research written by Omid Madani and Rosie Jones of Yahoo! and Giridhar Kumaran from University of Massachusetts. It is based on the fact that based on the user’s familiarity with the search topic, it would be appropriate to give him either introductory or advanced search results.

The findings are based on a four-fold procedure. Firstly, the definition of advanced and introductory web pages is given.

An Introductory webpage is defined as a page that doesn’t presuppose any background knowledge of the topic and to an extent introduces or defines key terms in the topic.

An advanced webpage would be one which assumes sufficient background knowledge of the topic and probably builds upon them.

Then it is shown that the definitions above hold, by the inter-labeler agreement. Three annotators are asked to label randomized sets of results for particular queries and are found to agree about 70% of the time. Also based on their labeling it is found that the search engines have an equal bias towards both introductory and advanced web pages. Also the precision for an introductory page to be at position 1 is slightly more than 0.5, showing that search engines generally make the top result an introductory one. The work tries to improve the precision for introductory documents from the positions 1 to 10.

An experiment was performed on the introductory and the advanced documents according to Fog, Flesch and Kincaid indices. All of them marked the documents as unreadable and weren’t able to distinguish the introductory from the advanced thus showing that the reading level measures aren’t enough to distinguish the documents. Also an experiment was performed in which a query was expanded using introductory trigger words. But it was found that it didn’t bring about significant improvement in the rankings of the introductory documents.

Thus a familiarity classifier was developed using reading level measures, distribution of stop words in the text and the non text features like the average line-length. This classifier when trained could label documents as introductory or advanced. It could be used to increase the precision at the top rankings by including more results there. However relevance can’t be increased this way. But the documents can be classified at crawl time, thus addressing this problem too.

The study was able to re-rank the documents, producing a statistically significantly higher proportion of introductory documents at 5 documents retrieved and at 10 documents retrieved, over baseline search engine retrieval. This kind of topic-independent, user-independent classifier is empowering for personalized search, as with a single change to the retrieval reranking, any user can specify whether they want introductory or advanced documents for any query.

Further work in this area would be integrate user profile to automatically know the knowledge level of the user, so that user doesn’t have to point out explicitly whether he wants advanced results or introductory results. This scheme could have majority of the10 results matching the user profile information. If the user clicks upon the minority results, its clear that he wants the opposite information. Also the classifier could include more features which help in better identification of advanced documents from the introductory documents.

The Tech Stuff

Saturday, September 30, 2006

Biasing Web results for topic familiarity

0 Comments:

About Me

Previous Posts