ProbFuse: Probabilistic Data Fusion
David Lillis
MSc, University College Dublin, Feb. 2006.
Abstract
In recent years, the proliferation of information being made available in such domains as the World Wide Web, corporate intranets and knowledge management systems and the "information overload" problem have caused Information Retrieval(IR) to change from a niche research area into a multi-billion dollar industry. Many approaches to this task of identifying documents that satisfy a user's information need have been proposed by numerous researchers. Due to this diversity of methods employed to perform IR, retrieval systems rarely return the same documents in response to the same queries. This has led to research being carried out in the fields of data fusion and metasearch, which seek to improve the quality of the results being presented to the user by combining the outputs of multiple IR algorithms or systems into a single result set. This thesis introduces probFuse, a probabilistic data fusion algorithm. ProbFuse uses the results of a number of training queries to build a profile of the distribution of relevant documents in the result sets that are produced by its various input systems. These distributions are used to calculate the probability of relevance for documents returned in subsequent result sets and this is used to produce a final fused result set to be returned to the user. ProbFuse has been evaluated on a number of test collections, ranging from small collections such as Cranfield and LISA to the Web Track collection from the TREC-2004 conference. For each of these collections, probFuse achieved significantly superior performance to CombMNZ, a data fusion algorithm often used as baseline against which to compare new techniques.