Abstract:
The ease of finding and retrieving information has become an integral part in our lives. With the growing surge of data, search engines facilitate the task of finding the relevant information pertaining to a need in large collections. Yet, searching massive volumes of diverse textual documents in order to satisfy a specific information need of a user is extremely challenging.
The retrieved units in commercial search engines, as is the case for enterprise engines, are full
documents; i.e., documents in a corpus are ranked by their presumed relevance to the information need expressed by a query. Relevant documents can contain much non-relevant information; specifically, only a short passage with relevant information suffices to deem the entire document relevant. This fact has motivated work on passage retrieval and passage-based document retrieval. In the former, a.k.a, focused retrieval, the retrieved units are passages, and in the latter, the retrieved units are documents but passage-level information is used for document ranking.
In this thesis, we first present a suite of novel document retrieval methods that are based on learning document ranking function using an effective passage ranking. We then explore the use of inter-passage similarities to improve the effectiveness of the retrieved list of passages. To lay theoretical grounds for the use of inter-passage similarities, we propose a novel set of cluster hypothesis tests for passages. Finally, we examine and analyze the use of true relevance feedback at the token level on the retrieval performance of document ranking.
https://technion.zoom.us/j/3800541616