Abstract:
Topical, also known as vertical, search engines specialize in retrieval of information restricted to a certain subject of information needs. The essential component of such a topical search engine is a topical crawler, a software agent that navigates the Web in search for documents fitting the scope of its search engine. The common approach to topical crawling appears to be best-first search over the digraph forming the Web, prioritizing the links in its search queue by the lexical similarity between the representation of the document embedding the link and the representation of the topic. In our work, we re-examine this approach to topical crawling. First, we show that the optimal footprint of the topical crawling process constitutes a minimum directed Steiner tree of the Web digraph, with the on-topic documents being Steiner terminals. Second, having this “Steiner optimality” in mind, we formalize a novel best-first approach that prioritizes search directions based on a Bayesian inference model that continuously updates its estimates given evidence collected during crawling. Our empirical comparative evaluation on real-world large snapshots of the Web shows that the proposed approach substantially outperforms the standard technique for best-first topical crawling.
https://technion.zoom.us/j/3800541616