Building an Open Source Meta-Search Engine
Abstract
A recent study [1] estimated the size of publicly Indexable Web at more than 11.5 billion pages. Furthermore, the index intersection between the largest available search engines -- namely Google, Yahoo!, MSN, Ask/Teoma -- is estimated to be 28.8%. A study [2] showed that 44% of searchers regularly use only a single search engine, 48% use just two or three search engines, and only 7% use more than three. Another study conducted by Jux2 pointed out that Google and Yahoo! share only 3.8 of their top 10 results, among the 500 most popular search terms. In a separate test of 91 random searches, they also found that Google and Yahoo! share only 23% of their top 100 results. They claim that "If the search engines are providing top results that are very different from each other, then by using only one search engine, Internet searchers are potentially missing relevant results".
As a consequence, meta-search engines are useful for many reasons. For instance, they allow (i) integration of search results provided by different engines, (ii) comparison of rank positions, (iii) advanced search features on top of commodity engines (e.g. Clustering, QA and Personalized results).
There are many industrial meta-search engines: Vivisimo and Dogpile are commercial clustering engines that group results drawn on-the-fly from other remote search engines. Jux2 is an industrial meta-search engine that compares, on three search engines, the different rank positions assumed by a set of URLs. A list of meta-search engines is in [4].
In the academic literature, there are many proposals for meta-searching. [10] proposes to work by downloading the individual documents, rather than working with the list of snippets returned by search engines. This approach has evident performance problems. [11] reports a survey of techniques that have been proposed to tackle several underlying challenges in building a meta-search engine. [7] discusses methods for improving answer relevance in meta-search engines. [8, 12, 13] propose several strategies for combining the ranked results returned from multiple search engines.
Download
Bibliography
[ 1] A.Gulli and A.Signorini, The Indexable Web is More than 11.5 Billion Pages [WWW2005]
[ 2] http://www.pewinternet.org/pdfs/PIP_Searchengine_users.pdf
[ 3] http://www.jux2.com/stats.php
[ 4] http://searchenginewatch.com/links/article.php/2156241
[ 5] http://rankcomparison.di.unipi.it/
[ 6] http://www.gnu.org/software/wget/
[ 7] Chidlovskii, System and method for improving answer relevance in meta-search engines. [U.S. Pat. 6829599, 2004]
[ 8] R.Fagin, R.Kumar, M.Mahdian, D.Sivakumar, and E.Vee, Comparing and aggregating rankings with ties. [PODS, 2004]
[ 9] P.Ferragina and A.Gulli, A personalized search engine based on web-snippet hierarchical clustering [WWW2005]
[10] S.Lawrence and C.L.Giles, Inquirus, the {NECI} meta search engine [WWW1998]
[11] W.Meng, C.Yu, and K.Liu, Building efficient and effective metasearch engines [ACM Computing Surveys, 2002]
[12] M.E.Renda and U.Straccia, Web metasearch: Rank vs. score based rank aggregation methods [SAC, 2003]
[13] F.Gibb S.Wu, F.Crestani, New methods of results merging for distributed information retrieval [DMIR, 2003]
[14] R.Stevens, UNIX Network Programming II [Prentice Hall, 2000]