Algorithms for the web and for social networks (Algoritmica per il web e le reti sociali)
About this page
This page contains pointers, information, links etc. related to the above-mentioned course that I am going to teach at the Bertinoro International Spring School (12-16 March 2012).
Syllabus
- Introduction [handout]
- The world-wide web [handout]
- Background and history
- Web characteristics
- The web graph
- Spam
- Advertising
- Search engines
- Index size and estimation
- Near-duplicates and shingling
- Search-engine architecture [handout #1, handout #2]
- Anatomy of a search engine
- Crawling
- Crawler architecture
- The URL frontier
- Design and implementation issues
- Distributed crawlers
- Indexing and front-end
- Information retrieval fundamentals [handout]
- Evaluation in information retrieval
- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Boolean model for IR
- Vector-space model for IR
- Probabilistic model for IR
- Evaluation in information retrieval
- Building inverted indices for large text collections [handout]
- Index construction
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Index construction
- Link analysis [handout]
- The Web as a graph
- PageRank
- Markov chains
- The PageRank computation
- PageRank as a function of its parameters
- Topic-specic PageRank
- Hubs and Authorities: HITS
- Advanced topics
- Large dictionaries [handout #1, handout #2]
- Overview on hashing
- Perfect hashing
- Minimal perfect hashing
- Graph representation and compression [handout]
- Overview
- Variable-length coding and compression: Elias, Golomb, arithmetic compression
- Locality and similarity
- Node order and compression
- Index compression [handout]
- Statistical properties of terms in information retrieval
- Postings and compression
- Designing algorithms for large graphs
- Large dictionaries [handout #1, handout #2]
References and further pointers
A large part of the course topics are covered in: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval (Cambridge University Press, 2008). I will moreover provide the handouts containing the slides used during the course. Further useful resources and links:
- for the part on crawling and search engine architecture
- the World Wide Web Consortium websites, containing links to standards and recommendations
- Allan Heydon and Marc Najork, Mercator: A Scalable, Extensible Web Crawler (World Wide Web 2(4):219-229, 1999)
- Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmitri Loguinov, IRLbot: scaling to 6 billion pages and beyond (World Wide Web, 427-436, 2008)
- Hector Garcia-Molina and Junghoo Cho, Parallel crawlers (World Wide Web, 124-135, 2002)
- Paolo Boldi, Bruno Codenotti, Massimo Santini, Sebastiano Vigna: UbiCrawler: a scalable fully distributed Web crawler (Softw., Pract. Exper. 34(8): 711-726, 2004)
- Marc Najork and Janet L. Wiener, Breadth-First Search Crawling Yields High-Quality Pages (World Wide Web, 114-118, 2001)
- Paolo Boldi, Massimo Santini, Sebastiano Vigna, Paradoxical Effects in PageRank Incremental Computations (Internet Mathematics 2(3): 387-404, 2005)
- for the part on information retrieval, index construction and compression
- Ian H. Witten, Alistair Moffat, Timothy C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images (Second Edition, Morgan Kaufmann, 1999)
- Ricardo A. Baeza-Yates, Berthier A. Ribeiro-Neto, Modern Information Retrieval - the concepts and technology behind search (Second edition Pearson Education Ltd., Harlow, England 2011)
- Stephen E. Robertson, Hugo Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond (Foundations and Trends in Information Retrieval (FTIR), 3(4):333-389, 2009)
- for the part on link analysis, some background and a good survey can be found in
- Eugene Seneta, Non-negative Matrices and Markov Chains (Springer, 2006)
- Amy N. Langville , Carl D. Meyer, Deeper inside pagerank (Internet Mathematics, 1(3): 335-380, 2004)
- the paper about the analysis of the facebook graph is also available on Arxiv
- for more papers, software and datasets, see the LAW website.
Final assignments
The students can choose among three types of final assignments, fully described in this document. Use this wiki if you plan to participate to the final evaluation.