Paolo Boldi

Algorithms for the web and for social networks (Algoritmica per il web e le reti sociali)

About this page

This page contains pointers, information, links etc. related to the above-mentioned course that I am going to teach at the Bertinoro International Spring School (12-16 March 2012).

Syllabus

  1. Introduction [handout]
  2. The world-wide web [handout]
    • Background and history
    • Web characteristics
      • The web graph
      • Spam
    • Advertising
    • Search engines
    • Index size and estimation
    • Near-duplicates and shingling
  3. Search-engine architecture [handout #1, handout #2]
    • Anatomy of a search engine
    • Crawling
      • Crawler architecture
      • The URL frontier
      • Design and implementation issues
    • Distributed crawlers
    • Indexing and front-end
  4. Information retrieval fundamentals [handout]
    • Evaluation in information retrieval
      • Information retrieval system evaluation
      • Standard test collections
      • Evaluation of unranked retrieval sets
      • Evaluation of ranked retrieval results
    • Boolean model for IR
    • Vector-space model for IR
    • Probabilistic model for IR
  5. Building inverted indices for large text collections [handout]
    • Index construction
      • Blocked sort-based indexing
      • Single-pass in-memory indexing
      • Distributed indexing
  6. Link analysis [handout]
    • The Web as a graph
    • PageRank
      • Markov chains
      • The PageRank computation
      • PageRank as a function of its parameters
      • Topic-specic PageRank
    • Hubs and Authorities: HITS
  7. Advanced topics
    • Large dictionaries [handout #1, handout #2]
      • Overview on hashing
      • Perfect hashing
      • Minimal perfect hashing
    • Graph representation and compression [handout]
      • Overview
      • Variable-length coding and compression: Elias, Golomb, arithmetic compression
      • Locality and similarity
      • Node order and compression
    • Index compression [handout]
      • Statistical properties of terms in information retrieval
      • Postings and compression
    • Designing algorithms for large graphs
      • Approximating the clustering coefficients [handout]
      • Approximating the distance distribution [handout]

References and further pointers

A large part of the course topics are covered in: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval (Cambridge University Press, 2008). I will moreover provide the handouts containing the slides used during the course. Further useful resources and links:

Final assignments

The students can choose among three types of final assignments, fully described in this document. Use this wiki if you plan to participate to the final evaluation.