Skip to content

CS 121 Project Spidermen

Ethan Wong edited this page Apr 29, 2023 · 9 revisions

Welcome to the 🕷️ men 121 crawler wiki!

Project Requirements

Stage 2

  1. Crawl the following web domains
  • *.ics.uci.edu/*
  • *.cs.uci.edu/*
  • *.informatics.uci.edu/*
  • *.stat.uci.edu/*
  1. For crawled web pages, extract the following information
  • # of unique pages (based on hostname)
  • longest page (most # of words, ignoring HTML markup)
  • 50 most common words (ignoring stop words), ordered by frequency
  • *.ics.uci.edu/* subdomains, ordered alphabetically with # of unique pages for that subdomain
  • list of URLs scraped (to add to frontier)
  1. Follow the scraper requirements
  • honor politeness requirement
  • implement check to see if a web page is_valid (included in above domains)
  • defragment returned URLs
  • transform relative URLs to absolute URLs
  1. Follow the crawler requirements
  • only crawl pages with high textual information content
  • detect + avoid infinite traps
  • detect + avoid similar sets of pages with no information
  • detect redirects, index redirected content
  • detect + avoid dead URLs (200 status but no content)
  • detect + avoid crawling very large files with low information value
  1. Development requirements
  • add comments throughout code
  1. Extra Credit
  • utilize GitHub (+1)
  • checks + robots / sitemap file usage (+1)
  • webpage similarity detection (exact / near) (+2)
  • crawler multithreading (+5)

Hints / Tips

  • use external libraries to parse HTML responses (BeautifulSoup, lxml)
  • (optional) save URL / webpage to local disk in scraper
  • for crawler requirements, first monitor where it is going and then adjust behavior
Clone this wiki locally