forked from Mondego/spacetime-crawler4py
-
Notifications
You must be signed in to change notification settings - Fork 1
CS 121 Project Spidermen
Ethan Wong edited this page Apr 29, 2023
·
9 revisions
Welcome to the 🕷️ men 121 crawler wiki!
- Crawl the following web domains
- *.ics.uci.edu/*
- *.cs.uci.edu/*
- *.informatics.uci.edu/*
- *.stat.uci.edu/*
- For crawled web pages, extract the following information
- # of unique pages (based on hostname)
- longest page (most # of words, ignoring HTML markup)
- 50 most common words (ignoring stop words), ordered by frequency
- *.ics.uci.edu/* subdomains, ordered alphabetically with # of unique pages for that subdomain
- list of URLs scraped (to add to frontier)
- Follow the scraper requirements
- honor politeness requirement
- implement check to see if a web page is_valid (included in above domains)
- defragment returned URLs
- transform relative URLs to absolute URLs
- Follow the crawler requirements
- only crawl pages with high textual information content
- detect + avoid infinite traps
- detect + avoid similar sets of pages with no information
- detect redirects, index redirected content
- detect + avoid dead URLs (200 status but no content)
- detect + avoid crawling very large files with low information value
- Development requirements
- add comments throughout code
- Extra Credit
- utilize GitHub (+1)
- checks + robots / sitemap file usage (+1)
- webpage similarity detection (exact / near) (+2)
- crawler multithreading (+5)
- use external libraries to parse HTML responses (BeautifulSoup, lxml)
- (optional) save URL / webpage to local disk in scraper
- for crawler requirements, first monitor where it is going and then adjust behavior