Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

ethanwong16 / spacetime-crawler4py Public

forked from Mondego/spacetime-crawler4py

Notifications You must be signed in to change notification settings
Fork 1
Star 1

Code
Pull requests
Actions
Projects
Wiki
Security
Insights

Additional navigation options

Code
Pull requests
Actions
Projects
Wiki
Security
Insights

CS 121 Project Spidermen

Jump to bottom

Ethan Wong edited this page Apr 29, 2023 · 9 revisions

Welcome to the 🕷️ men 121 crawler wiki!

Project Requirements

Stage 2

Crawl the following web domains

*.ics.uci.edu/*
*.cs.uci.edu/*
*.informatics.uci.edu/*
*.stat.uci.edu/*

For crawled web pages, extract the following information

# of unique pages (based on hostname)
longest page (most # of words, ignoring HTML markup)
50 most common words (ignoring stop words), ordered by frequency
*.ics.uci.edu/* subdomains, ordered alphabetically with # of unique pages for that subdomain
list of URLs scraped (to add to frontier)

Follow the scraper requirements

honor politeness requirement
implement check to see if a web page is_valid (included in above domains)
defragment returned URLs
transform relative URLs to absolute URLs

Follow the crawler requirements

only crawl pages with high textual information content
detect + avoid infinite traps
detect + avoid similar sets of pages with no information
detect redirects, index redirected content
detect + avoid dead URLs (200 status but no content)
detect + avoid crawling very large files with low information value

Development requirements

add comments throughout code

Extra Credit

utilize GitHub (+1)
checks + robots / sitemap file usage (+1)
webpage similarity detection (exact / near) (+2)
crawler multithreading (+5)

Hints / Tips

use external libraries to parse HTML responses (BeautifulSoup, lxml)
(optional) save URL / webpage to local disk in scraper
for crawler requirements, first monitor where it is going and then adjust behavior

Toggle table of contents Pages 2

CS 121 Project Spidermen
Questions

Clone this wiki locally

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.