MIDA: A Tool for Measuring the Web

MIDA is meant to be a general tool for web measurement projects. It is built in Go on top of Chrome/Chromium and the DevTools protocol, giving it a realistic vantage point to study the web and fine-grained access to information provided by Chrome Developer Tools.

Getting Started

Getting started with MIDA is easy! First, install:

$ wget files.mida.sprai.org/setup.py
$ sudo python3 setup.py

Now we are ready to visit a site and collect some data:

$ mida go www.illinois.edu

You can find the results of your crawl in the results/ directory.

Easy At-Scale Crawling

One major benefit of MIDA is in being able to run large scale, highly configurable crawls without needing to write your own crawler code. Here's an example of a single MIDA command which will crawl the Alexa Top 100K and gather a few specific types of data:

$ mida go -f https://files.mida.sprai.org/toplists/alexa.lst -n100000 -c8 --all-resources --screenshot --dom

Breaking this down by argument:

-f https://files.mida.sprai.org/toplists/alexa.lst: This is a list of the Alexa Top Websites. You can read from a local file or go get one hosted on the web somewhere

-n100000: Read the top 100,000 entries from the list

-c8: Run with 8 parallel crawlers (browser instances)

--all-resources: Gather all of the actual files/resources required to render the web page. Beware, this takes a lot of space!

--screenshot: Capture a screenshot after/if the load event for each website fires.

--dom: Capture a JSON representation of the DOM for each website visited.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
.github/workflows		.github/workflows
amqp		amqp
base		base
browser		browser
fetch		fetch
log		log
monitor		monitor
postprocess		postprocess
sanitize		sanitize
scripts		scripts
storage		storage
test		test
website		website
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build_task.go		build_task.go
command.go		command.go
defaults.go		defaults.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
pipeline.go		pipeline.go
site_test.go		site_test.go
stage_five.go		stage_five.go
stage_four.go		stage_four.go
stage_one.go		stage_one.go
stage_three.go		stage_three.go
stage_two.go		stage_two.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIDA: A Tool for Measuring the Web

Getting Started

Easy At-Scale Crawling

About

Releases

Packages

Languages

License

ITI/mida

Folders and files

Latest commit

History

Repository files navigation

MIDA: A Tool for Measuring the Web

Getting Started

Easy At-Scale Crawling

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages