Wikipedia_webCrawler_game

A simple Wikipedia web Crawler

HOW TO SET IT UP

⚠️ you MUST have BeautifulSoup4 intalled in order to run the code. requirements.txt will be available once the project is in satisfactory conditions ⚠️

To install execute the following code in Python Terminal

pip install beautifulsoup4

Clone the GitHub repo: https://github.com/ErvinGrimaldi03/Wikipedia_webCrawler_game.git

what is this?

This software is a simple Wikipedia web crawler. It starts from any valid wikipedia webpage, and it start exploring wikipedia. It selects 5 random wikiPages (valid Wikipedia articles) and explore the website.

How is made?

The software is entirely written in python 3. The main libraries used in the software are Beautifulsoup4, requests, re, and random. Beautifulsoup is used for the data retrieval of the pages, re is used to filter valid wikiPages throguh REGULAR EXPRESSIONS, requests to make a connection between user and server, and random to select the next wikiPage to visit. The software will retrieve the HTML of the current page and will produce a list of valid Wikipedia links. The list is later used to choose a random new link and repeat the process recursively until a stop criterion is met. Future updates might include new libraries and functionalities

Can I just download the dataset without running the program?

Not at the moment! unfortunately, the dataset is not ready yet. As soon it will be available I will work out a way to permit downoald of the dataset I gained.

Who can use it?

Most of the time, the answer is YES. however, if you are unsure about this type of matter, I would recommend to simply contact me in private. This is an Open-Source-Software and everyone is free to use. If you'd like to use it, I'd love to hear about it. Feel free to shoot me an email about it.

Why this?

While web crawling is not a complicated task, handling and storing massive data is. My intention is to map out Wikipedia and generate a graph map of every wikiPage article. This is no easy task, as today (01/26/2024), there are over 6.7 milion articles¹. The hardest task is to have an efficient graph search and visualization The current python side of the project is mostly to mine the data needed, while the user interface has to be thought of. My learning objectives are: efficient webcrawling, multithreading/multiprocessing, very large data visualization, very large database managment, data analytics.

Wikimedia Foundation. (2024, January 20). Size of wikipedia. Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#:~:text=As%20of%2026%20January%202024,number%20of%20pages%20is%2059%2C868%2C987. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
LICENSE		LICENSE
README.md		README.md
_JSON.py		_JSON.py
crawler.py		crawler.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia_webCrawler_game

A simple Wikipedia web Crawler

HOW TO SET IT UP

what is this?

How is made?

Can I just download the dataset without running the program?

Who can use it?

Why this?

About

Releases

Packages

Languages

License

ErvinGrimaldi03/Wikipedia_webCrawler_game

Folders and files

Latest commit

History

Repository files navigation

Wikipedia_webCrawler_game

A simple Wikipedia web Crawler

HOW TO SET IT UP

what is this?

How is made?

Can I just download the dataset without running the program?

Who can use it?

Why this?

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages