-
Notifications
You must be signed in to change notification settings - Fork 0
Crawler system
There are two ways you can crawl the websites to save the indexes
- Multithreaded approach
- Distributed Crawler system
The multithreaded crawler is implemented in the Phantom
class in the src/phantom.py
file. It uses multiple threads to crawl websites concurrently, which significantly speeds up the crawling process.
Here's a brief overview of how it works:
-
The
Phantom
class is initialized with a list of URLs to crawl, the number of threads to use, and other optional parameters like whether to show logs, print logs, and a burnout time after which the crawler stops. -
The
run
method starts the crawling process. It generates the specified number of threads and starts them. Each thread runs thecrawler
method with a unique ID and a randomly chosen URL from the provided list. -
The
crawler
method is the heart of the crawler. It starts with a queue containing the initial URL and continuously pops URLs from the queue, fetches their content, and adds their neighbors (links on the page) to the queue. It also keeps track of visited URLs to avoid revisiting them. The content of each visited URL is stored in aStorage
object. -
The
Parser
class is used to fetch and parse the content of a URL. It uses the BeautifulSoup library to parse the HTML content, extract the text and the links, and clean the URLs. -
The
Storage
class is used to store the crawled data. It stores the data in a dictionary and can save it to a JSON file. -
The
stop
method can be used to stop the crawling process. It sets akill
flag that causes thecrawler
methods to stop, waits for all threads to finish, and then saves the crawled data and prints some statistics.
You can start the program by running the script on src/phantom.py
. It uses phantom_engine.py
to crawl the sites using multiple threads.
The distributed crawler system uses a master-slave architecture to coordinate multiple crawlers. The master node is implemented in the phantom_master.py
file, and the slave nodes are implemented in the phantom_child.py
file. They communicate using sockets.
The phantom_master.py
file contains the Server
class, which is the master node in the distributed crawler system. It manages the slave nodes (crawlers) and assigns them websites to crawl.
Here's a brief overview:
-
The
Server
class is initialized with the host and port to listen on, the number of clients (crawlers) to accept, and a burnout time after which the crawlers stop. -
The
run
method starts the server. It creates a socket, binds it to the specified host and port, and starts listening for connections. It accepts connections from the crawlers, starts a new thread to handle each crawler, and adds the crawler to its list of clients. -
The
handle_client
method is used to handle a crawler. It continuously receives requests from the crawler and processes them. If a crawler sends a "close" request, it removes the crawler from its list of clients. If a crawler sends a "status" request, it updates its status. -
The
status
method is used to print the status of the server and the crawlers. It prints the list of crawlers and their statuses. -
The
send_message
method is used to send a message to a specific crawler. If an error occurs while sending the message, it removes the crawler from its list of clients. -
The
assign_sites
method is used to assign websites to the crawlers. It either assigns each website to a different crawler or assigns all websites to all crawlers, depending on theremove_exist
parameter. -
The
generate
method is used to generate the websites to crawl. It asks the user to enter the websites, assigns them to the crawlers, and starts the crawlers. -
The
start
method is used to start the server. It starts the server in a new thread and then enters a command loop where it waits for user commands. The user can enter commands to get the status of the server, broadcast a message to all crawlers, send a message to a specific crawler, stop the server, generate websites, assign websites to crawlers, and merge the crawled data. -
The
merge
method is used to merge the data crawled by the crawlers. It merges the index and title data from all crawlers into a single index and title file and deletes the old files. -
The
stop
method is used to stop the server. It sends a "stop" message to all crawlers, stops the server thread, and closes the server socket.
You can start the server by running the phantom_master.py
script. It will start listening for connections from crawlers and you can then enter commands to control the crawlers.
The phantom_child.py
file contains the Crawler
and Storage
classes, which implement the slave nodes in the distributed crawler system.
Here's a brief overview:
-
The
Crawler
class is initialized with the host and port of the server. It creates a socket and connects to the server. It also initializes several other attributes, such as its ID, a list of threads, and several flags. -
The
connect
method is used to connect the crawler to the server. It starts a new thread to listen to the server and enters a command loop where it waits for user commands. The user can enter commands to stop the crawler, send a message to the server, get the status of the crawler, toggle the running state of the crawler, and store the crawled data. -
The
listen_to_server
method is used to listen to the server. It continuously receives messages from the server and processes them. If the server sends a "stop" message, it stops the crawler. If the server sends a "setup" message, it sets up the crawler. If the server sends a "status" message, it prints the status of the crawler. If the server sends an "append" message, it adds URLs to the queue. If the server sends a "restart" message, it reinitializes the crawler. If the server sends a "crawl" message, it starts crawling. -
The
setup
method is used to set up the crawler. It sets the URL to crawl and the burnout time. -
The
add_queue
method is used to add URLs to the queue. -
The
initialize
method is used to initialize the crawler. It initializes several attributes, such as the list of local URLs, the queue, the start time, and the parser. -
The
crawl
method is used to start crawling. It continuously pops URLs from the queue, parses them, and adds the parsed data to the storage. It also adds the neighbors of the current URL to the queue. If the burnout time is reached, it stops crawling. -
The
store
method is used to store the crawled data. It saves the index and title data to the storage. -
The
stop
method is used to stop the crawler. It sets the kill flag, clears the traversed URLs, joins all threads, sends a "close" message to the server, and closes the client socket. -
The
send
method is used to send a message to the server. -
The
status
method is used to print the status of the crawler. -
The
Storage
class is used to store the crawled data. It is initialized with a filename and a dictionary to store the data. Theadd
method is used to add data to the storage. Thesave
method is used to save the data to a file.
You can start a crawler by creating an instance of the Crawler
class and calling the connect
method. The crawler will connect to the server and start listening for commands.