Skip to content

Crawler system

Ansah Mohammad edited this page May 8, 2024 · 1 revision

There are two ways you can crawl the websites to save the indexes

  1. Multithreaded approach
  2. Distributed Crawler system

1) Multithreaded Crawlers

The multithreaded crawler is implemented in the Phantom class in the src/phantom.py file. It uses multiple threads to crawl websites concurrently, which significantly speeds up the crawling process.

Here's a brief overview of how it works:

  • The Phantom class is initialized with a list of URLs to crawl, the number of threads to use, and other optional parameters like whether to show logs, print logs, and a burnout time after which the crawler stops.

  • The run method starts the crawling process. It generates the specified number of threads and starts them. Each thread runs the crawler method with a unique ID and a randomly chosen URL from the provided list.

  • The crawler method is the heart of the crawler. It starts with a queue containing the initial URL and continuously pops URLs from the queue, fetches their content, and adds their neighbors (links on the page) to the queue. It also keeps track of visited URLs to avoid revisiting them. The content of each visited URL is stored in a Storage object.

  • The Parser class is used to fetch and parse the content of a URL. It uses the BeautifulSoup library to parse the HTML content, extract the text and the links, and clean the URLs.

  • The Storage class is used to store the crawled data. It stores the data in a dictionary and can save it to a JSON file.

  • The stop method can be used to stop the crawling process. It sets a kill flag that causes the crawler methods to stop, waits for all threads to finish, and then saves the crawled data and prints some statistics.

You can start the program by running the script on src/phantom.py. It uses phantom_engine.py to crawl the sites using multiple threads.

2) Distributed Crawler system

The distributed crawler system uses a master-slave architecture to coordinate multiple crawlers. The master node is implemented in the phantom_master.py file, and the slave nodes are implemented in the phantom_child.py file. They communicate using sockets.

Phantom Master

The phantom_master.py file contains the Server class, which is the master node in the distributed crawler system. It manages the slave nodes (crawlers) and assigns them websites to crawl.

Here's a brief overview:

  • The Server class is initialized with the host and port to listen on, the number of clients (crawlers) to accept, and a burnout time after which the crawlers stop.

  • The run method starts the server. It creates a socket, binds it to the specified host and port, and starts listening for connections. It accepts connections from the crawlers, starts a new thread to handle each crawler, and adds the crawler to its list of clients.

  • The handle_client method is used to handle a crawler. It continuously receives requests from the crawler and processes them. If a crawler sends a "close" request, it removes the crawler from its list of clients. If a crawler sends a "status" request, it updates its status.

  • The status method is used to print the status of the server and the crawlers. It prints the list of crawlers and their statuses.

  • The send_message method is used to send a message to a specific crawler. If an error occurs while sending the message, it removes the crawler from its list of clients.

  • The assign_sites method is used to assign websites to the crawlers. It either assigns each website to a different crawler or assigns all websites to all crawlers, depending on the remove_exist parameter.

  • The generate method is used to generate the websites to crawl. It asks the user to enter the websites, assigns them to the crawlers, and starts the crawlers.

  • The start method is used to start the server. It starts the server in a new thread and then enters a command loop where it waits for user commands. The user can enter commands to get the status of the server, broadcast a message to all crawlers, send a message to a specific crawler, stop the server, generate websites, assign websites to crawlers, and merge the crawled data.

  • The merge method is used to merge the data crawled by the crawlers. It merges the index and title data from all crawlers into a single index and title file and deletes the old files.

  • The stop method is used to stop the server. It sends a "stop" message to all crawlers, stops the server thread, and closes the server socket.

You can start the server by running the phantom_master.py script. It will start listening for connections from crawlers and you can then enter commands to control the crawlers.

Phantom Child

The phantom_child.py file contains the Crawler and Storage classes, which implement the slave nodes in the distributed crawler system.

Here's a brief overview:

  • The Crawler class is initialized with the host and port of the server. It creates a socket and connects to the server. It also initializes several other attributes, such as its ID, a list of threads, and several flags.

  • The connect method is used to connect the crawler to the server. It starts a new thread to listen to the server and enters a command loop where it waits for user commands. The user can enter commands to stop the crawler, send a message to the server, get the status of the crawler, toggle the running state of the crawler, and store the crawled data.

  • The listen_to_server method is used to listen to the server. It continuously receives messages from the server and processes them. If the server sends a "stop" message, it stops the crawler. If the server sends a "setup" message, it sets up the crawler. If the server sends a "status" message, it prints the status of the crawler. If the server sends an "append" message, it adds URLs to the queue. If the server sends a "restart" message, it reinitializes the crawler. If the server sends a "crawl" message, it starts crawling.

  • The setup method is used to set up the crawler. It sets the URL to crawl and the burnout time.

  • The add_queue method is used to add URLs to the queue.

  • The initialize method is used to initialize the crawler. It initializes several attributes, such as the list of local URLs, the queue, the start time, and the parser.

  • The crawl method is used to start crawling. It continuously pops URLs from the queue, parses them, and adds the parsed data to the storage. It also adds the neighbors of the current URL to the queue. If the burnout time is reached, it stops crawling.

  • The store method is used to store the crawled data. It saves the index and title data to the storage.

  • The stop method is used to stop the crawler. It sets the kill flag, clears the traversed URLs, joins all threads, sends a "close" message to the server, and closes the client socket.

  • The send method is used to send a message to the server.

  • The status method is used to print the status of the crawler.

  • The Storage class is used to store the crawled data. It is initialized with a filename and a dictionary to store the data. The add method is used to add data to the storage. The save method is used to save the data to a file.

You can start a crawler by creating an instance of the Crawler class and calling the connect method. The crawler will connect to the server and start listening for commands.

Clone this wiki locally