CrawlWave

An obsolete and antiquated small scale distributed web crawler in C#

Written back in 2003-2004, CrawlWave is a distributed web crawler, primarily targeted to greek web pages.

Current Status

Given that the project was targeted at .NET Framework 2.0 using the technologies available at the time, it may not compile cleanly under new versions of the framework. Effort has been put into making it build successfully under MS Visual Studio 2010, but further modernization is required.

Moreover, since it was a student project (the outcome of an MSc Thesis), it suffers from many design issues. Some of them are described in the following TODO list.

TODO List:

Create a proper db access layer based on System.Data.Common or employ a No-SQL DB like MongoDB
Remove Web services, replace them by WCF / remoting interfaces or an AMQP broker
Remove code duplication, revise inheritance
Revise locking / synchronization mechanisms and queue management
Fix Singleton-itis
Allow plugins to target specific phases of processing
Revise caching mechanisms
Build a proper 'Url-Seen' filter
Integrate the operations of common plugins, like the DBUpdater and UrlSelection, in the server core, and allow plugins to redefine aspects of these operations
Revise logging mechanism
Use a full-blown HTML parser, like HtmlAgilityPack
Convert CrawlWave.Client to a windows service, integrate the functionality provided by CrawlWave.Scheduler in it
Create a simple launcher/update for the Client
Implement Plugin LifeCycle management (install, uninstall, activate, deactivate, etc)
Use generics on collections and other interfaces
Implement content extraction from other sources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CrawlWave

Current Status

TODO List:

Files

README.md

Latest commit

History

README.md

File metadata and controls

CrawlWave

Current Status

TODO List: