This project usage is restricted to Educational & Research purposes only. It consists of 2 parts:
Implements a web database classification algorithm that is described in
'QProber: A System for Automatic Classification of Hidden-Web Databases' by Gravano, Ipeirotis, and Sahami
Part 1 tries to classify web databases using a meta searching, whereby a query is made to Bing. This query searches for particular keywords listed in a directory,say root.txt and classifies database in a particular category by looking at the number of documents returned for those particular keywords
Implements a simplified version of content-summary extraction algorithm, which is fully described in
'Classification Aware Hidden-Web Text Database Selection' by Ipeirotis and Gravano
Part 2 generates a dictionary containing all the terms listed on that website.
To run the program:
$ python qproberGeneric.py
Once running, it will ask for the following parameters
- Coverage Threshold
- Specificity Threshold
- Database Url
Here are a few examples of input parameters
Coverage Threshold 100
Specificity Threshold: 0.6
Database Url: www.diabetes.org
Coverage Threshold: 100
Specificity Threshold: 0.6
Database Url: www.microsoft.com