Sumid

Motivation:

This program is used as a software support for my reserach project. The research project focuses on searching patterns in URL. I try to find, describe and quantify patterns in URL. The program could be categorized as a crawler (spider). Folowing articles describing experiments with the program:

Program operation - in short

Take input url from linklist.
Expand it to a tree.
Try to find a pattern(s) in input url
Try to iterate the patern (there might be about 1000 iterations per url)
Do an operation (log/download)
The program is intended to run for very long time (about a week). From that reason is written in multithreaded manner with too chatty logging).

Usage

The essential configuration is in file sumid.ini. The fine tuning is in settings.py.
Set-up at least the linklist parameter.
It's also good idea to set-up the WorkDir and LogDir.
Currently the script probably won't work under Windows, because of problems with filesystem paths.

Files and contents

Sumid is more a toolbox than a single program. Would be good to separate it in several smaller pieces. The state of art now is a result of how the program evolved satisfying concrete needs rather than seeing it as a product. Currently I am exploring the scrapy framework in order to transfer the core functionality in it.

sumid.py - the main program containing the four classes of consumer/producer line. Each producer runs in separate thread.
comptree.py - basically implements a tree structure for exploring web resources.
linklist.py - takes care of the input data.
miscutil.py - holds settings, debugging and some misc funcionality.
bow.py - implements bag of words. Analyses URLs and looks for words with highest frequencies.
sls.py - adaptor to pydigg library. Used to collect links for further experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.settings		.settings
config		config
documentation		documentation
linklist		linklist
src		src
test		test
thirdparty		thirdparty
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sumid

Motivation:

Program operation - in short

Usage

Files and contents

About

Releases

Packages

Languages

xhujerr/Sumid

Folders and files

Latest commit

History

Repository files navigation

Sumid

Motivation:

Program operation - in short

Usage

Files and contents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages