forked from CalculatedContent/cloud-crawler
-
Notifications
You must be signed in to change notification settings - Fork 0
/
RELEASE_0.2_REQUIREMENTS
124 lines (75 loc) · 2.9 KB
/
RELEASE_0.2_REQUIREMENTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
chm 9-aug-2013
cloud-crawler 0.2 release requirements
what we hope to achieve for the 0.2 release
does not cover chef recipes or required gems
A. Stabilize existing features
A.1 fix bloom filter test
A.2 implement url hashing and test in case url gets munged
A.3 review and update all current spec tests where sekeltons exist
A.4 more more more examples
A.5 blog post on DSL
include singleton mathod optimization as a side note
A.6 add OS liscence to all files, with Calculated Content
A.7 fix power point to reflect DSL changes, additions
A.8 someway to test actual deployed crawl
vagrant
regression
heartbeating
regression test / spec test of examples
B Minor feature additions
B.1 DSL extensions
before/pre_batch
after/post_batch
(super optional)
html pre-procesing via tidy, sanitize, etc
i.e. allow removal of pre tags for extracting lines
at least have in the gem deploy
random user agent
mozilla user agent
B.2 Driver sugar
add methods to create batch jobs from a list
B.3 Defaults for Trollop
clean this up so the DSLs are cleaner
B.4 Better tracking of batch jobs in qless
actual tracking of individial jobs in a batch so we can monitor
B.5 shutdown
how to shutdown a crawl
B.6 image repo?
B.7 interacive shell console for DSL
need to be able to test and debug on the fly!!!
See: http://doc.scrapy.org/en/latest/intro/overview.html
for more info
C. Features for Startup Signal
C.1 API / JSON support
allow crawling a rest-based api that returns json
store json as raw document in page store, but as json, not page
=> might as well store page as html, not processed with page object !
=> store xml, html, or json
support with different dsl call: fetch_json()
=> dsl core not used...different core !!!
=> what is the DSL? should support general api using ruby meta-magic for get/post/rest apis
=> json metamagic necessary
C.1.* how the f*ck do we test?
need a mock test AND a regression test
C.2 Sinatra json access endpoint on the master
How can we deploy qless-web AND json endpoint on the same sinatra_app?
Can we just extend the base class of the monitor and add a json route?
D. Community Building: best way?
IRC channel
Mailing List
Twitter
StackOverflow and Quora questions
python scrapy has: Healthy community
1,500 watchers, 350 forks on Github (link)
700 followers on Twitter (link)
850 questions on StackOverflow (link)
200 messages per month on mailing list (link)
40-50 users always connected to IRC channel (link)
D.1 video demo, similar to rails
D.2 power point / slide share
D.3 record video talk in a 'venue'
D.4 once ready, try to talk locally
D.5 blog posts
DSL
Performance of Qless
Detailed examples