-
Notifications
You must be signed in to change notification settings - Fork 5
/
README.txt
39 lines (29 loc) · 1.17 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Py MapReduce
=================
Py MapReduce is a simple monoserver implementation of MapReduce in python, using the multiprocessing module.
It can be use for instance for quick parallelization of file processing task, e.g. performing operations on each line of a large file.
Simple operations (regexp matching etc..) are hard to multithread in python because of the Global Interpreter Lock (http://wiki.python.org/moin/GlobalInterpreterLock). Here multiprocessing can help
Sample job (Word Count)
------------
class WC(Job):
"Sample Word count parallel implementation"
lc = 0
wc = 0
bc = 0
def __init__(self, f):
self.file = f
def reduce_start(self):
self.lc = 0
self.wc = 0
self.bc = 0
def enumerate(self):
return enumerate(open(self.file))
def map(self, pos, item):
return (pos, (1, len(item.split()), len(item)))
def reduce(self, pos, r):
(lc, wc, bc) = r
self.lc = self.lc + lc
self.wc = self.wc + wc
self.bc = self.bc + bc
def reduce_stop(self):
return (self.lc, self.wc, self.bc)