This repository has been archived by the owner on Oct 21, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
OVERVIEW.txt
144 lines (115 loc) · 4.18 KB
/
OVERVIEW.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
This is a short overview of how this works, intended primarily for those
who work on the code. More user focused documentation is available in
separate repos.
Processes
=========
There are a few different processes involved in a running system:
main daemon
python version specific runner (runner.py)
when running a job this forks to create the actual job processes
maybe more runners (one per py\d line in the conf)
short lived worker pools when loading metadata about jobs.
automata (asks the daemon to run jobs)
Communication between the automata and the daemon is over http (by default
over a unix socket). You can find the code in automata_common.py and
daemon.py.
Communication between the daemon and the runners is over anonymous sockets
(from socketpair()), each packet is a length prefix followed by pickled
data. All the communications code is in runner.py.
Communication between the runner and the main job process is a single json
object written back over a pipe. Job setup is partly loaded from setup.json
in the job dir and partly inherited from the parent process (normal fork
behaviour).
Communication between the main job process and the analysis processes (the
parallell part) is over a multiprocessing.Queue object.
Yes, this has more variation than there is any reason for. Rewriting it is
probably not useful though, all the communication methods work.
setup.json
==========
The job parameters are all in setup.json in the job dir, which contains
mostly a normalized version of the same things the automata sends to the
daemon to dispatch the job. The valid contents are specified by the
{options, datasets, jobids} variables in the method source. There is also
some meta info about the method.
There is also a short version of the profiling information for completed
jobs.
Identification of jobs for reuse
================================
When a job is requested a matching job is first searched for. A job with
the same parameters is valid assuming the source code for the method has
not been changed since it was run, or if the new code uses
equivalent_hashes to indicate that it is compatible with the old code. This
code (in control.py, database.py, dependency.py, deptree.py, methods.py,
workspace.py) is much harder to follow than there is any reason for.
database.Database.match_exact is called from dependency.initialise_jobs
which is called from control.initialise_jobs which is called from the
submit portion of daemon.py.
The data about method versions and compatibility comes from the runners,
the setup.json files are loaded directly by the daemon (using a worker
pool).
Datasets
========
Datasets are our main data storage system, suitable for data you stream
through. (Other data is normally stored in pickles.) Each column is stored
separately in one file per slice (except for small columns, where all
slices are in one file). Each column has a single type, one of the types in
sourcedata.type2iter (which is a few more than you see at the top of the
file). Most of these types are handled through gzutil, a C extension
available in a separate repo.
Each job can contain any number of datasets. On disk each dataset is a
directory containing a pickle with metainfo and all the column files.
Look in dataset.py more details.
List of files
=============
Runnable files
daemon.py
Main daemon
automatarunner.py
Runs your automata scripts
dsgrep.py
Grep one or more datasets
dsinfo.py
Print some info about a dataset
Bookeeping around jobs
control.py
database.py
dependency.py
deptree.py
methods.py
workspace.py
Datasets and ds/jobchaining
chaining.py
dataset.py
sourcedata.py
Launching of jobs, forking magic
dispatch.py
launch.py
runner.py
Other
automata_common.py
support functions for automatarunner/subjobs
extras.py
dumping ground for a lot of useful utility functions
status.py
sending and recieving of status-tree messages (the ^T stuff)
subjobs.py
running jobs from within other jobs
Not so interesting files
autoflush.py
blob.py
compat.py
py2/py3 compat
configfile.py
dscmdhelper.py
g.py
gzwrite.py
jobid.py
report.py
safe_pool.py
setupfile.py
status_messaging.py
unixhttp.py
web.py
workarounds.py
default_analysis
Directory with the default methods.