-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
236 lines (125 loc) · 23.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>LENSE outline</title>
<link rel="stylesheet" href="https://stackedit.io/res-min/themes/base.css" />
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<style>
img {
width: 700px;
}
</style>
</head>
<body><div class="container"><h1 id="the-lense-project">The L.E.N.S.E. Project</h1>
<p>(Learning from Expensive, Noisy, Slow Experts)</p>
<p><strong>Version 0.2</strong> <br>
This is a casually written draft document to explain the project.</p>
<hr>
<h4 id="latest-news">Latest News</h4>
<ul>
<li>Rough implementation is now <a href="https://github.com/lense-project/lense-base">up on GitHub</a></li>
<li>Check out our <a href="http://lense-project.github.io/early-results.txt">early results</a>.</li>
</ul>
<h4 id="abstract">Abstract</h4>
<p>We explore making time-sensitive predictions on an infinite stream of “problem instances” by selectively querying humans for noisy partial information, and exploiting previous observations to train a classifier that can incrementally take over from the humans. This stream is assumed to be too large to be kept in memory, so querying backwards over previously seen examples is considered impractical.</p>
<p>A limited version of this setting has been independently proposed several times under different names over the past two decades. It’s been called “<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.414.1128&rep=rep1&type=pdf">Active Learning in a streaming setting</a>“, “<a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0CDIQFjAC&url=http%3A%2F%2Fvideolectures.net%2Fsite%2Fnormal_dl%2Ftag%3D645374%2Fecmlpkdd2011_zliobaite_data_01.pdf&ei=VmNBVaGCNYS7ogSVuYHIAw&usg=AFQjCNGKwmhKslrNUDPS9lf03cLZp6F--Q&sig2=NoWhiWRe5MLeQvwZHumBvg">Online Active Learning</a>” or “<a href="https://users.soe.ucsc.edu/~dph/mypubs/LabelEfficient97.pdf">Label Efficient Learning</a>,” to name a few. This is generally proposed under two limitations: the “problem instances” must be multi-class classification problems, and the learning algorithm optimizes a Frequentist objective that does not provide a way for you to specify the complicated trade-off between expense, time delay, and expected number of errors.</p>
<p>This project proposes two improvements on the historical work: </p>
<ul>
<li>We propose a way to do structured instance prediction in the general case, by which we mean we can handle arbitrary multi-variable problem instances, with factor-graph structure. We also allow collection of human opinions about one variable at a time. (We stress <em>arbitrary</em> factor graphs, because there have been domain specific online structured prediction systems in the past, e.g. <a href="http://cogcomp.cs.illinois.edu/papers/SmallRo10.pdf">dependency parsing by Dan Roth</a>)</li>
<li>We also provide a novel fully Bayesian approach to Online Active Learning, which can optimize arbitrary loss functions.</li>
</ul>
<hr>
<h2 id="the-lense-motto-fake-it-until-you-make-it">The L.E.N.S.E. motto: “Fake it until you make it”</h2>
<p>Humans are already AI complete, so why not fake hyper intelligent AI by just asking for help once in a while? Generally, this doesn’t scale because it’s too expensive. Our trick is to <em>learn</em> from the human responses we get, and phase out humans as we learn. In short, the LENSE approach to hard AI: “fake it until you make it.”</p>
<h2 id="goals">Goals:</h2>
<ul>
<li>Provide <strong>arbitrary precision</strong> classifiers, by asking humans.</li>
<li>Make our classifiers <strong>cheap</strong> by incorporating machine learning.</li>
<li>Be <strong>useful</strong> to the real world.</li>
</ul>
<h2 id="motivating-example">Motivating Example:</h2>
<p>You answer the phone for a Chinese Food restaurant. You start to notice that every call follows pretty much the same flow, and you want to know if it would be possible to teach a computer to take over the phones for you. We’ll call this the Chinese Restaurant Process (pun intended)</p>
<p><img src="http://keenon.github.io/assets/lense/chinese_restaurant_process.png" alt="the Chinese Restaurant Process (pun intended)" title=""></p>
<p><em>If</em> we could have AI advanced enough to figure out the decisions at the red parts, we can let MacSpeech do the rest. No such AI exists at the moment, so we’ll have to <em>fake it</em>. LENSE aims to do exactly that.</p>
<h2 id="people-are-ai-complete-why-is-this-news">People are AI complete. Why is this news?</h2>
<p>Well, it isn’t, quite. The trick is <strong>money</strong>. People want to get paid for the work they do. Businesses don’t like how that scales.</p>
<p>Computers can learn from examples in real time, and not bother humans with questions the model already knows about. This will <strong>save money</strong>. Costs should scale at roughly <script type="math/tex" id="MathJax-Element-1">O(log(N))</script>, instead of <script type="math/tex" id="MathJax-Element-2">O(N)</script> for the no-learning case, as the computer learns to handle the common kinds of queries independently. The tagline here is <strong>decreasing marginal cost</strong>.</p>
<p><img src="http://keenon.github.io/assets/lense/scaling_costs.png" alt="relative scaling costs" title=""></p>
<h2 id="a-simple-example">A simple example</h2>
<p>Let’s pick apart the confirmation node on the Chinese Restaurant example, and talk about exactly how L.E.N.S.E. works in a very simple case.</p>
<p><img src="http://keenon.github.io/assets/lense/lense_example.png" alt="confirmation detection example" title=""></p>
<p>First, we get a “prior belief” over the possible values of <script type="math/tex" id="MathJax-Element-3">x</script>, <script type="math/tex" id="MathJax-Element-4">P_{\theta}(x)</script>, according to our model. This comes from a standard machine learning model, which we’ll explain in the next section.</p>
<p>In the absence of other information, <script type="math/tex" id="MathJax-Element-5">P(x)</script>, our belief over the true value of <script type="math/tex" id="MathJax-Element-6">x</script>, is equal to <script type="math/tex" id="MathJax-Element-7">P_{\theta}(x)</script>. <script type="math/tex" id="MathJax-Element-8">P(x)</script> can be more or less “peaky.” For a less peaky <script type="math/tex" id="MathJax-Element-9">P(x)</script>, we can increase the peakiness of our distribution over <script type="math/tex" id="MathJax-Element-10">x</script> by querying a human, and using Bayes’ rule: <script type="math/tex" id="MathJax-Element-11">P(x|h) = P(x)*P(h|x)</script>. Once sufficient mass is concentrated in one class, we turn in our result.</p>
<p>Deciding whether to query another human or turn in our current best guess is complicated. We resort to Bayesian Decision Theory here, where we specify some loss function <script type="math/tex" id="MathJax-Element-12">L(c,t,E)</script> which we seek to minimize. The inputs are: <script type="math/tex" id="MathJax-Element-13">c</script> is the financial cost, <script type="math/tex" id="MathJax-Element-14">t</script> is the time, and <script type="math/tex" id="MathJax-Element-15">E</script> is the expectation that our guess is wrong (<script type="math/tex" id="MathJax-Element-16">E = 1 - P(guess)</script>). We make decisions in order to minimize expected loss. We get to this in the section after next.</p>
<h2 id="learning-pthetax-and-phx">Learning <script type="math/tex" id="MathJax-Element-17">P_{\theta}(x)</script> and <script type="math/tex" id="MathJax-Element-18">P(h|x)</script></h2>
<p>The key to making LENSE work is that its parameters for the model <script type="math/tex" id="MathJax-Element-19">P_{\theta}(x)</script> are learned over time. We want <strong>fewer and fewer queries to be made to humans as (features, opinion) pairs accumulate</strong> and the parameters are refined. As an added bonus, LENSE should also learn the parameters for <script type="math/tex" id="MathJax-Element-20">P(h|x)</script> as accurately as possible. </p>
<p>We will perform this learning in a PGM structure, taking advantage of existing algorithms for handling learning with unobserved variables.</p>
<p><strong>Note:</strong> In the naive case I present here, this can be done with EM in batches over all currently available data. It turns out Percy has a paper on doing it in the online case, so that’s no problem: <a href="http://cs.stanford.edu/~pliang/papers/online-naacl2009.pdf">Online EM for Unsupervised Models (NAACL ‘09)</a></p>
<p>In our Chinese Restaurant confirmation system example, we have the following structure (which can be extended, we discuss that later):</p>
<p><img src="http://keenon.github.io/assets/lense/plate_model.png" alt="plate model" title=""></p>
<p>The only thing not specified here is exactly how the model weights combine with features to create <script type="math/tex" id="MathJax-Element-21">P_{\theta}(x)</script>, and how human error weights combine with the hidden true value <script type="math/tex" id="MathJax-Element-22">P(h|x)</script>. This is by design, to allow flexibility within the framework.</p>
<p><script type="math/tex" id="MathJax-Element-23">P_{\theta}(x)</script> and <script type="math/tex" id="MathJax-Element-24">P(h|x)</script> can be <strong>any functions that output valid probability distributions</strong>. This includes logistic regression, neural networks ending in a normalized layer, and everything in between.</p>
<h2 id="minimizing-lcte">Minimizing <script type="math/tex" id="MathJax-Element-25">L(c,t,E)</script></h2>
<p>The decision to query humans is made based on our current beliefs <script type="math/tex" id="MathJax-Element-26">P(x)</script>, and our preferences about outcomes, as expressed in <script type="math/tex" id="MathJax-Element-27">L(c,t,E)</script>.</p>
<p>In some settings, the cost of a 1% chance of error is far higher than the cost of asking 3 more people their opinions (ex. lawyers collecting case data). In other settings, the cost of an additional 500ms of latency is unacceptable (ex. hedge fund news analysis, consumer products). In still others, cost is the primary concern and both accuracy and latency aren’t as important (ex. corpus construction for ML). This is all encoded precisely by <script type="math/tex" id="MathJax-Element-28">L(c,t,E)</script>.</p>
<p>Bayesian statisticians are in general a pessimistic bunch, looking at every outcome as incurring some loss, and we follow their lead. Loss is a function of the cost <script type="math/tex" id="MathJax-Element-29">c</script>, the time <script type="math/tex" id="MathJax-Element-30">t</script>, and the expectation of error, <script type="math/tex" id="MathJax-Element-31">E</script>. <script type="math/tex" id="MathJax-Element-32">c</script> and <script type="math/tex" id="MathJax-Element-33">t</script> are trivial to calculate, and <script type="math/tex" id="MathJax-Element-34">E</script> is calculated by taking the probability under our current <script type="math/tex" id="MathJax-Element-35">P(x)</script> that our guess <script type="math/tex" id="MathJax-Element-36">g</script> is incorrect. That’s <script type="math/tex" id="MathJax-Element-37">E = 1-P(g)</script>.</p>
<p>Making no assumptions about the form of <script type="math/tex" id="MathJax-Element-38">L(c,t,E)</script>, we are trying to make a decision that minimizes our expected loss. We can look at this as a game tree, where our “opponent” (the crowd we’re querying) is playing the statistical strategy <script type="math/tex" id="MathJax-Element-39">P(h) = P(h|x)P_{\theta}(x)</script>. We can calculate <script type="math/tex" id="MathJax-Element-40">L(c,t,E)</script> in closed form whenever we choose to “turn in” our best guess at a node in our game tree, since we can calculate <script type="math/tex" id="MathJax-Element-41">E</script> and we know <script type="math/tex" id="MathJax-Element-42">c</script>, and can take an expectation over <script type="math/tex" id="MathJax-Element-43">t</script> at that point.</p>
<p><img src="http://keenon.github.io/assets/lense/query_game.png" alt="query game structure" title=""></p>
<p>There’s lots of existing work on how to optimally play a game like this. Exhaustive tree traversal is impossible in the general case, because of the exponential size of the tree. In those situations, there are quite a few approximate algorithms to guess expectations via sampling paths in the tree.</p>
<h2 id="the-multivariable-case">The multivariable case</h2>
<p>So far we’ve fleshed out what <script type="math/tex" id="MathJax-Element-44">K</script>-way classification looks like. Lots of problems can be reduced to a (potentially huge) sequence of <script type="math/tex" id="MathJax-Element-45">K</script>-way classification tasks, but that can be very inefficient.</p>
<p>Let’s imagine using LENSE to do accurate, low-latency NER for use in some pipeline system (say this is for that hedge fund that <a href="http://www.theatlantic.com/technology/archive/2011/03/does-anne-hathaway-news-drive-berkshire-hathaways-stock/72661/">accidentally thought Anne Hathaway positive mentions applied to Berkshire Hathaway</a>). That is generally represented by a sequence model, because the NER classification of a token’s neighbor effects its own NER classification.</p>
<p>We want to be able to ask humans about single tokens in the sequence, since that is much faster and cheaper than asking for labels of the whole sequence. We’d like to be able to use information about some tokens to effect others, and to know which tokens to ask queries about.</p>
<p>Going to arbitrary factor-graphs is a relatively straightforward change to the machinery we’ve already built up. Learning doesn’t need to be changed, we just insert the factor graph where we used to just have a single node “True X,” and hang human observations off of the individual nodes within the factor graph that they were observations for. Our loss function <script type="math/tex" id="MathJax-Element-46">L(c,t,E)</script> needs to be generalized so that <script type="math/tex" id="MathJax-Element-47">E</script> is a vector, where each entry <script type="math/tex" id="MathJax-Element-48">E_i</script> the probability of getting the class of node <script type="math/tex" id="MathJax-Element-49">i</script> in the graph wrong. Our game player for minimizing expected loss also now needs to generalize to decide which variable to query for, if it chooses to ask humans. Otherwise everything is exactly as before.</p>
<p><img src="http://keenon.github.io/assets/lense/ner_example.png" alt="anne hathaway ner example" title=""></p>
<p>There are a lot of interesting real world tasks that can be cast as MAP inference over a set of variables. In these tasks, it’s often (though not always) faster to query humans about individual variables than whole groups, and these settings especially should see a huge speedup with LENSE.</p>
<h2 id="in-full-generality">In full generality</h2>
<p>LENSE expects queries to arrive in the form <script type="math/tex" id="MathJax-Element-50">(G, f(x))</script>. Here <script type="math/tex" id="MathJax-Element-51">G</script> is a factor graph of random variables we want the values for, where each factor represents feature values (but no weights or probabilistic interpretation). </p>
<p><script type="math/tex" id="MathJax-Element-52">f(x)</script> is a function that we can pass a subset of variables, and it will return us a human opinion on them at some cost we know, and with some human error rate we don’t know. <script type="math/tex" id="MathJax-Element-53">f(x)</script> is responsible for phrasing the question to humans, making the query, and returning an answer. </p>
<p>LENSE will attempt to respond with the MAP assignment to <script type="math/tex" id="MathJax-Element-54">G</script>. It will do this by calling <script type="math/tex" id="MathJax-Element-55">f(x)</script> on subsets of variables to gather evidence. The number of times <script type="math/tex" id="MathJax-Element-56">f(x)</script> will be called, and on which variables, is balanced by a Bayesian Decision Theoretic model that seeks to find the MAP assignment in minimal time, with minimum monetary cost, and with maximum accuracy, which are balanced according to an equation you specify.</p>
<p>For every stream of queries to LENSE that are from the same “task”, there are 3 things you need to give it:</p>
<ul>
<li><script type="math/tex" id="MathJax-Element-57">P_{\theta}(x)</script>: the kind of model to use for your problem (log-linear, SVR, NN, etc). Your must have <em>some</em> statistical interpretation, and the better tuned the better.</li>
<li><script type="math/tex" id="MathJax-Element-58">P(h|x)</script>: the kind of model to use for modeling human error, where <script type="math/tex" id="MathJax-Element-59">h</script> is the human response, and <script type="math/tex" id="MathJax-Element-60">x</script> is the true value. This could be as simple as “With probability <script type="math/tex" id="MathJax-Element-61">\epsilon</script> human chooses uniformly at random, otherwise <script type="math/tex" id="MathJax-Element-62">h = x</script>”, or as complicated as a rich template model with lots of complicated shared unobserved parameters.</li>
<li><script type="math/tex" id="MathJax-Element-63">L(c,t,E)</script>: your loss function, which you would like minimized, as a function of money spend <script type="math/tex" id="MathJax-Element-64">m</script>, time delay for responses <script type="math/tex" id="MathJax-Element-65">t</script>, and expected error <script type="math/tex" id="MathJax-Element-66">E</script>. This can be arbitrarily complex, but a simple example is <script type="math/tex" id="MathJax-Element-67">L(c,t,E) = c + 0.5t^2 + 4E</script>, where the user is less concerned by cost than error rate (by a factor of 4), and really doesn’t want long delays (hence the square term).</li>
</ul>
<p>Once you’ve specified all of this, and are sending queries of the form <script type="math/tex" id="MathJax-Element-68">(G,f(x))</script>, it’s up to LENSE to behave such that your loss function is minimized (getting high accuracy, low cost, and short delays), and to learn parameters to both of your models (human error and priors).</p>
<h1 id="thats-all-folks">That’s all folks!</h1>
<p>That’s all that LENSE is so far. What follows are some suggestions for ways to contribute to the project both in a way that leads to academic papers and in ways that are useful to others (and sometimes both at the same time).</p>
<h2 id="a-central-implementation">A Central Implementation:</h2>
<p>The unsexy first order of business is a central implementation people can use as a platform for experimentation. That means abstract classes for each of the subcomponents, so changes are easily pluggable. It also means a lot of plumbing and machinery to get the whole thing working, and some careful interface design.</p>
<p>A <a href="https://github.com/lense-project/lense-ref">GitHub repo</a> is up, but currently at a very early stage, with no documentation at all. It implements all factor-graph calculations using <a href="http://factorie.cs.umass.edu/">FACTORIE</a>. Contributors are of course welcome.</p>
<h2 id="research-ideas">Research Ideas</h2>
<p>LENSE can be broken down into several subcomponents, which can each be improved individually, more or less. This provides an opportunity to fruitfully <strong>work completely separately, but with mutual goals</strong>, which I’ve always found was the most productive way to go.</p>
<h3 id="better-ways-to-model-phx">Better ways to model <script type="math/tex" id="MathJax-Element-69">P(h|x)</script>:</h3>
<p>This is a hot topic in the ML community because it leads to more accurate corpus construction. Lots of models have been tried, each one with more hidden variables and complicated plate diagrams than the last. NIPS has held a workshop on this for the last two years, which has yielded some papers on error modeling.</p>
<p><strong>Related Work</strong>:</p>
<ul>
<li><a href="http://www.ics.uci.edu/~qliu1/nips13_workshop/Papers/ActivePaper5.pdf">Exploiting Commonality and Interaction Effects in Crowdsourcing Tasks Using Latent Factor Models (NIPS ‘13)</a></li>
<li><a href="https://www.soe.ucsc.edu/research/technical-reports/UCSC-SOE-14-01/download">WorkerRank: Using Employer Implicit Judgements To Infer Worker Reputation (NIPS ‘14)</a></li>
</ul>
<h3 id="the-right-objective-for-pthetax-and-phx">The right objective for <script type="math/tex" id="MathJax-Element-70">P_{\theta}(x)</script> and <script type="math/tex" id="MathJax-Element-71">P(h|x)</script>:</h3>
<p>Any model for <script type="math/tex" id="MathJax-Element-72">P_{\theta}(x)</script> that gets used with LENSE needs to provide <strong>accurate statistical estimates</strong>, and <strong>must return a uniform distribution when it doesn’t know</strong>. This is more important than F1. That means that regularization is very important, and that we can’t expect optimal performance from many off-the-shelf algorithms (basically all of classification, and even some regression setups).</p>
<p>Picking the right training objective for this is key. This should probably involve a reference to <script type="math/tex" id="MathJax-Element-73">L(c,t,E)</script>, because a stronger regularizer will increase your expected cost <script type="math/tex" id="MathJax-Element-74">c</script> and delay <script type="math/tex" id="MathJax-Element-75">t</script> in return for better guarantees about your guess of your error expectation <script type="math/tex" id="MathJax-Element-76">E</script> being an overestimate.</p>
<p><strong>Related Work</strong>:</p>
<ul>
<li><a href="http://link.springer.com/book/10.1007%2F978-1-4757-4286-2">Statistical Decision Theory and Bayesian Analysis</a> (this is a whole book, but its concept of applying a loss function to statistical estimation of values might be a good place to start, beginning on p. 24)</li>
</ul>
<h3 id="interface-design-to-optimize-human-response-time-when-asking-questions">Interface design to optimize human response time when asking questions:</h3>
<p>There’s a human interface problem of how we ask questions to minimize delay. This has been explored in the past in HCI journals like CHI, so that’s an excellent place to start.</p>
<p><strong>Related Work</strong>:</p>
<ul>
<li><a href="http://dl.acm.org/citation.cfm?id=2557155&CFID=505300189&CFTOKEN=23922551">Cognitively Inspired Task Design to Improve User <br>
Performance on Crowdsourcing Platforms (CHI ‘14)</a></li>
</ul>
<h3 id="a-richer-set-of-choices-to-optimize-lcte-over">A richer set of choices to optimize <script type="math/tex" id="MathJax-Element-79">L(c,t,E)</script> over:</h3>
<p>We could have a richer set of choices for the algorithm. For example, what if there are multiple available humans, each with a different length of queue of requests (effecting expected delay) and a different accuracy and rate of completion. We would then have to choose between humans.</p>
<p>And then what about the case where there is a parallel stream of requests appearing at LENSE, and we have to account for queuing theory as we traverse the game tree of because possible human response times will change.</p>
<p><strong>Related Work:</strong></p>
<ul>
<li><a href="https://www.aaai.org/Papers/AAAI/2008/AAAI08-041.pdf">Simulation-Based Approach to General Game Playing (AAAI ‘08)</a></li>
<li><a href="http://www.ru.is/~yngvi/pdf/FinnssonB09a.pdf">Simulation Control in General Game Playing Agents (‘09)</a></li>
</ul></div></body>
</html>