Numeric fixes + combinators #99

olorin · 2016-03-31T23:21:13Z

Turns out I wasn't paying attention when I wrote the stddev accumulator - it was totally borked, and there was a bug in the tests which let it go undetected. Redid a bunch of the numeric types in order to fix.

Also: implemented the functions for combining partial numeric results (for resolving intermediate results of the parallel folds).

/cc @charleso @tmcgilchrist @thumphries

olorin · 2016-03-31T23:26:12Z

(Haven't fixed the 7.10 build yet, ignore that failure)

olorin · 2016-03-31T23:34:08Z

Running 1 benchmarks...
Benchmark bench: RUNNING...
benchmarking decoding/decode/conduit+attoparsec-bytestring/1000
time                 314.3 ms   (232.4 ms .. 386.5 ms)
                     0.983 R²   (0.963 R² .. 1.000 R²)
mean                 297.3 ms   (276.2 ms .. 312.6 ms)
std dev              21.00 ms   (12.93 ms .. 26.41 ms)
variance introduced by outliers: 17% (moderately inflated)

benchmarking field-parsing/parseField/200
time                 78.22 μs   (77.33 μs .. 79.30 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 79.07 μs   (78.44 μs .. 79.84 μs)
std dev              2.402 μs   (1.742 μs .. 3.330 μs)
variance introduced by outliers: 29% (moderately inflated)

benchmarking folds/updateSVParseState/1000
time                 158.6 ms   (154.9 ms .. 162.9 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 159.2 ms   (157.0 ms .. 161.1 ms)
std dev              2.895 ms   (2.040 ms .. 3.703 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking folds/hashText/1000
time                 98.05 μs   (96.99 μs .. 98.73 μs)
                     0.992 R²   (0.983 R² .. 0.997 R²)
mean                 99.64 μs   (94.74 μs .. 110.8 μs)
std dev              23.46 μs   (12.15 μs .. 42.48 μs)
variance introduced by outliers: 96% (severely inflated)

benchmarking folds/updateTextCounts/1000
time                 20.09 ms   (19.32 ms .. 21.06 ms)
                     0.992 R²   (0.984 R² .. 0.997 R²)
mean                 20.11 ms   (19.79 ms .. 20.47 ms)
std dev              780.4 μs   (636.1 μs .. 1.019 ms)
variance introduced by outliers: 13% (moderately inflated)

benchmarking numerics/updateNumericState/10000
time                 5.358 ms   (5.083 ms .. 5.555 ms)
                     0.986 R²   (0.979 R² .. 0.992 R²)
mean                 5.204 ms   (5.057 ms .. 5.512 ms)
std dev              569.2 μs   (331.0 μs .. 919.6 μs)
variance introduced by outliers: 66% (severely inflated)

charleso · 2016-04-01T00:12:57Z

src/Warden/Serial/Json/Numeric.hs

+  , "value" .= toJSON v
+  ]
+fromMean NoMean = object [
+    "type" .= String "no-mean"


The usual question, I'm assuming we're not worried about any live data for this yet?

Yep, the NumericSummary isn't in any live data.

olorin · 2016-04-01T00:24:01Z

Anyone mathsy feel like sanity-checking the mean/median combining? I would ping Huw or Aaron, but data scientists can't see the repo anymore. If not I can get an IRL review on Monday. /cc @jystic @markhibberd

olorin · 2016-04-03T22:46:03Z

@amosr maybe?

olorin · 2016-04-04T22:57:49Z

Got a verbal 👍 from @adefazio on the numeric stuff, with the exception of the numerical stability of the variance-combining bit - I agree it will need to be addressed, following up in #101.

markhibberd · 2016-04-04T23:01:09Z

@olorin Can you give me a bit more time before you merge.

olorin · 2016-04-04T23:01:55Z

@markhibberd yeah, waiting on that review from you.

markhibberd · 2016-04-05T00:12:13Z

@olorin This looks good to me 👍 I have stepped through and can't see any issues. Possibly worth a chat sometime on how to make these easier to review for everyone at some point. Need to get to the point where we are providing enough context and have our own shared body of knowledge that anyone can get in and desk-check these algorithms and people don't fob them off. Having soft-references for TAoCP and stats books, and having a standard library and way of referencing it for this type of code would go a long way I think. I will have a think if there is anything easy we can do facilitate this type of things, but ideas very welcome.

olorin · 2016-04-05T00:21:13Z

@markhibberd

Possibly worth a chat sometime on how to make these easier to review for everyone at some point. Need to get to the point where we are providing enough context and have our own shared body of knowledge that anyone can get in and desk-check these algorithms and people don't fob them off. Having soft-references for TAoCP and stats books, and having a standard library and way of referencing it for this type of code would go a long way I think. I will have a think if there is anything easy we can do facilitate this type of things, but ideas very welcome.

Agreed, this sounds good. Will have a think about ways we could implement it. I like the idea of having standard libraries for commonly-needed fiddly implementations, like some numeric things (this was the idea behind fisher, which I still haven't gotten around to implementing).

I was also wondering about including derivations for any non-obvious numeric stuff in either comments or associated documentation so anyone interested can check the work months down the track. It's not the easiest thing to do in text format, but pandoc can handle markdown-embedded LaTeX and it could be built along with the haddocks.

olorin added 8 commits March 30, 2016 22:47

Add numeric state update

1093e9b

Add benchmark for numerics update

5855013

Summarize numeric state

484001a

Combine means of two subsamples

6547eb1

Combine variance of two subsamples

ce9623d

Fix bug in unstable/test variance calculation

61dac1d

Fix off-by-one in mean/stddev accumulator count

7a1eb55

Unbork stddev computation

eb98764

olorin force-pushed the topic/numbers branch from b0c8aca to c3ae314 Compare March 31, 2016 23:24

Rename Count, it's not really a count

947f13b

olorin force-pushed the topic/numbers branch from c3ae314 to 947f13b Compare March 31, 2016 23:29

Remove redundant to/fromJSON instances

4378eb3

Basic docs for StdDevAcc

a880367

charleso reviewed Apr 1, 2016
View reviewed changes

olorin mentioned this pull request Apr 2, 2016

Refactor count summary for flexibility + more tests #100

Merged

charleso mentioned this pull request Apr 4, 2016

Inlude numerics in fold and summary #102

Merged

Fix off-by-one in mean/stddev accumulators

2921657

olorin merged commit 4335367 into master Apr 5, 2016

olorin deleted the topic/numbers branch April 5, 2016 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric fixes + combinators #99

Numeric fixes + combinators #99

olorin commented Mar 31, 2016

olorin commented Mar 31, 2016

olorin commented Mar 31, 2016

charleso Apr 1, 2016

olorin Apr 1, 2016

olorin commented Apr 1, 2016

olorin commented Apr 3, 2016

olorin commented Apr 4, 2016

markhibberd commented Apr 4, 2016

olorin commented Apr 4, 2016

markhibberd commented Apr 5, 2016

olorin commented Apr 5, 2016

Numeric fixes + combinators #99

Numeric fixes + combinators #99

Conversation

olorin commented Mar 31, 2016

olorin commented Mar 31, 2016

olorin commented Mar 31, 2016

charleso Apr 1, 2016

Choose a reason for hiding this comment

olorin Apr 1, 2016

Choose a reason for hiding this comment

olorin commented Apr 1, 2016

olorin commented Apr 3, 2016

olorin commented Apr 4, 2016

markhibberd commented Apr 4, 2016

olorin commented Apr 4, 2016

markhibberd commented Apr 5, 2016

olorin commented Apr 5, 2016