Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric fixes + combinators #99

Merged
merged 12 commits into from
Apr 5, 2016
Merged

Numeric fixes + combinators #99

merged 12 commits into from
Apr 5, 2016

Conversation

olorin
Copy link
Contributor

@olorin olorin commented Mar 31, 2016

Turns out I wasn't paying attention when I wrote the stddev accumulator - it was totally borked, and there was a bug in the tests which let it go undetected. Redid a bunch of the numeric types in order to fix.

Also: implemented the functions for combining partial numeric results (for resolving intermediate results of the parallel folds).

/cc @charleso @tmcgilchrist @thumphries

@olorin
Copy link
Contributor Author

olorin commented Mar 31, 2016

(Haven't fixed the 7.10 build yet, ignore that failure)

@olorin
Copy link
Contributor Author

olorin commented Mar 31, 2016

Running 1 benchmarks...
Benchmark bench: RUNNING...
benchmarking decoding/decode/conduit+attoparsec-bytestring/1000
time                 314.3 ms   (232.4 ms .. 386.5 ms)
                     0.983 R²   (0.963 R² .. 1.000 R²)
mean                 297.3 ms   (276.2 ms .. 312.6 ms)
std dev              21.00 ms   (12.93 ms .. 26.41 ms)
variance introduced by outliers: 17% (moderately inflated)

benchmarking field-parsing/parseField/200
time                 78.22 μs   (77.33 μs .. 79.30 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 79.07 μs   (78.44 μs .. 79.84 μs)
std dev              2.402 μs   (1.742 μs .. 3.330 μs)
variance introduced by outliers: 29% (moderately inflated)

benchmarking folds/updateSVParseState/1000
time                 158.6 ms   (154.9 ms .. 162.9 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 159.2 ms   (157.0 ms .. 161.1 ms)
std dev              2.895 ms   (2.040 ms .. 3.703 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking folds/hashText/1000
time                 98.05 μs   (96.99 μs .. 98.73 μs)
                     0.992 R²   (0.983 R² .. 0.997 R²)
mean                 99.64 μs   (94.74 μs .. 110.8 μs)
std dev              23.46 μs   (12.15 μs .. 42.48 μs)
variance introduced by outliers: 96% (severely inflated)

benchmarking folds/updateTextCounts/1000
time                 20.09 ms   (19.32 ms .. 21.06 ms)
                     0.992 R²   (0.984 R² .. 0.997 R²)
mean                 20.11 ms   (19.79 ms .. 20.47 ms)
std dev              780.4 μs   (636.1 μs .. 1.019 ms)
variance introduced by outliers: 13% (moderately inflated)

benchmarking numerics/updateNumericState/10000
time                 5.358 ms   (5.083 ms .. 5.555 ms)
                     0.986 R²   (0.979 R² .. 0.992 R²)
mean                 5.204 ms   (5.057 ms .. 5.512 ms)
std dev              569.2 μs   (331.0 μs .. 919.6 μs)
variance introduced by outliers: 66% (severely inflated)

, "value" .= toJSON v
]
fromMean NoMean = object [
"type" .= String "no-mean"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual question, I'm assuming we're not worried about any live data for this yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the NumericSummary isn't in any live data.

@olorin
Copy link
Contributor Author

olorin commented Apr 1, 2016

Anyone mathsy feel like sanity-checking the mean/median combining? I would ping Huw or Aaron, but data scientists can't see the repo anymore. If not I can get an IRL review on Monday. /cc @jystic @markhibberd

@olorin
Copy link
Contributor Author

olorin commented Apr 3, 2016

@amosr maybe?

@olorin
Copy link
Contributor Author

olorin commented Apr 4, 2016

Got a verbal 👍 from @adefazio on the numeric stuff, with the exception of the numerical stability of the variance-combining bit - I agree it will need to be addressed, following up in #101.

@markhibberd
Copy link
Contributor

@olorin Can you give me a bit more time before you merge.

@olorin
Copy link
Contributor Author

olorin commented Apr 4, 2016

@markhibberd yeah, waiting on that review from you.

@markhibberd
Copy link
Contributor

@olorin This looks good to me 👍 I have stepped through and can't see any issues. Possibly worth a chat sometime on how to make these easier to review for everyone at some point. Need to get to the point where we are providing enough context and have our own shared body of knowledge that anyone can get in and desk-check these algorithms and people don't fob them off. Having soft-references for TAoCP and stats books, and having a standard library and way of referencing it for this type of code would go a long way I think. I will have a think if there is anything easy we can do facilitate this type of things, but ideas very welcome.

@olorin
Copy link
Contributor Author

olorin commented Apr 5, 2016

@markhibberd

Possibly worth a chat sometime on how to make these easier to review for everyone at some point. Need to get to the point where we are providing enough context and have our own shared body of knowledge that anyone can get in and desk-check these algorithms and people don't fob them off. Having soft-references for TAoCP and stats books, and having a standard library and way of referencing it for this type of code would go a long way I think. I will have a think if there is anything easy we can do facilitate this type of things, but ideas very welcome.

Agreed, this sounds good. Will have a think about ways we could implement it. I like the idea of having standard libraries for commonly-needed fiddly implementations, like some numeric things (this was the idea behind fisher, which I still haven't gotten around to implementing).

I was also wondering about including derivations for any non-obvious numeric stuff in either comments or associated documentation so anyone interested can check the work months down the track. It's not the easiest thing to do in text format, but pandoc can handle markdown-embedded LaTeX and it could be built along with the haddocks.

@olorin olorin merged commit 4335367 into master Apr 5, 2016
@olorin olorin deleted the topic/numbers branch April 5, 2016 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants