You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that our standard library's randomness support is not up to our usual standards. I think it's using old technology, and perhaps also old Lean naming conventions from before most of the current library was designed.
We have two separate interfaces to randomness: there's IO.getRandomBytes, and a splittable RNG based on RandomGen. Here, I'm talking about the latter.
First off, the namespacing is inconsistent. Some of the operators are in the IO namespace, like IO.setRandSeed and IO.rand. Most of them are in no defined namespace at all, like randBool, randNat, and stdSplit, making it harder to find them with autocomplete.
But the big issue seems to be that the StdGen in the Lean library uses an algorithm that's known to have poor randomness properties (see paper). It seems to have been based on an old version of the Haskell libraries. I'm far from an expert here, but I believe that SplitMix (which the Haskell libraries have used for the last four or so years) is the state of the art for splittable RNGs, though when I last looked into this a few years ago for a different project, I also ran across a blog post about common implementation mistakes in SplitMix. Git blame says that the Haskell libraries were used as inspiration five years ago, so Lean would have imported them just before they were updated.
For reference: here's the PR that updated the Haskell library in 2020. They considered, but rejected, having separate classes for splittable and non-splittable RNGs, deciding against it primarily due to backwards compatibility. It seems that ours is not used much, so perhaps we should have both?
Plausible also uses it. IIRC tf-random, which was a slower predecessor to SplitMix that used the Threefish block cipher to generate new states, came about when the statistical issues with the old Haskell StdGen started causing problems with QuickCheck generators in practice.
Impact
I don't know of anyone being actively blocked by this. It came up while working on the language reference, and I figured it was worth writing down what little I know about the matter for the stdlib team to prioritize as you see fit.
It seems that our standard library's randomness support is not up to our usual standards. I think it's using old technology, and perhaps also old Lean naming conventions from before most of the current library was designed.
We have two separate interfaces to randomness: there's
IO.getRandomBytes
, and a splittable RNG based onRandomGen
. Here, I'm talking about the latter.First off, the namespacing is inconsistent. Some of the operators are in the
IO
namespace, likeIO.setRandSeed
andIO.rand
. Most of them are in no defined namespace at all, likerandBool
,randNat
, andstdSplit
, making it harder to find them with autocomplete.But the big issue seems to be that the
StdGen
in the Lean library uses an algorithm that's known to have poor randomness properties (see paper). It seems to have been based on an old version of the Haskell libraries. I'm far from an expert here, but I believe that SplitMix (which the Haskell libraries have used for the last four or so years) is the state of the art for splittable RNGs, though when I last looked into this a few years ago for a different project, I also ran across a blog post about common implementation mistakes in SplitMix. Git blame says that the Haskell libraries were used as inspiration five years ago, so Lean would have imported them just before they were updated.For reference: here's the PR that updated the Haskell library in 2020. They considered, but rejected, having separate classes for splittable and non-splittable RNGs, deciding against it primarily due to backwards compatibility. It seems that ours is not used much, so perhaps we should have both?
I found one use of
StdGen
in Mathlib. Its comments point out what seem to be implementation issues with seeding.Plausible also uses it. IIRC
tf-random
, which was a slower predecessor to SplitMix that used the Threefish block cipher to generate new states, came about when the statistical issues with the old Haskell StdGen started causing problems with QuickCheck generators in practice.Impact
I don't know of anyone being actively blocked by this. It came up while working on the language reference, and I figured it was worth writing down what little I know about the matter for the stdlib team to prioritize as you see fit.
Add 👍 to issues you consider important. If others are impacted by this issue, please ask them to add 👍 to it.
The text was updated successfully, but these errors were encountered: