-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev diary: single-atom Beta power law #64
Comments
We can use cross entropy to measure the performance of the model. It cannot be cheated by over-confident predictions, because it measures the difference between prediction distribution and the real distribution. |
Oh nice! Is there somewhere I can read about this? Does this work for data that's real flashcards? Or do you need to simulate flashcards from a known probability distribution in order to obtain the K-L distance between the "true" prior and the distribution assumed by Ebisu/FSRS? |
FSRS Optimizer has used binary cross entropy as the loss funtion: |
It's very common to use cross entropy as the loss function in classification tasks: https://machinelearningmastery.com/cross-entropy-for-machine-learning/ |
Oh I see what you mean 😕 alas in all my examples, what I call "log likelihood" is literally cross-entropy from binary classification: per https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html cross-entropy is ebisu/scripts/analyzeHistory.py Lines 37 to 39 in 8ae2a95
n=1 ) before summing them. In the Bernoulli/binomial quiz case, the "likelihood" is just the probability mass function so the two are identical.
I've definitely noticed that this metric is vulnerable to being misled by overconfident models that assign overoptimistic predictions all the time, like I shared in fasiha/ebisu.js#23 (comment). Let's carry on discussing that there! |
(disclaimer for the following: I am dyscalculic so I stumbled into this one more or less accidentally at 2am while making a combinatorial analogue of Anki that represents concepts latently and separately from their testing contexts. I have a rough mental image of what it's doing but I kind of just tried every single combination of "Math.log", "/", "*", etc. that I could think of and once it started spitting out good numbers I ran with it. I don't know how it works but I'm suspecting it's similar.) let decayEstimate = -0.5; // gradient over time
let flexibility = 0.095;
// how much it freaks out and adjusts if it gets something wrong
let base = 1.168;
// -- learning --
// first entry is recorded
let $sinceLastSeen = 0.001; // in hours
let $accuracy = 1.0; // between 0 and 1
let impliedDecay = (Math.log($accuracy) / Math.log(base)) / $lastSeen;
// impliedDecay: 0
// i.e, "wow! the user has a perfect memory and will never forget this ever"
let newDecay = average( [decayEstimate, impliedDecay], { ratio: [flexibility, 1 - flexibility] } )
// newDecay: -0.4525
decayEstimate = newDecay;
// ...
// typically after a few repetitions, you will get:
// decayEstimate: -0.35
// -- situation 1: user waited way too long after learning --
let $sinceLastSeen = 24;
let expectedAccuracy = Math.pow(1.168, -0.35 * 24)
// expectedAccuracy: 0.27
// case A (did better than we expected):
let accuracy = 0.84;
// usually was calculated via levenshtein distance, I don't think it matters too much
let impliedDecay = (Math.log(0.84) / Math.log(1.168)) / 24;
// impliedDecay = -0.047
// ...
// -> newDecay = -0.33
// case B (did as well as expected) -> decay remains identical
// case C (did worse than expected) -> decay increases All that gets stored per concept entry is lastSeen and decay. In retrospect, what I think this is implicitly doing is modelling both maturity/interval, strength of belief updates, cross-entropy loss, and inherent difficulty, using a single parameter. I don't know what the downsides of this approach are but it was fast to calculate, and held up surprisingly well across the ~20 people I tested it on (ranging from beginners to advanced learners who wanted daily practice). I wish I knew how to apply it to that test data. edit: important thing to note, I optimized the hyperparams here for minimizing boredom, not maximizing accuracy. |
@ckoshka thank you so much for sharing! I love these simple, battle-tested models, and I totally adore language learning apps that break the mold 🤩 Langwitch is such a fresh take, I look forward to experimenting with it! Can you check if the following is right (I used Python, sorry, if you prefer, I can readily rewrite in JavaScript): from math import log
initDecay = -0.5
flexibility = 0.095
base = 1.168
def update(decay: float, sinceLastSeen: float, accuracy: float) -> float:
impliedDecay = (log(accuracy) / log(base)) / sinceLastSeen
return decay * flexibility + impliedDecay * (1 - flexibility)
def predict(decay: float, sinceLastSeen: float) -> float:
return base**(decay * sinceLastSeen)
model = initDecay
for sinceLastSeen, accuracy in [(24, .2), (12, .6), (36, .3)]:
prediction = predict(model, sinceLastSeen)
model = update(model, sinceLastSeen, accuracy)
print(
f'{sinceLastSeen} hours: pred={prediction:.2f}. Quiz accuracy={accuracy} → new decay={model}') That last bit is a little demo that starts with the initial decay and goes through three quizzes, 24 hours, then 12 hours, then 36 hours, updating after each quiz. At each quiz it prints out the prediction and the new decay. This is the output:
If I did the coding right, these numbers hopefully make sense? I haven't dug enough to know if Langwitch actively prompts users to review a concept or if it's fully passive, but is there an accuracy (returned by If the above code is right, it might be straightforward to plug this into the test frameworks @L-M-Sherlock and I both have, and see how this algorithm would have handled a bunch of flashcards I'm familiar with (the same ~380 test cases from my original post). Thanks again!!! |
(Apologies for spamming those subscribed 🙇 please feel free to mute this issue, I'll be continuing to use it as a dev diary) I was thinking about how annoying it is that the sum-of-log-likelihoods under-penalizes overconfident models—for example, consider a flashcard for which the student got five successes followed by a failure. A model that predicted 90% for all six of these would score considerably better than a more conservative model that predicted 60% for all six: In [10]: from scipy.stats import bernoulli
In [16]: bernoulli.logpmf(1, 0.9) * 5 + bernoulli.logpmf(0, 0.9)
Out[16]: -2.829387671283177
In [17]: bernoulli.logpmf(1, 0.6) * 5 + bernoulli.logpmf(0, 0.6)
Out[17]: -3.4704188507041085 In the sum-of-log-likelihood metric, higher is better. (I'm being very extra by importing I feel like the conservative model that predicts 60% is "better" here and I'd prefer a metric that handled this. So I was reading up about how machine learning folks handle unbalanced data, which is basically what we have here: failures being a lot less common than successes is what causes this problem, and thanks to @L-M-Sherlock's very kind pointer I was able to understand that what I called "sum-of-log-likelihoods" was called cross-entropy 🙇. So a few years ago a user on the Data Science Stack Exchange pointed to a paper on Arxiv introducing a focal loss metric, a modification to cross-entropy that appeared to ameliorate the impact of unbalanced data, in an intriguing way. Rather than the score of a single observation being log(p if result else 1-p) Lin et al. propose log(p**((1-p)**gamma)) if result else (1-p)**(p**gamma)) for some Check it out: from math import log
def bernoulliLogProbabilityFocal(result: bool, p: float, gamma: float = 2) -> float:
assert 0 <= p <= 1
assert 0 <= gamma
focalP = p**((1 - p)**gamma)
focalQ = (1 - p)**(p**gamma)
return log(focalP if result else focalQ) and In [20]: bernoulliLogProbabilityFocal(True, 0.9, 2)*5 + bernoulliLogProbabilityFocal(False, 0.9, 2)
Out[20]: -1.8703619511080685
In [21]: bernoulliLogProbabilityFocal(True, 0.6, 2)*5 + bernoulliLogProbabilityFocal(False, 0.6, 2)
Out[21]: -0.7385251624874881 With the
The important thing to see here is that the rankings have switched. The conservative model is better (higher number). Hmmmm, this might be what we want! So last time we compared a few parameterizations of the v3-ensemble algorithm (an ensemble of weighted Beta distributions over logarithmically-increasing halflives, all decaying exponentially) and the single-Beta power law algorithm. I wanted to find the "best" parameterization for both algorithms for my training sample of ~380 flashcards (~4700 reviews):
I did a quick 2D grid search varying alpha/beta and halflife for both models to see what parameters led to the best sum of focal loss for all flashcards, all reviews in my test set: Beta power law: v3-ensemble, for So these tell us the "optimal" parameters for each of these algorithms that yield the best focal loss summed over all flashcards, all quizzes. The quotes around "optimal" are very heavy because this is such a rough grid search, and who knows maybe focal loss isn't what we'll want to use anyway but let's see where it leads us. Picking some parameters around the optima for both algorithms and remaking the charts similar to the first post above yields: Umm, I'm not really sure how focal loss can go positive. It could be because I adapted it from its original Bernoulli-probability-mass-function setting to the binomial's and the fuzzy-binary probability distribution that Ebisu uses (I map my flashcard history from Anki's 1-4 ranks to binomial (4=easy=2 out of 2) and fuzzy-binary (1=hard= But assuming this is reasonable, the two algorithms perform surprisingly similar. The best parameterization for either doesn't really outstrip the other. Of course I then wanted to see what halflives these algorithms output. For each of the 382 flashcards I can take the final model after the last quiz and compute its 50-percentile decay time (halflife) and 80-percentile, and plot it: It's clear that the v3-ensemble algorithm behaves a lot better than the Beta-power-law algorithm when it comes to halflives. The v3-ensemble final halflives are between a few thousand and a few tens of thousands of hours, and the 80-percentile times are all uniformly a bit less than that. The beta-power-law however has enormous halflives up to 10^10 hours (one million years…): for a hundred out of 382 flashcards the Beta-power-law halflife is somewhat comparable to the v3-ensemble halflife but the rest of them are considerably higher. The 80-percentile are a lot more reasonable. I don't think this is a major condemnation of the Beta-power-law model (though I might be biased, this is after all its dev diary 😝). There is literally no serious bound on the power law's recall probability: In terms of usability, I think we could very reasonably build Anki-style quiz apps that schedule reviews when the recall probability drops to 80% or 70%? (I personally don't like such apps, I prefer apps that calculate which card to review now when I'm free, on my schedule, not on the app's schedule, so I could use some feedback). And of course both algorithms output reasonable probability of recall for each flashcard's review. I'll be digging into the individual reviews next and seeing if I prefer one or the other algorithm, now that I have a metric (focal loss) that I think helps deal with unbalanced data. The code to generate all these charts is in this branch. Setup instructions are in the top of this, then cd ebisu # make sure you're in the right branch
python scripts/analyzeHistory.py # saves the last plot above, for v3-ensemble
python scripts/betapowerlaw.py # saves the second-to-last plot above, beta-power-law
python scripts/compareEnsemblePowerlawHalflives.py # generates halflife/80-percentile plots If you run these in |
I figure it out. The main divergence of our perspectives is that Ebisu considers the probability of recall in card level, but FSRS considers the P(recall) in review history level. I built a dataset here: https://huggingface.co/datasets/open-spaced-repetition/fsrs-dataset. I hope it would be helpful to your development and research. |
@L-M-Sherlock yeah this is a great observation. I still am not exactly sure how you convert Ebisu's Continuing with the dev diary for this single-Beta power-law model. I found the reason the focal loss was going positive (since focal loss is in the log-domain, this means that probability > 1 🙊): I had a typo in the noisy-binary focal loss function, I was mixing This changes the results of the grid search just a little, and I confirmed that if I map my Anki history to just binary ( So it appears that the bug in focal loss hasn't destroyed our nice algorithm here 😅🙏! I've been digging into the predictions and halflives generated by this algorithm and I like them. The v3-ensemble model (an ensemble of several Beta random variables at different halflives, each with exponential decay) produces different predictions and posterior halflives compared to this single-Beta power law when it comes to failures, early successes, and surprise long successes, but I think the power law more quickly reacts? In the table below (sorry for how big it is, if possible please read it on desktop), I have the results for flashcard 1300038030806 (in my Anki database, linked in the first post). The results for each quiz are 1-4:
For each quiz, I show
We see by the 11th quiz, after a run of ✅s with the last one 1000 hours after the last review, the v3 ensemble (right-most columns) has exploded the halflife to almost 2800 hours. However, the power law algorithm is still conservative and predicts 1500 hours halflife. Anki chose to schedule the 12th quiz 2500 hours later which failed. There are three runs of ✅s and in each, the power law chooses to strengthen its halflife in a different way from the v3 algorithm. Both choices seem defensible, and the probability of recall predicted by each is not bad. The main issue that authors of Anki-style apps will see is that the time to reach 80% recall probability is extremely conservative for the power-law algorithm. Given that this quiz schedule (picked by Anki) results in the power-law algorithm assigning the quizzes 60-ish percent predicted recall probability, quiz apps might use the time to 60% (or 70%) recall to schedule reviews? I don't have good instincts when it comes to designing apps like this (as I said, I much prefer apps that don't schedule cards but rather check which are most at risk of being forgotten when I am ready to spend time reviewing, on my schedule), but it seems like there's a risk when the time-to-80% is too soon but time-to-70% is too far? Is this inevitable when scheduling reviews in the future, and should I not worry about it? For apps that I tend to write, of course, that don't schedule reviews, the fact that the power-law model predicts probability of recall reasonably is very exciting. We now have a few different algorithms that solve the core issue raised in #43: we have an ensemble of Gamma random variables, an ensemble of Beta random variables (reusing Ebisu2), and also potentially this single-Beta power-law algorithm (also leveraging Ebisu2). |
The code is here: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/0da0a9adeecb4d354ce3f8933fc2d5ee300a9bb6/other.py#L234-L264
I think the unbalanced data is not big deal for FSRS. Because FSRS has could predict the probability of recall accurately: Source: https://www.reddit.com/r/Anki/comments/15mab6e/fsrs_explained_part_2_accuracy/
FSRS allows users to set their desired retention. FSRS use this formula to schedule the interval based on desired retention and stability: |
I was doodling and realized that while everything in the single-atom Ebisu case (v2 and v3) takes
pNow = p**(elapsed / t)
for an atom parameterized by[a, b, t]
(where the probability recall at timet
is assumed to be aBeta(a, b)
random variable, that is,probability recall at time t ~ Beta(a, b)
), there's nothing stopping us from changing thing.The
p**elapsed
exponentiation is why our Beta random variable decays via an exponential, and we can very very easily get a single Beta to exhibit power-law forgetting by sayingpNow = p**log2(1 + elapsed / t)
. Both these expressions share some nice properties:p**(elapsed / t)
andp**log2(1 + elapsed / t)
both are 0.5 whent==elapsed
anda=b
, i.e.,t
remains a halflife of the new expressiont
→ 0 and asymptotically approach 0 ast
grows very large.The difference of course is that the power-law
p**log2(1 + elapsed / t)
decays muuuch slower than the exponential decay. It turns out that it's very easy to reuse the existing Ebisu v2 Beta+exponential library to do this power-law scheme, since basicallypNow = p**f(elapsed)
, i.e., the Beta random variable is raised to some power—elapsed
for exponential decay,log2(elapsed...)
for power-law decay.I have a little script that demonstrates this: https://github.com/fasiha/ebisu/blob/v3-release-candidate/scripts/betapowerlaw.py
To run this,
python -m pip install numpy scipy pandas matplotlib tqdm ipython "git+https://github.com/fasiha/ebisu@v3-release-candidate"
,rc1
branch:git clone https://github.com/fasiha/ebisu.git && cd ebisu && git fetch -a && git checkout v3-release-candidate
,collection-no-fields.anki2
in thescripts
folder so the script can find itipython
%run scripts/betapowerlaw.py
. This will produce some text/figuresNow you can follow along:
Above we compare the predicted recall 10 and 100 halflives:
Running the script above will generate this chart comparing a few models for a few hundred quizzes in terms of log-likelihood:
I have a very similar script for benchmarking the v3 ensemble-of-Betas algorithm,
%run scripts/analyzeHistory.py
will run this and generate this:In the two charts above, higher is better (higher likelihood). Each point corresponds to the sum of log-likelihoods (product of raw likelihoods) for each quiz for that flashcard. Each is sorted by the worst log-likelihood to the best, and the 125 right-most quizzes are flashcards for which I have no failures.
Looking at these side-by-side:
Both scripts also spit out a text file containing per-flashcard, per-quiz details of what likelihood each model assigned to the current quiz and its current halflife. Looking through these is really interesting because you can see how different models result in very different halflives after each quiz. This also emphasizes why benchmarking algorithms via log-likelihood (see fasiha/ebisu.js#23) is tricky: an easy way to "cheat" is just be overly optimistic because in general failures are quite uncommon and the penalty an algorithm incurs by being very wrong about occasional failures is more than made up by the boost it gets by over-confidently predicting every quiz to be a success. This is really important: an algorithm/model that performs well in terms of sum-of-log-likelihoods doesn't mean it's the best, we have to look at how it handles failures, how it grows halflives after quizzes, if they're reasonable.
So right now I'm not sure what to do 😂 hence this dev diary—maybe writing things out will give me some ideas. I could try to see if there are better initial parameters that improve on these. I'm also going to investigate whether the halflives produced by the two algorithms are reasonable (since some apps will no doubt want to do the Anki thing and schedule reviews for when recall probability drops below a threshold).
If it turns out the single-atom Beta power law algorithm is good enough, should I scrap the Beta-ensemble model…? 😝!
The text was updated successfully, but these errors were encountered: