Support Eleuther LM-Eval-Harness in Levanter #675

dlwh · 2024-07-25T20:52:29Z

Adds Eleuther's LM Eval Harness as a callback in Levanter. It's much slower than it needs to be because I'm not doing any sequence packing, but it gets the job done. Scores on Llama 3 seem reasonable, so I think this is right.

Closes #564

…pierq

src/levanter/eval_harness.py

percyliang · 2024-12-03T06:28:36Z

src/levanter/eval_harness.py

+            whole_enc = self.tokenizer(context + completion)
+            context_enc = self.tokenizer(context)


Must we run tokenizer twice?

it's how it's done in lm harness and in side alpaca. it's the easiest thing and not a bottleneck

src/levanter/eval_harness.py

percyliang · 2024-12-03T06:34:18Z

src/levanter/eval_harness.py

+
+    task: str
+    task_alias: str | None = None
+    num_fewshot: int | None = None


What does None represent that's not representable by an integer?

none means the default for the task, whatever it is

src/levanter/eval_harness.py

percyliang · 2024-12-03T06:36:16Z

src/levanter/eval_harness.py

+        return [task.to_dict() if isinstance(task, TaskConfig) else task for task in self.task_spec]
+
+    def to_task_dict(self) -> dict:
+        import lm_eval.tasks as tasks


Add docstring on what this function is doing / what it's for?

src/levanter/eval_harness.py

percyliang · 2024-12-03T06:37:19Z

src/levanter/eval_harness.py

+
+    EvalPos = model.Pos if max_eval_length is None else model.Pos.resize(max_eval_length)
+    harness = LevanterHarnessLM(EvalBatch, EvalPos, model, axis_resources, tokenizer)
+    # we always log_samples here and filter out the samples later if we don't want them


log_samples is a verb?

i mean, it's a verb phrase :-)

also, we don't need this behavior actually because i can't do the metrics I want to do with the samples anyhow

oh i lied, we do

src/levanter/eval_harness.py

percyliang · 2024-12-03T06:40:08Z

src/levanter/eval_harness.py

+
+NAT_TO_BIT = 1 / np.log(2)
+
+# eval_harness isn't consistent enough for this to actually be workable


What does this mean? lm evaluation harness is just a framework, so whether it's meaningful or not depends on the actual eval and the model size?

well like, the samples for multiple choice tasks don't all have the answer in a standard format even though they easily could (and must internall at some point)

percyliang · 2024-12-03T06:41:16Z

src/levanter/eval_harness.py

+        return self.trainer.EvalBatch
+
+    @cached_property
+    def the_tokenizer(self):


not obvious what this function is doing from the name...maybe tokenizer_object or something?

i use this convention throughout levanter...

src/levanter/eval_harness.py

percyliang · 2024-12-03T06:43:43Z

src/levanter/eval_harness.py

+    to_log = {}
+    for task_name, task_results in report["results"].items():
+        for metric_name, metric_value in task_results.items():
+            if metric_name.endswith(",none"):


What is this hackery? put assumptions in comment

they just add ,none to all the default metrics for some reason. e.g. it's acc,none, acc_norm,none etc etc

percyliang · 2024-12-03T06:44:57Z

src/levanter/eval_harness.py

+        )
+
+        if jax.process_index() == 0:
+            with tempfile.NamedTemporaryFile("w", delete=False, suffix=".json") as f:


tmp file that doesn't get deleted doesn't sound good - I'd delete or put the file in the evaluation directory (which I guess would have to be passed in as an explicit location)

it's delete=False so that the wandb process still can upload it

percyliang · 2024-12-03T06:45:28Z

src/levanter/main/train_lm.py

@@ -122,6 +126,14 @@ def main(config: TrainLmConfig):
        Pos = config.model.Pos
        KeyPos = config.model.KeyPos

+        # to do partitioning, our dimensions have to be divisible by the size of the physical axes they're mapped to


Why is this moved up?

percyliang · 2024-12-03T06:45:47Z

src/levanter/models/lm_model.py

+        ignore_id: Optional[int] = None,
+        all_causal: bool = True,
+    ) -> "LmExample":
+        # mask out the prompt tokens


nikil-ravi · 2024-12-03T18:54:12Z

src/levanter/eval_harness.py

+                    task_dict = tasks.get_task_dict([task], manager)
+                    this_task = task_dict.popitem()[1]
+                    # hacky, but this allows us to run multiple instances of the same task with different fewshot settings
+                    this_task.config.task = our_name


I think this works well since the result file is saved as task.jsonl so using the alias helps us distinguish the number of shots. It might be worth it to move this logic into a helper function create_task_with_unique_name() for readability...

nikil-ravi · 2024-12-03T19:08:22Z

src/levanter/eval_harness.py

+    return outputs
+
+
+def _actually_run_eval_harness(config: LmEvalHarnessConfig, model, tasks_to_run, tokenizer, EvalBatch, axis_resources):


typing + docstring for outputs would be useful here

nikil-ravi · 2024-12-03T19:55:43Z

src/levanter/eval_harness.py

+    return outputs
+
+
+def _compute_averages(outputs):


Unrelated to this PR specifically but we could add more ways to aggregate results- the DCLM paper for example reports centered accuracy. Maybe we can have this aggregation function be something we pass into LmEvalHarnessConfig. Or something like subtract_random_baseline: true in the YAML config...

imho we should modify lm harness to do that, but it's a great idea

dlwh added 15 commits May 19, 2024 21:39

wip

8d1f3a6

Merge remote-tracking branch 'origin/main' into eval_harness

a91afd9

ok it runs. garbage but it runs?

628c525

don't require imports

d4bf9e2

wio

a32308a

wip

319eb6a

Merge remote-tracking branch 'origin/main' into eval_harness

d0a1560

almost there

f05b189

Merge remote-tracking branch 'origin/main' into eval_harness

7ba7f47

maybe we are there?

19ac049

launcher

f5dec31

fix (?) logging of loading time etc.

12c0b06

wip

463331e

wip

2d4d0d8

Merge remote-tracking branch 'origin/main' into eval_harness

f9ccebb

dlwh marked this pull request as draft July 25, 2024 20:52

dlwh added 14 commits November 15, 2024 15:12

Merge remote-tracking branch 'origin/main' into eval_harness

3040956

off by one

c7c5f70

Merge remote-tracking branch 'origin/main' into eval_harness

50c20c0

move logging and types to util to make python's module resolution hap…

eb441e3

…pierq

hijack HF's download so it works with gcs etc.

353caa6

missed some renames?

72fa689

rename maybe_fused_next_token_loss

16195b0

add some more tests to make sure different seq lens work

e2cab79

bump jax version

4c34eec

depend on my fork

4ecc630

eval_harness is about there

bcfc225

refactor

e0ef6f8

pad number of requests to proper length

e52b61d

sigh

babaee2