[Prototype] Sandbox for Implementation of generate and integration of lm_eval (evaluation harness) #222

bigximik · 2025-04-03T12:31:59Z

✨ Description

This PR draft will be split in 3 PRs

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

bigximik · 2025-04-03T12:41:01Z

I have created a debugging sandbox with manual tests for now. The results are as follows:

Ignoring `attention_mask` and `position_ids`:

Batch Size	No Flash Attention (Float32)	No Flash Attention (BF16)	Flash Attention (BF16)
1	Same output (same model via HF and Fast-LLM)	Same output	Different output
2	Different output	Different output	Different output

Converting `attention_mask` (from HF `forward`) to `sequence_lengths`:

Batch Size	No Flash Attention (Float32)	No Flash Attention (BF16)	Flash Attention (BF16)
1	FastLLM empty output	FastLLM empty output	Different output
2	FastLLM empty output	FastLLM empty output	Different output

It seems sequence_lengths is not supported for fused attention and does not improve Flash Attention. Could this be correct?

If attention_mask is a left-padded mask like this:
[[0, 0, 0, 1, 1, 1, 1], ....]
I convert it to sequence_lengths = [[3, 4], ....].

# First non zero indexes or zero index if the row is all zeros (invalid row)
first_non_zero_indexes = attention_mask.argmax(dim=1)

# Check if the sequence is left-padded and if the remaining ones are continuous 1-ns
assert (attention_mask.sum(axis=1) == (attention_mask.shape[1] - first_non_zero_indexes)).all()

sequence_lenghts = [
    torch.tensor(
        [attention_mask.shape[1]] if el == 0 else [el, attention_mask.shape[1] - el], dtype=torch.int64
    )
    for el in first_non_zero_indexes.tolist()
]

bigximik · 2025-04-03T13:02:42Z

@sohamparikh @jlamypoirier Hi, I am trying to use the cross-document attention prevention that @tscholak pointed me to (https://github.com/ServiceNow/Fast-LLM/pull/177/files) to mimic left padding for documents in a batch during generation. It appears to be doing the correct thing, such as building the internal mask and position IDs, but it is not working. Could you please comment on what might be wrong? Thanks!

…model and saving

…denis/generate

…f from cli

fast_llm/engine/training/evaluator.py

tscholak · 2025-04-22T15:53:43Z

fast_llm/engine/training/lm_eval/utils.py

+    assert not args.wandb_args  # default empty string
+    assert not args.wandb_config_args  # default empty string
+    assert args.model == "hf"  # default value of 'hf'
+    assert not args.model_args  # default empty string
+    assert args.batch_size == 1  # default value of 1
+    assert args.max_batch_size is None
+    assert args.device is None


make sure these are raised during config class validation

Moving this to _validate will require lm_eval be installed for a dry_run also to separate parameter preparation function, as it nows not just passes arguments, but also creates lm_eval tasks and evaluation_tracker objects

tasks are object which are actually passed to evaluate, and evaluation_tracker tracks and saves details to disk

fast_llm/engine/training/lm_eval/utils.py

…to more specific name

jlamypoirier · 2025-05-06T17:21:06Z

fast_llm/tools/evaluate.py

@@ -0,0 +1,25 @@
+import argparse


I don't really the point of this, it's just a synonym of training. Let's remove and maybe later add proper standalone evaluation.

Having a separate cli command for evaluation — even if it's just a wrapper around training — was requested by @tscholak, so that people can start using it with our models with fml-ops.

There are three use cases:

To run evaluations during training, which is handled by the fast-llm train command.

To run evaluation only, using a training config — this is what the evaluate command currently supports.

To run evaluation on a model without a training config — this is the candidate for standalone evaluation.

I get the need but there is no need for a separate command for it as it brings nothing new. It's not harder to write train than evaluate, train_iters is 0 by default anyway, and if using an existing config with specified train_iters it's simple enough to add the cli argument. I'd rather keep the evaluate keyword for standalone evaluation, even if we don't do it yet.

Also this won't really work with #245

@jlamypoirier, @bigximik, thanks for having this discussion. To clarify:

We absolutely need a standalone evaluate command. It should be configured exactly like the evaluation that runs during training. The goal is total parity: when we run evaluation separately, it must match precisely what happens during training (same configs, same metrics, same logging to wandb under the same project, group, and run). This ensures reproducibility, avoids maintaining multiple workflows, and streamlines processes.

Yes, technically it could be wrapped into train with additional flags, but that's exactly the complexity we don't want. There should be a single obvious way to run standalone evaluation. Having one clear, endorsed entry point (evaluate) reduces cognitive overhead and simplifies team practices.

I'd like to separate this point from implementation specifics. This functionality is high-priority and needs to land ASAP, either in this PR or @bigximik's next one. And I want us to be on the same page and acknowledge the need. So can we please do that?

Also, @bigximik is driving this work and owns the deliverable. Let's keep this in mind, @jlamypoirier.

Btw, I get that under the hood this could be routed through train with train_iters=0. Thanks for pointing this out, @jlamypoirier. I believe that's what @bigximik is doing here.
The point is that we expose a clear evaluate command to users. Even if it's starts out as a thin wrapper, that's a win. It gives us the simplicity and clarity we want. Perfect is the enemy of good enough. Let's land this and improve it incrementally if needed.

Okay, I will leave evaluate as-is for now. Once this PR is merged, I’ll create a proper evaluate command that can accept either a TrainerConfig or an EvaluationConfig.

We'll likely need to call the config for the evaluate command EvaluationConfig, since EvaluatorConfig is already used for individual evaluations. Alternatively, we could name it EvaluatorRunnerConfig.

@jlamypoirier @tscholak — thoughts on the config naming?

…ly aware

…tuple

jlamypoirier · 2025-05-07T17:02:48Z

fast_llm/engine/inference/huggingface.py

+        raise NotImplementedError()
+
+    @classmethod
+    def from_fast_llm_model_in_training(


def from_trainer(cls, trainer: Trainer, **kwargs):

i have changed to from_model and it accepts optional runner and micro_batch_size

from_trainer would be more appropriate since it's the construct we are actually using.

…tokenizer

…nts instead of TrainerConfig

…iningProgressInfor to TrainingProgress

bigximik · 2025-05-09T13:41:09Z

I’ve finished working in this draft and will create 3 new PRs from it:

Generate support
Refactoring of evaluations
lm_eval integration

In addition to the changes here, I’ll be adding tests and documentation updates as needed.

I’ll also be tracking this draft in case further discussion continues here.

jlamypoirier · 2025-05-09T16:41:04Z

fast_llm/engine/training/config.py

-    def get_iteration_count(self, training_iterations: int, extra_evaluations: int = 0):
-        # Number of completed validation iterations
-        return (self.get_count(training_iterations) + extra_evaluations) * self.iterations if self.enabled() else 0
+class TrainingEvaluatorConfig(EvaluatorConfigBase):


I don't see the point of the inheritance. We can just call .evaluator.get_evaluator()

so the evaluator runner does not need to know which one it is instantiating, it works just with the interface:

https://github.com/ServiceNow/Fast-LLM/blob/denis/generate/fast_llm/engine/evaluation/evaluator.py#L352

this is in preparation for the standalone evaluation command, actually all is ready for it we just need to decide how to name EvaluationConfig for it and how exactly to put needed fields, do we want to mimic Trainer config or just put all on the same base level?

jlamypoirier · 2025-05-09T16:41:26Z

fast_llm/engine/training/config.py

+        # Number of completed evaluation runs
+        return (self.run_interval.get_count(training_iterations) + extra_evaluations) if self.run_interval.enabled() else 0
+
+    def get_evaluator(


It’s for the same reason — so that EvaluatorRunner can instantiate evaluators from either TrainingEvaluatorConfig or EvaluatorConfig transparently.

TrainingEvaluatorConfig and TrainingEvaluator simply wrap the Evaluator classes to support being called multiple times, controlled via run_interval.

This is part of the preparation for introducing a standalone evaluate command.

In TrainerConfig, we have a dict[str, TrainingEvaluatorConfig], which supports multiple evaluation runs (controlled via run_interval).

In EvaluationConfig, we'll have a dict[str, EvaluatorConfig], which supports a single evaluation run.

Both types can be passed transparently to EvaluatorRunner, allowing it to handle either case uniformly.

changes for debugging

b8f6b62

Toolkit User and others added 8 commits April 8, 2025 09:57

temporal hack to save logits

4022249

simple generate function and less hacky way to save logits

505e658

moved mkdir

a53855b

fast llm classes

c14ca4d

refactored logits saving, test and added hidden_test return from the …

6a72203

…model and saving

added notebook to check logits and hidden states diffs

e713aa2

Merge branch 'denis/generate' of github.com:ServiceNow/Fast-LLM into …

04e914c

…denis/generate

fix to a document mask for attention mask

51f59f8

This was referenced Apr 15, 2025

Integration of Fast-LLM EleutherAI/lm-evaluation-harness#2913

Closed

Integration of Fast-LLM ServiceNow/fml-lm-evaluation-harness#1

Draft

Toolkit User added 10 commits April 15, 2025 15:21

fix for an absent attention_mask

9c01471

updated classes and funcions naming, removed temporal param from init

7f1ca8a

updated manual test

0488fdb

evaluation abstraction implementation

c65d9ba

fixes for evaluation only in trainer

1543a56

added evaluate command

dc2b5e0

lm_eval integration, one gpu

0cb3ad7

fixing typos

c86ae20

fixes to make lm_eval reporting to work with wrapper object instead o…

145ee50

…f from cli

comments and some code formatting

85b19d8