-
Notifications
You must be signed in to change notification settings - Fork 33
[Prototype] Sandbox for Implementation of generate and integration of lm_eval (evaluation harness) #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I have created a debugging sandbox with manual tests for now. The results are as follows: Ignoring
|
Batch Size | No Flash Attention (Float32) | No Flash Attention (BF16) | Flash Attention (BF16) |
---|---|---|---|
1 | Same output (same model via HF and Fast-LLM) | Same output | Different output |
2 | Different output | Different output | Different output |
Converting attention_mask
(from HF forward
) to sequence_lengths
:
Batch Size | No Flash Attention (Float32) | No Flash Attention (BF16) | Flash Attention (BF16) |
---|---|---|---|
1 | FastLLM empty output | FastLLM empty output | Different output |
2 | FastLLM empty output | FastLLM empty output | Different output |
It seems sequence_lengths
is not supported for fused attention and does not improve Flash Attention. Could this be correct?
If attention_mask
is a left-padded mask like this:
[[0, 0, 0, 1, 1, 1, 1], ....]
I convert it to sequence_lengths = [[3, 4], ....]
.
# First non zero indexes or zero index if the row is all zeros (invalid row)
first_non_zero_indexes = attention_mask.argmax(dim=1)
# Check if the sequence is left-padded and if the remaining ones are continuous 1-ns
assert (attention_mask.sum(axis=1) == (attention_mask.shape[1] - first_non_zero_indexes)).all()
sequence_lenghts = [
torch.tensor(
[attention_mask.shape[1]] if el == 0 else [el, attention_mask.shape[1] - el], dtype=torch.int64
)
for el in first_non_zero_indexes.tolist()
]
@sohamparikh @jlamypoirier Hi, I am trying to use the cross-document attention prevention that @tscholak pointed me to (https://github.com/ServiceNow/Fast-LLM/pull/177/files) to mimic left padding for documents in a batch during generation. It appears to be doing the correct thing, such as building the internal mask and position IDs, but it is not working. Could you please comment on what might be wrong? Thanks! |
β¦model and saving
assert not args.wandb_args # default empty string | ||
assert not args.wandb_config_args # default empty string | ||
assert args.model == "hf" # default value of 'hf' | ||
assert not args.model_args # default empty string | ||
assert args.batch_size == 1 # default value of 1 | ||
assert args.max_batch_size is None | ||
assert args.device is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure these are raised during config class validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this to _validate will require lm_eval be installed for a dry_run also to separate parameter preparation function, as it nows not just passes arguments, but also creates lm_eval tasks and evaluation_tracker objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tasks are object which are actually passed to evaluate, and evaluation_tracker tracks and saves details to disk
@@ -0,0 +1,25 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really the point of this, it's just a synonym of training. Let's remove and maybe later add proper standalone evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a separate cli command for evaluation β even if it's just a wrapper around training β was requested by @tscholak, so that people can start using it with our models with fml-ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three use cases:
- To run evaluations during training, which is handled by the
fast-llm train
command. - To run evaluation only, using a training config β this is what the
evaluate
command currently supports. - To run evaluation on a model without a training config β this is the candidate for standalone evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the need but there is no need for a separate command for it as it brings nothing new. It's not harder to write train
than evaluate
, train_iters
is 0 by default anyway, and if using an existing config with specified train_iters
it's simple enough to add the cli argument. I'd rather keep the evaluate keyword for standalone evaluation, even if we don't do it yet.
Also this won't really work with #245
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlamypoirier, @bigximik, thanks for having this discussion. To clarify:
We absolutely need a standalone evaluate command. It should be configured exactly like the evaluation that runs during training. The goal is total parity: when we run evaluation separately, it must match precisely what happens during training (same configs, same metrics, same logging to wandb under the same project, group, and run). This ensures reproducibility, avoids maintaining multiple workflows, and streamlines processes.
Yes, technically it could be wrapped into train
with additional flags, but that's exactly the complexity we don't want. There should be a single obvious way to run standalone evaluation. Having one clear, endorsed entry point (evaluate
) reduces cognitive overhead and simplifies team practices.
I'd like to separate this point from implementation specifics. This functionality is high-priority and needs to land ASAP, either in this PR or @bigximik's next one. And I want us to be on the same page and acknowledge the need. So can we please do that?
Also, @bigximik is driving this work and owns the deliverable. Let's keep this in mind, @jlamypoirier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I get that under the hood this could be routed through train with train_iters=0
. Thanks for pointing this out, @jlamypoirier. I believe that's what @bigximik is doing here.
The point is that we expose a clear evaluate
command to users. Even if it's starts out as a thin wrapper, that's a win. It gives us the simplicity and clarity we want. Perfect is the enemy of good enough. Let's land this and improve it incrementally if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will leave evaluate
as-is for now. Once this PR is merged, Iβll create a proper evaluate
command that can accept either a TrainerConfig
or an EvaluationConfig
.
We'll likely need to call the config for the evaluate
command EvaluationConfig
, since EvaluatorConfig
is already used for individual evaluations. Alternatively, we could name it EvaluatorRunnerConfig
.
@jlamypoirier @tscholak β thoughts on the config naming?
raise NotImplementedError() | ||
|
||
@classmethod | ||
def from_fast_llm_model_in_training( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def from_trainer(cls, trainer: Trainer, **kwargs):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have changed to from_model
and it accepts optional runner
and micro_batch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_trainer
would be more appropriate since it's the construct we are actually using.
β¦nts instead of TrainerConfig
β¦iningProgressInfor to TrainingProgress
Iβve finished working in this draft and will create 3 new PRs from it:
In addition to the changes here, Iβll be adding tests and documentation updates as needed. Iβll also be tracking this draft in case further discussion continues here. |
def get_iteration_count(self, training_iterations: int, extra_evaluations: int = 0): | ||
# Number of completed validation iterations | ||
return (self.get_count(training_iterations) + extra_evaluations) * self.iterations if self.enabled() else 0 | ||
class TrainingEvaluatorConfig(EvaluatorConfigBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of the inheritance. We can just call .evaluator.get_evaluator()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the evaluator runner does not need to know which one it is instantiating, it works just with the interface:
this is in preparation for the standalone evaluation command, actually all is ready for it we just need to decide how to name EvaluationConfig for it and how exactly to put needed fields, do we want to mimic Trainer config or just put all on the same base level?
# Number of completed evaluation runs | ||
return (self.run_interval.get_count(training_iterations) + extra_evaluations) if self.run_interval.enabled() else 0 | ||
|
||
def get_evaluator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Itβs for the same reason β so that EvaluatorRunner
can instantiate evaluators from either TrainingEvaluatorConfig
or EvaluatorConfig
transparently.
TrainingEvaluatorConfig
and TrainingEvaluator
simply wrap the Evaluator
classes to support being called multiple times, controlled via run_interval
.
This is part of the preparation for introducing a standalone evaluate
command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TrainerConfig
, we have a dict[str, TrainingEvaluatorConfig]
, which supports multiple evaluation runs (controlled via run_interval
).
In EvaluationConfig
, we'll have a dict[str, EvaluatorConfig]
, which supports a single evaluation run.
Both types can be passed transparently to EvaluatorRunner
, allowing it to handle either case uniformly.
β¨ Description
This PR draft will be split in 3 PRs
π Type of change
Select all that apply:
π Changes
List the key changes introduced in this PR:
β Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact
π Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
ποΈ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.