Skip to content

[Prototype] Sandbox for Implementation of generate and integration of lm_eval (evaluation harness) #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented Apr 3, 2025

✨ Description

This PR draft will be split in 3 PRs

πŸ” Type of change

Select all that apply:

  • πŸ› Bug fix (non-breaking change that addresses a specific issue)
  • πŸš€ New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • πŸ“ˆ Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • πŸ› οΈ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • πŸ“¦ Dependency bump (updates dependencies, including Dockerfile or package changes)
  • πŸ“ Documentation change (updates documentation, including new content or typo fixes)
  • πŸ”§ Infrastructure/Build change (affects build process, CI/CD, or dependencies)

πŸ“ Changes

List the key changes introduced in this PR:

  1. Change A
  2. Change B

βœ… Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • πŸ“œ I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • πŸŽ‰ The functionality is complete, and I have tested the changes.
  • πŸ“ I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • πŸ‹ I have updated the Docker configuration or dependencies, if applicable.
  • πŸ”„ I have ensured compatibility with the existing setup after dependency changes.

Testing

  • πŸ§ͺ I have added or updated tests to cover my changes.
  • βœ”οΈ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • πŸ‹οΈ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • πŸ“Š I have run benchmarks where applicable to evaluate the performance impact.
  • βœ… The benchmarks show no performance regression.
  • πŸš€ The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • πŸ“ˆ I have provided benchmark results and detailed any performance impact below, if applicable.

πŸ“Š Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:


πŸ—’οΈ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

@bigximik
Copy link
Contributor Author

bigximik commented Apr 3, 2025

I have created a debugging sandbox with manual tests for now. The results are as follows:

Ignoring attention_mask and position_ids:

Batch Size No Flash Attention (Float32) No Flash Attention (BF16) Flash Attention (BF16)
1 Same output (same model via HF and Fast-LLM) Same output Different output
2 Different output Different output Different output

Converting attention_mask (from HF forward) to sequence_lengths:

Batch Size No Flash Attention (Float32) No Flash Attention (BF16) Flash Attention (BF16)
1 FastLLM empty output FastLLM empty output Different output
2 FastLLM empty output FastLLM empty output Different output

It seems sequence_lengths is not supported for fused attention and does not improve Flash Attention. Could this be correct?

If attention_mask is a left-padded mask like this:
[[0, 0, 0, 1, 1, 1, 1], ....]
I convert it to sequence_lengths = [[3, 4], ....].

# First non zero indexes or zero index if the row is all zeros (invalid row)
first_non_zero_indexes = attention_mask.argmax(dim=1)

# Check if the sequence is left-padded and if the remaining ones are continuous 1-ns
assert (attention_mask.sum(axis=1) == (attention_mask.shape[1] - first_non_zero_indexes)).all()

sequence_lenghts = [
    torch.tensor(
        [attention_mask.shape[1]] if el == 0 else [el, attention_mask.shape[1] - el], dtype=torch.int64
    )
    for el in first_non_zero_indexes.tolist()
]

@bigximik
Copy link
Contributor Author

bigximik commented Apr 3, 2025

@sohamparikh @jlamypoirier Hi, I am trying to use the cross-document attention prevention that @tscholak pointed me to (https://github.com/ServiceNow/Fast-LLM/pull/177/files) to mimic left padding for documents in a batch during generation. It appears to be doing the correct thing, such as building the internal mask and position IDs, but it is not working. Could you please comment on what might be wrong? Thanks!

Comment on lines 335 to 341
assert not args.wandb_args # default empty string
assert not args.wandb_config_args # default empty string
assert args.model == "hf" # default value of 'hf'
assert not args.model_args # default empty string
assert args.batch_size == 1 # default value of 1
assert args.max_batch_size is None
assert args.device is None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure these are raised during config class validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to _validate will require lm_eval be installed for a dry_run also to separate parameter preparation function, as it nows not just passes arguments, but also creates lm_eval tasks and evaluation_tracker objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks are object which are actually passed to evaluate, and evaluation_tracker tracks and saves details to disk

@@ -0,0 +1,25 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really the point of this, it's just a synonym of training. Let's remove and maybe later add proper standalone evaluation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a separate cli command for evaluation β€” even if it's just a wrapper around training β€” was requested by @tscholak, so that people can start using it with our models with fml-ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three use cases:

  • To run evaluations during training, which is handled by the fast-llm train command.
  • To run evaluation only, using a training config β€” this is what the evaluate command currently supports.
  • To run evaluation on a model without a training config β€” this is the candidate for standalone evaluation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the need but there is no need for a separate command for it as it brings nothing new. It's not harder to write train than evaluate, train_iters is 0 by default anyway, and if using an existing config with specified train_iters it's simple enough to add the cli argument. I'd rather keep the evaluate keyword for standalone evaluation, even if we don't do it yet.

Also this won't really work with #245

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlamypoirier, @bigximik, thanks for having this discussion. To clarify:

We absolutely need a standalone evaluate command. It should be configured exactly like the evaluation that runs during training. The goal is total parity: when we run evaluation separately, it must match precisely what happens during training (same configs, same metrics, same logging to wandb under the same project, group, and run). This ensures reproducibility, avoids maintaining multiple workflows, and streamlines processes.

Yes, technically it could be wrapped into train with additional flags, but that's exactly the complexity we don't want. There should be a single obvious way to run standalone evaluation. Having one clear, endorsed entry point (evaluate) reduces cognitive overhead and simplifies team practices.

I'd like to separate this point from implementation specifics. This functionality is high-priority and needs to land ASAP, either in this PR or @bigximik's next one. And I want us to be on the same page and acknowledge the need. So can we please do that?

Also, @bigximik is driving this work and owns the deliverable. Let's keep this in mind, @jlamypoirier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I get that under the hood this could be routed through train with train_iters=0. Thanks for pointing this out, @jlamypoirier. I believe that's what @bigximik is doing here.
The point is that we expose a clear evaluate command to users. Even if it's starts out as a thin wrapper, that's a win. It gives us the simplicity and clarity we want. Perfect is the enemy of good enough. Let's land this and improve it incrementally if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will leave evaluate as-is for now. Once this PR is merged, I’ll create a proper evaluate command that can accept either a TrainerConfig or an EvaluationConfig.

We'll likely need to call the config for the evaluate command EvaluationConfig, since EvaluatorConfig is already used for individual evaluations. Alternatively, we could name it EvaluatorRunnerConfig.

@jlamypoirier @tscholak β€” thoughts on the config naming?

raise NotImplementedError()

@classmethod
def from_fast_llm_model_in_training(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def from_trainer(cls, trainer: Trainer, **kwargs):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have changed to from_model and it accepts optional runner and micro_batch_size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_trainer would be more appropriate since it's the construct we are actually using.

@bigximik
Copy link
Contributor Author

bigximik commented May 9, 2025

I’ve finished working in this draft and will create 3 new PRs from it:

  • Generate support
  • Refactoring of evaluations
  • lm_eval integration

In addition to the changes here, I’ll be adding tests and documentation updates as needed.

I’ll also be tracking this draft in case further discussion continues here.

@bigximik bigximik changed the title Implementation of generate and integration of lm_eval (evaluation harness) Sandbox for Implementation of generate and integration of lm_eval (evaluation harness) May 9, 2025
def get_iteration_count(self, training_iterations: int, extra_evaluations: int = 0):
# Number of completed validation iterations
return (self.get_count(training_iterations) + extra_evaluations) * self.iterations if self.enabled() else 0
class TrainingEvaluatorConfig(EvaluatorConfigBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of the inheritance. We can just call .evaluator.get_evaluator()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the evaluator runner does not need to know which one it is instantiating, it works just with the interface:

https://github.com/ServiceNow/Fast-LLM/blob/denis/generate/fast_llm/engine/evaluation/evaluator.py#L352

this is in preparation for the standalone evaluation command, actually all is ready for it we just need to decide how to name EvaluationConfig for it and how exactly to put needed fields, do we want to mimic Trainer config or just put all on the same base level?

# Number of completed evaluation runs
return (self.run_interval.get_count(training_iterations) + extra_evaluations) if self.run_interval.enabled() else 0

def get_evaluator(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s for the same reason β€” so that EvaluatorRunner can instantiate evaluators from either TrainingEvaluatorConfig or EvaluatorConfig transparently.

TrainingEvaluatorConfig and TrainingEvaluator simply wrap the Evaluator classes to support being called multiple times, controlled via run_interval.

This is part of the preparation for introducing a standalone evaluate command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TrainerConfig, we have a dict[str, TrainingEvaluatorConfig], which supports multiple evaluation runs (controlled via run_interval).

In EvaluationConfig, we'll have a dict[str, EvaluatorConfig], which supports a single evaluation run.

Both types can be passed transparently to EvaluatorRunner, allowing it to handle either case uniformly.

@bigximik bigximik mentioned this pull request May 12, 2025
18 tasks
@jlamypoirier jlamypoirier changed the title Sandbox for Implementation of generate and integration of lm_eval (evaluation harness) [Prototype] Sandbox for Implementation of generate and integration of lm_eval (evaluation harness) Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants