Skip to content

Covariate Std Err with baselines #245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Covariate Std Err with baselines #245

wants to merge 7 commits into from

Conversation

recursix
Copy link
Collaborator

@recursix recursix commented May 7, 2025

Description by Korbit AI

What change is being made?

Add new functionality for covariate standard error analysis with baselines in covariate_std_err.py, introduce new agent configurations and LLM settings, update environment variable documentation, adjust supported Python version, and extend the reproducibility journal. Also, include mock data for toy experiments under covariate_toy_experiment.

Why are these changes being made?

These changes enhance the analysis toolkit by providing methods for evaluating covariate effects on model performance. Incorporating new agents and LLM configurations improves adaptability for different LLM scenarios. Updating environment documentation provides clarity for users on configuration settings. Raising the required Python version aligns with newer dependencies, ensuring compatibility.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

recursix and others added 5 commits May 7, 2025 08:23
- Implemented Task and Agent classes to simulate agent performance on tasks.
- Added methods for calculating task success rates and overall success rates.
- Included functions for sampling rewards from a Bernoulli distribution based on task success rates.
- Created plotting functions for visualizing task difficulty and Gaussian distributions.
- Introduced a utility function to augment matrices with averages for analysis.
* add llama4-support

* llama4 maverick L1

* llama4 maverick L2

---------

Co-authored-by: agentlabtraces <[email protected]>
@recursix recursix requested a review from optimass May 7, 2025 12:26
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Readability Misspelled Parameter Name ▹ view
Readability Missing Type Hint and Unclear Default ▹ view
Functionality Missing Gaussian Parameter Validation ▹ view
Performance Excessive Memory Usage in Cross-validation ▹ view
Functionality Missing Promised Return Value ▹ view
Readability Missing Type Hints in Function Signature ▹ view
Readability Complex Formula Without Clarity ▹ view
Documentation Incorrect return documentation ▹ view
Design Duplicate GLM Implementation Logic ▹ view
Files scanned
File Path Reviewed
src/agentlab/agents/generic_agent/init.py
src/agentlab/analyze/covariate_toy_experiment/mock_data.py
src/agentlab/llm/llm_configs.py
src/agentlab/agents/generic_agent/agent_configs.py
src/agentlab/analyze/covariate_std_err.py
src/agentlab/analyze/agent_xray.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

competence: float,
benchmark: list[Task],
type: int = None,
consistancy: float = 10,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misspelled Parameter Name category Readability

Tell me more
What is the issue?

The parameter 'consistancy' in Agent.init() is misspelled. It should be 'consistency'.

Why this matters

Misspelled variable names can cause confusion and make code harder to understand and maintain.

Suggested change ∙ Feature Preview
consistency: float = 10,
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +62 to +65
def agent_on_benchmark(
agent: Agent,
benchmark: list[Task],
n_samples_per_task=None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Type Hint and Unclear Default category Readability

Tell me more
What is the issue?

The parameter n_samples_per_task lacks a type hint and has None as default without explanation of what that means.

Why this matters

Missing type hints and unexplained None defaults make it unclear what values are acceptable for this parameter.

Suggested change ∙ Feature Preview
def agent_on_benchmark(
    agent: Agent,
    benchmark: list[Task],
    n_samples_per_task: int | None = 1,  # Number of samples to generate per task
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

plt.title("Distribution of Task Difficulty")


def plot_gaussian(mu, sigma, label=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Gaussian Parameter Validation category Functionality

Tell me more
What is the issue?

The function doesn't validate that sigma is positive, which is required for a valid Gaussian distribution.

Why this matters

Negative or zero sigma values would result in invalid probability distributions or division by zero errors.

Suggested change ∙ Feature Preview

Add parameter validation:

def plot_gaussian(mu, sigma, label=None):
    """
    Plot a Gaussian distribution with mean mu and standard deviation sigma.
    """
    if sigma <= 0:
        raise ValueError("sigma must be positive")
    x = np.linspace(0, 1, 1000)
    plt.plot(
        x, 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(-0.5 * ((x - mu) / sigma) ** 2), label=label
    )
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +132 to +169
def std_err_diff_baselines(rewards, baselines):
"""
Find the best baseline and compute the adjusted mean and SE.

Parameters:
- rewards: array-like of shape (n,)
Observed rewards (may contain NaN).
- baselines: array-like of shape (n, k)
k baseline estimates per sample (may contain NaN).

Returns:
- adjusted_reward_mean: float
Mean of valid rewards.
- adjusted_se: float∫
SE of the adjusted mean (differences) using the selected baseline.
- selected_baseline: np.ndarray of shape (n,)
The values of the chosen baseline with NaNs filled.
"""
rewards = np.asarray(rewards, dtype=float)
baselines = _replace_nans_by_average(baselines)

if rewards.shape[0] != baselines.shape[0]:
raise ValueError("rewards and baselines must have the same length.")

# Identify valid reward samples
valid = ~np.isnan(rewards)
reward_valid = rewards[valid]
if reward_valid.size == 0:
return np.nan, np.nan

selected_baseline_valid = _select_best_baseline(reward_valid, baselines[valid])
diffs = reward_valid - selected_baseline_valid
adjusted_se = np.std(diffs, ddof=1) / np.sqrt(diffs.size)

# Adjusted mean reward is the raw mean of valid rewards
adjusted_reward_mean = reward_valid.mean()

return adjusted_reward_mean, adjusted_se
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Promised Return Value category Functionality

Tell me more
What is the issue?

The function's docstring promises to return a 'selected_baseline' value, but the function only returns two values (adjusted_reward_mean and adjusted_se).

Why this matters

Callers expecting three return values will receive a ValueError or unpacking error when using this function.

Suggested change ∙ Feature Preview

Either update the docstring to reflect the actual return values or modify the function to return the selected baseline as documented:

def std_err_diff_baselines(rewards, baselines):
    ...
    return adjusted_reward_mean, adjusted_se, selected_baseline_valid
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +210 to +212
def std_err_glm_cv_regularized(
rewards, baselines, lambda_grid=None, n_splits=5, n_boot=200, random_state=None
):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Type Hints in Function Signature category Readability

Tell me more
What is the issue?

Function uses generic names 'rewards' and 'baselines' without type hints, making it unclear what data types and value ranges are expected.

Why this matters

Without proper type hints, developers need to read the docstring or implementation to understand valid inputs, which slows down code comprehension and can lead to runtime errors.

Suggested change ∙ Feature Preview
def std_err_glm_cv_regularized(
    rewards: np.ndarray,  # binary values (0 or 1)
    baselines: np.ndarray,  # shape (n, k) baseline estimates
    lambda_grid: Optional[np.ndarray] = None,
    n_splits: int = 5,
    n_boot: int = 200,
    random_state: Optional[int] = None
) -> Tuple[float, float]:
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

@marcotet
Copy link

marcotet commented May 7, 2025

The aggregate_success method is imported and used in the notebook, but is not defined where it is being imported from:
ImportError: cannot import name 'aggregate_success' from 'agentlab.analyze.covariate_std_err'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants