This document provides step-by-step instructions on how to evaluate the generation component of the system. Follow the instructions below carefully to ensure correct execution and evaluation.
Here are brief explanations of the metrics we’ve used:
- Correctness Mean Score: This metric evaluates whether the generated answer correctly matches the reference answer provided for the query. The score ranges from 1 to 5, where a higher score indicates better correctness.
- Faithfulness Relevancy: This metric checks if the generated answer is relevant to the contexts provided. It evaluates whether the response focuses on and pertains to the information within the context. The accepted values are YES or NO.
- Faithfulness Accuracy: This metric assesses if the information provided in the generated answer is accurate and correctly reflects the context. While relevancy checks if the answer is related to the context, accuracy evaluates if the details provided are correct. The accepted values are YES or NO.
- Faithfulness Conciseness and Pertinence: This metric evaluates whether the generated answer avoids including unrelated or irrelevant information and remains concise. The accepted values are YES or NO.
Make sure you have completed the first two steps mentioned in the retrieval evaluation:
- Evaluation dataset: Prepare a dataset containing the questions and answers for each resource or resource chunk, which will be used for the evaluation.
- Populate database: Ensure that the Elasticsearch database is populated using one of the strategies outlined in the indexing strategies documentation.
@router.post("/dataset_generation")
async def dataset_generation(
file: UploadFile = File(...),
limit: int = File(None),
question_column: str = File("openai_query"),
model_prompt: str = Form("llama3.1")...):
- file: The JSONL file generated using the OpenAI batch API. This file will be used as input for generating responses.
- limit: For testing purposes, this parameter limits the number of queries to be processed, allowing you to generate only a few responses and ensure everything is working correctly.
- question_column: The name of the column containing the query when the JSONL file is converted to a Pandas DataFrame.
- model_prompt: Specifies the model to use for generating the prompt. If you modified the Docker Compose before starting and changed the model in the Llama.cpp container, this is where you indicate whether you are using llama3.1 or Phi3.5-mini, as those are the available options.
- llm_model: The local LLM model to be used for generation.
- search_text_boost: Boost value for the text-based search, same as in the retrieval evaluation. Refer to the retrieval evaluation documentation for more details.
- search_embedding_boost: Boost value for the embedding-based search, also the same as in the retrieval evaluation.
- k: The number of top documents returned from the search, similar to the retrieval evaluation.
- process: Indicates whether the responses were generated by OpenAI or the local LLM. This is used to assign the appropriate name to the final output file.
- job: Specifies the type of task, whether it is for summarization, retrieval evaluation, or generation evaluation. This also affects the naming of the output file.
After starting the Docker Compose and populating the database, you need to run the generation/dataset_generation endpoint to generate the queries initially created using the OpenAI batch API, which generated a question and an answer for each resource. This endpoint will execute the queries one by one using the local LLM you’ve configured and will store all the responses in a file within the app/data directory.
The generated file will contain the following fields:
- resource_id_source: The original ID of the resource from which the query and answer were generated.
- openai_query: The question/query generated by the OpenAI API.
- openai_answer: The corresponding answer generated by the OpenAI API.
- context: The context used for the query, typically derived from the resource data.
- resources_id_context: The resource IDs associated with the context used.
- local_llm_model: The local LLM model used to generate the response.
- model_prompt: The prompt provided to the local LLM model for response generation.
- response: The actual response generated by the local LLM model.
- tokens_predicted: The number of tokens predicted by the LLM in its response.
- tokens_evaluated: The number of tokens evaluated during the prompt-response cycle.
- prompt_n: Number of tokens used in the prompt.
- prompt_ms: Time taken (in milliseconds) to process the prompt.
- prompt_per_token_ms: Average time taken per token in the prompt (in milliseconds).
- prompt_per_second: Number of tokens processed per second in the prompt.
- predicted_n: Number of tokens in the predicted output.
- predicted_ms: Time taken (in milliseconds) to generate the prediction.
- predicted_per_token_ms: Average time taken per token in the predicted output (in milliseconds).
- predicted_per_second: Number of tokens predicted per second.
@router.post("/evaluate_generation")
The /evaluation/evaluate_generation endpoint evaluates the answers generated by the local LLM (or OpenAI) based on correctness and faithfulness metrics. This process involves comparing the generated answer against the reference answer and the given context.
Parameters:
- file: A CSV file containing the questions, reference answers, and the generated answers by the local model. It is the output of the dataset_generation endpoint.
- openai_api_key: The OpenAI API key, used for evaluation.
- openai_model: The OpenAI model to use (default is gpt-4o-mini-2024-07-18) for the evaluation.
- max_tokens: The maximum number of tokens allowed for the generation evaluation (default is 400) by Openai.
- limit: An optional limit to process a specific number of entries from the CSV file.
- query_column: The column in the CSV containing the queries or questions.
- reference_answer_column: The column in the CSV containing the reference answers generated by OpenAI.
- generated_answer_column: The column in the CSV containing the answers generated by the local model.
- resource_id_column: The column in the CSV containing the resource IDs associated with the queries and answers.
- contexts_column: The column that contains the context related to the queries.
- correctness_threshold: The minimum threshold that a generated answer must meet to be considered correct.
- process: Indicates whether the responses were generated by OpenAI or by the local model and affects the output file naming.
- search_text_boost: Boost value for the text-based search, similar to the retrieval evaluation.
- search_embedding_boost: Boost value for the embedding-based search.
- k: The number of documents returned from the search.
- clearml_track_experiment: Indicates whether the results should be tracked and saved in ClearML.
- clearml_experiment_name: The experiment name in ClearML (default is "Generation evaluation").
- clearml_project_name: The project name in ClearML.
The evaluations are saved in separate files in the app/data directory, one for each metric:
- Correctness file: Saves the correctness score and the reasoning behind it.
- Faithfulness file: Saves the evaluations of relevancy, accuracy, and conciseness.
Upon completion of the evaluation, the system returns a set of metrics summarizing the quality of the generated responses. If ClearML tracking is enabled, these metrics will also be uploaded to the ClearML dashboard.