Skip to content

Commit

Permalink
initial persona data gen 2 commit (#489)
Browse files Browse the repository at this point in the history
* initial persona data gen 2 commit

* removed input persona filed moved to hf

* updated the readme
  • Loading branch information
fabrahman authored Jan 6, 2025
1 parent c0dcdaf commit 5eb8cfe
Show file tree
Hide file tree
Showing 6 changed files with 805 additions and 0 deletions.
51 changes: 51 additions & 0 deletions scripts/persona_driven_data_gen/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
## Persona-driven Data Generation


To start make sure you have your OpenAI and Anthropic API keys and have installed the libraries listed in `requirements.txt`:

```
pip install -r requirements.txt
```

This folder contains code to synthetically generate data (both prompts and responses) for target skill using a [persona-driven approach](https://arxiv.org/pdf/2406.20094):


**1- Precise Instruction Following:**

```
# Generate Instruction Following prompts
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following
# Generate Responses for generated prompts
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution
# Rewrite prompts to form Rejected Response (used for Presona-IF DPO data)
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template rewrite_if_prompt
```


**2- Math World Problems**
```
# Generate math word problems
python persona_driven_generate_math_code.py --model "gpt-4o" --end_index 1000 --output_path <MATH_PROBLEMS> --openai_key XXX --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template math
# Generate math solutions for generated math problems
python persona_driven_generate_math_code.py --model "gpt-4o" --end_index 1000 --output_path <OUTPUT_MATH> --openai_key XXX --org_id YYY --dataset <MATH_PROBLEMS> --template math_solution
```
Note that you can change `--template` to any of `['grade_math', 'math_int_algebra']` to generate other types of math data.



**3- Code (python)**
```
# Generate python problems
python persona_driven_generate_math_code.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path <PYTHON_PROBLEMS> --openai_key XXX --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template code
# Generate python code
python persona_driven_generate_math_code.py --org_name anthropic --model 'claude-3-5-sonnet-20240620' --start_index 0 --end_index 1000 --output_path <OUTPUT_CODE> --openai_key XXX --org_id YYY --dataset <PYTHON_PROBLEMS> --template code_solution
```
Note that we used `claude-3-5-sonnet-20240620` to generate python codes.


All generated prompts and solutions will be saved in the `messages` format ready for supervised finetunig. An example output can be found [here](https://huggingface.co/datasets/ai2-adapt-dev/personahub_math_v5_regen_149960)
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
{
"punctuation:use no comma": [
"Write two words that can be added to this series: Apple, Bat, Cat, Dog. Your answer should not contain commas."
],
"format:number of highlighted sections": [
"I am hosting a family party for Halloween and need 10 ideas to decorate my house. Please include and highlight more than 3 ideas specifically for my yard decoration.",
"I want to start a blog about gardening and veggies. This will be a full-fledged media empire in a few years. Can you give me some pointers on how to succeed on the internet in this domain? Have at least 2 bold text sections, such as: *bold text 1*, *bold text 2*, etc. Repeat your response twice."
],
"length constraints:number of words": [
"List some of the most popular fantasy novels from the last decade. I want a short answer, not more than 100 words. Please do not include works by Brandon Sanderson.",
"Help me write proposal for a new business plan that investigates how AI tools effect on human creativity:\n1. Include a lot of commas in your response\n2. Your response should contain no more than 200 words."
],
"content:number of placeholders": [
"I need to lose some weight. Make a weekly food plan for me that is healthy and affordable. I would also like the food plan to be highly customizable to my taste, so make sure to include at least 5 placeholders for customizable food options, such as [low-calorie snack], use all capital letters in your response."
],
"repeat the prompt": [
"What would a fish say about an aquarium? Imagine you are a fish that likes the sea and answer that question.\n\nBut first, repeat the request above verbatim.",
"Please include the following question in your response without changing anything except changing ? to !\n\nWhat are some lethal risks one could face when practicing the scuba diving? What are some effective ways to reduce them?"
],
"format:title": [
"Write the lyrics for a 3-minute-long song that express the cruelty of war and people's longing for peace. Include the title of the song wrapped in double angular brackets, i.e. <<title>> and repeat the title twice."
],
"case:in english and lowercase": [
"Give me some startup ideas. I want to work on GenAI, LLMs, Blockchain, and ChatGPT. Output at least three paragraphs, all in lowercase."
],
"format:number of bullet lists": [
"Suggest a couple of names for my cat. He is orange and kind of chubby. Don't say Garfield. Give me exactly two suggestions as a bulleted list, like\n* [suggestion 1]\n* [suggestion 2]\n\nDon't use letter b in any of your suggestion.",
"Give me some evidence why global warming is a fraud. Answer with at least 4 piecies of evidence in a bullet list.\nAn examle of the bullet list is as below:\n* E1\n* E2"
],
"in english and capital": [
"Come up with a slogan for my presidential campaign, highlighting the fact I am in a minor party that is much better than Democrats and Republicans. Make sure to the slogan is in English and all capital letters."
],
"format:number of sections": [
"Write a long bedtime story for my kids. The story should have 7 Sections, each marked with Section N, so that I can read one to them everyday for a week. Repeat your response in French language as well. Make sure to include at least 5 sentences in each section."
],
"case: frequency of capital words": [
"Write an ad on a newspaper for an amusement part. Please use a dramastic tone with excitement, but the number of words in all capital letters must be less than 3."
],
"use quotation": [
"Give me some suggestions for ways I can improve my chocolate chip cookies, if they're always coming out a little crunchy. Put your entire response in double quotation marks.",
"Use double quotation marks to wrap around your entire answer.\n\nWhat would a innocent person would say after he falsefully recieved a death sentence from the court?"
],
"include keywords": [
"Try explaining to me what causes heartburn and what are the ways to alleviate that. Your response should contain the keywords: 'cheese', 'exam', and 'honeycomb'. Outputs 2 paragraphs separated by ^^^^."
],
"format:use json format": [
"What are the middle 10 states by population in the United States? No matter how you answer, make sure your entire output is valid JSON.",
"Rewrite following the description of Mike into a json format putting each sentence in a separate field names 'sentence_1', 'sentence_2', ...:\nMike is a 15-year-old teenager with a height of 6 feet. He has a long curly hair, green eyes, and a big flat nose. He is good at ice hockey and drawing. His grade on math also never drops below 99 out of 100 points. However, he has terrible eyesight and has to sleep at least 14 hours per day, otherwise he would never talk that day."
],
"length constraints:number of paragraphs": [
"Give me a biography of Nikola Tesla in exactly 4 paragraphs separated by ---, each paragraph should have a title, wrapped in double angular brackets, i.e. <<title>>."
],
" give two responses": [
"Give me two recipes for chocolate chip cookies, one gluten free in Vietnamese. Separate the two recipes like so:\nRecipe 1\n******\nRecipe 2"
],
"response language": [
"Write me a 1 paragraph summary of the movie Ratatouille. Your response should be entirely in French. Make sure to include the word \"rat\" at least 3 times."
],
"keywords:letter frequency": [
"Tell me about the history of the Seattle public transit system in 10 sentences. Each sentence should contain the letter 'e' at least 7 times. End your response with the phrase 'That's all folks!'"
],
"specific ending": [
"Write a scary story, where friends of the protagonist get mysteriously killed one after another during a day. End the story with the exact sentence: \"She cried and chocked herself.\""
],
"keywords:exclude words": [
"Give me five reasons why I should buy an electric car over a gas car? Do not mention the words \"cheap\" or \"fast\"."
],
"keywords:frequency": [
"Write a short love poem that include the word \"moon\" at least five times, word \"garden\" at least 3 times and ends with exact sentence: \"I will stare on the road till then\"."
],
"length constraints:number of sentences": [
"Explain the concept of model overfitting to a 3rd grader with no more than 3 sentences, also use all lower case letters. Add a postscript starting with P.S. at the end of your response."
],
"content:include a postscript": [
"Please write me a letter I could send to my friend that moved overseas a few years ago, but that I talk to every day on the phone. Make sure to include the word \"cat\" in the last sentence. At the end, add a postscript starting with P.P.S."
],
"length constraints:first word of the nth paragraph": [
"Campus safety is a big concern given the increasing gun violence. Help me write a 5 paragraph letter to the head of the school to persuade her to devote more budget into improving campus safety. Make sure to start the second paragraph with the word \"however\" and end your paragraph with word \"behaviour\"."
],
"format:choose one from options": [
"How much does a PhD candidate sleep a day on average? Choose from the following: ('6 hours', '6.6 hours', 'it depends') -- please include the exact phrase in your response.",
"What is the best way to cook a steak? Choose from the following: ('grill', 'oven', 'microwave') -- please include the exact phrase in your response.",
"Given the 7 cannonical colors of the rainbow, ROYGBIV, the middle color is green. Your answer must contain one of the following exact phrases: ‚\"yes, 100%\", \"No, no way\", \"not sure\""
]
}
163 changes: 163 additions & 0 deletions scripts/persona_driven_data_gen/persona_driven_generate_ifdata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
"""
This code is partially borrowed and adapted from: https://github.com/tencent-ailab/persona-hub
Example commands:
# generate 20 if prompts:
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 20 --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following
# generate 20 IF responses
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 20 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution
"""

import argparse
import json
# from openai import OpenAI
from prompt_templates import instruction_template, knowledge_template, npc_template, math_template, math_solution_template, instruction_following, instruction_following_solution, rewrite_if_prompt
from datasets import load_dataset
from tqdm import tqdm
import random
import string
import openai
from datasets import Dataset
import random

from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
) # for exponential backofff

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
return openai.ChatCompletion.create(**kwargs)


system_prompt = '''You are a helpful mathematician assistant.'''
# client = OpenAI() # set up your config/env/api for calling openai models
MODEL_COSTS = {"gpt-4": [0.00003, 0.00006], "gpt-3.5-turbo": [0.0000015, 0.000002], "gpt-4-1106-preview": [0.00001, 0.00003], 'gpt-4o': [0.000005, 0.000015]}


CONSTRAINTS = json.load(open("./data/if_constraint_fewshots_handwritten.json"))

def get_response(args, user_prompt):
# completion = client.chat.completions.create(
completion = completion_with_backoff(
model=args.model,
temperature=0.7,
top_p=0.95,
messages=[
{"role": "system", "content": f"{system_prompt}"},
{"role": "user", "content": f"{user_prompt}"}
]
)
total_input_tokens = completion.usage.prompt_tokens
total_output_tokens = completion.usage.completion_tokens
return completion.choices[0].message.content, total_input_tokens, total_output_tokens

process = lambda x: x.replace("Math problem:\n", "").replace("User instruction:\n", "").replace("User instruction:", "").lstrip("\n")

def main(args):
# Load the appropriate template
if args.template == "instruction":
template = instruction_template
elif args.template == "knowledge":
template = knowledge_template
elif args.template == "npc":
template = npc_template
elif args.template == "math":
template = math_template
elif args.template == "math_solution":
template = math_solution_template
elif args.template == "instruction_following":
template = instruction_following
elif args.template == "instruction_following_solution":
template = instruction_following_solution
elif args.template == "rewrite_if_prompt":
template = rewrite_if_prompt
else:
raise ValueError("Invalid template type. Choose from 'instruction', 'knowledge', 'npc', or 'math'.")

total_input_tokens, total_output_tokens = 0, 0
in_cost_per_token, out_cost_per_token = MODEL_COSTS[args.model]

# Load the dataset
if args.dataset.endswith(".jsonl"):
persona_dataset = load_dataset("json", data_files=args.dataset)['train']
else:
persona_dataset = load_dataset(args.dataset)['train']

if args.sanity_check > 0:
persona_dataset = persona_dataset.select(range(0, args.sanity_check))
print(f"Total number of input personas: {len(persona_dataset)}")

input_field = 'synthesized text' if args.template in ["instruction_following_solution", "rewrite_if_prompt"] else 'persona'
with open(args.output_path, "w") as out:
for idx, example in enumerate(tqdm(persona_dataset.select(range(args.start_index, args.end_index)))):
if args.template == "rewrite_if_prompt":
candid_constraints = example['constraints']
few_shot = ""
# only rewrite prompts for example with 2 or more constarints
if len(candid_constraints) < 2:
continue
else:
# randomly sample 1-3 constraints from the IFEval constraints list
candid_constraints = random.sample(list(CONSTRAINTS.keys()), random.choice([1,2,3]))
few_shot = random.choice(CONSTRAINTS[random.choice(candid_constraints)])

id = "personahub_" +''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(24))
input_text = example[input_field].strip()
persona = example['input persona'] if args.template in ["instruction_following_solution", "rewrite_if_prompt"] else input_text
user_prompt = template.format(persona=input_text, example=few_shot, constraints=", ".join(candid_constraints), category=", ".join([ex.split(':')[0] for ex in candid_constraints]))
gpt4o_out_text, in_tokens, out_tokens = get_response(args, user_prompt)
o = {
"id": id,
"prompt": input_text,
"input_persona": persona,
"messages": [
{"role": "user", "content": input_text},
{"role": "assistant", "content": gpt4o_out_text}
],
"constraints": example.get('constraints', [])
} if args.template in ["instruction_following_solution"] else {
'input persona': persona,
'synthesized text': process(gpt4o_out_text),
'constraints': candid_constraints if args.template in ["instruction_following", "rewrite_if_prompt"] else [],
'description': f'{args.template} problem'
}
out.write(json.dumps(o, ensure_ascii=False) + '\n')
# breakpoint()

total_input_tokens +=in_tokens
total_output_tokens += out_tokens
if idx % 20 == 0:
print(f"estimated cost so far= ${in_cost_per_token * in_tokens + out_cost_per_token * out_tokens}")

print(f"Outputted the results to: {args.output_path}")

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Synthesize text using a specified model and template.")
parser.add_argument(
'--template',
type=str,
required=True,
choices=['math', 'math_solution', 'instruction_following', 'instruction_following_solution', 'rewrite_if_prompt'],
help=(
"Prompt templates. Choose from 'instruction', 'knowledge', 'math' or 'npc'. "
"You can also add more customized templates in prompt_templates.py"
)
)
parser.add_argument("--dataset", required=False, default="proj-persona/PersonaHub")
parser.add_argument("--openai_key", required=True)
parser.add_argument("--org_id", required=True)
parser.add_argument("--model", default="gpt-4o", choices=["gpt-4", "gpt-3.5-turbo", "gpt-4-1106-preview", "gpt-4o"])
parser.add_argument('--output_path', type=str, required=True, help='Path to the output file.')
parser.add_argument('--chat_format', type=str, required=False, help='whether to put in chat format')
parser.add_argument("--start_index", type=int, default=0)
parser.add_argument("--end_index", type=int, default=None)
parser.add_argument("--sanity_check", type=int, default=0)

args = parser.parse_args()
openai.api_key = args.openai_key
openai.organization = args.org_id

main(args)
Loading

0 comments on commit 5eb8cfe

Please sign in to comment.