initial persona data gen 2 commit (#489)

* initial persona data gen 2 commit * removed input persona filed moved to hf * updated the readme
allenai · Jan 6, 2025 · 5eb8cfe · 5eb8cfe
1 parent c0dcdaf
commit 5eb8cfe
Show file tree

Hide file tree

Showing 6 changed files with 805 additions and 0 deletions.
diff --git a/scripts/persona_driven_data_gen/README.md b/scripts/persona_driven_data_gen/README.md
@@ -0,0 +1,51 @@
+## Persona-driven Data Generation
+
+
+To start make sure you have your OpenAI and Anthropic API keys and have installed the libraries listed in `requirements.txt`:
+
+```
+pip install -r requirements.txt
+```
+
+This folder contains code to synthetically generate data (both prompts and responses) for target skill using a [persona-driven approach](https://arxiv.org/pdf/2406.20094):
+
+
+**1- Precise Instruction Following:**
+
+```
+# Generate Instruction Following prompts
+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following
+
+# Generate Responses for generated prompts
+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution
+
+# Rewrite prompts to form Rejected Response (used for Presona-IF DPO data)
+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template rewrite_if_prompt
+```
+
+
+**2- Math World Problems**
+```
+# Generate math word problems
+python persona_driven_generate_math_code.py --model "gpt-4o" --end_index 1000 --output_path <MATH_PROBLEMS> --openai_key XXX --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template math
+
+# Generate math solutions for generated math problems
+python persona_driven_generate_math_code.py --model "gpt-4o" --end_index 1000 --output_path <OUTPUT_MATH> --openai_key XXX --org_id YYY --dataset <MATH_PROBLEMS> --template math_solution 
+```
+Note that you can change `--template` to any of `['grade_math', 'math_int_algebra']` to generate other types of math data.
+
+
+
+**3- Code (python)**
+```
+# Generate python problems
+
+python persona_driven_generate_math_code.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path <PYTHON_PROBLEMS> --openai_key XXX --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template code 
+
+# Generate python code
+python persona_driven_generate_math_code.py --org_name anthropic --model 'claude-3-5-sonnet-20240620' --start_index 0 --end_index 1000 --output_path <OUTPUT_CODE> --openai_key XXX --org_id YYY --dataset <PYTHON_PROBLEMS> --template code_solution 
+```
+Note that we used `claude-3-5-sonnet-20240620` to generate python codes.
+
+
+All generated prompts and solutions will be saved in the `messages` format ready for supervised finetunig. An example output can be found [here](https://huggingface.co/datasets/ai2-adapt-dev/personahub_math_v5_regen_149960)
diff --git a/scripts/persona_driven_data_gen/data/if_constraint_fewshots_handwritten.json b/scripts/persona_driven_data_gen/data/if_constraint_fewshots_handwritten.json
@@ -0,0 +1,85 @@
+{
+     "punctuation:use no comma": [
+          "Write two words that can be added to this series: Apple, Bat, Cat, Dog. Your answer should not contain commas."
+     ],
+     "format:number of highlighted sections": [
+          "I am hosting a family party for Halloween and need 10 ideas to decorate my house. Please include and highlight more than 3 ideas specifically for my yard decoration.",
+          "I want to start a blog about gardening and veggies. This will be a full-fledged media empire in a few years. Can you give me some pointers on how to succeed on the internet in this domain? Have at least 2 bold text sections, such as: *bold text 1*, *bold text 2*, etc. Repeat your response twice."
+     ],
+     "length constraints:number of words": [
+          "List some of the most popular fantasy novels from the last decade. I want a short answer, not more than 100 words. Please do not include works by Brandon Sanderson.",
+          "Help me write proposal for a new business plan that investigates how AI tools effect on human creativity:\n1. Include a lot of commas in your response\n2. Your response should contain no more than 200 words."
+     ],
+     "content:number of placeholders": [
+          "I need to lose some weight. Make a weekly food plan for me that is healthy and affordable. I would also like the food plan to be highly customizable to my taste, so make sure to include at least 5 placeholders for customizable food options, such as [low-calorie snack], use all capital letters in your response."
+     ],
+     "repeat the prompt": [
+          "What would a fish say about an aquarium? Imagine you are a fish that likes the sea and answer that question.\n\nBut first, repeat the request above verbatim.",
+          "Please include the following question in your response without changing anything except changing ? to !\n\nWhat are some lethal risks one could face when practicing the scuba diving? What are some effective ways to reduce them?"
+     ],
+     "format:title": [
+          "Write the lyrics for a 3-minute-long song that express the cruelty of war and people's longing for peace. Include the title of the song wrapped in double angular brackets, i.e. <<title>> and repeat the title twice."
+     ],
+     "case:in english and lowercase": [
+          "Give me some startup ideas. I want to work on GenAI, LLMs, Blockchain, and ChatGPT. Output at least three paragraphs, all in lowercase."
+     ],
+     "format:number of bullet lists": [
+          "Suggest a couple of names for my cat. He is orange and kind of chubby. Don't say Garfield. Give me exactly two suggestions as a bulleted list, like\n* [suggestion 1]\n* [suggestion 2]\n\nDon't use letter b in any of your suggestion.",
+          "Give me some evidence why global warming is a fraud. Answer with at least 4 piecies of evidence in a bullet list.\nAn examle of the bullet list is as below:\n* E1\n* E2"
+     ],
+     "in english and capital": [
+          "Come up with a slogan for my presidential campaign, highlighting the fact I am in a minor party that is much better than Democrats and Republicans. Make sure to the slogan is in English and all capital letters."
+     ],
+     "format:number of sections": [
+          "Write a long bedtime story for my kids. The story should have 7 Sections, each marked with Section N, so that I can read one to them everyday for a week. Repeat your response in French language as well. Make sure to include at least 5 sentences in each section."
+     ],
+     "case: frequency of capital words": [
+          "Write an ad on a newspaper for an amusement part. Please use a dramastic tone with excitement, but the number of words in all capital letters must be less than 3."
+     ],
+     "use quotation": [
+          "Give me some suggestions for ways I can improve my chocolate chip cookies, if they're always coming out a little crunchy. Put your entire response in double quotation marks.",
+          "Use double quotation marks to wrap around your entire answer.\n\nWhat would a innocent person would say after he falsefully recieved a death sentence from the court?"
+     ],
+     "include keywords": [
+          "Try explaining to me what causes heartburn and what are the ways to alleviate that. Your response should contain the keywords: 'cheese', 'exam', and 'honeycomb'. Outputs 2 paragraphs separated by ^^^^."
+     ],
+     "format:use json format": [
+          "What are the middle 10 states by population in the United States? No matter how you answer, make sure your entire output is valid JSON.",
+          "Rewrite following the description of Mike into a json format putting each sentence in a separate field names 'sentence_1', 'sentence_2', ...:\nMike is a 15-year-old teenager with a height of 6 feet. He has a long curly hair, green eyes, and a big flat nose. He is good at ice hockey and drawing. His grade on math also never drops below 99 out of 100 points. However, he has terrible eyesight and has to sleep at least 14 hours per day, otherwise he would never talk that day."
+     ],
+     "length constraints:number of paragraphs": [
+          "Give me a biography of Nikola Tesla in exactly 4 paragraphs separated by ---, each paragraph should have a title, wrapped in double angular brackets, i.e. <<title>>."
+     ],
+     " give two responses": [
+          "Give me two recipes for chocolate chip cookies, one gluten free in Vietnamese. Separate the two recipes like so:\nRecipe 1\n******\nRecipe 2"
+     ],
+     "response language": [
+          "Write me a 1 paragraph summary of the movie Ratatouille. Your response should be entirely in French. Make sure to include the word \"rat\" at least 3 times."
+     ],
+     "keywords:letter frequency": [
+          "Tell me about the history of the Seattle public transit system in 10 sentences. Each sentence should contain the letter 'e' at least 7 times. End your response with the phrase 'That's all folks!'"
+     ],
+     "specific ending": [
+          "Write a scary story, where friends of the protagonist get mysteriously killed one after another during a day. End the story with the exact sentence: \"She cried and chocked herself.\""
+     ],
+     "keywords:exclude words": [
+          "Give me five reasons why I should buy an electric car over a gas car? Do not mention the words \"cheap\" or \"fast\"."
+     ],
+     "keywords:frequency": [
+          "Write a short love poem that include the word \"moon\" at least five times, word \"garden\" at least 3 times and ends with exact sentence: \"I will stare on the road till then\"."
+     ],
+     "length constraints:number of sentences": [
+          "Explain the concept of model overfitting to a 3rd grader with no more than 3 sentences, also use all lower case letters. Add a postscript starting with P.S. at the end of your response."
+     ],
+     "content:include a postscript": [
+          "Please write me a letter I could send to my friend that moved overseas a few years ago, but that I talk to every day on the phone. Make sure to include the word \"cat\" in the last sentence. At the end, add a postscript starting with P.P.S."
+     ],
+     "length constraints:first word of the nth paragraph": [
+          "Campus safety is a big concern given the increasing gun violence. Help me write a 5 paragraph letter to the head of the school to persuade her to devote more budget into improving campus safety. Make sure to start the second paragraph with the word \"however\" and end your paragraph with word \"behaviour\"."
+     ],
+     "format:choose one from options": [
+          "How much does a PhD candidate sleep a day on average? Choose from the following: ('6 hours', '6.6 hours', 'it depends') -- please include the exact phrase in your response.",
+          "What is the best way to cook a steak? Choose from the following: ('grill', 'oven', 'microwave') -- please include the exact phrase in your response.",
+          "Given the 7 cannonical colors of the rainbow, ROYGBIV, the middle color is green. Your answer must contain one of the following exact phrases: ‚\"yes, 100%\", \"No, no way\", \"not sure\""
+     ]
+}
diff --git a/scripts/persona_driven_data_gen/persona_driven_generate_ifdata.py b/scripts/persona_driven_data_gen/persona_driven_generate_ifdata.py
@@ -0,0 +1,163 @@
+"""
+This code is partially borrowed and adapted from: https://github.com/tencent-ailab/persona-hub
+
+Example commands:
+# generate 20 if prompts: 
+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 20 --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following
+
+# generate 20 IF responses
+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 20 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution
+"""
+
+import argparse
+import json
+# from openai import OpenAI
+from prompt_templates import instruction_template, knowledge_template, npc_template, math_template, math_solution_template, instruction_following, instruction_following_solution, rewrite_if_prompt
+from datasets import load_dataset
+from tqdm import tqdm
+import random
+import string
+import openai
+from datasets import Dataset
+import random
+
+from tenacity import (
+    retry,
+    stop_after_attempt,
+    wait_random_exponential,
+)  # for exponential backofff
+
+@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
+def completion_with_backoff(**kwargs):
+    return openai.ChatCompletion.create(**kwargs)
+
+
+system_prompt = '''You are a helpful mathematician assistant.'''
+# client = OpenAI()   # set up your config/env/api for calling openai models
+MODEL_COSTS = {"gpt-4": [0.00003, 0.00006], "gpt-3.5-turbo": [0.0000015, 0.000002], "gpt-4-1106-preview": [0.00001, 0.00003], 'gpt-4o': [0.000005, 0.000015]}
+
+
+CONSTRAINTS = json.load(open("./data/if_constraint_fewshots_handwritten.json"))
+
+def get_response(args, user_prompt):
+    # completion = client.chat.completions.create(
+    completion = completion_with_backoff(
+        model=args.model,
+        temperature=0.7,
+        top_p=0.95,
+        messages=[
+            {"role": "system", "content":  f"{system_prompt}"},
+            {"role": "user", "content": f"{user_prompt}"}
+        ]
+    )
+    total_input_tokens = completion.usage.prompt_tokens
+    total_output_tokens = completion.usage.completion_tokens
+    return completion.choices[0].message.content, total_input_tokens, total_output_tokens
+
+process = lambda x: x.replace("Math problem:\n", "").replace("User instruction:\n", "").replace("User instruction:", "").lstrip("\n")
+
+def main(args):
+    # Load the appropriate template
+    if args.template == "instruction":
+        template = instruction_template
+    elif args.template == "knowledge":
+        template = knowledge_template
+    elif args.template == "npc":
+        template = npc_template
+    elif args.template == "math":
+        template = math_template
+    elif args.template == "math_solution":
+        template = math_solution_template
+    elif args.template == "instruction_following":
+        template = instruction_following
+    elif args.template == "instruction_following_solution":
+        template = instruction_following_solution
+    elif args.template == "rewrite_if_prompt":
+        template = rewrite_if_prompt
+    else:
+        raise ValueError("Invalid template type. Choose from 'instruction', 'knowledge', 'npc', or 'math'.")
+
+    total_input_tokens, total_output_tokens = 0, 0
+    in_cost_per_token, out_cost_per_token = MODEL_COSTS[args.model]
+
+    # Load the dataset
+    if args.dataset.endswith(".jsonl"):
+        persona_dataset = load_dataset("json", data_files=args.dataset)['train']
+    else:
+        persona_dataset = load_dataset(args.dataset)['train']
+
+    if args.sanity_check > 0:
+        persona_dataset = persona_dataset.select(range(0, args.sanity_check))
+    print(f"Total number of input personas: {len(persona_dataset)}")
+
+    input_field = 'synthesized text' if args.template in ["instruction_following_solution", "rewrite_if_prompt"] else 'persona'
+    with open(args.output_path, "w") as out:
+        for idx, example in enumerate(tqdm(persona_dataset.select(range(args.start_index, args.end_index)))):
+            if args.template == "rewrite_if_prompt":
+                candid_constraints = example['constraints']
+                few_shot = ""
+                # only rewrite prompts for example with 2 or more constarints
+                if len(candid_constraints) < 2:
+                    continue
+            else:
+                # randomly sample 1-3 constraints from the IFEval constraints list
+                candid_constraints = random.sample(list(CONSTRAINTS.keys()), random.choice([1,2,3]))
+                few_shot = random.choice(CONSTRAINTS[random.choice(candid_constraints)])
+
+            id = "personahub_" +''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(24))
+            input_text = example[input_field].strip()
+            persona = example['input persona'] if args.template in ["instruction_following_solution", "rewrite_if_prompt"] else input_text
+            user_prompt = template.format(persona=input_text, example=few_shot, constraints=", ".join(candid_constraints), category=", ".join([ex.split(':')[0] for ex in candid_constraints]))
+            gpt4o_out_text, in_tokens, out_tokens = get_response(args, user_prompt)
+            o = {
+                "id": id, 
+                "prompt": input_text, 
+                "input_persona": persona, 
+                "messages": [
+                    {"role": "user", "content": input_text}, 
+                    {"role": "assistant", "content": gpt4o_out_text}
+                    ],
+                "constraints": example.get('constraints', [])
+                } if args.template in ["instruction_following_solution"] else {
+                    'input persona': persona,
+                    'synthesized text': process(gpt4o_out_text),
+                    'constraints': candid_constraints if args.template in ["instruction_following", "rewrite_if_prompt"] else [],
+                    'description': f'{args.template} problem'
+                }
+            out.write(json.dumps(o, ensure_ascii=False) + '\n')
+            # breakpoint()
+
+            total_input_tokens +=in_tokens
+            total_output_tokens += out_tokens
+            if idx % 20 == 0:
+                print(f"estimated cost so far= ${in_cost_per_token * in_tokens + out_cost_per_token * out_tokens}")
+
+    print(f"Outputted the results to: {args.output_path}")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Synthesize text using a specified model and template.")
+    parser.add_argument(
+        '--template', 
+        type=str, 
+        required=True, 
+        choices=['math', 'math_solution', 'instruction_following', 'instruction_following_solution', 'rewrite_if_prompt'], 
+        help=(
+            "Prompt templates. Choose from 'instruction', 'knowledge', 'math' or 'npc'. "
+            "You can also add more customized templates in prompt_templates.py"
+        )
+    )
+    parser.add_argument("--dataset", required=False, default="proj-persona/PersonaHub")
+    parser.add_argument("--openai_key", required=True)
+    parser.add_argument("--org_id", required=True)
+    parser.add_argument("--model", default="gpt-4o", choices=["gpt-4", "gpt-3.5-turbo", "gpt-4-1106-preview", "gpt-4o"])
+    parser.add_argument('--output_path', type=str, required=True, help='Path to the output file.')
+    parser.add_argument('--chat_format', type=str, required=False, help='whether to put in chat format')
+    parser.add_argument("--start_index", type=int, default=0)
+    parser.add_argument("--end_index", type=int, default=None)
+    parser.add_argument("--sanity_check", type=int, default=0)
+
+    args = parser.parse_args()
+    openai.api_key = args.openai_key
+    openai.organization = args.org_id
+
+    main(args)