diff --git a/scripts/persona_driven_data_gen/README.md b/scripts/persona_driven_data_gen/README.md new file mode 100644 index 000000000..2b07ee54d --- /dev/null +++ b/scripts/persona_driven_data_gen/README.md @@ -0,0 +1,51 @@ +## Persona-driven Data Generation + + +To start make sure you have your OpenAI and Anthropic API keys and have installed the libraries listed in `requirements.txt`: + +``` +pip install -r requirements.txt +``` + +This folder contains code to synthetically generate data (both prompts and responses) for target skill using a [persona-driven approach](https://arxiv.org/pdf/2406.20094): + + +**1- Precise Instruction Following:** + +``` +# Generate Instruction Following prompts +python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following + +# Generate Responses for generated prompts +python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution + +# Rewrite prompts to form Rejected Response (used for Presona-IF DPO data) +python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template rewrite_if_prompt +``` + + +**2- Math World Problems** +``` +# Generate math word problems +python persona_driven_generate_math_code.py --model "gpt-4o" --end_index 1000 --output_path --openai_key XXX --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template math + +# Generate math solutions for generated math problems +python persona_driven_generate_math_code.py --model "gpt-4o" --end_index 1000 --output_path --openai_key XXX --org_id YYY --dataset --template math_solution +``` +Note that you can change `--template` to any of `['grade_math', 'math_int_algebra']` to generate other types of math data. + + + +**3- Code (python)** +``` +# Generate python problems + +python persona_driven_generate_math_code.py --model "gpt-4o" --start_index 0 --end_index 1000 --output_path --openai_key XXX --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template code + +# Generate python code +python persona_driven_generate_math_code.py --org_name anthropic --model 'claude-3-5-sonnet-20240620' --start_index 0 --end_index 1000 --output_path --openai_key XXX --org_id YYY --dataset --template code_solution +``` +Note that we used `claude-3-5-sonnet-20240620` to generate python codes. + + +All generated prompts and solutions will be saved in the `messages` format ready for supervised finetunig. An example output can be found [here](https://huggingface.co/datasets/ai2-adapt-dev/personahub_math_v5_regen_149960) diff --git a/scripts/persona_driven_data_gen/data/if_constraint_fewshots_handwritten.json b/scripts/persona_driven_data_gen/data/if_constraint_fewshots_handwritten.json new file mode 100644 index 000000000..dec4c76a0 --- /dev/null +++ b/scripts/persona_driven_data_gen/data/if_constraint_fewshots_handwritten.json @@ -0,0 +1,85 @@ +{ + "punctuation:use no comma": [ + "Write two words that can be added to this series: Apple, Bat, Cat, Dog. Your answer should not contain commas." + ], + "format:number of highlighted sections": [ + "I am hosting a family party for Halloween and need 10 ideas to decorate my house. Please include and highlight more than 3 ideas specifically for my yard decoration.", + "I want to start a blog about gardening and veggies. This will be a full-fledged media empire in a few years. Can you give me some pointers on how to succeed on the internet in this domain? Have at least 2 bold text sections, such as: *bold text 1*, *bold text 2*, etc. Repeat your response twice." + ], + "length constraints:number of words": [ + "List some of the most popular fantasy novels from the last decade. I want a short answer, not more than 100 words. Please do not include works by Brandon Sanderson.", + "Help me write proposal for a new business plan that investigates how AI tools effect on human creativity:\n1. Include a lot of commas in your response\n2. Your response should contain no more than 200 words." + ], + "content:number of placeholders": [ + "I need to lose some weight. Make a weekly food plan for me that is healthy and affordable. I would also like the food plan to be highly customizable to my taste, so make sure to include at least 5 placeholders for customizable food options, such as [low-calorie snack], use all capital letters in your response." + ], + "repeat the prompt": [ + "What would a fish say about an aquarium? Imagine you are a fish that likes the sea and answer that question.\n\nBut first, repeat the request above verbatim.", + "Please include the following question in your response without changing anything except changing ? to !\n\nWhat are some lethal risks one could face when practicing the scuba diving? What are some effective ways to reduce them?" + ], + "format:title": [ + "Write the lyrics for a 3-minute-long song that express the cruelty of war and people's longing for peace. Include the title of the song wrapped in double angular brackets, i.e. <> and repeat the title twice." + ], + "case:in english and lowercase": [ + "Give me some startup ideas. I want to work on GenAI, LLMs, Blockchain, and ChatGPT. Output at least three paragraphs, all in lowercase." + ], + "format:number of bullet lists": [ + "Suggest a couple of names for my cat. He is orange and kind of chubby. Don't say Garfield. Give me exactly two suggestions as a bulleted list, like\n* [suggestion 1]\n* [suggestion 2]\n\nDon't use letter b in any of your suggestion.", + "Give me some evidence why global warming is a fraud. Answer with at least 4 piecies of evidence in a bullet list.\nAn examle of the bullet list is as below:\n* E1\n* E2" + ], + "in english and capital": [ + "Come up with a slogan for my presidential campaign, highlighting the fact I am in a minor party that is much better than Democrats and Republicans. Make sure to the slogan is in English and all capital letters." + ], + "format:number of sections": [ + "Write a long bedtime story for my kids. The story should have 7 Sections, each marked with Section N, so that I can read one to them everyday for a week. Repeat your response in French language as well. Make sure to include at least 5 sentences in each section." + ], + "case: frequency of capital words": [ + "Write an ad on a newspaper for an amusement part. Please use a dramastic tone with excitement, but the number of words in all capital letters must be less than 3." + ], + "use quotation": [ + "Give me some suggestions for ways I can improve my chocolate chip cookies, if they're always coming out a little crunchy. Put your entire response in double quotation marks.", + "Use double quotation marks to wrap around your entire answer.\n\nWhat would a innocent person would say after he falsefully recieved a death sentence from the court?" + ], + "include keywords": [ + "Try explaining to me what causes heartburn and what are the ways to alleviate that. Your response should contain the keywords: 'cheese', 'exam', and 'honeycomb'. Outputs 2 paragraphs separated by ^^^^." + ], + "format:use json format": [ + "What are the middle 10 states by population in the United States? No matter how you answer, make sure your entire output is valid JSON.", + "Rewrite following the description of Mike into a json format putting each sentence in a separate field names 'sentence_1', 'sentence_2', ...:\nMike is a 15-year-old teenager with a height of 6 feet. He has a long curly hair, green eyes, and a big flat nose. He is good at ice hockey and drawing. His grade on math also never drops below 99 out of 100 points. However, he has terrible eyesight and has to sleep at least 14 hours per day, otherwise he would never talk that day." + ], + "length constraints:number of paragraphs": [ + "Give me a biography of Nikola Tesla in exactly 4 paragraphs separated by ---, each paragraph should have a title, wrapped in double angular brackets, i.e. <<title>>." + ], + " give two responses": [ + "Give me two recipes for chocolate chip cookies, one gluten free in Vietnamese. Separate the two recipes like so:\nRecipe 1\n******\nRecipe 2" + ], + "response language": [ + "Write me a 1 paragraph summary of the movie Ratatouille. Your response should be entirely in French. Make sure to include the word \"rat\" at least 3 times." + ], + "keywords:letter frequency": [ + "Tell me about the history of the Seattle public transit system in 10 sentences. Each sentence should contain the letter 'e' at least 7 times. End your response with the phrase 'That's all folks!'" + ], + "specific ending": [ + "Write a scary story, where friends of the protagonist get mysteriously killed one after another during a day. End the story with the exact sentence: \"She cried and chocked herself.\"" + ], + "keywords:exclude words": [ + "Give me five reasons why I should buy an electric car over a gas car? Do not mention the words \"cheap\" or \"fast\"." + ], + "keywords:frequency": [ + "Write a short love poem that include the word \"moon\" at least five times, word \"garden\" at least 3 times and ends with exact sentence: \"I will stare on the road till then\"." + ], + "length constraints:number of sentences": [ + "Explain the concept of model overfitting to a 3rd grader with no more than 3 sentences, also use all lower case letters. Add a postscript starting with P.S. at the end of your response." + ], + "content:include a postscript": [ + "Please write me a letter I could send to my friend that moved overseas a few years ago, but that I talk to every day on the phone. Make sure to include the word \"cat\" in the last sentence. At the end, add a postscript starting with P.P.S." + ], + "length constraints:first word of the nth paragraph": [ + "Campus safety is a big concern given the increasing gun violence. Help me write a 5 paragraph letter to the head of the school to persuade her to devote more budget into improving campus safety. Make sure to start the second paragraph with the word \"however\" and end your paragraph with word \"behaviour\"." + ], + "format:choose one from options": [ + "How much does a PhD candidate sleep a day on average? Choose from the following: ('6 hours', '6.6 hours', 'it depends') -- please include the exact phrase in your response.", + "What is the best way to cook a steak? Choose from the following: ('grill', 'oven', 'microwave') -- please include the exact phrase in your response.", + "Given the 7 cannonical colors of the rainbow, ROYGBIV, the middle color is green. Your answer must contain one of the following exact phrases: ‚\"yes, 100%\", \"No, no way\", \"not sure\"" + ] +} \ No newline at end of file diff --git a/scripts/persona_driven_data_gen/persona_driven_generate_ifdata.py b/scripts/persona_driven_data_gen/persona_driven_generate_ifdata.py new file mode 100755 index 000000000..a0e07d933 --- /dev/null +++ b/scripts/persona_driven_data_gen/persona_driven_generate_ifdata.py @@ -0,0 +1,163 @@ +""" +This code is partially borrowed and adapted from: https://github.com/tencent-ailab/persona-hub + +Example commands: +# generate 20 if prompts: +python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 20 --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following + +# generate 20 IF responses +python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index 20 --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution +""" + +import argparse +import json +# from openai import OpenAI +from prompt_templates import instruction_template, knowledge_template, npc_template, math_template, math_solution_template, instruction_following, instruction_following_solution, rewrite_if_prompt +from datasets import load_dataset +from tqdm import tqdm +import random +import string +import openai +from datasets import Dataset +import random + +from tenacity import ( + retry, + stop_after_attempt, + wait_random_exponential, +) # for exponential backofff + +@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6)) +def completion_with_backoff(**kwargs): + return openai.ChatCompletion.create(**kwargs) + + +system_prompt = '''You are a helpful mathematician assistant.''' +# client = OpenAI() # set up your config/env/api for calling openai models +MODEL_COSTS = {"gpt-4": [0.00003, 0.00006], "gpt-3.5-turbo": [0.0000015, 0.000002], "gpt-4-1106-preview": [0.00001, 0.00003], 'gpt-4o': [0.000005, 0.000015]} + + +CONSTRAINTS = json.load(open("./data/if_constraint_fewshots_handwritten.json")) + +def get_response(args, user_prompt): + # completion = client.chat.completions.create( + completion = completion_with_backoff( + model=args.model, + temperature=0.7, + top_p=0.95, + messages=[ + {"role": "system", "content": f"{system_prompt}"}, + {"role": "user", "content": f"{user_prompt}"} + ] + ) + total_input_tokens = completion.usage.prompt_tokens + total_output_tokens = completion.usage.completion_tokens + return completion.choices[0].message.content, total_input_tokens, total_output_tokens + +process = lambda x: x.replace("Math problem:\n", "").replace("User instruction:\n", "").replace("User instruction:", "").lstrip("\n") + +def main(args): + # Load the appropriate template + if args.template == "instruction": + template = instruction_template + elif args.template == "knowledge": + template = knowledge_template + elif args.template == "npc": + template = npc_template + elif args.template == "math": + template = math_template + elif args.template == "math_solution": + template = math_solution_template + elif args.template == "instruction_following": + template = instruction_following + elif args.template == "instruction_following_solution": + template = instruction_following_solution + elif args.template == "rewrite_if_prompt": + template = rewrite_if_prompt + else: + raise ValueError("Invalid template type. Choose from 'instruction', 'knowledge', 'npc', or 'math'.") + + total_input_tokens, total_output_tokens = 0, 0 + in_cost_per_token, out_cost_per_token = MODEL_COSTS[args.model] + + # Load the dataset + if args.dataset.endswith(".jsonl"): + persona_dataset = load_dataset("json", data_files=args.dataset)['train'] + else: + persona_dataset = load_dataset(args.dataset)['train'] + + if args.sanity_check > 0: + persona_dataset = persona_dataset.select(range(0, args.sanity_check)) + print(f"Total number of input personas: {len(persona_dataset)}") + + input_field = 'synthesized text' if args.template in ["instruction_following_solution", "rewrite_if_prompt"] else 'persona' + with open(args.output_path, "w") as out: + for idx, example in enumerate(tqdm(persona_dataset.select(range(args.start_index, args.end_index)))): + if args.template == "rewrite_if_prompt": + candid_constraints = example['constraints'] + few_shot = "" + # only rewrite prompts for example with 2 or more constarints + if len(candid_constraints) < 2: + continue + else: + # randomly sample 1-3 constraints from the IFEval constraints list + candid_constraints = random.sample(list(CONSTRAINTS.keys()), random.choice([1,2,3])) + few_shot = random.choice(CONSTRAINTS[random.choice(candid_constraints)]) + + id = "personahub_" +''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(24)) + input_text = example[input_field].strip() + persona = example['input persona'] if args.template in ["instruction_following_solution", "rewrite_if_prompt"] else input_text + user_prompt = template.format(persona=input_text, example=few_shot, constraints=", ".join(candid_constraints), category=", ".join([ex.split(':')[0] for ex in candid_constraints])) + gpt4o_out_text, in_tokens, out_tokens = get_response(args, user_prompt) + o = { + "id": id, + "prompt": input_text, + "input_persona": persona, + "messages": [ + {"role": "user", "content": input_text}, + {"role": "assistant", "content": gpt4o_out_text} + ], + "constraints": example.get('constraints', []) + } if args.template in ["instruction_following_solution"] else { + 'input persona': persona, + 'synthesized text': process(gpt4o_out_text), + 'constraints': candid_constraints if args.template in ["instruction_following", "rewrite_if_prompt"] else [], + 'description': f'{args.template} problem' + } + out.write(json.dumps(o, ensure_ascii=False) + '\n') + # breakpoint() + + total_input_tokens +=in_tokens + total_output_tokens += out_tokens + if idx % 20 == 0: + print(f"estimated cost so far= ${in_cost_per_token * in_tokens + out_cost_per_token * out_tokens}") + + print(f"Outputted the results to: {args.output_path}") + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Synthesize text using a specified model and template.") + parser.add_argument( + '--template', + type=str, + required=True, + choices=['math', 'math_solution', 'instruction_following', 'instruction_following_solution', 'rewrite_if_prompt'], + help=( + "Prompt templates. Choose from 'instruction', 'knowledge', 'math' or 'npc'. " + "You can also add more customized templates in prompt_templates.py" + ) + ) + parser.add_argument("--dataset", required=False, default="proj-persona/PersonaHub") + parser.add_argument("--openai_key", required=True) + parser.add_argument("--org_id", required=True) + parser.add_argument("--model", default="gpt-4o", choices=["gpt-4", "gpt-3.5-turbo", "gpt-4-1106-preview", "gpt-4o"]) + parser.add_argument('--output_path', type=str, required=True, help='Path to the output file.') + parser.add_argument('--chat_format', type=str, required=False, help='whether to put in chat format') + parser.add_argument("--start_index", type=int, default=0) + parser.add_argument("--end_index", type=int, default=None) + parser.add_argument("--sanity_check", type=int, default=0) + + args = parser.parse_args() + openai.api_key = args.openai_key + openai.organization = args.org_id + + main(args) diff --git a/scripts/persona_driven_data_gen/persona_driven_generate_math_code.py b/scripts/persona_driven_data_gen/persona_driven_generate_math_code.py new file mode 100755 index 000000000..94c976ccd --- /dev/null +++ b/scripts/persona_driven_data_gen/persona_driven_generate_math_code.py @@ -0,0 +1,177 @@ +""" +This code is partially borrowed and adapted from: https://github.com/tencent-ailab/persona-hub + +Example uses: +# example 10 math solutions +python persona_driven_generate_math_code.py --model "gpt-4o" --end_index 10 --output_path <OUTPUT_MATH> --openai_key XXX --org_id YYY --dataset <OUTPUT_MATH_PROMPT> --template math_solution +# example for 10 code prompts +python persona_driven_generate_math_code.py --model "gpt-4o" --start_index 0 --end_index 10 --output_path <OUTPUT_CODE_PROMPT> --openai_key XXX --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template code +# example for 10 code soltuions +python persona_driven_generate_math_code.py --org_name anthropic --model 'claude-3-5-sonnet-20240620' --start_index 0 --end_index 10 --output_path <OUTPUT_CODE> --openai_key XXX --org_id YYY --dataset <OUTPUT_CODE_PROMPT> --template code_solution +""" + +import argparse +import json +# from openai import OpenAI +from prompt_templates import instruction_template, knowledge_template, npc_template, math_template, math_solution_template, math_template_easy, grade_math_solution_template, code_template, code_solution_template, math_int_algebra_template +from datasets import load_dataset +from tqdm import tqdm +import random +import string +import openai +from datasets import Dataset +import anthropic + +from tenacity import ( + retry, + stop_after_attempt, + wait_random_exponential, +) # for exponential backoff + +@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6)) +def completion_with_backoff(**kwargs): + return openai.ChatCompletion.create(**kwargs) + + +system_prompt = '''You are a helpful mathematician assistant.''' +# client = OpenAI() # set up your config/env/api for calling openai models +MODEL_COSTS = {"gpt-4": [0.00003, 0.00006], "gpt-3.5-turbo": [0.0000015, 0.000002], "gpt-4-1106-preview": [0.00001, 0.00003], 'gpt-4o': [0.000005, 0.000015], 'claude-sonnet':[0.000003, 0.000015], 'claude-3-5-sonnet-20240620': [0.000003, 0.000015]} + + +def get_response(args, user_prompt, org_name="openai"): + # completion = client.chat.completions.create( + if org_name == "openai": + completion = completion_with_backoff( + model=args.model, + max_tokens=1024, + temperature=0.7, + messages=[ + {"role": "system", "content": f"{system_prompt}"}, + {"role": "user", "content": f"{user_prompt}"} + ] + ) + total_input_tokens = completion.usage.prompt_tokens + total_output_tokens = completion.usage.completion_tokens + + elif args.org_name == 'anthropic': + client = anthropic.Anthropic( + api_key=args.api_key, + ) + response = client.messages.create( + model=args.model, + max_tokens=1024, + temperature=0.7, + system=system_prompt, + messages=[ + {"role": "user", "content": user_prompt} + ] + ) + total_input_tokens += response.usage.input_tokens + total_output_tokens += response.usage.output_tokens + return completion.choices[0].message.content, total_input_tokens, total_output_tokens + +process = lambda x: x.replace("Math problem:", "").replace("Question: ", "").lstrip("\n").strip() + +def main(args): + # Load the appropriate template + if args.template == "instruction": + template = instruction_template + elif args.template == "knowledge": + template = knowledge_template + elif args.template == "npc": + template = npc_template + elif args.template == "math": + template = math_template + elif args.template == "grade_math": + template = math_template_easy + elif args.template == "math_solution": + template = math_solution_template + elif args.template == "grade_math_solution": + template = grade_math_solution_template + elif args.template == "code": + template = code_template + elif args.template == "code_solution": + template = code_solution_template + elif args.template == "math_int_algebra": + template = math_int_algebra_template + elif args.template == "instruction_following": + template = instruction_following + elif args.template == "instruction_following_solution": + template = instruction_following_solution + elif args.template == "rewrite_if_prompt": + template = rewrite_if_prompt + else: + raise ValueError("Invalid template type. Choose from 'instruction', 'knowledge', 'npc', or 'math'.") + + + total_input_tokens, total_output_tokens = 0, 0 + in_cost_per_token, out_cost_per_token = MODEL_COSTS[args.model] + + # Load the dataset + if args.dataset.endswith(".jsonl"): + persona_dataset = load_dataset("json", data_files=args.dataset)['train'] + else: + persona_dataset = load_dataset(args.dataset)['train'] + + if args.sanity_check > 0: + persona_dataset = persona_dataset.select(range(0, args.sanity_check)) + print(f"Total number of input personas: {len(persona_dataset)}") + + input_field = 'synthesized text' if args.template in ["math_solution", "rewrite_if_prompt", "grade_math_solution", "code_solution"] else 'persona' + with open(args.output_path, "w") as out: + for idx, example in enumerate(tqdm(persona_dataset.select(range(args.start_index, args.end_index)))): + id = "personahub_" + ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(24)) + input_text = example[input_field].strip() + persona = example['input persona'] if args.template in ["math_solution", "grade_math_solution", "code_solution"] else input_text + user_prompt = template.format(persona=input_text) + gpt4o_out_text, in_tokens, out_tokens = get_response(args, user_prompt) + o = { + "id": id, + "prompt": input_text, + "input_persona": persona, + "messages": [ + {"role": "user", "content": input_text}, + {"role": "assistant", "content": gpt4o_out_text} + ] + } if args.template in ["math_solution", "grade_math_solution", "code_solution"] else { + 'input persona': persona, + 'synthesized text': process(gpt4o_out_text), + 'description': f'{args.template} problem' + } + out.write(json.dumps(o, ensure_ascii=False) + '\n') + + total_input_tokens +=in_tokens + total_output_tokens += out_tokens + if idx % 20 == 0: + print(f"estimated cost so far= ${in_cost_per_token * in_tokens + out_cost_per_token * out_tokens}") + + print(f"Outputted the results to: {args.output_path}") + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Synthesize text using a specified model and template.") + parser.add_argument( + '--template', + type=str, + required=True, + choices=['math', 'math_solution', 'grade_math', 'grade_math_solution', 'code', 'code_solution', 'math_int_algebra'], + help=( + "Prompt templates. Choose from 'instruction', 'knowledge', 'math' or 'npc'. " + "You can also add more customized templates in prompt_templates.py" + ) + ) + parser.add_argument("--dataset", required=False, default="proj-persona/PersonaHub") + parser.add_argument("--org_name", default="openai", help="choose either openai for gpt-x or anthropic for claude") + parser.add_argument("--openai_key", required=True) + parser.add_argument("--org_id", required=False) + parser.add_argument("--model", default="gpt-4o", choices=["gpt-4", "gpt-3.5-turbo", "gpt-4-1106-preview", "gpt-4o", 'claude-sonnet', 'claude-3-5-sonnet-20240620']) + parser.add_argument('--output_path', type=str, required=True, help='Path to the output file.') + parser.add_argument('--chat_format', type=str, required=False, help='whether to put in chat format') + parser.add_argument("--start_index", type=int, default=0) + parser.add_argument("--end_index", type=int, default=None) + parser.add_argument("--sanity_check", type=int, default=0) + + args = parser.parse_args() + openai.api_key = args.openai_key + openai.organization = args.org_id + + main(args) diff --git a/scripts/persona_driven_data_gen/prompt_templates.py b/scripts/persona_driven_data_gen/prompt_templates.py new file mode 100755 index 000000000..ab07c956a --- /dev/null +++ b/scripts/persona_driven_data_gen/prompt_templates.py @@ -0,0 +1,326 @@ +math_template = '''Create a math problem related to the following persona: + +{persona} + +Note: + +1. The math problem should be challenging and involve advanced mathematical skills and knowledge. Only top talents can solve it correctly. +2. You should make full use of the persona description to create the math problem to ensure that the math problem is unique and specific to the persona. +3. Your response should always start with "Math problem:". Your response should not include a solution to the created math problem. +4. Your created math problem should include no more than 2 sub-problems. +''' + +math_template_easy = '''Create a grade school math word problem related to the following persona: + +{persona} + +Note: + +1. You should make full use of the persona description to create the math problem to ensure that the math problem is unique and specific to the persona. +2. The problem primarily involves performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach a single final answer. +3. Your response should always start with "Math problem:". Your response should not include a solution to the created math problem. +''' + + +instruction_template = '''Guess a prompt that the following persona may ask you to do: + +{persona} + +Note: + +1. The prompt should be informative and specific. +2. Your output should start with "User prompt:"''' + +knowledge_template = '''{persona} + +Assume you are the persona described above and you are writing a Quora article using your knowledge, skills, experience, or insights to help others learn and benefit from it. + +Note: + +1. The article should be specific, informative and knowledge-rich. +2. Your response should start with "Title:"''' + +npc_template = '''World of Warcraft (WoW) is a massively multiplayer online role-playing game (MMORPG) developed by Blizzard Entertainment. It is set in the high-fantasy world of Azeroth, a land filled with rich lore, diverse races, and epic conflicts. The game has evolved significantly since its release in 2004, with numerous expansions adding new continents, races, classes, and storylines. Below is a detailed overview of the game's worldview, story background, and some key characters and NPCs. + +### Worldview and Story Background + +**Azeroth** is a world steeped in ancient history, powerful magic, and epic conflicts. The planet is divided into several continents, each with its own unique environments, cultures, and histories. The main continents include: + +- Eastern Kingdoms: Home to the human kingdoms, dwarves, gnomes, and the undead Forsaken. +- Kalimdor: Inhabited by orcs, night elves, tauren, trolls, and other races. +- Northrend: A frozen continent, home to the Lich King and the undead Scourge. +- Pandaria: A mystical land shrouded in mists, home to the Pandaren. +- Broken Isles: The remnants of the ancient Night Elf civilization and the site of the Tomb of Sargeras. +- Zandalar and Kul Tiras: Introduced in the Battle for Azeroth expansion, these are the homelands of the Zandalari trolls and the human kingdom of Kul Tiras, respectively. +- Shadowlands: The realm of the afterlife, introduced in the Shadowlands expansion. + +The story of Azeroth is vast and complex, spanning millennia and involving numerous races, factions, and cosmic forces. Here are some key aspects of the world's background: + +#### **The Titans and the Old Gods** + +- **The Titans**: Azeroth was shaped by the Titans, colossal beings who are part of the Pantheon, a group of god-like entities dedicated to bringing order to the universe. The Titans discovered Azeroth and found it infested with chaotic entities known as the Old Gods. To combat this, they created the Titan-forged, including the Keepers, to help shape and protect the world. + +- **The Old Gods**: These malevolent, ancient beings sought to corrupt Azeroth. The Titans imprisoned the Old Gods beneath the surface of the world, but their influence persisted, causing chaos and corruption throughout history. Notable Old Gods include C'Thun, Yogg-Saron, N'Zoth, and Y'Shaarj. + +#### **The Sundering** + +- **The Well of Eternity**: At the center of ancient Kalimdor was the Well of Eternity, a source of immense arcane power. The Highborne, a group of night elves led by Queen Azshara, recklessly tapped into its power, attracting the attention of the Burning Legion, a demonic army led by the dark titan Sargeras. + +- **The War of the Ancients**: This conflict saw the night elves, dragons, and other races unite to repel the Burning Legion's invasion. The war culminated in the Sundering, a catastrophic event that shattered the supercontinent of Kalimdor into several smaller continents and created the Maelstrom, a massive, swirling vortex of energy. + +#### **The Rise and Fall of Empires** + +- **The Troll Empires**: Before the Sundering, the trolls established powerful empires, such as the Gurubashi and Amani. These empires declined over time but left a lasting impact on Azeroth's history. + +- **The Night Elf Empire**: After the Sundering, the night elves established a new empire, centered around the World Tree, Nordrassil. They became the guardians of nature and the Emerald Dream, a parallel realm of primal life. + +- **The Human Kingdoms**: Humans emerged as a dominant race in the Eastern Kingdoms, founding powerful kingdoms such as Stormwind, Lordaeron, and Dalaran. These kingdoms played crucial roles in the defense of Azeroth against various threats. + +#### **The First and Second Wars** + +- **The First War**: The orcs, originally from the world of Draenor, were corrupted by the Burning Legion and transported to Azeroth through the Dark Portal. They waged war against the human kingdom of Stormwind, ultimately destroying it. + +- **The Second War**: The orcs, now united under the Horde, continued their conquest, clashing with the Alliance of Lordaeron, a coalition of human, dwarf, and high elf forces. The Alliance eventually triumphed, and the orcs were interned in camps. + +#### **The Scourge and the Lich King** + +- **The Lich King**: Created by the demon lord Kil'jaeden, the Lich King was originally the orc shaman Ner'zhul. He was transformed into a powerful undead entity and imprisoned in the Frozen Throne in Northrend. The Lich King created the Scourge, an army of undead, to pave the way for a new invasion by the Burning Legion. + +- **The Third War**: The Scourge ravaged the human kingdoms, leading to the fall of Lordaeron and the rise of the undead Forsaken. The war culminated in the Battle of Mount Hyjal, where the combined forces of the night elves, Horde, and Alliance defeated the Burning Legion. + +#### **The Burning Crusade and Beyond** + +- **The Burning Crusade**: The first expansion of WoW saw players journey to Outland, the shattered remnants of Draenor, to combat the Burning Legion and its allies. + +- **Wrath of the Lich King**: This expansion focused on the conflict with the Lich King in Northrend, culminating in his defeat at Icecrown Citadel. + +- **Cataclysm**: The return of the corrupted Dragon Aspect Deathwing caused massive upheaval across Azeroth, reshaping the world and leading to new conflicts. + +- **Mists of Pandaria**: This expansion introduced the mysterious continent of Pandaria and its inhabitants, the Pandaren, as well as new threats from the Sha and the mogu. + +- **Warlords of Draenor**: Players traveled to an alternate-timeline Draenor to confront the Iron Horde, a new orcish threat. + +- **Legion**: The Burning Legion launched a full-scale invasion of Azeroth, leading to epic battles and the eventual defeat of the dark titan Sargeras. + +- **Battle for Azeroth**: This expansion reignited the conflict between the Alliance and Horde, with new zones, races, and storylines. + +- **Shadowlands**: The latest expansion takes players to the realm of the afterlife, where they must confront new threats and uncover the mysteries of death. + +### Overarching Themes + +**1. Conflict and Unity** +- The world of Azeroth is defined by its conflicts, both internal and external. The ongoing struggle between the Alliance and Horde is a central theme, but there are also numerous other conflicts involving ancient evils, demonic invasions, and cosmic forces. Despite these conflicts, there are moments of unity where disparate factions come together to face common threats. + +**2. Corruption and Redemption** +- Many of Azeroth's greatest heroes and villains have faced corruption, often by dark forces such as the Old Gods or the Burning Legion. Redemption is a recurring theme, with characters seeking to atone for their past actions and reclaim their honor. + +**3. Legacy and Heritage** +- The history of Azeroth is rich with ancient civilizations, legendary heroes, and powerful artifacts. The legacy of these past events shapes the present, with characters and factions often drawing on their heritage to guide their actions. + +**4. Magic and Technology** +- Azeroth is a world where magic and technology coexist. Arcane magic, divine power, and druidic nature magic are all integral to the world's functioning, while technological advancements by races like the gnomes and goblins add another layer of complexity. + +**5. Exploration and Discovery** +- The world of Azeroth is vast and filled with hidden secrets, ancient ruins, and uncharted territories. Exploration and discovery are key aspects of the game's appeal, with players constantly uncovering new lore and adventures. + +### Key Characters and NPCs + +**1. Thrall (Go'el)** +- **Race**: Orc +- **Class**: Shaman +- **Background**: Thrall is one of the most iconic characters in WoW. He was the Warchief of the Horde and played a crucial role in uniting the orc clans and leading them to a new home in Kalimdor. Thrall is known for his wisdom, strength, and deep connection to the elements. + +**2. Jaina Proudmoore** +- **Race**: Human +- **Class**: Mage +- **Background**: Jaina is the daughter of Admiral Daelin Proudmoore and one of the most powerful mages in Azeroth. She has been a key figure in many of the game's major events, including the founding of Theramore and the defense of Azeroth against various threats. + +**3. Sylvanas Windrunner** +- **Race**: Undead (formerly High Elf) +- **Class**: Hunter +- **Background**: Sylvanas was the Ranger-General of Silvermoon before being turned into a banshee by Arthas Menethil. She later became the leader of the Forsaken and, for a time, the Warchief of the Horde. Her actions have often been controversial and have had significant impacts on the game's storyline. + +**4. Anduin Wrynn** +- **Race**: Human +- **Class**: Priest +- **Background**: Anduin is the King of Stormwind and the son of the legendary King Varian Wrynn. Known for his compassion and desire for peace, Anduin has grown into a strong leader, guiding the Alliance through numerous conflicts. + +**5. Arthas Menethil (The Lich King)** +- **Race**: Undead (formerly Human) +- **Class**: Death Knight +- **Background**: Arthas was the Crown Prince of Lordaeron who fell from grace and became the Lich King, one of the most feared beings in Azeroth. His story is central to the Wrath of the Lich King expansion. + +**6. Illidan Stormrage** +- **Race**: Night Elf (Demon Hunter) +- **Class**: Demon Hunter +- **Background**: Illidan is a complex character who has walked the line between hero and villain. He was imprisoned for ten thousand years for his use of forbidden magic but later became a key figure in the fight against the Burning Legion. + +**7. Bolvar Fordragon** +- **Race**: Human (later Undead) +- **Class**: Paladin (later Death Knight) +- **Background**: Bolvar was a noble paladin who sacrificed himself to become the new Lich King, containing the Scourge. His story takes a dramatic turn in the Shadowlands expansion. + +**8. Tyrande Whisperwind** +- **Race**: Night Elf +- **Class**: Priestess of Elune +- **Background**: Tyrande is the High Priestess of Elune and the leader of the Night Elves. She is a fierce warrior and a devoted leader, often seen alongside her husband, Malfurion Stormrage. + +**9. Malfurion Stormrage** +- **Race**: Night Elf +- **Class**: Druid +- **Background**: Malfurion is the first Night Elf druid and one of the most powerful druids in Azeroth. He has played a crucial role in many of the world's major events, including the War of the Ancients and the defense of Azeroth against numerous threats. + +**10. Vol'jin** +- **Race**: Troll +- **Class**: Shadow Hunter +- **Background**: Vol'jin was the leader of the Darkspear Trolls and later became the Warchief of the Horde. He is known for his wisdom, bravery, and deep connection to the spirits. + +### Notable NPCs + +**1. Khadgar** +- **Race**: Human +- **Class**: Mage +- **Background**: Khadgar is one of the most powerful mages in Azeroth and a key figure in the fight against the Burning Legion. He played a significant role in the events of the Warlords of Draenor and Legion expansions. + +**2. Varok Saurfang** +- **Race**: Orc +- **Class**: Warrior +- **Background**: Saurfang is a legendary orc warrior known for his honor and strength. He played a pivotal role in the events of the Battle for Azeroth expansion. + +**3. Lor'themar Theron** +- **Race**: Blood Elf +- **Class**: Ranger +- **Background**: Lor'themar is the Regent Lord of Quel'Thalas and the leader of the Blood Elves. He has guided his people through many challenges, including their alliance with the Horde. + +**4. Genn Greymane** +- **Race**: Worgen (formerly Human) +- **Class**: Warrior +- **Background**: Genn is the King of Gilneas and a fierce leader of the Worgen. He has a deep-seated hatred for Sylvanas Windrunner and has been a key figure in the Alliance's efforts against the Horde. + +**5. Baine Bloodhoof** +- **Race**: Tauren +- **Class**: Warrior +- **Background**: Baine is the High Chieftain of the Tauren and the son of the legendary Cairne Bloodhoof. He is known for his wisdom, strength, and dedication to his people. + +**6. Alexstrasza the Life-Binder** +- **Race**: Dragon (Red Dragonflight) +- **Class**: Aspect of Life +- **Background**: Alexstrasza is the Aspect of Life and the leader of the Red Dragonflight. She has played a crucial role in many of Azeroth's major events, including the fight against Deathwing and the Cataclysm. + +**7. Magni Bronzebeard** +- **Race**: Dwarf +- **Class**: Warrior (later Speaker of Azeroth) +- **Background**: Magni is the former King of Ironforge who was transformed into a diamond form to become the Speaker of Azeroth, communicating with the world-soul of the planet. + +**8. Turalyon** +- **Race**: Human +- **Class**: Paladin +- **Background**: Turalyon is a legendary paladin and one of the original Knights of the Silver Hand. He spent many years fighting the Burning Legion in the Twisting Nether and returned to Azeroth during the Legion expansion. + +**9. Alleria Windrunner** +- **Race**: High Elf (later Void Elf) +- **Class**: Ranger +- **Background**: Alleria is the eldest of the Windrunner sisters and a skilled ranger. She embraced the powers of the Void and became a key figure in the fight against the Burning Legion. + +**10. Nathanos Blightcaller** +- **Race**: Undead +- **Class**: Hunter +- **Background**: Nathanos is a loyal champion of Sylvanas Windrunner and one of the most skilled hunters in Azeroth. He played a significant role in the events of the Battle for Azeroth expansion. + +--- + +Above is the introduction and backgroud story of the game "World of Warcraft (WoW)". + +Your task is to consider what NPC the following persona will become after they come to the world of WoW: + +{persona} + +Note: + +1. Your response should start with "Name:". +2. Your NPC description should be specific and consistent with the game. +3. You also need to specify how the NPC interacts with players in the game. +''' + + +math_solution_template = '''Provide solution to the given math problem. + +Problem: {persona} + +Note: Provide your solution step-by-step, and end your solution in a new line in the following format: \nFinal Answer: The final answer is $final_answer$. I hope it is correct. +''' + +grade_math_solution_template = '''Provide solution to the given math problem. + +Problem: {persona} + +Note: First provide your solution step-by-step, and then output only a single final answer after ####''' +# end with your final answer in the following format: #### final_answer + +instruction_following = '''Create a verifiable instruction that the following persona might ask you to do: + +{persona} + +An example of verifiable instruction could be: {example} + +Note: + +1. The above example is not tied to any particular persona, but you should create one that is unique and specific to the given persona. +2. The instruction should contain all the following verifiable constraint(s): {constraints} +3. Your output should start with "User instruction:". Your output should not include an answer to the instruction. +''' + +instruction_following_solution = '''Provide a response to the given instruction while satisfying the constraints. + +Instruction: {persona} + +Note that you should follow the instruction precisely and satisfy all the constraints. +''' + +rewrite_if_prompt = '''Rewrite the given instruction to remove one of the constraints. + +Instruction: {persona} + +Note: + +1. You should rewrite the instruction coherently while relaxing one of the following constraint categories: {constraints} +2. Remember to entirely relax one of the constraint category that is {category} +3. Your output should start with "User instruction:". Your output should not include an answer to the instruction. +''' + +# (such as "write in more than 400 words") + +code_template = '''{persona} + +Assume you are the persona described above and you are asking a python programming question in stackoverflow. + +Note: + +1. Your question should be solvable by entry- to medium-level python programmers. +2. Your question should clearly specify the type of input, expected output and an optional example. +3. Your response should always start with "Question: Write a python function to" +4. Your response should not include a solution to the created coding problem.''' + + +code_solution_template = '''Provide solution to the given python programming question. + +Question: {persona} + + +Note: + +1. Your response should always start with the function definition and end with the final return statement. +2. Your response should only and only include python function.''' + + +math_int_algebra_template = '''Create an intermediate algebra math problem related to the following persona: + +{persona} + +Note: + +1. The math problem should be challenging and involve one of the intermediate algebra topics such as solving polynomial equations, solving linear equations, inequalities, or quadratic equations, or simplifying rational and radical expressions. +2. You should make full use of the persona description to create the math problem to ensure that the math problem is unique and specific to the persona. +3. Your response should always start with "Math problem:". Your response should not include a solution to the created math problem. +4. Your created math problem should include no more than 2 sub-problems. +''' \ No newline at end of file diff --git a/scripts/persona_driven_data_gen/requirements.txt b/scripts/persona_driven_data_gen/requirements.txt new file mode 100644 index 000000000..e405088ab --- /dev/null +++ b/scripts/persona_driven_data_gen/requirements.txt @@ -0,0 +1,3 @@ +anthropic +openai==0.28 +tenacity