Skip to content

SciPhi is a simple framework for generating synthetic / fine-tuning data, and for robust evaluation of LLMs.

License

Notifications You must be signed in to change notification settings

TheGrognardling/sciphi

This branch is 124 commits behind SciPhi-AI/synthesizer:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

4db0701 · Sep 22, 2023

History

72 Commits
Sep 20, 2023
Sep 22, 2023
Sep 21, 2023
Sep 15, 2023
Sep 20, 2023
Sep 20, 2023
Sep 15, 2023
Sep 15, 2023
Sep 15, 2023
Sep 21, 2023
Sep 15, 2023
Sep 21, 2023
Sep 22, 2023

Repository files navigation

SciPhi [ΨΦ]: A framework for breaking LLM scaling laws

Overview

SciPhi is a Python package that provides two high-level features:

  • Configurable generation of LLM-mediated synthetic training / tuning data for LLMs.
  • Seamless LLM-mediated evaluation of model output.

Screenshot 2023-09-18 at 9 53 55 AM

Questions?

Join us on Discord here or contact me directly. For a SciPhi tutorial, go here.

Installation

# Repository setup
git clone https://github.com/emrgnt-cmplxty/sciphi.git
cd sciphi
# Install dependencies
# pip3 install poetry (if you don't have it)
poetry install -E all
# Setup your environment
cp .env.example .env && vim .env

Requirements

  • Python >= 3.11 and < 3.12
  • Poetry for package management

Optional Feature Requirements

For additional features, you can install the optional dependencies:

poetry install -E <extra_name>
  • anthropic_support: For running with Anthropic models.
  • hf_support: For running with the HuggingFace package, useful for a large variety of model access.
  • openai_support: For running with OpenAI models.
  • vllm_support: For with VLLM, useful for fast inference.
  • llama_index_support: For LlamaIndex, useful for grounded synthesis.
  • chroma_support: For Chroma support, used for large vector databases.
  • all: For all dependencies (ex-vllm, which requires a separate install).

Usage

Dataset Generation

You can use SciPhi for dataset generation by executing the relevant runner.py file with various command-line arguments.

poetry run python sciphi/examples/data_generation/runner.py --provider_name=openai --model_name=gpt-4 --log_level=DEBUG --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need

Key Command-Line Arguments

  • --provider: Which provider to use for completions (default: "openai").
  • --model_name: The name of the model to load from the provider (default: "gpt-3.5-turbo").
  • --temperature: Temperature parameter for the provided model (default: 0.7).
  • --example_config: Which example configuration to use (default: "textbooks_are_all_you_need").
  • --override_config_path: Used to override the example configurations with custom config.
  • --num_samples: Number of samples to generate (default: 1_024).
  • --output_dir: File path to override the default output output file path with.
  • --output_file_name: Filename to override the default output file name with.

Stock data configs

  • evol_instruct - A config for replicating the EvolInstruct dataset
  • textbooks_are_all_you_need - A config for replicating the Python textbook data from Textbooks Are All You Need [2]

Example generated data

Screenshot 2023-09-17 at 11 11 18 PM

Development

The code snippet below shows how to use SciPhi to generate synthetic data for a given LLM provider.

# Build an LLM and provider interface
llm_config = LLMConfigManager.get_config_for_provider(
    provider_name
).create(**build_llm_config(args))
llm_provider = InterfaceManager.get_provider(
    provider_name,
    model_name,
    llm_config,
)

# Initialize the data maker
data_maker = DataMaker(
    DataGeneratorMode(data_config.generator_mode),
    prompt_generator,
    prompt,
    # Optional field,
    # currently only used when generator_mode == "from_hf_dataset"
    dataset_name=data_config.dataset_name,
)

# Generate & write out the results
output_path = get_output_path(args)
logger.debug(f"Writing results to: {output_path}.")
writer = JsonlDataWriter(output_path)

for batch in data_maker.generator(args.batch_size, args.num_samples):
    completions = llm_provider.get_batch_completion(batch)
    for formatted_prompt, completion in zip(batch, completions):
        logger.debug("-" * 100)
        logger.debug(f"Formatted Prompt:\n{formatted_prompt}")
        logger.debug(f"\nCompletion:\n{completion}")
        logger.debug("-" * 100)

        # Write the results using DataWriter
        writer.write(
            [
                {
                    "formatted_prompt": formatted_prompt,
                    "completion": completion,
                }
            ]
        )

License

This project is licensed under the Apache-2.0 License.

Datasets Generated

[1] Python Synthetic Textbooks

Sources

[1] WizardCoder: Empowering Code Large Language Models with Evol-Instruct

[2] Textbooks Are All You Need

📖 Citation

Reference to cite if you use LlamaIndex in a paper:

@software{Emergent_AGI_SciPhi,
author = {Colegrove, Owen},
doi = {Pending},
month = {09},
title = {{LlamaIndex}},
url = {https://github.com/emrgnt-cmplxty/sciphi},
year = {2023}
}

About

SciPhi is a simple framework for generating synthetic / fine-tuning data, and for robust evaluation of LLMs.

Resources

License

Citation

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%