SciPhi [ΨΦ]: A framework for breaking LLM scaling laws

Overview

SciPhi is a Python package that provides two high-level features:

Configurable generation of LLM-mediated synthetic training / tuning data for LLMs.
Seamless LLM-mediated evaluation of model output.

Questions?

Join us on Discord here or contact me directly. For a SciPhi tutorial, go here.

Installation

# Repository setup
git clone https://github.com/emrgnt-cmplxty/sciphi.git
cd sciphi
# Install dependencies
# pip3 install poetry (if you don't have it)
poetry install -E all
# Setup your environment
cp .env.example .env && vim .env

Requirements

Python >= 3.11 and < 3.12
Poetry for package management

Optional Feature Requirements

For additional features, you can install the optional dependencies:

poetry install -E <extra_name>

anthropic_support: For running with Anthropic models.
hf_support: For running with the HuggingFace package, useful for a large variety of model access.
openai_support: For running with OpenAI models.
vllm_support: For with VLLM, useful for fast inference.
llama_index_support: For LlamaIndex, useful for grounded synthesis.
chroma_support: For Chroma support, used for large vector databases.
all: For all dependencies (ex-vllm, which requires a separate install).

Usage

Dataset Generation

You can use SciPhi for dataset generation by executing the relevant runner.py file with various command-line arguments.

poetry run python sciphi/examples/data_generation/runner.py --provider_name=openai --model_name=gpt-4 --log_level=DEBUG --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need

Key Command-Line Arguments

--provider: Which provider to use for completions (default: "openai").
--model_name: The name of the model to load from the provider (default: "gpt-3.5-turbo").
--temperature: Temperature parameter for the provided model (default: 0.7).
--example_config: Which example configuration to use (default: "textbooks_are_all_you_need").
--override_config_path: Used to override the example configurations with custom config.
--num_samples: Number of samples to generate (default: 1_024).
--output_dir: File path to override the default output output file path with.
--output_file_name: Filename to override the default output file name with.

Stock data configs

evol_instruct - A config for replicating the EvolInstruct dataset
textbooks_are_all_you_need - A config for replicating the Python textbook data from Textbooks Are All You Need [2]

Example generated data

Development

The code snippet below shows how to use SciPhi to generate synthetic data for a given LLM provider.

# Build an LLM and provider interface
llm_config = LLMConfigManager.get_config_for_provider(
    provider_name
).create(**build_llm_config(args))
llm_provider = InterfaceManager.get_provider(
    provider_name,
    model_name,
    llm_config,
)

# Initialize the data maker
data_maker = DataMaker(
    DataGeneratorMode(data_config.generator_mode),
    prompt_generator,
    prompt,
    # Optional field,
    # currently only used when generator_mode == "from_hf_dataset"
    dataset_name=data_config.dataset_name,
)

# Generate & write out the results
output_path = get_output_path(args)
logger.debug(f"Writing results to: {output_path}.")
writer = JsonlDataWriter(output_path)

for batch in data_maker.generator(args.batch_size, args.num_samples):
    completions = llm_provider.get_batch_completion(batch)
    for formatted_prompt, completion in zip(batch, completions):
        logger.debug("-" * 100)
        logger.debug(f"Formatted Prompt:\n{formatted_prompt}")
        logger.debug(f"\nCompletion:\n{completion}")
        logger.debug("-" * 100)

        # Write the results using DataWriter
        writer.write(
            [
                {
                    "formatted_prompt": formatted_prompt,
                    "completion": completion,
                }
            ]
        )

License

This project is licensed under the Apache-2.0 License.

Datasets Generated

[1] Python Synthetic Textbooks

Sources

[1] WizardCoder: Empowering Code Large Language Models with Evol-Instruct

[2] Textbooks Are All You Need

📖 Citation

Reference to cite if you use LlamaIndex in a paper:

@software{Emergent_AGI_SciPhi,
author = {Colegrove, Owen},
doi = {Pending},
month = {09},
title = {{LlamaIndex}},
url = {https://github.com/emrgnt-cmplxty/sciphi},
year = {2023}
}

Name	Name	Last commit message	Last commit date
Latest commit emrgnt-cmplxty modify llama index source (SciPhi-AI#27 ) Sep 22, 2023 4db0701 · Sep 22, 2023 History 72 Commits
.github	.github	fix prompt generator check (SciPhi-AI#18 )	Sep 20, 2023
sciphi	sciphi	modify llama index source (SciPhi-AI#27 )	Sep 22, 2023
.env.example	.env.example	Feature/add chroma and populator (SciPhi-AI#21 )	Sep 21, 2023
.flake8	.flake8	first commit	Sep 15, 2023
.gitattributes	.gitattributes	Update .gitattributes	Sep 20, 2023
.gitignore	.gitignore	Feature/add llama index (SciPhi-AI#19 )	Sep 20, 2023
.gitmodules	.gitmodules	first commit	Sep 15, 2023
.isort.cfg	.isort.cfg	first commit	Sep 15, 2023
.pre-commit-config.yaml	.pre-commit-config.yaml	first commit	Sep 15, 2023
CITATION.cff	CITATION.cff	Create CITATION.cff	Sep 21, 2023
LICENSE	LICENSE	first commit	Sep 15, 2023
README.md	README.md	Update README.md	Sep 21, 2023
pyproject.toml	pyproject.toml	make script more robust (SciPhi-AI#25 )	Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciPhi [ΨΦ]: A framework for breaking LLM scaling laws

Overview

Questions?

Installation

Requirements

Optional Feature Requirements

Usage

Dataset Generation

Key Command-Line Arguments

Stock data configs

Example generated data

Development

License

Datasets Generated

Sources

📖 Citation

About

Releases

Packages

Languages

License

TheGrognardling/sciphi

Folders and files

Latest commit

History

Repository files navigation

SciPhi [ΨΦ]: A framework for breaking LLM scaling laws

Overview

Questions?

Installation

Requirements

Optional Feature Requirements

Usage

Dataset Generation

Key Command-Line Arguments

Stock data configs

Example generated data

Development

License

Datasets Generated

Sources

📖 Citation

About

Resources

License

Citation

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages