Skip to content

Commit

Permalink
Clean up outstanding files for v0.1
Browse files Browse the repository at this point in the history
  • Loading branch information
minimaxir committed May 17, 2020
1 parent 25c3e94 commit 7a616e5
Show file tree
Hide file tree
Showing 9 changed files with 39 additions and 37 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ __pycache__
test_notebooks/
/build
/dist
*.egg-info
*.egg-info
.vscode/settings.json
2 changes: 2 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@ A few notes on some opinionated design decisions present in aitextgen.
- Although GPT-2 is the default Transformers model architecture used, there is limited GPT-2 specific code. This allows this tool to easily adapt to new CLM architectures that may be released in the future, such as SparseTransformers or Reformers.
- `generate_to_file()` automatically assigns the generated process a seed if one is not specified. This allows other users to reproduce a generation deterministically (e.g. in a Jupyter Notebook), in order to provide proof that at text was generated by AI and not altered.
- For training checkpoints, aitextgen deliberately disables pytorch-Lightning’s checkpoint feature since it incurs a lot of overhead, and using the native model saving within the model itself is easier.
- Testing/Validation is deliberately not implemented since it serves as more of a crutch and doesn't provide that much help in identifying overfitting.

## Philosophies

- The development intent of aitextgen is as a _tool_ for AI text generation, and not a philosophical experiment behind AI consciousness or whatnot. (alternatively, one could argue that _humans_ are the ones who perform actions based on prior knowledge and [free will is a myth](https://www.youtube.com/watch?v=kQjb-EP2JEE), but that's a discussion for another time)
- AI generated text should _never_ be called "deepfake text." Stop trying to make deepfake text happen.

## Deviations from Huggingface Transformers

Expand Down
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
graft */static
global-exclude .DS_Store
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hu
- Finetunes on a pretrained 124M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency! (even from the 1.5B GPT-2 model!)
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks and upload to to the Huggingface model repositorsy. Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress progress, with the ability to add optional loggers.
- The input dataset is its own object, allowing you to not only easily encode, cache, and compress them on a local computer before transporting it, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ models so it learns some data fully and some partially to create blended output.
- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
- The input dataset is its own object, allowing you to not only easily encode, cache, and compress them on a local computer before transporting to a remote serve, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ on multiple datasets to create blended output.

You can read more about aitextgen in the documentation!

## Demo

You can play with aitextgen _for free_ with powerful GPUs using these Colaboratory Notebooks!

- Finetune an existing 124M GPT-2 model on your own dataset (GPU)
- [Finetune an existing 124M GPT-2 model on your own dataset (GPU)](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing)
- Train a GPT-2 model + tokenizer from scratch (GPU)

## Installation
Expand All @@ -27,11 +27,11 @@ aitextgen can be installed from PyPI:
pip3 install aitextgen
```

## Quick Example
## Quick Examples

Here's how you can quickly test out aitextgen on your own computer, even if you don't have a GPU!

For generating text from a pretrained GPT-2 model ([Jupyter Notebook](/notebooks/generation_hello_world.ipynb)):
For generating text from a pretrained GPT-2 model:

```python
from aitextgen import aitextgen
Expand Down Expand Up @@ -114,7 +114,7 @@ aitextgen is a tool primarily intended to help facilitate creative content. It i

- State that the text was generated using aitextgen and/or a GPT-2 model architecture. (a link to this repo would be a bonus!)
- If parodying a person, explicitly state that it is a parody, and reference who it is parodying.
- If the generated human-curated, or if it's unsupervised random output
- If the generated text is human-curated, or if it's unsupervised random output
- Indicating who is maintaining/curating the AI-generated text.
- Make a good-faith effort to remove overfit output from the generated text that matches the input text verbatim.

Expand Down
31 changes: 15 additions & 16 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,18 @@ A rough roadmap for implementing new features. **All is subject to change at a m

## Launch

* Training using pytorch-lightning, with suppport for fp16 and Colab TPUs.
* Training a GPT-2 model from scratch w/ parametricized context window sizes and parameters
* PyTorch support for training/generating
* Export to static Torchscript trace.
* Generation from Transformer's native generate() function
* Actual documentation
* Examples
* Training on a CPU
* Training on a GPU
* Training on multiple GPU (4x T4)
* Training on a TPU
* Cross-Training on Multiple Datasets
* Generate on a CPU
* Generate on a GPU
* Model Deployment w/ Torchscript and starlette
* API docs for all classes
- Training using pytorch-lightning, with suppport for fp16 and Colab TPUs.
- Training a GPT-2 model from scratch w/ parametricized context window sizes and parameters
- PyTorch support for training/generating
- Export to static Torchscript trace.
- Generation from Transformer's native generate() function
- Actual documentation
- Examples
- Training on a CPU
- Training on a GPU
- Training on multiple GPU (4x T4)
- Training on a TPU
- Cross-Training on Multiple Datasets
- Generate on a CPU
- Generate on a GPU
- API docs for all classes
2 changes: 1 addition & 1 deletion aitextgen/tokenizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def train_tokenizer(
:param save_path: Where to save the final tokenizer
:param added_tokens: List of tokens to add to the tokenizer (currently not working)
:param bos_token: Beginning-of-string special token
:param eos_token: End-of-string special token
:param eos_token: End-of-string special token
:param unk_token: Unknown special token
"""

Expand Down
2 changes: 1 addition & 1 deletion aitextgen/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def build_gpt2_config(
bos_token_id: int = 0,
eos_token_id: int = 0,
max_length: int = 1024,
dropout: float = 0.1,
dropout: float = 0.0,
**kwargs
):
"""
Expand Down
8 changes: 3 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
transformers>=2.8.0
fire
transformers>=2.9.1
fire>=0.3.0
msgpack
pytorch-lightning>=0.7.3
tqdm>=4.41.0
pyyaml
pytorch-lightning>=0.7.6
14 changes: 7 additions & 7 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
from setuptools import setup, find_packages
from setuptools import setup

long_description = """
A robust tool for advanced AI text generation.
A robust Python tool for text-based AI training and generation using GPT-2.
"""


setup(
name="aitextgen",
packages=["aitextgen"], # this must be the same as the name above
version="0.1",
description="A robust tool for advanced AI text generation using Transformers.",
description="A robust Python tool for text-based AI training and generation using GPT-2.",
long_description=long_description,
long_description_content_type="text/markdown",
author="Max Woolf",
author_email="[email protected]",
url="https://github.com/minimaxir/aitextgen",
keywords=["wordcloud", "data visualization", "text cool stuff"],
keywords=["gpt-2", "gpt2", "text generation", "ai"],
classifiers=[],
license="MIT",
entry_points={"console_scripts": ["aitextgen=aitextgen.cli:aitextgen_cli"]},
python_requires=">=3.6",
include_package_data=True,
install_requires=[
"transformers>=2.9.0",
"fire",
"transformers>=2.9.1",
"fire>=0.3.0",
"msgpack",
"pytorch-lightning>=0.7.5",
"pytorch-lightning>=0.7.6",
],
)

0 comments on commit 7a616e5

Please sign in to comment.