Clean up outstanding files for v0.1

minimaxir · May 17, 2020 · 7a616e5 · 7a616e5
1 parent 25c3e94
commit 7a616e5
Show file tree

Hide file tree

Showing 9 changed files with 39 additions and 37 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,5 @@ __pycache__
 test_notebooks/
 /build
 /dist
-*.egg-info
+*.egg-info
+.vscode/settings.json
diff --git a/DESIGN.md b/DESIGN.md
@@ -9,10 +9,12 @@ A few notes on some opinionated design decisions present in aitextgen.
 - Although GPT-2 is the default Transformers model architecture used, there is limited GPT-2 specific code. This allows this tool to easily adapt to new CLM architectures that may be released in the future, such as SparseTransformers or Reformers.
 - `generate_to_file()` automatically assigns the generated process a seed if one is not specified. This allows other users to reproduce a generation deterministically (e.g. in a Jupyter Notebook), in order to provide proof that at text was generated by AI and not altered.
 - For training checkpoints, aitextgen deliberately disables pytorch-Lightning’s checkpoint feature since it incurs a lot of overhead, and using the native model saving within the model itself is easier.
+- Testing/Validation is deliberately not implemented since it serves as more of a crutch and doesn't provide that much help in identifying overfitting.
 
 ## Philosophies
 
 - The development intent of aitextgen is as a _tool_ for AI text generation, and not a philosophical experiment behind AI consciousness or whatnot. (alternatively, one could argue that _humans_ are the ones who perform actions based on prior knowledge and [free will is a myth](https://www.youtube.com/watch?v=kQjb-EP2JEE), but that's a discussion for another time)
+- AI generated text should _never_ be called "deepfake text." Stop trying to make deepfake text happen.
 
 ## Deviations from Huggingface Transformers
 

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,2 @@
+graft */static
+global-exclude .DS_Store
diff --git a/README.md b/README.md
@@ -7,16 +7,16 @@ aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hu
 - Finetunes on a pretrained 124M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
 - Generates text faster than gpt-2-simple and with better memory efficiency! (even from the 1.5B GPT-2 model!)
 - With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks and upload to to the Huggingface model repositorsy. Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
-- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress progress, with the ability to add optional loggers.
-- The input dataset is its own object, allowing you to not only easily encode, cache, and compress them on a local computer before transporting it, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ models so it learns some data fully and some partially to create blended output.
+- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
+- The input dataset is its own object, allowing you to not only easily encode, cache, and compress them on a local computer before transporting to a remote serve, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ on multiple datasets to create blended output.
 
 You can read more about aitextgen in the documentation!
 
 ## Demo
 
 You can play with aitextgen _for free_ with powerful GPUs using these Colaboratory Notebooks!
 
-- Finetune an existing 124M GPT-2 model on your own dataset (GPU)
+- [Finetune an existing 124M GPT-2 model on your own dataset (GPU)](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing)
 - Train a GPT-2 model + tokenizer from scratch (GPU)
 
 ## Installation
@@ -27,11 +27,11 @@ aitextgen can be installed from PyPI:
 pip3 install aitextgen
 ```
 
-## Quick Example
+## Quick Examples
 
 Here's how you can quickly test out aitextgen on your own computer, even if you don't have a GPU!
 
-For generating text from a pretrained GPT-2 model ([Jupyter Notebook](/notebooks/generation_hello_world.ipynb)):
+For generating text from a pretrained GPT-2 model:
 
 ```python
 from aitextgen import aitextgen
@@ -114,7 +114,7 @@ aitextgen is a tool primarily intended to help facilitate creative content. It i
 
 - State that the text was generated using aitextgen and/or a GPT-2 model architecture. (a link to this repo would be a bonus!)
 - If parodying a person, explicitly state that it is a parody, and reference who it is parodying.
-- If the generated human-curated, or if it's unsupervised random output
+- If the generated text is human-curated, or if it's unsupervised random output
 - Indicating who is maintaining/curating the AI-generated text.
 - Make a good-faith effort to remove overfit output from the generated text that matches the input text verbatim.
 

diff --git a/ROADMAP.md b/ROADMAP.md
@@ -4,19 +4,18 @@ A rough roadmap for implementing new features. **All is subject to change at a m
 
 ## Launch
 
-* Training using pytorch-lightning, with suppport for fp16 and Colab TPUs.
-* Training a GPT-2 model from scratch w/ parametricized context window sizes and parameters
-* PyTorch support for training/generating
-* Export to static Torchscript trace.
-* Generation from Transformer's native generate() function
-* Actual documentation
-  * Examples
-    * Training on a CPU
-    * Training on a GPU
-    * Training on multiple GPU (4x T4)
-    * Training on a TPU
-    * Cross-Training on Multiple Datasets
-    * Generate on a CPU
-    * Generate on a GPU
-    * Model Deployment w/ Torchscript and starlette
-  * API docs for all classes
+- Training using pytorch-lightning, with suppport for fp16 and Colab TPUs.
+- Training a GPT-2 model from scratch w/ parametricized context window sizes and parameters
+- PyTorch support for training/generating
+- Export to static Torchscript trace.
+- Generation from Transformer's native generate() function
+- Actual documentation
+  - Examples
+    - Training on a CPU
+    - Training on a GPU
+    - Training on multiple GPU (4x T4)
+    - Training on a TPU
+    - Cross-Training on Multiple Datasets
+    - Generate on a CPU
+    - Generate on a GPU
+  - API docs for all classes
diff --git a/aitextgen/tokenizers.py b/aitextgen/tokenizers.py
@@ -29,7 +29,7 @@ def train_tokenizer(
     :param save_path: Where to save the final tokenizer
     :param added_tokens: List of tokens to add to the tokenizer (currently not working)
     :param bos_token: Beginning-of-string special token
-    :param eos_token: End-of-string special token 
+    :param eos_token: End-of-string special token
     :param unk_token: Unknown special token
     """
 

diff --git a/aitextgen/utils.py b/aitextgen/utils.py
@@ -102,7 +102,7 @@ def build_gpt2_config(
     bos_token_id: int = 0,
     eos_token_id: int = 0,
     max_length: int = 1024,
-    dropout: float = 0.1,
+    dropout: float = 0.0,
     **kwargs
 ):
     """

diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,4 @@
-transformers>=2.8.0
-fire
+transformers>=2.9.1
+fire>=0.3.0
 msgpack
-pytorch-lightning>=0.7.3
-tqdm>=4.41.0
-pyyaml
+pytorch-lightning>=0.7.6
diff --git a/setup.py b/setup.py
@@ -1,30 +1,30 @@
-from setuptools import setup, find_packages
+from setuptools import setup
 
 long_description = """
-A robust tool for advanced AI text generation.
+A robust Python tool for text-based AI training and generation using GPT-2.
 """
 
 
 setup(
     name="aitextgen",
     packages=["aitextgen"],  # this must be the same as the name above
     version="0.1",
-    description="A robust tool for advanced AI text generation using Transformers.",
+    description="A robust Python tool for text-based AI training and generation using GPT-2.",
     long_description=long_description,
     long_description_content_type="text/markdown",
     author="Max Woolf",
     author_email="[email protected]",
     url="https://github.com/minimaxir/aitextgen",
-    keywords=["wordcloud", "data visualization", "text cool stuff"],
+    keywords=["gpt-2", "gpt2", "text generation", "ai"],
     classifiers=[],
     license="MIT",
     entry_points={"console_scripts": ["aitextgen=aitextgen.cli:aitextgen_cli"]},
     python_requires=">=3.6",
     include_package_data=True,
     install_requires=[
-        "transformers>=2.9.0",
-        "fire",
+        "transformers>=2.9.1",
+        "fire>=0.3.0",
         "msgpack",
-        "pytorch-lightning>=0.7.5",
+        "pytorch-lightning>=0.7.6",
     ],
 )