- Use
.yml
extension for GitHub Actions workflows consistently (#40). - Use isort and ruff to sort imports and format Python code (#41).
- Add
TorchDiskDataset
class to support using.pt
or.pth
files as inputs forfit_model()
andfit_model_distributed()
(#38). Similar toNumpyDiskDataset
added in tinytopics 0.6.0, this class also uses memory-mapped mode to load data so that larger than system memory datasets can be used for training.
- Add distributed training speed and cost metrics on 8x A100 (40 GB SXM4) to the distributed training article (#34). This supplements the existing 1x H100 and 4x H100 metrics.
- Add unit tests for
fit_model_distributed()
(#35). - Add pytest-cov to development dependencies (#35).
- Add
fit_model_distributed()
to support distributed training using Hugging Face Accelerate. See the distributed training article for details (#32).
- Use
tqdm.auto
for better progress bar visuals when used in notebooks (#30). - Move dataset classes and loss functions into dedicated modules to improve code structure and reusability (#31).
fit_model()
now supports using PyTorchDataset
as input, in addition to in-memory tensors. This allows fitting topic models on data larger than GPU VRAM or system RAM. TheNumpyDiskDataset
class is added to read.npy
document-term matrices from disk on-demand (#26).
- Added a memory-efficient training article demonstrating the new features for fitting topic models on large datasets (#27).
- Add badges for CI tests and mkdocs workflows to
README.md
(#24). - Add PyTorch management guide link for uv to
README.md
(735fcca).
- Use hatchling 1.26.3 in
pyproject.toml
to work aroundrye publish
errors (c56387c).
-
Increased the speed of
generate_synthetic_data()
significantly by using direct mixture sampling, which leverages the properties of multinomial distributions (#21).This change makes simulating data at the scale of 100K x 100K more feasible. Although the approaches before and after are mathematically equivalent, the data generated with the same seed in previous versions and this version onward will be bitwise different.
- Use
pip
andpython3
in command line instructions consistently.
- tinytopics now requires Python >= 3.10 to use PEP 604 style shorthand syntax for union and optional types (#14).
- Refactor type hints to use more base abstract classes, making them less limiting to specific implementations (#14).
- Add unit tests for all functions using pytest, with a GitHub Actions workflow to run tests under Linux and Windows (#18).
- Update articles to simplify import syntax using
import tinytopics as tt
(#16). - Close precise figure handles in plot functions instead of the current figure (#18).
- Plot functions now correctly use string and list type color palette inputs when specified (do not call them as functions) (#18).
- Refactor the code to use a more functional style and add type hints to improve code clarity (#9).
- Add
scale_color_tinytopics()
to support the coloring need for arbitrary number of topics (#4).
- Simplify hyperparameter tuning by adopting modern stochastic gradient methods.
fit_model()
now uses a combination of the AdamW optimizer (with weight decay) and the cosine annealing (with warm restarts) scheduler (#2).
- Fix "Structure plot" y-axis range issue by adding a
normalize_rows
argument toplot_structure()
for normalizing rows so that they all sum exactly to 1, and explicitly setting the y-axis limit to [0, 1]. (#1).
- Add text data topic modeling example article (#7).
- Reorder arguments in plotting functions to follow conventions.
- Reduce the minimum version requirement for all dependencies in
pyproject.toml
.
- Add more details on PyTorch installation in
README.md
. - Improve text quality in articles.
- Add
CHANGELOG.md
to record changes. - Add essential metadata to
pyproject.toml
.
- First version.