Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 8 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,29 +10,12 @@

Khisto is a Python library for creating histograms using the **Khiops optimal binning algorithm**. Unlike standard histograms that use fixed-width bins or simple heuristics, Khisto automatically determines the optimal number of bins and their variable widths to best represent the underlying data distribution.

## Features

- **Optimal Binning**: Uses the MODL (Minimum Description Length) principle to find the best discretization.
- **Variable-Width Bins**: Captures dense regions with fine bins and sparse regions with wider bins.
- **NumPy Compatible**: Drop-in replacement for `numpy.histogram`.
- **Matplotlib Integration**: `khisto.matplotlib.hist` works like `plt.hist`.
- **Core Histogram API**: Inspect every available granularity with `khisto.core.compute_histograms` and `HistogramResult`.
- **Minimal Dependencies**: Only requires NumPy (matplotlib optional for plotting).
Documentation is available at **[khiops.github.io/khisto-python](https://khiopsml.github.io/khisto-python/)**.

| Standard Gaussian | Heavy-tailed Pareto |
| --- | --- |
| ![Adaptive Gaussian histogram](docs/images/gaussian-quick-start.png) | ![Adaptive Pareto histogram](docs/images/pareto-quick-start.png) |

## Reproducing The Example Distributions

The complete runnable script is available in `scripts/generate_distribution_examples.py`.

Run it from the repository root to regenerate both example distributions and the figure files used in this README:

```bash
python scripts/generate_distribution_examples.py
```

## Installation

```bash
Expand All @@ -47,85 +30,28 @@ pip install "khisto[matplotlib]"

## Quick Start

### NumPy-like API

```python
import numpy as np
from khisto import histogram

# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)

# Compute optimal histogram (drop-in replacement for np.histogram)
hist, bin_edges = histogram(data)

# With density normalization
density, bin_edges = histogram(data, density=True)

# Limit maximum number of bins
hist, bin_edges = histogram(data, max_bins=10)

# Specify range
hist, bin_edges = histogram(data, range=(-2, 2))
```

Using 10,000 samples keeps the adaptive refinement visible while remaining fast to compute.

Heavy-tailed example:

```python
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

# Generate 10,000 samples from a Pareto distribution, shifted to start at 1 for better log-log visualization
shape = 3
long_tail_data = np.random.pareto(shape, size=10000) + 1
# Generate 10,000 samples from a Pareto distribution
long_tail_data = np.random.pareto(3, size=10000)

# Plot an adaptive histogram on logarithmic axes.
n, bins, patches = hist(long_tail_data, density=True)
Comment thread
marcboulle marked this conversation as resolved.
Outdated
plt.xscale("log")
plt.xscale("symlog")
plt.yscale("log")
plt.show()
```

### Matplotlib Integration

```python
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)

# Density is usually the most interpretable view with variable-width bins.
n, bins, patches = hist(data, density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()
# Generate 10,000 samples from a Normal distribution
Comment thread
marcboulle marked this conversation as resolved.
Outdated
normal_data = np.random.normal(size=10000)

# Cumulative density follows matplotlib semantics.
n, bins, patches = hist(data, density=True, cumulative=True)
plt.ylabel('Cumulative probability')
# Plot an adaptive histogram
n, bins, patches = hist(normal_data, density=True)
Comment thread
marcboulle marked this conversation as resolved.
Outdated
plt.show()
```

## How It Works

Khisto uses the Khiops optimal binning algorithm based on the MODL (Minimum Optimal Description Length) principle. Instead of using fixed-width bins like traditional histograms, it:

1. Analyzes the data distribution
2. Finds bin boundaries that minimize information loss
3. Creates variable-width bins that adapt to data density

This results in histograms that better represent the underlying distribution, with finer bins in dense regions and wider bins in sparse regions.

The method implemented in Khiops is comprehensively detailed in [2] and further extended in [1].

- [1] M. Boullé. Floating-point histograms for exploratory analysis of large scale real-world data sets. Intelligent Data Analysis, 28(5):1347-1394, 2024
- [2] V. Zelaya Mendizábal, M. Boullé, F. Rossi. Fast and fully-automated histograms for large-scale data sets. Computational Statistics & Data Analysis, 180:0-0, 2023

## Development

```bash
Expand All @@ -140,16 +66,6 @@ uv sync --group dev --extra all
uv run pytest
```

## Documentation

Full documentation is hosted at **[khiops.github.io/khisto-python](https://khiops.github.io/khisto-python/)**.

- [API Reference](https://khiops.github.io/khisto-python/array/histogram/index.html) — NumPy-like histogram API
- [Matplotlib Integration](https://khiops.github.io/khisto-python/matplotlib/index.html) — `hist` plotting function
- [Core API](https://khiops.github.io/khisto-python/core/index.html) — full access to histogram granularity levels
- [API Comparison](https://khiops.github.io/khisto-python/api_comparison.html) — side-by-side with NumPy and Matplotlib
- [Demo Notebook](https://khiops.github.io/khisto-python/demo.html) — interactive walkthrough

## License

[BSD 3-Clause Clear License](LICENSE)
611 changes: 294 additions & 317 deletions docs/demo.ipynb

Large diffs are not rendered by default.

14 changes: 2 additions & 12 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Get started
data = np.random.normal(0, 1, 10_000)
hist, bin_edges = histogram(data) # optimal bins, no guessing

.. grid:: 1 1 3 3
.. grid:: 1 1 2 2
:gutter: 3
:class-container: sd-mt-3

Expand All @@ -68,16 +68,6 @@ Get started
``compute_histograms`` exposes every granularity level so you can
pick the resolution that suits your analysis.

.. grid:: 1 1 2 2
:gutter: 3
:class-container: sd-mt-1

.. grid-item-card:: :octicon:`git-compare;1.5em` API comparison
:link: api_comparison
:link-type: doc

Side-by-side parameter tables for NumPy, Matplotlib, and Khisto.

.. grid-item-card:: :octicon:`play;1.5em` Interactive demo
:link: demo
:link-type: doc
Expand All @@ -90,8 +80,8 @@ Get started
:hidden:

Histograms <array/histogram/index>
Core <core/index>
Matplotlib <matplotlib/index>
Core <core/index>

.. toctree::
:maxdepth: 2
Expand Down
31 changes: 16 additions & 15 deletions sandbox/khisto_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[4]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m khisto \u001b[38;5;28;01mimport\u001b[39;00m matplotlib\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m matplotlib.hist([data, [\u001b[32m1\u001b[39m, \u001b[32m2\u001b[39m, \u001b[32m3\u001b[39m], [\u001b[32m2\u001b[39m,\u001b[32m2\u001b[39m,\u001b[32m2\u001b[39m,\u001b[32m2\u001b[39m]], max_bins=\u001b[32m20\u001b[39m, alpha=\u001b[32m0.5\u001b[39m)\n",
"\u001b[36mFile \u001b[39m\u001b[32m~/python/khisto-python/src/khisto/matplotlib/hist.py:122\u001b[39m, in \u001b[36mhist\u001b[39m\u001b[34m(x, range, max_bins, density, cumulative, histtype, orientation, log, color, label, ax, edgecolor, linewidth, alpha, **kwargs)\u001b[39m\n\u001b[32m 119\u001b[39m ax = plt.gca()\n\u001b[32m 121\u001b[39m \u001b[38;5;66;03m# Compute histogram using khisto\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m122\u001b[39m hist_values, bin_edges = \u001b[30;43mkhisto_histogram\u001b[39;49m\u001b[30;43m(\u001b[39;49m\n\u001b[32m 123\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mx\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mrange\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mrange\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mmax_bins\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mmax_bins\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mdensity\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mdensity\u001b[39;49m\n\u001b[32m 124\u001b[39m \u001b[30;43m\u001b[39;49m\u001b[30;43m)\u001b[39;49m\n\u001b[32m 125\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m cumulative_mode != \u001b[32m0\u001b[39m:\n\u001b[32m 126\u001b[39m hist_values = _apply_cumulative(\n\u001b[32m 127\u001b[39m hist_values,\n\u001b[32m 128\u001b[39m bin_edges,\n\u001b[32m 129\u001b[39m density=density,\n\u001b[32m 130\u001b[39m reverse=cumulative_mode < \u001b[32m0\u001b[39m,\n\u001b[32m 131\u001b[39m )\n",
"\u001b[36mFile \u001b[39m\u001b[32m~/python/khisto-python/src/khisto/matplotlib/hist.py:122\u001b[39m, in \u001b[36mhist\u001b[39m\u001b[34m(x, range, max_bins, density, cumulative, histtype, orientation, log, color, label, ax, edgecolor, linewidth, alpha, **kwargs)\u001b[39m\n\u001b[32m 119\u001b[39m ax = plt.gca()\n\u001b[32m 121\u001b[39m \u001b[38;5;66;03m# Compute histogram using khisto\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m122\u001b[39m hist_values, bin_edges = \u001b[30;43mkhisto_histogram\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43mx\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mrange\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mrange\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mmax_bins\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mmax_bins\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mdensity\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mdensity\u001b[39;49m\u001b[30;43m)\u001b[39;49m\n\u001b[32m 123\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m cumulative_mode != \u001b[32m0\u001b[39m:\n\u001b[32m 124\u001b[39m hist_values = _apply_cumulative(\n\u001b[32m 125\u001b[39m hist_values,\n\u001b[32m 126\u001b[39m bin_edges,\n\u001b[32m 127\u001b[39m density=density,\n\u001b[32m 128\u001b[39m reverse=cumulative_mode < \u001b[32m0\u001b[39m,\n\u001b[32m 129\u001b[39m )\n",
"\u001b[36mFile \u001b[39m\u001b[32m~/python/khisto-python/src/khisto/array/histogram/api.py:107\u001b[39m, in \u001b[36mhistogram\u001b[39m\u001b[34m(a, range, max_bins, density)\u001b[39m\n\u001b[32m 53\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mhistogram\u001b[39m(\n\u001b[32m 54\u001b[39m a: ArrayLike,\n\u001b[32m 55\u001b[39m \u001b[38;5;28mrange\u001b[39m: Optional[\u001b[38;5;28mtuple\u001b[39m[\u001b[38;5;28mfloat\u001b[39m, \u001b[38;5;28mfloat\u001b[39m]] = \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m 56\u001b[39m max_bins: Optional[\u001b[38;5;28mint\u001b[39m] = \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m 57\u001b[39m density: \u001b[38;5;28mbool\u001b[39m = \u001b[38;5;28;01mFalse\u001b[39;00m,\n\u001b[32m 58\u001b[39m ) -> \u001b[38;5;28mtuple\u001b[39m[NDArray[np.float64], NDArray[np.float64]]:\n\u001b[32m 59\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"Compute an optimal histogram using the Khiops binning algorithm.\u001b[39;00m\n\u001b[32m 60\u001b[39m \n\u001b[32m 61\u001b[39m \u001b[33;03m Parameters\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 105\u001b[39m \u001b[33;03m Analysis, 180:0-0, 2023.\u001b[39;00m\n\u001b[32m 106\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m107\u001b[39m arr = \u001b[30;43mnp\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43masarray\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43ma\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mdtype\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mnp\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43mfloat64\u001b[39;49m\u001b[30;43m)\u001b[39;49m\n\u001b[32m 109\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m arr.ndim != \u001b[32m1\u001b[39m:\n\u001b[32m 110\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 111\u001b[39m \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mExpected 1-D array, got \u001b[39m\u001b[38;5;132;01m{\u001b[39;00marr.ndim\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m-D array instead. \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 112\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mReshape your data or flatten it before calling histogram.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 113\u001b[39m )\n",
"\u001b[31mValueError\u001b[39m: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part."
]
Expand All @@ -153,7 +153,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"id": "23c19584",
"metadata": {},
"outputs": [
Expand All @@ -174,7 +174,7 @@
" <a list of 3 BarContainer objects>)"
]
},
"execution_count": 5,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
Expand All @@ -195,7 +195,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"id": "2749c664",
"metadata": {},
"outputs": [
Expand All @@ -219,7 +219,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"id": "1cca2392",
"metadata": {},
"outputs": [
Expand All @@ -241,7 +241,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"id": "2ad6d7e5",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -273,7 +273,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 5,
"id": "b6c4ea8c",
"metadata": {},
"outputs": [
Expand All @@ -297,6 +297,7 @@
],
"source": [
"from khisto.matplotlib import hist\n",
"from khisto.matplotlib.hist import _hist\n",
"\n",
"# Basic histogram plot\n",
"fig, ax = plt.subplots(figsize=(8, 5))\n",
Expand All @@ -311,7 +312,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 6,
"id": "09479225",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -339,7 +340,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 8,
"id": "6c89bf07",
"metadata": {},
"outputs": [
Expand All @@ -366,7 +367,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 12,
"id": "25d8d0e5",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -403,7 +404,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 15,
"id": "d985437b",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -487,7 +488,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 18,
"id": "51179a02",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -521,7 +522,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 21,
"id": "e9bbabc9",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -558,7 +559,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 22,
"id": "1190f8aa",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -594,7 +595,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 23,
"id": "bf2ba150",
"metadata": {},
"outputs": [
Expand Down
13 changes: 9 additions & 4 deletions src/khisto/core/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,13 +235,14 @@ def _process_histogram_file(file_path: Path) -> list[HistogramResult]:
]


def compute_histograms(x: np.ndarray) -> list[HistogramResult]:
def compute_histograms(x: NDArray[np.float64]) -> list[HistogramResult]:
"""Compute optimal histogram of an array using khisto CLI binary input.

Parameters
----------
x : np.ndarray
Array of numeric values.
x : NDArray[np.float64]
Array of numeric values. Only 1-dimensional arrays are supported.
Missing values (NaN) are filtered out.

Returns
-------
Expand All @@ -257,10 +258,14 @@ def compute_histograms(x: np.ndarray) -> list[HistogramResult]:
If input array is empty after filtering.
"""
x = np.asarray(x, dtype=np.float64)

if len(x) == 0:
raise ValueError("Input array is empty")

x = x[~np.isnan(x)]

if len(x) == 0:
raise ValueError("Input array is empty after filtering")
raise ValueError("Input array is empty after filtering missing values")

# Use delete=False so the files are closed before the subprocess reads them.
# On Windows, files keep an exclusive lock while open, whence,
Expand Down
Loading
Loading