Skip to content

WIP:DataScan refactor to expose data and statistics consistently #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 86 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
f3e20c7
Split physical data and GT data functionality up
tylerriccio33 Mar 15, 2025
91f63f4
Unify datascan summary data access in property
tylerriccio33 Mar 15, 2025
f392e64
Begin refactor of datascan to expose internal data
tylerriccio33 Mar 16, 2025
e1a20b3
add dedicated profile file
tylerriccio33 Mar 16, 2025
9cba8b2
Fully implement statistics calculation
tylerriccio33 Mar 16, 2025
fc183d8
Adding everything for check in
tylerriccio33 Mar 17, 2025
0a2f565
remove lockfile
tylerriccio33 Mar 17, 2025
b22dacb
account for panic exceptions specially
tylerriccio33 Mar 19, 2025
06b5b3b
add conftest quality of life
tylerriccio33 Mar 19, 2025
906cbf3
Add chatlas to dev dependency
tylerriccio33 Mar 20, 2025
8cc3fe3
add dict pivot function
tylerriccio33 Mar 20, 2025
eb44d17
complete data exposure and partially complete html creation
tylerriccio33 Mar 20, 2025
5f9db0e
ensure debugger stops in conftest
tylerriccio33 Mar 20, 2025
f42de7b
add more parametric testing
tylerriccio33 Mar 20, 2025
94e40b1
consolidate html creation
tylerriccio33 Mar 20, 2025
8589f4d
make table column generation dynamic with analysis
tylerriccio33 Mar 21, 2025
ff41c60
type hint mapping instead of dict on utils
tylerriccio33 Mar 23, 2025
a95b42a
consolidate dataframe creation and html
tylerriccio33 Mar 23, 2025
3043c57
add dedicated statistics profile
tylerriccio33 Mar 23, 2025
e94a6a3
implement stat profile classes
tylerriccio33 Mar 23, 2025
3e6ae95
increase random examples
tylerriccio33 Mar 23, 2025
b5c4c17
add possible depracation to test previoew
tylerriccio33 Mar 23, 2025
2362c42
remove df lib narwhals/ibis branch
tylerriccio33 Mar 23, 2025
63e96b8
add .swp to git ignore
tylerriccio33 Mar 23, 2025
d2ecaae
inline colname/coltype gt formatting
tylerriccio33 Mar 23, 2025
4ff3afa
resolve pytest settings
tylerriccio33 Mar 23, 2025
c6339ec
Fix left borders
tylerriccio33 Mar 23, 2025
766d8e5
Merge branch 'main' into proc-compare
tylerriccio33 Mar 23, 2025
9b1828d
Delete uv.lock
tylerriccio33 Mar 23, 2025
3a9ff99
remove uv from makefile test
tylerriccio33 Mar 24, 2025
87f8a4e
add pytest-randomly to workflow test dependancies
tylerriccio33 Mar 24, 2025
da59a46
Update .github/workflows/ci-tests.yaml
rich-iannone Mar 24, 2025
fde4f28
allow dataframe construction with multiple types
tylerriccio33 Mar 27, 2025
e5d8624
add numeric and fraction formatting
tylerriccio33 Mar 27, 2025
19f5bdb
implement fully lazyframe support
tylerriccio33 Mar 28, 2025
0d25ea5
fix lockfile
tylerriccio33 Mar 28, 2025
6b86be9
implement stat level labeling
tylerriccio33 Mar 28, 2025
6f244bd
allow json returns to be non consistent types
tylerriccio33 Mar 28, 2025
bb3d9df
remove defunct metadata class
tylerriccio33 Mar 28, 2025
b1c2302
remove defunct py_type attr
tylerriccio33 Mar 28, 2025
8a6fc11
remove defunct todos and uneccessary nw conversions
tylerriccio33 Mar 28, 2025
6fcc860
implement IQR
tylerriccio33 Mar 28, 2025
33564ac
optional sample data in tabular reporting
tylerriccio33 Mar 29, 2025
4194ef2
fix borders
tylerriccio33 Mar 29, 2025
6f2965c
fix bool calculation and formatting
tylerriccio33 Mar 29, 2025
b1c8373
drop all null stats
tylerriccio33 Mar 29, 2025
2c1626b
annotate row count
tylerriccio33 Mar 29, 2025
da17430
fix inconsistent handling of false counts across libraries
tylerriccio33 Mar 30, 2025
aaf285d
remove html snapshot tests as it's no longer exactly the same
tylerriccio33 Mar 30, 2025
8e92baf
fix dataframe implementation polars specific
tylerriccio33 Mar 30, 2025
63d2f4b
clean up TODOs, extract label map creater
tylerriccio33 Mar 30, 2025
0079f9b
clean up stat classes casting
tylerriccio33 Mar 30, 2025
45f9dc7
standardize extension of stats instead of constant appends
tylerriccio33 Mar 30, 2025
3e13d58
add more tests for developer error in statistics
tylerriccio33 Mar 30, 2025
8de208f
add iqr to string length
tylerriccio33 Mar 30, 2025
3c0804a
add pandas and arrow strategies
tylerriccio33 Mar 30, 2025
7218e96
move min/max to structure
tylerriccio33 Mar 30, 2025
832ac2e
remove pandas datetime tests
tylerriccio33 Mar 30, 2025
524ad9c
generically type datascan init
tylerriccio33 Mar 30, 2025
a2eaa73
add deterministic test of values
tylerriccio33 Mar 30, 2025
b81dd39
add col table summry examples
tylerriccio33 Mar 30, 2025
6d225cc
decrease rounding decimals in visual
tylerriccio33 Mar 30, 2025
2c66c0a
remove metadata only test
tylerriccio33 Mar 30, 2025
3b75f7f
re-impliment assistant scan endpoint
tylerriccio33 Apr 1, 2025
eb91321
implement ibis intake
tylerriccio33 Apr 2, 2025
fbae108
rename UQ and NA
tylerriccio33 Apr 5, 2025
d8ea301
SD and mean in their own category
tylerriccio33 Apr 5, 2025
5c10c98
fix order of min/max
tylerriccio33 Apr 5, 2025
97018bf
IQR in dedicated cat
tylerriccio33 Apr 5, 2025
4a43125
align column values to right and headers to center
tylerriccio33 Apr 5, 2025
d6bff9d
make calculations aware of column name
tylerriccio33 Apr 5, 2025
d8890ff
Implement frequencies as a skeleton concept; incorporating bools
tylerriccio33 Apr 14, 2025
b5b5c15
drop trailing dec marks and zeros
tylerriccio33 Apr 16, 2025
8f0330d
early terminations in compact fmt
tylerriccio33 Apr 19, 2025
9f4b8e0
impliment custom fraction formatting
tylerriccio33 Apr 19, 2025
0d94ea7
put min/max in dedicated stat group called bounds
tylerriccio33 Apr 19, 2025
63cb101
annotate min/max should return to descr at some point
tylerriccio33 Apr 19, 2025
307d0a5
fix unformatted t/f percentages
tylerriccio33 Apr 19, 2025
1c1db37
move _fmt_frac to utils
tylerriccio33 Apr 20, 2025
a5282fa
account for 0s in fraction formatting
tylerriccio33 Apr 20, 2025
f0feae0
bug where _fmt_frac does not convert to series
tylerriccio33 Apr 20, 2025
507bbb2
account for arbitrary types in integer formatting
tylerriccio33 Apr 20, 2025
a7dc59b
account for tables that don't need freqs
tylerriccio33 Apr 20, 2025
68b7e65
remove testing artifact
tylerriccio33 Apr 20, 2025
614b3a1
add datetime and date formatting to gt
tylerriccio33 Apr 26, 2025
6c12b51
add SL footnote
tylerriccio33 Apr 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
pip install -e '.[dev]'
- name: Install test dependencies
run: |
pip install pytest pytest-cov pytest-snapshot pandas polars ibis-framework[duckdb,mysql,postgres,sqlite]>=9.5.0 chatlas shiny
pip install pytest pytest-randomly pytest-cov pytest-snapshot pandas polars ibis-framework[duckdb,mysql,postgres,sqlite]>=9.5.0 chatlas shiny hypothesis
- name: pytest unit tests
run: |
make test
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -124,3 +124,4 @@ datasets/
/*.parquet
/*.csv
.ruff_cache
.swp
4 changes: 3 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
.PHONY: check

test:
pytest --cov=pointblank --cov-report=xml
pytest --cov=pointblank --cov-report=xml \
--randomly-seed=12301998


test-update:
pytest --snapshot-update
Expand Down
65 changes: 65 additions & 0 deletions pointblank/_datascan_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from __future__ import annotations

from math import floor, log10
from typing import TYPE_CHECKING

from great_tables.vals import fmt_integer, fmt_number, fmt_scientific

if TYPE_CHECKING:
pass


def _round_to_sig_figs(value: float, sig_figs: int) -> float:
if value == 0:
return 0
return round(value, sig_figs - int(floor(log10(abs(value)))) - 1)


def _compact_integer_fmt(value: float | int) -> str:
if value == 0:
formatted = "0"
elif abs(value) >= 1 and abs(value) < 10_000:
formatted = fmt_integer(value, use_seps=False)[0]
else:
formatted = fmt_scientific(value, decimals=1, exp_style="E1")[0]

return formatted


def _compact_decimal_fmt(value: float | int) -> str:
if value == 0:
formatted = "0.00"
elif abs(value) < 1 and abs(value) >= 0.01:
formatted = fmt_number(value, decimals=2)[0]
elif abs(value) < 0.01:
formatted = fmt_scientific(value, decimals=1, exp_style="E1")[0]
elif abs(value) >= 1 and abs(value) < 1000:
formatted = fmt_number(value, n_sigfig=3)[0]
elif abs(value) >= 1000 and abs(value) < 10_000:
formatted = fmt_number(value, decimals=0, use_seps=False)[0]
else:
formatted = fmt_scientific(value, decimals=1, exp_style="E1")[0]

return formatted


def _compact_0_1_fmt(value: float | int | None) -> str | None:
if value is None:
return value

if value == 0:
return " 0.00"

if value == 1:
return " 1.00"

if abs(value) < 1 and abs(value) >= 0.01:
return " " + fmt_number(value, decimals=2)[0]

if abs(value) < 0.01:
return "<0.01"

if abs(value) > 0.99:
return ">0.99"

return fmt_number(value, n_sigfig=3)[0]
31 changes: 31 additions & 0 deletions pointblank/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import inspect
import re
from collections import defaultdict
from typing import TYPE_CHECKING, Any

import narwhals as nw
Expand All @@ -12,9 +13,28 @@
from pointblank._constants import ASSERTION_TYPE_METHOD_MAP, GENERAL_COLUMN_TYPES

if TYPE_CHECKING:
from collections.abc import Mapping

from pointblank._typing import AbsoluteBounds, Tolerance


def transpose_dicts(list_of_dicts: list[dict[str, Any]]) -> dict[str, list[Any]]:
if not list_of_dicts:
return {}

# Get all unique keys across all dictionaries
all_keys = set()
for d in list_of_dicts:
all_keys.update(d.keys())

result = defaultdict(list)
for d in list_of_dicts:
for key in all_keys:
result[key].append(d.get(key)) # None is default for missing keys

return dict(result)


def _derive_single_bound(ref: int, tol: int | float) -> int:
"""Derive a single bound using the reference."""
if not isinstance(tol, float | int):
Expand Down Expand Up @@ -750,3 +770,14 @@ def _format_to_float_value(
formatted_vals = _get_column_of_values(gt, column_name="x", context="html")

return formatted_vals[0]


def _pivot_to_dict(col_dict: Mapping[str, Any]): # TODO : Type hint and unit test
result_dict = {}
for col, sub_dict in col_dict.items():
for key, value in sub_dict.items():
# add columns fields not present
if key not in result_dict:
result_dict[key] = [None] * len(col_dict)
result_dict[key][list(col_dict.keys()).index(col)] = value
return result_dict
40 changes: 40 additions & 0 deletions pointblank/_utils_html.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,49 @@
from __future__ import annotations

from typing import Any

from great_tables import html

from pointblank._constants import TABLE_TYPE_STYLES
from pointblank._utils import _format_to_integer_value


def _fmt_frac(vec) -> list[str | None]:
res: list[str | None] = []
for x in vec:
if x is None:
res.append(x)
continue

if x == 0:
res.append("0")
continue

if x < 0.01:
res.append("<.01")
continue

try:
intx: int = int(x)
except ValueError: # generic object, ie. NaN
res.append(str(x))
continue

if intx == x: # can remove trailing 0s w/o loss
res.append(str(intx))
continue

res.append(str(round(x, 2)))

return res


def _make_sublabel(major: str, minor: str) -> Any:
return html(
f'{major!s}<span style="font-size: 0.75em; vertical-align: sub; position: relative; line-height: 0.5em;">{minor!s}</span>'
)


def _create_table_type_html(
tbl_type: str | None, tbl_name: str | None, font_size: str = "10px"
) -> str:
Expand Down
4 changes: 1 addition & 3 deletions pointblank/assistant.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,7 @@ def assistant(
if data is not None:
scan = DataScan(data=data)

scan_dict = scan.to_dict()

tbl_type = scan_dict["tbl_type"]
tbl_type: str = scan.profile.implementation.name.lower()
tbl_json = scan.to_json()

if tbl_name is not None:
Expand Down
27 changes: 27 additions & 0 deletions pointblank/compare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from __future__ import annotations

from typing import TYPE_CHECKING

from pointblank import DataScan

if TYPE_CHECKING:
from narwhals.typing import IntoFrame


class Compare:
def __init__(self, a: IntoFrame, b: IntoFrame) -> None:
self.a: IntoFrame = a
self.b: IntoFrame = b

def compare(self) -> None:
## Scan both frames
self._scana = DataScan(self.a)
self._scanb = DataScan(self.b)

## Get summary outs
summarya = self._scana.summary_data
summaryb = self._scana.summary_data

summarya.columns

self._scana.profile
Loading