WIP:`DataScan` refactor to expose data and statistics consistently #94

tylerriccio33 · 2025-03-16T14:42:48Z

This is going to be a big one... I'm working to expose the statistics logic in DataScan in a more object oriented way. I'm planning to do 3 major things:

Create a more object oriented interface to the column (and data) profiles with defined attributes. This way the IDEs can help us find things instead of blindly remembering a key in some dict, this will also help the static type checker find and fix things we might not immediately realize.
Unify the ibis/narwhals interface to use basically the same statistics. I want to avoid branching and duplicated logic if possible. Maybe I can contribute things to narwhals if things are genuinely missing.
Use a dataframe to do everything instead of pulling out series and calculating stats iteratively. If we use dataframes we do 2 things - expand access to datascan for other libraries w/o series APIs and take advantage of parallelism. For example, polars can parallelize the stats ops and even share subplans (if we introduce lazy-ism), which will dramatically improve performance.

This is all a draft as of now I just wanted to clue you in.

Also, I don't plan on changing any tests and the interface will remain untouched!

rich-iannone · 2025-03-17T14:49:53Z

@tylerriccio33 this is great, thanks for working on this key piece! Could you try merging with main and see if there are any conflicts? I did some recent work within this file (merging #93) just two days ago. All of that work didn't touch the DataScan class (check https://github.com/posit-dev/pointblank/pull/93/files) but it might be good to merge just in case since there was even a previous merged PR (#88).

Also, is it okay if I release a new version of the package while you're working on this? Given that https://posit-dev.github.io/pointblank/reference/col_summary_tbl.html#pointblank.col_summary_tbl is at a good place (and it's in the docs), it makes sense for a new release.

tylerriccio33 · 2025-03-17T20:41:18Z

I will push what I have right now for reference but it'll come with a ton of conflicts. I'm about 90% of the way there, I just have to unify the html rendering interfaces for each of the columns. To be clear I'm not changing functionality just unifying the way the comparison data is exposed so i can build on it for Compare.

So, i'll push what i have now but I still have maybe a day or two to go until I think it's in a good spot. Sorry for the giant pr

tylerriccio33 · 2025-03-17T20:42:56Z

Sorry this is like a comically large refactor, we might need to meet in person again to figure some stuff out

rich-iannone · 2025-03-17T21:29:36Z

This is great! Don’t worry (I’m not worried) and keep going :)

Send me an email anytime if you want to discuss anything through Zoom (we can schedule easily that way).

tylerriccio33 · 2025-03-19T12:07:07Z

Don’t mean to hold you hostage on this refactor, it’s virtually all complete, I just need to reroute all the html logic and add comprehensive test cases. Been busy last few nights, tonight I should have this done and we’d be off to the races.

tylerriccio33 · 2025-03-20T02:04:10Z

Create a more object oriented interface to the column (and data) profiles with defined attributes. This way the IDEs can help us find things instead of blindly remembering a key in some dict, this will also help the static type checker find and fix things we might not immediately realize.

This is there, I think it's a start and can be improved in the future. More bugs will be found and ways to optimize the interface once we add new statistics.

Unify the ibis/narwhals interface to use basically the same statistics. I want to avoid branching and duplicated logic if possible. Maybe I can contribute things to narwhals if things are genuinely missing.

There's no more branching, the function takes something Narwhals compliant and returns as such.

Use a dataframe to do everything instead of pulling out series and calculating stats iteratively. If we use dataframes we do 2 things - expand access to datascan for other libraries w/o series APIs and take advantage of parallelism. For example, polars can parallelize the stats ops and even share subplans (if we introduce lazy-ism), which will dramatically improve performance.

This was easy, thankfully, but I need to improve the test coverage as it relates to lazy and eager narwhals compliant frames. The code can cover lazy frames somewhat gracefully but it isn't battle tested.

Currently, the big thing I'm still reconnecting is the HTML generation. The pr does generate the table, but it's absent a few of the amazing formatting features you had. I'm pretty close, I just need to reconnect and reapply the logic.

rich-iannone · 2025-03-20T04:56:29Z

@tylerriccio33 Take your time, you got this! Honestly, no rush as I'm not touching DataScan-related things for a while anyway.

tylerriccio33 · 2025-04-02T23:00:15Z

Ibis tables to now work but in order to reduce the complexity, it gets converted to either arrow, polars or pandas. It will fail if it's not able to do so. This enables it to fit w/narwhals gracefully. What's your opinion on this? It saves a ton of branching but adds those libraries to the implied dependencies. I guess that's ok because I think GT already requires one of them.

These tests all run:

`
@given(happy_path_df | happy_path_ldf | _arrow_strat() | _pandas_strat())
@example(pb.load_dataset("small_table", "polars"))
@example(pb.load_dataset("small_table", "pandas"))
@example(pb.load_dataset("small_table", "duckdb"))
@example(pb.load_dataset("game_revenue", "polars"))
@example(pb.load_dataset("game_revenue", "pandas"))
@example(pb.load_dataset("game_revenue", "duckdb"))
@example(pb.load_dataset("nycflights", "polars"))
@example(pb.load_dataset("nycflights", "pandas"))
@example(pb.load_dataset("nycflights", "duckdb"))
@settings(deadline=None)
def test_col_summary_tbl(df):
col_summary = col_summary_tbl(df)

assert isinstance(col_summary, GT)

`

This is the "controversial" block:

`
if as_native.implementation.name == "IBIS" and as_native._level == "lazy":
assert isinstance(as_native, LazyFrame) # help mypy

ibis_native = as_native.to_native()

valid_conversion_methods = ("to_pyarrow", "to_pandas", "to_polars")
for conv_method in valid_conversion_methods:
    try:
        valid_native = getattr(ibis_native, conv_method)()
    except (NotImplementedError, ImportError, ModuleNotFoundError):
        continue
    break
else:
    msg = (
        "To use `ibis` as input, you must have one of arrow, pandas, polars or numpy "
        "available in the process. Until `ibis` is fully supported by Narwhals, this is "
        "necessary. Additionally, the data must be collected in order to calculate some "
        "structural statistics, which may be performance detrimental."
    )
    raise ImportError(msg)
as_native = nw.from_native(valid_native)

self.nw_data: Frame = nw.from_native(as_native)
`

tylerriccio33 · 2025-04-02T23:01:24Z

The formatting of the code in that comment looks all out of whack for some reason, but you can see the changed lines in the commit anyways.

I'll get going on the table formatting, not worried :)

rich-iannone · 2025-04-03T18:42:11Z

Hey! Going to provide thoughts on the dependencies. Overall we're trying to keep them low, especially w.r.t. Data Frame libraries. We want people to bring their own data (i.e., not have to install both Polars and Pandas) and make the package work with whatever DF is provided. So we do careful things internally to determine the type of table object and act accordingly.

Unfortunately right now in Great Tables you have to provide data in either Pandas or Polars (and we have that requirement here, you need at least one of those DF libraries). In the future we're hoping to have a no-dependency SimpleFrame with more limited capabilities, but adequate enough for packages like Pointblank that don't need a lot of features. Also, we want to remove other heavy dependencies from Great Tables like NumPy.

So given these constraints, you might have to have a less elegant chunk of Ibis-specific code (which I know isn't great), But it'll at least be good for the user.

Hope this is all okay and reasonable. I know it's limiting from a development standpoint but being dependency light is something users will appreciate a lot.

tylerriccio33 · 2025-04-04T01:13:08Z

So you'd have to have either polars or pandas if you pass a duckdb backed ibis table, everything else is fine. Since polars or pandas is already a requirement I don't think my pr would upset anything?

rich-iannone · 2025-04-04T03:47:10Z

On more careful reading, I think you’re totally fine! Apologies, because on first reading, I somehow thought the move was to add some hard dependencies.

Moral of the story is to not read GH comments so late at night :/

tylerriccio33 · 2025-04-20T23:27:00Z

Alrighty I think I'm close on the visual. I have 5 things that I've observed are different, 2 of which I could use a little help on.

The datetime type is wrapping funny. I can't figure out how to make it small and flat flexibly.
The datetime values themselves are too large. I'm not sure what to do here, since in the future I think the datetime should also get the full suite of stats, ie. p5, q1, etc.
I didn't put the SL back in. I think string lengths may be better indicated in another location. For example in the subtitle, or a footnote, a text saying "stats x,y and z are the length of the string". Conversely, we could keep the SL and add that text? I'm worried the SL could crowd out the values.
I removed some of the rounding. I actually have strong feelings on this, I don't agree w/rounding values at the top end/bottom of the spectrum. For example column d has a max of 9999.99 and it's visually rounded to 10k. I would (personally) really like to see the true max of that value - there could be serious value in seeing the max is 9999.99 and not 10k. What do you think?
I see the full suite of stats are running for string lengths in the original version. Why is that? In my screenshot it is. I'm happy to take it out, or account for this, just curious. I personally don't see a problem with IQR stats getting computed on string lengths.

rich-iannone · 2025-04-25T14:40:56Z

Apologies for the delay in responding to this. I'll provide some comments for each of the questions:

The datetime type is wrapping funny. I can't figure out how to make it small and flat flexibly.
The datetime values themselves are too large. I'm not sure what to do here, since in the future I think the datetime should also get the full suite of stats, ie. p5, q1, etc.

Maybe what can be done is something like this: 2016-<wbr>01-04<br>00:32:00 ? The tag gives a hint to break at that character position (a plain <br> might also work fine, experimentation in a browser is needed). Then a datetime would fit on three lines while still looking readable. I agree datetime should get all the stats, and this formatting would help make it look presentable. Same sort of formatting could be applied to date objects.

I didn't put the SL back in. I think string lengths may be better indicated in another location. For example in the subtitle, or a footnote, a text saying "stats x,y and z are the length of the string". Conversely, we could keep the SL and add that text? I'm worried the SL could crowd out the values.

I think you're right, a message in the footnote about which entries use string-lengths as the measure would be more appropriate. I'm all for leaving out the SL text.

I removed some of the rounding. I actually have strong feelings on this, I don't agree w/rounding values at the top end/bottom of the spectrum. For example column d has a max of 9999.99 and it's visually rounded to 10k. I would (personally) really like to see the true max of that value - there could be serious value in seeing the max is 9999.99 and not 10k. What do you think?

I think we should preserve as much of the original value as possible until we're out of space, which is the main thing I always wanted to avoid. That looks like eight characters including a comma and the decimal mark. Beyond that, I have an idea: include the full unrounded version as tooltip text (and we may have to add a note or some visual indication that values are rounded and that there is a precise value on hover).

I see the full suite of stats are running for string lengths in the original version. Why is that? In my screenshot it is. I'm happy to take it out, or account for this, just curious. I personally don't see a problem with IQR stats getting computed on string lengths.

I think I was just going by what other comparable data summary tools were doing. I noticed that the Positron data viewer only included min/max/median for string lengths. Coupled with the fact that the range is often low and integer-based it just seemed a bit noisy/redundant to me. Happy to reconsider this one and just have them included for consistency!

Just wanted to add that this is looking great! And we can always make little adjustments as time goes on (I'm open to pretty much anything here).

tylerriccio33 · 2025-04-26T21:40:36Z

Do you think the datetime (and date) formatting could be handled by fmt_datetime from gt? Perhaps if there was a compaction option, or something interesting we could do w/the formatting arguments? It would be nice to leverage the html logic that already exists than to custom build it.

https://posit-dev.github.io/great-tables/reference/GT.fmt_datetime.html

tylerriccio33 added 5 commits March 15, 2025 10:34

Split physical data and GT data functionality up

f3e20c7

Unify datascan summary data access in property

91f63f4

Begin refactor of datascan to expose internal data

f392e64

add dedicated profile file

e1a20b3

Fully implement statistics calculation

9cba8b2

tylerriccio33 added 2 commits March 17, 2025 16:41

Adding everything for check in

fc183d8

remove lockfile

0a2f565

account for panic exceptions specially

b22dacb

tylerriccio33 added 7 commits March 19, 2025 18:10

add conftest quality of life

06b5b3b

Add chatlas to dev dependency

906cbf3

add dict pivot function

8cc3fe3

complete data exposure and partially complete html creation

eb44d17

ensure debugger stops in conftest

5f9db0e

add more parametric testing

f42de7b

consolidate html creation

94e40b1

tylerriccio33 added 8 commits March 20, 2025 21:47

make table column generation dynamic with analysis

8589f4d

type hint mapping instead of dict on utils

ff41c60

consolidate dataframe creation and html

a95b42a

add dedicated statistics profile

3043c57

implement stat profile classes

e94a6a3

increase random examples

3e6ae95

add possible depracation to test previoew

b5c4c17

remove df lib narwhals/ibis branch

2362c42

tylerriccio33 added 19 commits April 5, 2025 08:04

rename UQ and NA

fbae108

SD and mean in their own category

d8ea301

fix order of min/max

5c10c98

IQR in dedicated cat

97018bf

align column values to right and headers to center

4a43125

make calculations aware of column name

d6bff9d

Implement frequencies as a skeleton concept; incorporating bools

d8890ff

drop trailing dec marks and zeros

b5b5c15

early terminations in compact fmt

8f0330d

impliment custom fraction formatting

9f4b8e0

put min/max in dedicated stat group called bounds

0d94ea7

annotate min/max should return to descr at some point

63cb101

fix unformatted t/f percentages

307d0a5

move _fmt_frac to utils

1c1db37

account for 0s in fraction formatting

a5282fa

bug where _fmt_frac does not convert to series

f0feae0

account for arbitrary types in integer formatting

507bbb2

account for tables that don't need freqs

a7dc59b

remove testing artifact

68b7e65

tylerriccio33 mentioned this pull request Apr 21, 2025

Add hypothesis library for parametric testing #75

Open

add datetime and date formatting to gt

614b3a1

add SL footnote

6c12b51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP:`DataScan` refactor to expose data and statistics consistently #94

WIP:`DataScan` refactor to expose data and statistics consistently #94

tylerriccio33 commented Mar 16, 2025 •

edited

Loading

rich-iannone commented Mar 17, 2025

tylerriccio33 commented Mar 17, 2025

tylerriccio33 commented Mar 17, 2025

rich-iannone commented Mar 17, 2025

tylerriccio33 commented Mar 19, 2025

tylerriccio33 commented Mar 20, 2025

rich-iannone commented Mar 20, 2025

tylerriccio33 commented Apr 2, 2025 •

edited

Loading

tylerriccio33 commented Apr 2, 2025 •

edited

Loading

rich-iannone commented Apr 3, 2025

tylerriccio33 commented Apr 4, 2025

rich-iannone commented Apr 4, 2025

tylerriccio33 commented Apr 20, 2025

rich-iannone commented Apr 25, 2025 •

edited

Loading

tylerriccio33 commented Apr 26, 2025

WIP:DataScan refactor to expose data and statistics consistently #94

Are you sure you want to change the base?

WIP:DataScan refactor to expose data and statistics consistently #94

Conversation

tylerriccio33 commented Mar 16, 2025 • edited Loading

rich-iannone commented Mar 17, 2025

tylerriccio33 commented Mar 17, 2025

tylerriccio33 commented Mar 17, 2025

rich-iannone commented Mar 17, 2025

tylerriccio33 commented Mar 19, 2025

tylerriccio33 commented Mar 20, 2025

rich-iannone commented Mar 20, 2025

tylerriccio33 commented Apr 2, 2025 • edited Loading

tylerriccio33 commented Apr 2, 2025 • edited Loading

rich-iannone commented Apr 3, 2025

tylerriccio33 commented Apr 4, 2025

rich-iannone commented Apr 4, 2025

tylerriccio33 commented Apr 20, 2025

rich-iannone commented Apr 25, 2025 • edited Loading

tylerriccio33 commented Apr 26, 2025

WIP:`DataScan` refactor to expose data and statistics consistently #94

WIP:`DataScan` refactor to expose data and statistics consistently #94

tylerriccio33 commented Mar 16, 2025 •

edited

Loading

tylerriccio33 commented Apr 2, 2025 •

edited

Loading

tylerriccio33 commented Apr 2, 2025 •

edited

Loading

rich-iannone commented Apr 25, 2025 •

edited

Loading