Skip to content

WIP:DataScan refactor to expose data and statistics consistently #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 86 commits into
base: main
Choose a base branch
from

Conversation

tylerriccio33
Copy link
Contributor

@tylerriccio33 tylerriccio33 commented Mar 16, 2025

This is going to be a big one... I'm working to expose the statistics logic in DataScan in a more object oriented way. I'm planning to do 3 major things:

  • Create a more object oriented interface to the column (and data) profiles with defined attributes. This way the IDEs can help us find things instead of blindly remembering a key in some dict, this will also help the static type checker find and fix things we might not immediately realize.
  • Unify the ibis/narwhals interface to use basically the same statistics. I want to avoid branching and duplicated logic if possible. Maybe I can contribute things to narwhals if things are genuinely missing.
  • Use a dataframe to do everything instead of pulling out series and calculating stats iteratively. If we use dataframes we do 2 things - expand access to datascan for other libraries w/o series APIs and take advantage of parallelism. For example, polars can parallelize the stats ops and even share subplans (if we introduce lazy-ism), which will dramatically improve performance.

This is all a draft as of now I just wanted to clue you in.

Also, I don't plan on changing any tests and the interface will remain untouched!

@rich-iannone
Copy link
Member

@tylerriccio33 this is great, thanks for working on this key piece! Could you try merging with main and see if there are any conflicts? I did some recent work within this file (merging #93) just two days ago. All of that work didn't touch the DataScan class (check https://github.com/posit-dev/pointblank/pull/93/files) but it might be good to merge just in case since there was even a previous merged PR (#88).

Also, is it okay if I release a new version of the package while you're working on this? Given that https://posit-dev.github.io/pointblank/reference/col_summary_tbl.html#pointblank.col_summary_tbl is at a good place (and it's in the docs), it makes sense for a new release.

@tylerriccio33
Copy link
Contributor Author

I will push what I have right now for reference but it'll come with a ton of conflicts. I'm about 90% of the way there, I just have to unify the html rendering interfaces for each of the columns. To be clear I'm not changing functionality just unifying the way the comparison data is exposed so i can build on it for Compare.

So, i'll push what i have now but I still have maybe a day or two to go until I think it's in a good spot. Sorry for the giant pr

@tylerriccio33
Copy link
Contributor Author

Sorry this is like a comically large refactor, we might need to meet in person again to figure some stuff out

@rich-iannone
Copy link
Member

This is great! Don’t worry (I’m not worried) and keep going :)

Send me an email anytime if you want to discuss anything through Zoom (we can schedule easily that way).

@tylerriccio33
Copy link
Contributor Author

Don’t mean to hold you hostage on this refactor, it’s virtually all complete, I just need to reroute all the html logic and add comprehensive test cases. Been busy last few nights, tonight I should have this done and we’d be off to the races.

@tylerriccio33
Copy link
Contributor Author

  • Create a more object oriented interface to the column (and data) profiles with defined attributes. This way the IDEs can help us find things instead of blindly remembering a key in some dict, this will also help the static type checker find and fix things we might not immediately realize.

This is there, I think it's a start and can be improved in the future. More bugs will be found and ways to optimize the interface once we add new statistics.

  • Unify the ibis/narwhals interface to use basically the same statistics. I want to avoid branching and duplicated logic if possible. Maybe I can contribute things to narwhals if things are genuinely missing.

There's no more branching, the function takes something Narwhals compliant and returns as such.

  • Use a dataframe to do everything instead of pulling out series and calculating stats iteratively. If we use dataframes we do 2 things - expand access to datascan for other libraries w/o series APIs and take advantage of parallelism. For example, polars can parallelize the stats ops and even share subplans (if we introduce lazy-ism), which will dramatically improve performance.

This was easy, thankfully, but I need to improve the test coverage as it relates to lazy and eager narwhals compliant frames. The code can cover lazy frames somewhat gracefully but it isn't battle tested.

Currently, the big thing I'm still reconnecting is the HTML generation. The pr does generate the table, but it's absent a few of the amazing formatting features you had. I'm pretty close, I just need to reconnect and reapply the logic.

@rich-iannone
Copy link
Member

@tylerriccio33 Take your time, you got this! Honestly, no rush as I'm not touching DataScan-related things for a while anyway.

@tylerriccio33
Copy link
Contributor Author

tylerriccio33 commented Apr 2, 2025

Ibis tables to now work but in order to reduce the complexity, it gets converted to either arrow, polars or pandas. It will fail if it's not able to do so. This enables it to fit w/narwhals gracefully. What's your opinion on this? It saves a ton of branching but adds those libraries to the implied dependencies. I guess that's ok because I think GT already requires one of them.

These tests all run:

`
@given(happy_path_df | happy_path_ldf | _arrow_strat() | _pandas_strat())
@example(pb.load_dataset("small_table", "polars"))
@example(pb.load_dataset("small_table", "pandas"))
@example(pb.load_dataset("small_table", "duckdb"))
@example(pb.load_dataset("game_revenue", "polars"))
@example(pb.load_dataset("game_revenue", "pandas"))
@example(pb.load_dataset("game_revenue", "duckdb"))
@example(pb.load_dataset("nycflights", "polars"))
@example(pb.load_dataset("nycflights", "pandas"))
@example(pb.load_dataset("nycflights", "duckdb"))
@settings(deadline=None)
def test_col_summary_tbl(df):
col_summary = col_summary_tbl(df)

assert isinstance(col_summary, GT)

`

This is the "controversial" block:

`
if as_native.implementation.name == "IBIS" and as_native._level == "lazy":
assert isinstance(as_native, LazyFrame) # help mypy

ibis_native = as_native.to_native()

valid_conversion_methods = ("to_pyarrow", "to_pandas", "to_polars")
for conv_method in valid_conversion_methods:
    try:
        valid_native = getattr(ibis_native, conv_method)()
    except (NotImplementedError, ImportError, ModuleNotFoundError):
        continue
    break
else:
    msg = (
        "To use `ibis` as input, you must have one of arrow, pandas, polars or numpy "
        "available in the process. Until `ibis` is fully supported by Narwhals, this is "
        "necessary. Additionally, the data must be collected in order to calculate some "
        "structural statistics, which may be performance detrimental."
    )
    raise ImportError(msg)
as_native = nw.from_native(valid_native)

self.nw_data: Frame = nw.from_native(as_native)
`

@tylerriccio33
Copy link
Contributor Author

tylerriccio33 commented Apr 2, 2025

The formatting of the code in that comment looks all out of whack for some reason, but you can see the changed lines in the commit anyways.

I'll get going on the table formatting, not worried :)

@rich-iannone
Copy link
Member

Hey! Going to provide thoughts on the dependencies. Overall we're trying to keep them low, especially w.r.t. Data Frame libraries. We want people to bring their own data (i.e., not have to install both Polars and Pandas) and make the package work with whatever DF is provided. So we do careful things internally to determine the type of table object and act accordingly.

Unfortunately right now in Great Tables you have to provide data in either Pandas or Polars (and we have that requirement here, you need at least one of those DF libraries). In the future we're hoping to have a no-dependency SimpleFrame with more limited capabilities, but adequate enough for packages like Pointblank that don't need a lot of features. Also, we want to remove other heavy dependencies from Great Tables like NumPy.

So given these constraints, you might have to have a less elegant chunk of Ibis-specific code (which I know isn't great), But it'll at least be good for the user.

Hope this is all okay and reasonable. I know it's limiting from a development standpoint but being dependency light is something users will appreciate a lot.

@tylerriccio33
Copy link
Contributor Author

So you'd have to have either polars or pandas if you pass a duckdb backed ibis table, everything else is fine. Since polars or pandas is already a requirement I don't think my pr would upset anything?

@rich-iannone
Copy link
Member

On more careful reading, I think you’re totally fine! Apologies, because on first reading, I somehow thought the move was to add some hard dependencies.

Moral of the story is to not read GH comments so late at night :/

@tylerriccio33
Copy link
Contributor Author

Screenshot 2025-04-20 at 7 20 58 PM

Alrighty I think I'm close on the visual. I have 5 things that I've observed are different, 2 of which I could use a little help on.

  1. The datetime type is wrapping funny. I can't figure out how to make it small and flat flexibly.
  2. The datetime values themselves are too large. I'm not sure what to do here, since in the future I think the datetime should also get the full suite of stats, ie. p5, q1, etc.
  3. I didn't put the SL back in. I think string lengths may be better indicated in another location. For example in the subtitle, or a footnote, a text saying "stats x,y and z are the length of the string". Conversely, we could keep the SL and add that text? I'm worried the SL could crowd out the values.
  4. I removed some of the rounding. I actually have strong feelings on this, I don't agree w/rounding values at the top end/bottom of the spectrum. For example column d has a max of 9999.99 and it's visually rounded to 10k. I would (personally) really like to see the true max of that value - there could be serious value in seeing the max is 9999.99 and not 10k. What do you think?
  5. I see the full suite of stats are running for string lengths in the original version. Why is that? In my screenshot it is. I'm happy to take it out, or account for this, just curious. I personally don't see a problem with IQR stats getting computed on string lengths.

@rich-iannone
Copy link
Member

rich-iannone commented Apr 25, 2025

Apologies for the delay in responding to this. I'll provide some comments for each of the questions:

The datetime type is wrapping funny. I can't figure out how to make it small and flat flexibly.
The datetime values themselves are too large. I'm not sure what to do here, since in the future I think the datetime should also get the full suite of stats, ie. p5, q1, etc.

Maybe what can be done is something like this: 2016-<wbr>01-04<br>00:32:00 ? The tag gives a hint to break at that character position (a plain <br> might also work fine, experimentation in a browser is needed). Then a datetime would fit on three lines while still looking readable. I agree datetime should get all the stats, and this formatting would help make it look presentable. Same sort of formatting could be applied to date objects.

I didn't put the SL back in. I think string lengths may be better indicated in another location. For example in the subtitle, or a footnote, a text saying "stats x,y and z are the length of the string". Conversely, we could keep the SL and add that text? I'm worried the SL could crowd out the values.

I think you're right, a message in the footnote about which entries use string-lengths as the measure would be more appropriate. I'm all for leaving out the SL text.

I removed some of the rounding. I actually have strong feelings on this, I don't agree w/rounding values at the top end/bottom of the spectrum. For example column d has a max of 9999.99 and it's visually rounded to 10k. I would (personally) really like to see the true max of that value - there could be serious value in seeing the max is 9999.99 and not 10k. What do you think?

I think we should preserve as much of the original value as possible until we're out of space, which is the main thing I always wanted to avoid. That looks like eight characters including a comma and the decimal mark. Beyond that, I have an idea: include the full unrounded version as tooltip text (and we may have to add a note or some visual indication that values are rounded and that there is a precise value on hover).

I see the full suite of stats are running for string lengths in the original version. Why is that? In my screenshot it is. I'm happy to take it out, or account for this, just curious. I personally don't see a problem with IQR stats getting computed on string lengths.

I think I was just going by what other comparable data summary tools were doing. I noticed that the Positron data viewer only included min/max/median for string lengths. Coupled with the fact that the range is often low and integer-based it just seemed a bit noisy/redundant to me. Happy to reconsider this one and just have them included for consistency!


Just wanted to add that this is looking great! And we can always make little adjustments as time goes on (I'm open to pretty much anything here).

@tylerriccio33
Copy link
Contributor Author

Do you think the datetime (and date) formatting could be handled by fmt_datetime from gt? Perhaps if there was a compaction option, or something interesting we could do w/the formatting arguments? It would be nice to leverage the html logic that already exists than to custom build it.

https://posit-dev.github.io/great-tables/reference/GT.fmt_datetime.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants