-
Notifications
You must be signed in to change notification settings - Fork 10
WIP:DataScan
refactor to expose data and statistics consistently
#94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@tylerriccio33 this is great, thanks for working on this key piece! Could you try merging with main and see if there are any conflicts? I did some recent work within this file (merging #93) just two days ago. All of that work didn't touch the Also, is it okay if I release a new version of the package while you're working on this? Given that |
I will push what I have right now for reference but it'll come with a ton of conflicts. I'm about 90% of the way there, I just have to unify the html rendering interfaces for each of the columns. To be clear I'm not changing functionality just unifying the way the comparison data is exposed so i can build on it for Compare. So, i'll push what i have now but I still have maybe a day or two to go until I think it's in a good spot. Sorry for the giant pr |
Sorry this is like a comically large refactor, we might need to meet in person again to figure some stuff out |
This is great! Don’t worry (I’m not worried) and keep going :) Send me an email anytime if you want to discuss anything through Zoom (we can schedule easily that way). |
Don’t mean to hold you hostage on this refactor, it’s virtually all complete, I just need to reroute all the html logic and add comprehensive test cases. Been busy last few nights, tonight I should have this done and we’d be off to the races. |
This is there, I think it's a start and can be improved in the future. More bugs will be found and ways to optimize the interface once we add new statistics.
There's no more branching, the function takes something Narwhals compliant and returns as such.
This was easy, thankfully, but I need to improve the test coverage as it relates to lazy and eager narwhals compliant frames. The code can cover lazy frames somewhat gracefully but it isn't battle tested. Currently, the big thing I'm still reconnecting is the HTML generation. The pr does generate the table, but it's absent a few of the amazing formatting features you had. I'm pretty close, I just need to reconnect and reapply the logic. |
@tylerriccio33 Take your time, you got this! Honestly, no rush as I'm not touching DataScan-related things for a while anyway. |
Ibis tables to now work but in order to reduce the complexity, it gets converted to either arrow, polars or pandas. It will fail if it's not able to do so. This enables it to fit w/narwhals gracefully. What's your opinion on this? It saves a ton of branching but adds those libraries to the implied dependencies. I guess that's ok because I think GT already requires one of them. These tests all run: `
` This is the "controversial" block: `
self.nw_data: Frame = nw.from_native(as_native) |
The formatting of the code in that comment looks all out of whack for some reason, but you can see the changed lines in the commit anyways. I'll get going on the table formatting, not worried :) |
Hey! Going to provide thoughts on the dependencies. Overall we're trying to keep them low, especially w.r.t. Data Frame libraries. We want people to bring their own data (i.e., not have to install both Polars and Pandas) and make the package work with whatever DF is provided. So we do careful things internally to determine the type of table object and act accordingly. Unfortunately right now in Great Tables you have to provide data in either Pandas or Polars (and we have that requirement here, you need at least one of those DF libraries). In the future we're hoping to have a no-dependency SimpleFrame with more limited capabilities, but adequate enough for packages like Pointblank that don't need a lot of features. Also, we want to remove other heavy dependencies from Great Tables like NumPy. So given these constraints, you might have to have a less elegant chunk of Ibis-specific code (which I know isn't great), But it'll at least be good for the user. Hope this is all okay and reasonable. I know it's limiting from a development standpoint but being dependency light is something users will appreciate a lot. |
So you'd have to have either polars or pandas if you pass a duckdb backed ibis table, everything else is fine. Since polars or pandas is already a requirement I don't think my pr would upset anything? |
On more careful reading, I think you’re totally fine! Apologies, because on first reading, I somehow thought the move was to add some hard dependencies. Moral of the story is to not read GH comments so late at night :/ |
Apologies for the delay in responding to this. I'll provide some comments for each of the questions:
Maybe what can be done is something like this:
I think you're right, a message in the footnote about which entries use string-lengths as the measure would be more appropriate. I'm all for leaving out the SL text.
I think we should preserve as much of the original value as possible until we're out of space, which is the main thing I always wanted to avoid. That looks like eight characters including a comma and the decimal mark. Beyond that, I have an idea: include the full unrounded version as tooltip text (and we may have to add a note or some visual indication that values are rounded and that there is a precise value on hover).
I think I was just going by what other comparable data summary tools were doing. I noticed that the Positron data viewer only included min/max/median for string lengths. Coupled with the fact that the range is often low and integer-based it just seemed a bit noisy/redundant to me. Happy to reconsider this one and just have them included for consistency! Just wanted to add that this is looking great! And we can always make little adjustments as time goes on (I'm open to pretty much anything here). |
Do you think the datetime (and date) formatting could be handled by https://posit-dev.github.io/great-tables/reference/GT.fmt_datetime.html |
This is going to be a big one... I'm working to expose the statistics logic in
DataScan
in a more object oriented way. I'm planning to do 3 major things:This is all a draft as of now I just wanted to clue you in.
Also, I don't plan on changing any tests and the interface will remain untouched!