-
Notifications
You must be signed in to change notification settings - Fork 100
Data Explorer: Summary statistics heuristics for precision #2339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The summary statistics are pretty ugly now. Part of this will be returning the unformatted numbers from the backend and handling the formatting in the UI |
Since we moved to fixed-space fonts, this thin space solution won't work anymore. My first principles approach would be to add formatting options to the |
How about adding thin spaces using spans with padding but no contents? Those will copy out cleanly but will also let us format things nicely.
We'd obviously need to add these on the frontend, probably fine as long as we know that the column type is numeric and it's parseable as such |
We would have to put some kind of placeholder unicode character on the backend so that the frontend can reliable replace it with the HTML display formatting that we want |
…for data values, summary stats (#3310) Partly addresses #2339 and #3210, and #3314. Adds a format_options parameter in the get_data_values and get_column_profiles backend methods to allow customized formatting of large, small, and medium-sized numbers. If the value threshold is above or below limits implied by the parameters, scientific notation is used. We've talked about dynamically trimming zero padding from the end of numbers (e.g. if all the floats displayed are integral, then we can trim the trailing zeros off all of them), but this will be a frontend-only change that can be done after.
What else do we want to try to do this week from where things are now? |
The Python formatting in summary stats looks remarkably good -- thanks for all the PRs! Only thing I see missing is |
@wesm it does stickout to me a bit that we're adding a lot of sig-fig in the decimals. I think it'd be nice if > 1, then avoid printing more than 2 decimal places, ie |
Right -- we discussed this a bit in the past. If we want decimal alignment with numbers > 1 and small numbers between 1 and -1, we either:
I will go ahead and do #2 until we are ready to implement #3325.
|
This might be a good time to review the current state of things and decide what needs to be refined or changed? Given that the formatting is happening on the backend, we need to be careful about doing too much non-trivial formatting logic in the front end, but there may be some low hanging fruit |
We've largely decided on trying to respect the underlying dataframe library options as seen in: #7266 (comment) For moderately variable data (small, med, large) we're doing a good job, but the scientific notation is still a bit ugly: Given we have nice alignment otherwise, perhaps we want to close this issue as done and open two secondary issues:
|
Problem Space: How to handle decimal precision across extremely broad ranges of possible data.
Guiding principles:
Tasks
Very large data:
Print max of 7 digits, this gives us enough room alongside the
median
/mean
etc to let the numbers breathe at min width of the summary column.NICE TO HAVE: We should have a thin space for each three digit group, ie
1000000.
becomes1 000 000.
with thinner spaces. We'd still need to be careful to make sure alignment across rows at the decimal place is valid. Alternatively, we may want to use an underscore to indicate each 3 digits (1,000s place), but I think that can happen post public beta.This avoids major locale problems with using
,
meaning a decimal in Europe.At that scale, we could safely drop all decimals and rely on whole numbers but indicate it is still not a whole number by including a trailing
.
, ie1,000,000.
If > 1, then avoid printing more than 2 decimal places, ie
1.23
is ok but1.23456
is not.After that "max printable value" we should switch to either 5 max significant figures + scientific notation, ie
112.05e+10
or 3 significant figures + scientific notation, ie1.12e+10
. My preference would be 3 significant figures + scientific notation.We must be careful to treat all the numbers equally, though, so there is still nice alignment at the decimal, with a scientific notation, and then the exponent can vary across ranges, ie
1.12e+10
and1.10e+21
.Very small data (<1):
0.
which counts as one additional digit to get us 5.0.05, 0.05671, 0.000000027
becomesAlternatively, we could go with only necessary scientific notation, but I think that the consistent scientific notation is a bit cleaner.
I think it would be useful to coordinate some of the existing logic/heuristics that
tibble
andpillar
use:We should be able to apply extremely similar numerical handling for sane defaults.
Backend
We may need to handle rounding or even display on the backend.
The text was updated successfully, but these errors were encountered: