Data Explorer: Summary statistics heuristics for precision #2339

jthomasmock · 2024-02-27T20:21:05Z

Problem Space: How to handle decimal precision across extremely broad ranges of possible data.

Guiding principles:

Target decimal alignment, even across large ranges to make comparisons at a glance easy
Avoid printing too many digits -- namely trying to limit the significant figures we print.

Tasks

R formatting
Python formatting

Very large data:

Print max of 7 digits, this gives us enough room alongside the median/mean etc to let the numbers breathe at min width of the summary column.
NICE TO HAVE: We should have a thin space for each three digit group, ie 1000000. becomes 1 000 000. with thinner spaces. We'd still need to be careful to make sure alignment across rows at the decimal place is valid. Alternatively, we may want to use an underscore to indicate each 3 digits (1,000s place), but I think that can happen post public beta.
This avoids major locale problems with using , meaning a decimal in Europe.
At that scale, we could safely drop all decimals and rely on whole numbers but indicate it is still not a whole number by including a trailing ., ie 1,000,000.
If > 1, then avoid printing more than 2 decimal places, ie 1.23 is ok but 1.23456 is not.
After that "max printable value" we should switch to either 5 max significant figures + scientific notation, ie 112.05e+10 or 3 significant figures + scientific notation, ie 1.12e+10. My preference would be 3 significant figures + scientific notation.
We must be careful to treat all the numbers equally, though, so there is still nice alignment at the decimal, with a scientific notation, and then the exponent can vary across ranges, ie 1.12e+10 and 1.10e+21.

Very small data (<1):

Max of 4 digits, not including the 0. which counts as one additional digit to get us 5.
For small data, I think we can be a bit more aggressive with switching to scientific notation, and switch to 3 digits max but likely 2-3 in most situations.
0.05, 0.05671, 0.000000027 becomes

5.  e-2 
5.67e-2 
2.7 e-8

Alternatively, we could go with only necessary scientific notation, but I think that the consistent scientific notation is a bit cleaner.

0.05 
0.0567
2.7 e-8

If all numbers are 5 or less digits, then don't use scientific notation, just decimal align.

0.05
0.0567
0.0113

I think it would be useful to coordinate some of the existing logic/heuristics that tibble and pillar use:

We should be able to apply extremely similar numerical handling for sane defaults.

Backend

We may need to handle rounding or even display on the backend.

The text was updated successfully, but these errors were encountered:

wesm · 2024-04-15T21:31:50Z

The summary statistics are pretty ugly now. Part of this will be returning the unformatted numbers from the backend and handling the formatting in the UI

wesm · 2024-05-23T14:13:00Z

We should have a thin space for each three digit group, ie 1000000. becomes 1 000 000. with thinner spaces. We'd still need to be careful to make sure alignment across rows at the decimal place is valid.

This avoids major locale problems with using , meaning a decimal in Europe.

Since we moved to fixed-space fonts, this thin space solution won't work anymore. My first principles approach would be to add formatting options to the get_data_values request (e.g. pass the thousands separator and decimal point that you want based on the application locale). Thoughts? cc @jmcphers

jthomasmock · 2024-05-23T19:25:56Z

Yah eventually we may want to make it configurable or approach like pillar with underscores instead of spaces. That is tricky for copy-paste out though

jmcphers · 2024-05-23T20:02:08Z

How about adding thin spaces using spans with padding but no contents? Those will copy out cleanly but will also let us format things nicely.

1<span class="tinyspace"></span>000<span class="tinyspace"></span>000

We'd obviously need to add these on the frontend, probably fine as long as we know that the column type is numeric and it's parseable as such

wesm · 2024-05-23T20:20:52Z

We would have to put some kind of placeholder unicode character on the backend so that the frontend can reliable replace it with the HTML display formatting that we want

…for data values, summary stats (#3310) Partly addresses #2339 and #3210, and #3314. Adds a format_options parameter in the get_data_values and get_column_profiles backend methods to allow customized formatting of large, small, and medium-sized numbers. If the value threshold is above or below limits implied by the parameters, scientific notation is used. We've talked about dynamically trimming zero padding from the end of numbers (e.g. if all the floats displayed are integral, then we can trim the trailing zeros off all of them), but this will be a frontend-only change that can be done after.

wesm · 2024-06-03T23:10:44Z

What else do we want to try to do this week from where things are now?

jthomasmock · 2024-06-04T01:14:30Z

What else do we want to try to do this week from where things are now?

The Python formatting in summary stats looks remarkably good -- thanks for all the PRs!

Only thing I see missing is missing and categorical types, which I think are captured in #2161

jthomasmock · 2024-06-04T02:13:46Z

@wesm it does stickout to me a bit that we're adding a lot of sig-fig in the decimals. I think it'd be nice if > 1, then avoid printing more than 2 decimal places, ie 1.23 or 1.00 is ok but 1.23456 or 1.00000 is a bit much.

wesm · 2024-06-04T13:56:38Z

I think it'd be nice if > 1, then avoid printing more than 2 decimal places, ie 1.23 or 1.00 is ok but 1.23456 or 1.00000 is a bit much.

Right -- we discussed this a bit in the past. If we want decimal alignment with numbers > 1 and small numbers between 1 and -1, we either:

have to ask the backend to format all numbers with additional significant digits, and then trim superfluous zeros dynamically on the front end (see Dynamically trim zero padding from decimal digits in data explorer grid (or similar) #3325) or
Do not do decimal alignment, and set the formatting to use 2 significant digits for numbers over 1 and 4 digits for small numbers < 1.

I will go ahead and do #2 until we are ready to implement #3325.

~~I am also not sure why the numbers are left-aligned all the sudden, that looks like a bug to me cc @softwarenerd~~ I see this is #3376

wesm · 2025-03-21T20:18:41Z

This might be a good time to review the current state of things and decide what needs to be refined or changed? Given that the formatting is happening on the backend, we need to be careful about doing too much non-trivial formatting logic in the front end, but there may be some low hanging fruit

jthomasmock · 2025-04-30T01:21:56Z

We've largely decided on trying to respect the underlying dataframe library options as seen in: #7266 (comment)

For moderately variable data (small, med, large) we're doing a good job, but the scientific notation is still a bit ugly:

Given we have nice alignment otherwise, perhaps we want to close this issue as done and open two secondary issues:

Respect underlying pd.set_option() or equivalent for precision, sci units, etc.
Better handle scientific notation in summary statistics and in grid

jthomasmock added this to the Public Beta 2024 Q2 milestone Feb 28, 2024

wesm added the area: data explorer Issues related to Data Explorer category. label Feb 29, 2024

wesm mentioned this issue Apr 22, 2024

Data Explorer: Use pandas for more nicely formatted numeric column summary stats #2850

Merged

dfalbel mentioned this issue May 9, 2024

Add support for computing summary stats posit-dev/ark#342

Merged

jthomasmock added the epic Epic label May 21, 2024

jthomasmock assigned dfalbel and wesm May 21, 2024

jthomasmock mentioned this issue May 21, 2024

Data Explorer: Pandas summary statistics formatting #3210

Closed

dfalbel mentioned this issue May 21, 2024

R Formatting of summary statistics #3211

Closed

wesm mentioned this issue May 29, 2024

Data Explorer: Add preliminary customizable float formatting options for data values, summary stats #3310

Merged

wesm mentioned this issue May 30, 2024

Dynamically trim zero padding from decimal digits in data explorer grid (or similar) #3325

Open

dfalbel mentioned this issue Jun 4, 2024

Data Explorer: Add support for formatting options posit-dev/ark#382

Merged

dfalbel modified the milestones: Public Beta 2024 Q2, Release Candidate Jun 27, 2024

jthomasmock mentioned this issue Aug 9, 2024

Data Explorer: Decimal alignment rules not applied #4303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Explorer: Summary statistics heuristics for precision #2339

Data Explorer: Summary statistics heuristics for precision #2339

jthomasmock commented Feb 27, 2024 •

edited

Loading

wesm commented Apr 15, 2024

wesm commented May 23, 2024 •

edited

Loading

jthomasmock commented May 23, 2024

jmcphers commented May 23, 2024

wesm commented May 23, 2024

wesm commented Jun 3, 2024

jthomasmock commented Jun 4, 2024

jthomasmock commented Jun 4, 2024

wesm commented Jun 4, 2024 •

edited

Loading

wesm commented Mar 21, 2025

jthomasmock commented Apr 30, 2025

Data Explorer: Summary statistics heuristics for precision #2339

Data Explorer: Summary statistics heuristics for precision #2339

Comments

jthomasmock commented Feb 27, 2024 • edited Loading

Very large data:

Very small data (<1):

Backend

wesm commented Apr 15, 2024

wesm commented May 23, 2024 • edited Loading

jthomasmock commented May 23, 2024

jmcphers commented May 23, 2024

wesm commented May 23, 2024

wesm commented Jun 3, 2024

jthomasmock commented Jun 4, 2024

jthomasmock commented Jun 4, 2024

wesm commented Jun 4, 2024 • edited Loading

wesm commented Mar 21, 2025

jthomasmock commented Apr 30, 2025

jthomasmock commented Feb 27, 2024 •

edited

Loading

wesm commented May 23, 2024 •

edited

Loading

wesm commented Jun 4, 2024 •

edited

Loading