Skip to content

Use ICU for better Unicode sorting #1066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

alerque
Copy link
Collaborator

@alerque alerque commented Jul 14, 2019

This is an extract from #1063 with the ICU related bit separately for testing and review.

...and as long as its being run separately here there are probably a few other places that could take advantage of better Unicode support.

Additionally if there is a way (sorting flag?) to do phonetic interspersed sorting instead of per-alphabet that would be an interesting alternative as well.

@simonmichael simonmichael changed the title Use ICU for better Unicode support Use ICU for better Unicode sorting Jul 15, 2019
@simonmichael
Copy link
Owner

We probably need a little test for this and a mention & example of it in the manual. Also, as previously discussed we are looking for someone on windows (and the other platforms) to test building this, we want to know if it makes installation harder.

@simonmichael simonmichael added A-WISH Some kind of improvement request or proposal. help wanted needs-docs To unblock: needs corresponding documentation or doc updates needs-testing To unblock: needs more developer testing or general usage needs-tests To unblock: needs more automated tests or test updates platform-freebsd platform-linux platform-mac platform-nix platform-windows labels Jul 15, 2019
@alerque alerque self-assigned this Jul 15, 2019
@simonmichael simonmichael added the i18n Internationalisation/localisation-related. label Jul 15, 2019
@simonmichael
Copy link
Owner

For the record, this improves the sorting of the new descriptions/payees/notes commands by making them case insensitive and “accent insensitive”, so that eg ã sorts similarly to a. It adds text-icu as a dependency, which requires the icu C library.

@simonmichael simonmichael removed the needs-tests To unblock: needs more automated tests or test updates label Jul 15, 2019
@alerque alerque added help wanted needs-tests To unblock: needs more automated tests or test updates and removed help wanted labels Jul 15, 2019
@alerque
Copy link
Collaborator Author

alerque commented Jul 15, 2019

The test I added was for a proposed bit that isn't implemented yet (account name sorting), I didn't add a test yet for the part I did implement (descriptions/payee/note sorting) ;-)

@simonmichael
Copy link
Owner

What's needed to move this forward ? I think:

  • find someone to see if this builds on Windows.

And if that works, then:

  • clarify which commands are changed, mention the new sorting behaviour in their docs
  • consider what other commands to change for consistency
  • rebase these changes on latest master, and combine/clean up commits

@simonmichael simonmichael removed the needs-tests To unblock: needs more automated tests or test updates label Nov 19, 2019
@alerque
Copy link
Collaborator Author

alerque commented Nov 26, 2019

I think that summary is correct. When the time comes I'll be happy to take care of some of the other steps, but I don't have access to Windows to test on.

@simonmichael simonmichael force-pushed the master branch 6 times, most recently from 56bc295 to 01f9c70 Compare July 11, 2021 09:26
@simonmichael
Copy link
Owner

Hi @alerque. Returning to your 2019 hledger pr to add text-icu for better sorting. I believe text-icu is still a problem for portability/installability - see discussion this year on #2319. So I'm inclined to close this.

@alerque
Copy link
Collaborator Author

alerque commented May 17, 2025

Is this because Haskell's text-icu package has C library dependencies? Would that be the only C library being depended on?

In very recent news ICU4x just stabilized 2.0.0 recently. That should have some nice effects that trickle-down to other language ecosystems eventually, but may not have affected Haskell yet.

@simonmichael
Copy link
Owner

I expect there are some other C libs depended on, but only very common usually trouble-free ones. ICU is usually not in that tier, and never will be I think, so adding a dependency on it is not free. I found installation hassles on windows and mac this year.

The haskell text-icu package does have a maintainer again, since 2022.

How much is better international text sorting needed ?

@simonmichael
Copy link
Owner

PS, and as an alternative, could we implement our own 80% solution ?

@alerque
Copy link
Collaborator Author

alerque commented May 17, 2025

It isn't an earth shattering issue for sure, but 6 years later it is still a paint point and I still post-process lots of hledger output to fix it.

80% is probably better than nothing and I understand the concern over adding non-native dependencies. I do find it hard to believe that Haskell doesn't have anything native to handle this properly though. It's tricky but not that hard. Unfortunately my Haskell has not been exercised very much in the last few years as Rust, Lua, and some shell stuff have dominated my work.

@simonmichael
Copy link
Owner

simonmichael commented May 17, 2025

Maybe there is something native. For #2319 we found the encoding lib; I'm assuming that one doesn't help us here.

Good international support is definitely a goal for hledger, though lower priority than installability and maintainability.

@simonmichael
Copy link
Owner

Some interesting things to review here: https://hackage.haskell.org/packages/search?terms=text-icu

@simonmichael simonmichael added needs-research To unblock: needs some research/investigation and removed needs-docs To unblock: needs corresponding documentation or doc updates needs-testing To unblock: needs more developer testing or general usage labels May 17, 2025
@alerque
Copy link
Collaborator Author

alerque commented May 17, 2025

https://hackage.haskell.org/package/unicode-collation

Presumably this is what Pandoc uses, and seems to be a pure Haskell library that covers the necessary ground.

@simonmichael
Copy link
Owner

Exactly.. that looks great.

@simonmichael simonmichael added needs-impact-analysis To unblock: needs analysis of interactions with other features, users, ecosystem needs-docs To unblock: needs corresponding documentation or doc updates needs-code To unblock: needs code/code updates and removed needs-research To unblock: needs some research/investigation labels May 17, 2025
@alerque alerque force-pushed the icu-sorting branch 3 times, most recently from d8a143c to 8899971 Compare May 17, 2025 21:20
@simonmichael
Copy link
Owner

Looking good! I imagine when you are ready you'll squash this into one commit.

We'll want all commands to sort the same way, ultimately. I see some sorting-related functional tests for three commands:

hledger/test/accounts-sorting.test
hledger/test/balance/sorting.test
hledger/test/register/sort.test

But perhaps it'll be easier to start a new unicode-sorting.test file just for this.

@alerque
Copy link
Collaborator Author

alerque commented May 17, 2025

At the moment I haven't even gotten my (a little bit dated) GHC toolchain from Arch Linux to cooperate and build anything to test. Old toolchains have issues with the current deps, but using a new version has other issues too, namely I got lost in this snafu when trying to use 9.10.1 via stack build.

But yes I'll look at testing all three bits we're sorting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-WISH Some kind of improvement request or proposal. i18n Internationalisation/localisation-related. needs-code To unblock: needs code/code updates needs-docs To unblock: needs corresponding documentation or doc updates needs-impact-analysis To unblock: needs analysis of interactions with other features, users, ecosystem platform-freebsd platform-linux platform-mac platform-nix platform-windows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants