-
-
Notifications
You must be signed in to change notification settings - Fork 326
Use ICU for better Unicode sorting #1066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
We probably need a little test for this and a mention & example of it in the manual. Also, as previously discussed we are looking for someone on windows (and the other platforms) to test building this, we want to know if it makes installation harder. |
For the record, this improves the sorting of the new descriptions/payees/notes commands by making them case insensitive and “accent insensitive”, so that eg ã sorts similarly to a. It adds text-icu as a dependency, which requires the icu C library. |
The test I added was for a proposed bit that isn't implemented yet (account name sorting), I didn't add a test yet for the part I did implement (descriptions/payee/note sorting) ;-) |
What's needed to move this forward ? I think:
And if that works, then:
|
I think that summary is correct. When the time comes I'll be happy to take care of some of the other steps, but I don't have access to Windows to test on. |
c9b2bbd
to
ea9bddb
Compare
56bc295
to
01f9c70
Compare
Is this because Haskell's In very recent news ICU4x just stabilized 2.0.0 recently. That should have some nice effects that trickle-down to other language ecosystems eventually, but may not have affected Haskell yet. |
I expect there are some other C libs depended on, but only very common usually trouble-free ones. ICU is usually not in that tier, and never will be I think, so adding a dependency on it is not free. I found installation hassles on windows and mac this year. The haskell text-icu package does have a maintainer again, since 2022. How much is better international text sorting needed ? |
PS, and as an alternative, could we implement our own 80% solution ? |
It isn't an earth shattering issue for sure, but 6 years later it is still a paint point and I still post-process lots of hledger output to fix it. 80% is probably better than nothing and I understand the concern over adding non-native dependencies. I do find it hard to believe that Haskell doesn't have anything native to handle this properly though. It's tricky but not that hard. Unfortunately my Haskell has not been exercised very much in the last few years as Rust, Lua, and some shell stuff have dominated my work. |
Maybe there is something native. For #2319 we found the Good international support is definitely a goal for hledger, though lower priority than installability and maintainability. |
Some interesting things to review here: https://hackage.haskell.org/packages/search?terms=text-icu |
https://hackage.haskell.org/package/unicode-collation Presumably this is what Pandoc uses, and seems to be a pure Haskell library that covers the necessary ground. |
Exactly.. that looks great. |
d8a143c
to
8899971
Compare
Looking good! I imagine when you are ready you'll squash this into one commit. We'll want all commands to sort the same way, ultimately. I see some sorting-related functional tests for three commands:
But perhaps it'll be easier to start a new unicode-sorting.test file just for this. |
At the moment I haven't even gotten my (a little bit dated) GHC toolchain from Arch Linux to cooperate and build anything to test. Old toolchains have issues with the current deps, but using a new version has other issues too, namely I got lost in this snafu when trying to use 9.10.1 via But yes I'll look at testing all three bits we're sorting. |
This is an extract from #1063 with the ICU related bit separately for testing and review.
...and as long as its being run separately here there are probably a few other places that could take advantage of better Unicode support.
Additionally if there is a way (sorting flag?) to do phonetic interspersed sorting instead of per-alphabet that would be an interesting alternative as well.