-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better time zone workflow for JS target #46
Comments
@glennsl Any thoughts? |
Hmm, I'm not sure where that would get us. Once you have the guesses, what would you do with them? Also, as I mentioned in #42, it is possible to query the JS environment's time zone database using |
This would allow loading the appropriate JSON file of the time zone, instead of loading every single time zone in tzdb.
That would likely be too slow at least for timere, correct. For timedesc, it may just be easier to ask user to construct a JS time and convert into Timedesc.t, if desirable at all. |
Ah, so off-loading the async work to the consumer. Yeah that would be a nice option to have. It does have some significant architectural repercussions though, as the consumer will either have to pass this time zone around everywhere, or somehow deal with the possibility that the local timezone is not yet known.
It's definitely a trade-off, but apparently not bad enough that the successor of the most widely used JS date time library makes any note about a noticeable performance impact in its documentation. It would be a nice option to have, and also nice to have some numbers to evaluate alternatives.
Then you'll just have a fixed offset timezone with all the problems that come with that, right? A third option might be to offer a reduced time zone database without all the historical baggage. Spacetime manages to get down to 47kB that way, which is quite acceptable. Compared to moment-timezone at 479 kB and date-fns-timezone at 922 kB. And compared (unfairly) to timere's own unminifed full tzdb at 5.4 MB. |
Timere needs access to the (full) transition table to resolve pattern matching queries. Timedesc is comparable to Luxon yes, though I'll have to think about how this would work out - we'll have to replace the stored transition tables with an oracle that interacts with
True.
Approach of Spacetime (https://github.com/spencermountain/spacetime/blob/master/zonefile/iana.js) seems to be just restricting ourselves to the immediate next 12 months-ish, namely for Sydney we have
which is just the next transition (without even the year). Similarly for https://github.com/vvo/tzdb. Both libraries seem assume the application will always load the latest rebuilt library from NPM, and you don't compute anything more than one year into the future. I'll add that I find their exact approach very fragile...trimming too much to save space. If that is indeed what suffices for JS usage, I am happy to add a tzdb-js backend that only includes very recently months of data. |
I see. Yeah that certainly makes this trickier.
Yeah, for using
Does the transition really happen at different times every year? Or is it just that there's a (very small) possibility that it will?
I think it's at least very unlikely that most front-end applications will deal with datetimes before 1970 and after, say, 10 years into the future. So filtering out anything outside that should already reduce the size quite a bit. For our specific use case, which is quite time-centric, we have some historical data going back to the 90s I believe. But that's just used for machine learning on the back end. On the front end we don't show historical data from before 2015, and only look about a year into the future. And if we were to display historical data somehow it would still just be with dates (or more likely weeks at the smallest granularity), so no offset is really needed anyway. I can also imagine other formats for the time zone database that would allow more flexibility with a smaller footprint, such as a rule-based format that enables computing the table on demand. A rule could be in the form |
For most places afaict, yes - I think government bodies tend to define it around "nth weekday of some month" or something like that? But no strict rule on a specific pattern being followed. I actually don't know if the date then becomes fixed if we switch to ISO week calendar - I'll have to check.
Fair enough - sounds like a tzdb-slim (or some better name) will be a good addition.
Transitions in the general case cannot be algorithmically generated, might be the case for contemporary periods using ISO week calendar as suggested above - don't know for certain yet. |
I checked and it seems to be the case for Australia/Sydney for 2021/2022 that you can state the split using a fixed ISO week date (varying year), but I don't have time to write code to check if that's always the case for all time zones in recent +/- N years at the moment. Another (simpler) idea is to simply pass the table through a compression pass, which might already yield a fair bit of savings. |
Ah, of course they do 🙄
I don't see why not. In the worst case you'd just have a "rule" for every transition, which is effectively status quo (with a bit more overhead). Anyway, I'm not trying to say that the specific rule format I suggested would be optimal, or even better than a table, but just seed the idea that the database doesn't have to be stored as a bulky lookup table, but could be (partly) computed on demand based on some kind of rules. Coming up with the actual rules, format and encoding process is of course still a non-trivial task.
Promising. Thanks for checking! This is just optimization, and so not very time sensitive, but I might take this on as a fun challenge if and when I get some time.
There's definitely lots of potential for compression, but they ought to be transferred in gzipped form already, and so I don't imagine traditional compression would do much for transfer size. It could help with memory use though, which is especially important on mobile. It would also be easy to remove quite a bit of bloat just by transforming the entries from [ "3920576400", { "is_dst": false, "offset": 3600 } ] to [ 3920576400, 0, 3600 ] |
Re compression: I was primarily thinking of at the marshalled level (so it'd be without the clutter of JSON already). Also might be even better to switch to hashtable as the underlying store... EDIT for context: the install process looks like |
Right, yeah I had no idea what that process was. Thanks for the context! Here's another interesting idea. Try running this JS code on one of the JSON time zone objects: {
var last = 0;
tz.table.map(([time, {offset}]) => {
let delta = time - last;
last = time;
return [delta, offset];
})
} It's not hard to see the recurring patterns, but also that there's a very small set of unique entries. Which means you could encode this in a two-level lookup table. E.g.: {
"L1": {
1: [ 16934400, 3600 ],
2: [ 14515200, 0 ].
...
},
"L2": [1, 2, 1, 2, 3, 2, 3, 4, 1, 2, 1, ...]
} with the entries in I imagine you could do much better than this by taking the recurring patterns into account, but this could already reduce the database size by 80% if we're lucky, I think. Edit: Also, what's the signficance of the |
Hm...adjusting to a different binary encoding would indeed save us a lot of space, but then the initialisation of timedesc could be significantly slower compared to just unmarshalling the string - I don't know if this is the case.
|
Would it be hard to unmarshal on demand instead? |
A few examples of the potential effectiveness of this encoding:
|
Come to think of it, the main table is just a pair of array, so really as efficient as one can be when maximising for both space efficiency and initialisation efficiency.
The change would be transparent as type |
Ah right, you're mapping that way...right, we can then use a byte/char as the index, with dynamic unmarshalling...yeah okay this sounds like a good idea. |
Yeah okay, this is very doable, just that adding the relevant testing will take some time (on top of me lacking a functional desktop right now for fuzzing). |
Cool! Let me know if you want me to pitch in with something. I could for example write the encoding script. I don't know what you use as a source though. |
Actually, this would probably be useful for many other projects as well, across languages, so perhaps we should set up a separate repository for the time zone database, with ocaml and js packages to encode and decode. That would give it more real world testing, some help with keeping it up to date etc. |
We need a
We can then just use |
Hm...yeah interesting idea - our current representation is copied from Rust One issue with that is then we'll have to put extra time into making it future proof etc (which uh would exceed my free time budget : D). |
Sounds like a good plan. I don't quite understand the rationale behind this format though. A lookup table for offsets would offer some savings, but only 50% of the offset field at most, since it's replacing an int16 with an int8. The offset is only 2/7ths the size of the entry as a whole, however, which means it would amount to a 15% reduction in total size at most, even disregarding the overhead of the L1 table. What I show above is that, assuming That would be something more like this:
Rust WASM is already pretty bloated (I say as my jsoo bundle size just passed 10MB, unminified and whatnot, but still), so my guess would be no 😆 It's a good question, and definitely something that would be worth looking into. But from a quick search of their issues I can't find anything.
Ah, true! Maybe not at first then, but something we could have in mind for the future! |
Oh, searched the wrong repo! I found two issues mentioning size, neither of them related to WASM. The solution they focus on seem to be various ways of filtering which timezones are included. |
The original time zone database actually seems to be in a rule based format already Definitely not a simple format though. I think I prefer the compression scheme we've come up with here, even if it ends up a bit bulkier. |
Right relative time, right okay. This requires investigation on the exact upperbound of the gaps...which will need another set of code...though I think 32bit is a safe bet, and we can just make the code gen raise an exception when it's not 32bit. |
I was thinking we might just keep using int64, for simplicity. That will also allow using 0 as the starting point while still starting at "the beginning of history" (i.e. the first delta being Int64.min). Otherwise we'll also need a starting time field I think. Going from int64 to int32 will also just yield 1.5 percentage points or so additional reduction, because we've already reduced the total number of timestamps by 90%. |
I added the serialisation side (with some more conservative choices in the layout for now), and the size of Deserialisation is not implemented yet. |
Swapping from 16 bit to 8 bit for index shrinks to down to 667K |
Made some further adjustment to reduce width of |
The format should be relatively obvious from |
That's awesome! Almost 85% reduction in size! I'm surprised the encoding of offset needs to be this involved though, but I guess that make sense for arcane historical reasons. I see you've added a deserializer too now, so seems you have full control! I continue to be impressed about how fast you manage to make things happen after I suggest something. Thanks so much for this! |
Yeah, always some oddities in tzdb sadly. I'll make a different PR for reducing the number of years we include in tzdb (reorganising the tzdb backends a bit).
: D thanks! Right now I'm doing the most boring part - testing the (de)serialisation (laptop not very good at running ever so slightly heavy tests...) |
Okay finished debugging - quickcheck catches a lot of errors as usual, namely |
I see it's been merged now. Nice work! It seems that the whole database is decompressed at the same time though, which means you'll still have to pay the memory and cpu cost of all the time zones even if you most likely just need one of them. This is especially unfortunate on mobile, which is generally more memory and cpu-constrained, but also power-constrained. And just having things in memory consumes quite a bit of power, as I understand it. Would it be possible to have a map with time zones compressed individually, and only decompress them as needed? Also looking forward to see how much removing historic a far future time zone transitions will affect the size now! |
: D
Yep that's the intended direction, just wanted to make that a separate PR.
Indeed, what's a good default you reckon? +- 20 years? |
Awesome!
I think most application's needs are likely asymmetric, and my gut feeling is +10/-30 years. That'd cover the nineties, and therefore the whole mainstream digital/internet era. I also can't imagine most applications needing to project more than 10 years into the future. |
Hey @glennsl I've adjusted the generator pipeline and shrunk the year range to 1990 to 2040, |
Hey. I'm not sure how I'd be able to confirm those numbers independently, but it certainly seems plausible! As I would expect most of the space being taken up by irregularities, and most irregularities to occur in early history. I wonder how much it would actually increase if you include more recent history. As long as the pattern remains the same, it would just add a few more entries to the L2 lookup table. Would it be easy to check how much difference going back to 1970 would make, for example? |
Yeah I streamlined the year range specification, so you can do the adjustment in the
Do a Finally do (The order here is different since I was doing the
For 1850 - 2040,
For 1850 - 2100, the command yields
For 1970 - 2040, the command yields
For 1970 - 2100, the command yields
For 1990 - 2040, the command yields
For 1990 - 2100, the command yields
|
Hmm, interesting. The size increases with about 50% of the increase in time for 1990 vs 1970. Much bigger than I thought it would be. Compared to ~33% for 1990 vs 1850 and ~25% for 2040 vs 2100. Much better than 100% (or more) of course, but still quite expensive to do so just in case. Unless there are good reasons for extending the range I think 1990 is a good default. Thanks for getting the numbers and explaining the procedure! |
I don't think I follow this |
The number of years between 1990 and 2040 is 50, and between 1970 and 2040 is 70. That's a 40% increase in time covered. And the size for those ranges is 532K and 628K respectively, which is an 18% increase. Therefore the increase in size is about half that of the increase in time covered. Hope that's easier to follow (and also free of mistakes!) |
The total is all files combined though - we're only interested in comparing the sizes of one of the compiled object files or tzdb_compressed.ml. So for say 1970 to 2040, my understanding is that it's roughly 110K when compiled. |
Ah yes, sorry. The difference is roughly the same though. For |
I am leaning toward 1970 - 2040 for the final publish, what you think? I believe the tests cover everything we've added
let lookup_record name : record option =
match M.find_opt name !db with
| Some table ->
assert (check_table table);
Some (process_table table)
| None ->
match M.find_opt name compressed with
| Some compressed_table ->
let table =
Compressed_table.of_string_exn compressed_table
in
assert (check_table table);
db := M.add name table !db;
Some (process_table table)
| None -> None but implicitly tested by the If you don't spot anything missing I am going to finalise timedesc.0.7.0 and submit later. |
Sweet! Can't think of anything, no.
1970 is a very natural point in time that is likely to conform to users expectations thanks to Unix time. I still think 1990 is sufficient for 99% of use cases though, and that it's relatively expensive to extend it to 1970 given that.Then again, we've already reduced it by quite a lot, so perhaps it's time to relax a bit 😄 |
Completed with timedesc (and timedesc-tzlocal-js) 0.8.0 One additional adjustment since then is removal of use of Marshal entirely, to make |
Right now
Time_zone.local
attempts construction using name guesses (strings) fromTimedesc_tzlocal.local
,but this would fail if
tzdb.none
backend is pickedRun time retrieval of tzdb json file without being typed to a Lwt promise was not possible iirc,
so perhaps one way forward is to expose the string guesses as an alias (
Timedesc.Time_zone.local_string
perhaps).The text was updated successfully, but these errors were encountered: