-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cache timeout of 5s, now works with offline git commits #192
Conversation
My plan (#155) was to make a module to set up an external git repo that can be synced with, and automatically invalidate cache when it pulls changes from remote. Unfortunately I haven't quite found the motivation to actually make it yet. |
Yes, all cached items which use the global Python object 'cache' (which is a Flask 'Cache' object) now expire after 5s.
It shouldn't. I have used the "git log" command with large git histories (like the Android AOSP source, Linux kernel source, long-running software projects, etc.) and it's always very fast. I don't see any reason it should be slow in Realms-wiki. If you can provide a public git repo of Markdown files, with a large history, I will try to repro the "slow" history page you describe. If it takes "a very long time", then that is a bug that should be fixed in Realms-wiki... I would be happy to help with that, as I am putting Realms-wiki into production use here. |
I busy site would likely utilize a reverse proxy cache using varnish or nginx, or a CDN like cloudflare. Also to invalidate server-side cache better a |
Some of the history for our site (http://flexget.com) can take up to 30 seconds to load on our server. (here's the git repo for it https://github.com/Flexget/wiki) It isn't so bad now that the actual items are loaded in the background, as at least the page comes up quickly. |
I think something like this would be nicer than invalidating the cache every 5 seconds by default. If we ever make an integrated plugin to sync with a remote repo the invalidation setup would be greatly eased. |
Or, could we make this just apply to the wiki pages and not the history cache? The history cache is built so that the old bits can stay cached and will get updated with only newer items when needed. |
No, I think that would defeat the purpose, which is to make Realms-wiki compatible with out-of-band editing of the wiki's .md files with pure git. An out-of-date history would be displayed to Realms users, but not command-line git users, if the history cache did not expire.
That is a straight-up bug. This is something that can be fixed. A maximum page load time should be 1 to 5 seconds. On my ~10 year old AMD Athlon(tm) 64 X2 Dual Core, with Realms-wiki running through gunicorn and SSL, most of my lightweight test pages take less than 0.5 s to load. I see your site is public; would you be willing to publish your server-side git database to github? I'd like to run timing tests. First using Realms, then command-line git, to see where the slowness is. Perhaps the usage pattern of git requires using git gc or git prune, or perhaps there is a parser recursion someplace, or perhaps SQLite (or whatever your database server) is the culprit. Or perhaps I will be unable to reproduce (could the problem depend on the memory or disk I/O limits of your server?). As an aside, there are several types of backend storage supported by Flask's Cache (from their docs):
Maybe trying a different cache backend is the correct solution... anyway, I'd like to fix the bug. 30 seconds is too long.
The bug you referenced had a comment proposing two possible causes:
In Gollum, both of those things are done by Ruby libraries that Realms-wiki does not use. (The "Poor caching" mentioned there was a reference to needing multiple backend types, like Flask has). I would be surprised if the root cause of your 30 s page time is the same. I agree that a 5 s cache timeout is a simplistic solution. A git hook would be better. It would be best if the git hook could invalidate just the affected pages' cache entries, rather flushing the whole thing. |
The way the history cache works, (or at least the way I intended,) is it compares the current HEAD to the cached head from the history cache, and generates any missing pieces between HEAD and the last cached item. I think that should still work even when we don't invalidate the history cache.
Already on github here https://github.com/Flexget/wiki . Would be awesome if you can speed this up. I couldn't get it to work fast, which is why I implemented the somewhat complicated history cache capable of filling in the gaps and only loading a page at a time. Would be great to remove all that if we can make it work faster.
gollum/gollum#1078 (comment) Here you can see they traced the slowness to querying the underlying git repo for the commit history of a given file. They decided on the same solutions I had in realms, which was to eliminate the call to determine the latest commit to a file when viewing a page, and only do that with ajax in other places due to the inherent slowness. I had assumed this very similar problem was because of the way git works, you have to load a lot of the git blobs to do this tracing of history for a file, which manifested in both the python and ruby git libs. I could be wrong though, once again, would be great if you can figure something to speed it up.
There isn't a database involved, logins are via oauth, pretty sure that's the only thing realms uses the db for. The slowness could certainly be due to disk limits of the server, doing that trace requires loading a bunch of the git blobs off the disk, but I think for that reason it's good to keep the already generated history cached (if we can't figure a significant speedup.)
flexget.com is using redis currently, but everything is nice and quick when it's cached already. The problem is just when it has to generate the history cache. |
To expand on what I mean by git history inherently requiring loading many blobs from the disk, what it must do to get the history for any given wiki file in the repo is: Read every commit blob from the disk in the entire history of the git repo all the way from HEAD back to the initial commit. For every one of those commits, it also has to load the associated tree blob from disk and scan to see if that commit affected the file in question. When the git repo gets a lot of commits, and a lot of files, that's quite a bit of stuff to do. |
Thanks for the references. I will take a look within the next couple of days and try to repro the issue.
I disagree that it was traced to the underlying git repo. That comment specifically mentions the slowness is in a Ruby function called page.last_version, and furthermore, it says the slowness is caused by their Ruby code instead of raw git:
(Emphasis mine.) So my money is still on a bug in the Realms Python, and not in the git database itself. Time for science. I'll report back to this thread. |
@gazpachoking, Could you post the full paths to 2 or 3 of your slowest history pages on Flexget? Please list exact pages that where you see the symptom. I will diagnose those pages in particular. The slower the better. (Please post here publicly so others can also reproduce and/or diagnose this issue.) Thanks! |
The slowest history load times will be for any pages who only have enough commits to fit on one page of the history view. The time should be roughly the same for any such page, as it has to trace the entire history of the repo in one go to generate the history items for that file. Hard to link them a bit because as soon as one of those history pages is visited it gets cached and the problem disappears for that page. If you go to the index page and pick a random page you are pretty likely to come across one, we have a few heavily edited pages, but a bunch of them probably only have a few commits and will show the issue. |
Hmm, that was true last time I tested, but it seems maybe something isn't quite working as planned with the cache even on our deployed version. Here is one such page, https://flexget.com/_history/PluginError Who knows if it will already be cached when you try. EDIT: A couple more such pages https://flexget.com/_history/TerminalTable https://flexget.com/_history/Searches/divxatope |
I guess I'm actually not even super fussed about history load time anymore now that the items are loaded with ajax. History isn't needed so commonly anyway. If we can speed it up, great, otherwise I'm down for whatever everyone thinks is best. |
That ruby function had to trace the history commits in the git repo just like our history page does. And our git library (dulwich) is pure python (EDIT: I forgot about the C extensions in dulwich, just double checked, and they are enabled on that server,) which is why I think it's slow for the same reason their pure ruby git library was slow at that operation. EDIT: To elaborate, in a git repo with many commits, calling their ruby function page.last_version on a worst case page that was only edited in the initial git commit of the repo would cause the same history trace all the way down to the root commit just like a full history of the page would. |
I am able to reproduce the issue. The Realms AJAX call to get history for the home page (home.md) takes about 5 seconds. This is a ridiculously long time, as home.md only has 311 commits. In contrast, I can get the entire history of home.md with command-line git in less than a quarter-second:
My conclusion is that the bug is not in the git database, it is somewhere in Realms or Dulwich.
I do not understand why this is necessary. Surely there is an equivalent of git log filename.md under Dulwich? I've never used Dulwich before, so I'll dig into it more later today. Thanks again. |
Doesn't matter how many commits the actual file has, just the whole repo. The way git works under the hood, you cannot trace back the history of commits for just one file, you must trace from the head commit back, checking whether each commit touched the file. Now that I think about it though, there might be an easy optimization to be added: I'm not sure if it stops searching at the commit where it says the file was added. It might search all the way back to the root commit no matter what. I could be wrong, but deserves some investigation.
My assumption was that was just a much faster C implementation. |
(cont'd) First, a note: Realms is walking the commit tree, searching for relevant commits. Newer versions of Dulwich have a very similar method called dulwich.porcelain.log(), which does the same thing (but it prints it to STDOUT instead of returning a list of Python commit objects). Eliminating Realms entirely and just calling dulwich.porcelain.log() is also very slow. The following line of Python code takes several seconds to execute: dulwich.porcelain.log("../../Flexget/wiki/", paths=iter(["home.md"])) It seems that walking the Dulwich commit tree in Python is just plain slow. We are not the first ones to be burned by this. Read this code comment I found in another Python project, at https://github.com/jonashaag/klaus/blob/master/klaus/repo.py: def history(self, commit, path=None, max_commits=None, skip=0):
"""Return a list of all commits that affected `path`, starting at branch
or commit `commit`. `skip` can be used for pagination, `max_commits`
to limit the number of commits returned.
Similar to `git log [branch/commit] [--skip skip] [-n max_commits]`.
"""
# XXX The pure-Python/dulwich code is very slow compared to `git log`
# at the time of this writing (mid-2012).
# For instance, `git log .tx` in the Django root directory takes
# about 0.15s on my machine whereas the history() method needs 5s.
# Therefore we use `git log` here until dulwich gets faster.
# For the pure-Python implementation, see the 'purepy-hist' branch.
cmd = ['git', 'log', '--format=%H']
if skip:
cmd.append('--skip=%d' % skip)
if max_commits:
cmd.append('--max-count=%d' % max_commits)
cmd.append(commit.id)
if path:
cmd.extend(['--', path])
output = check_output(cmd, cwd=os.path.abspath(self.path))
sha1_sums = output.strip().split(b'\n')
return [self[sha1] for sha1 in sha1_sums] Personally, I like his solution of simply shelling out to "git log". Very few lines of code. And I found this comment at https://pypi.python.org/pypi/gitmit/0.1 .
So it seems that the Dulwich commit object hierarchy is to blame. I will look at the git C code. I don't know if it is faster simply because of C, or if they have a smarter way to find relevant entries in the tree. But right now it looks like this Realms bug fix will boil down to one of these:
It seems like fixing this in Dulwich would be generally useful, so I will take a stab at that first. But I don't hold high hopes of finding a magic walker algorithm in git's C code. Assuming that fails, what are your thoughts on libgit2 vs. git log ...? |
(cont'd) I did a deep dive into Dulwich, and also read through much of the command-line git C code. I used the Python speed test script for Dulwich posted here, which includes profiling (it's an attachment to comment #4): As a simple example, I found that this one line of code, which provides the entire Flexget commit history, takes about 1 second on my old hardware: dulwich.porcelain.log("../../Flexget/wiki/") However, to get the history for only home.md, it takes over 20 seconds! Note the additional argument paths: dulwich.porcelain.log("../../Flexget/wiki/", paths=iter(["home.md"])) The big jump in slowness takes place in Dulwich's walk.py, in method Walker._should_return(), which is called on each commit to decide if it should be filtered out or not (depending on whether the commit shows a change to one of the files listed in paths[]). This test is done very inefficiently. Walker() walks over every commit, getting all of the deltas between it and its parent (by recursively walking over all associated git Tree objects and comparing file revision numbers). It then only returns the commit if its delta matches one of the supplied paths. This means (basically) that every commit gets its entire Tree object hierarchy walked over twice: once to check against its parent, and once again to check against its child. (Most of the Tree objects get walked multiple times, as they are used by multiple commits.) There is no caching of the results of walking the Trees. There is no early filtering of git Trees that have a "previously seen" revision, nor pruning of sub-Trees that don't contain the paths we care about. In contrast, the command-line git C code uses optimizations like pruning while walking, and stopping when the requested path is unreachable. (Command-line git has bit flags like "seen" and "uninteresting" that get flagged on objects as it walks the tree.) But the real reason for Dulwich's slowness is with it's .pack file implementation, in pack.py and _pack.c. It does a stat(), lseek(), and read() every time it goes to grab an object (like a git Tree) out of the pack file. I used the utility strace() to see this -- 104K system calls to find logs for home.md: (.venv) ubuntu@cst-wiki:~/realms-wiki/.venv$ strace python ./speed_test.py ../../Flexget/wiki 2>&1 | grep -E 'read|stat|seek' | wc
104284 566792 9620847 Command-line git, in contrast, does a grand total of 87 of those calls to get the same data:
No wonder it's 100x faster. I will take another look at Dulwich's pack.py next. It might be possible to get a quick speed improvement by putting the whole pack file into RAM first to avoid all those system calls (it's about 15 MB, not bad by modern standards). Or maybe it could just do some simple caching. After that, there may still be some optimizations for class Walker(), but pack.py should definitely be fixed first. This will probably not be a quick fix. So, my recommendation is to just shell out to command-line git log ... for now, like the example code above. |
I tried modifying Dulwich to use a memory-mapped (mmap) file for the pack file. I also hacked around a bug where it would stat the pack file every time it tried to read from it. That successfully reduced the number of read/stat/seek system calls from 104 K to just 688, but it only sped up the search by about 1 second (out of 20 secs). So, all those syscalls() are not the major offender here (but they don't help, either). I looked closer at the profiling data, and the way Dulwich walks the tree does about 4 million calls to <string>:8(__new__), and also about 4 million calls to {built-in method __new__ of type object at 0x906bc0}. There are also 2 million calls to {dulwich._diff_tree._is_tree} -- why? -- and about 1 million calls to each of {method 'extend' of 'list' objects} and {method 'pop' of 'list' objects}. So, my revised conclusion is that the slowness is mostly due to the repeatedly-recursive nature in which the git Tree objects are walked in Dulwich. I do not think there's a simple bug fix. IMHO, the files diff_tree.py and walk.py need to be redesigned from scratch. I'm still in favor of shelling out to command-line "git log ...", but I haven't looked at libgit2 yet. I would be willing to look at libgit2 if that is a preferred solution, but I'm running out of time for this bug. @scragg0x , thoughts? |
I will merge a PR that uses git log for now. |
I looked at libgit2, which is written in C, but has Python bindings. Sadly, it has the same basic issue as Dulwich, even though it's written in C. Here is one of several posts about it, titled, "Why would it take nearly 10 seconds just to query one file's last commit log": In that post, the user wound up doing the same research I've done here. He also found that command-line git was 100x faster than libgit2, because it dynamically flagged commit objects with optimization flags like SEEN, UNINTERESTING, TREESAME, CHILD_SHOWN, etc. It then uses these flags to early-out from the tree walk whenever possible. Official response from libgit2: "We have not implemented git log functionality." All of the libgit2 bug entries about git log were collected into this one summary bug report, which is still open: Thus, with libgit2 disqualified, my final recommendation is to shell out to command-line git for Reams-wiki page histories. (I'll implement it if @scragg0x agrees, and would accept the Pull Request.) |
Ha, beat me by 6 minutes. I'll get it to you within the next couple of days. Thanks for your time! |
To me, the primary benefit of a git-and-file-based wiki is that one can also use plain git + text editor to maintain the wiki (in addition to using the beautiful browser interface).
Realms is currently incompatible with this type of usage, because it keeps its cached pages forever (unless the user makes a change using the Realms GUI). It will ignore a new git commit that was not made via the Realms GUI, because it finds its old cached HTML and thinks that old HTML is still valid.
This commit adds a 5 second cache timeout, thus causing offline git commits to show up no later than 5s later (when the Realms cache expires). The hard-coded magic number of 5s will limit the number of cache refreshes to 12 per second, which still provides good caching for busy servers, but now allows us SSH+vi users to easily create and maintain pages.