Skip to content

Commit ca27299

Browse files
committed
Reranking on the client side
1 parent c6f5300 commit ca27299

File tree

1 file changed

+108
-0
lines changed

1 file changed

+108
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
title: "Re-ranking search results on the client side"
3+
date: "2024-11-03"
4+
author: ["Daoud Clarke"]
5+
---
6+
7+
By many measures, [Mwmbl](https://mwmbl.org) is doing great. We have
8+
indexed over half a billion pages, we have over 4,000 registered
9+
users, and over 30,000 curations from those users. Our volunteers are
10+
crawling around 5 million pages a day.
11+
12+
But the score that I care about most right now is
13+
[NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). This
14+
measures the quality of our search results against a "gold standard"
15+
which is just Bing search results for the same query. Obviously, we
16+
are not ultimately aiming to be like Bing, so eventually we will stop
17+
using Bing and start using our curated data, once we have enough and
18+
the quality is high enough. But we are far enough away from being good
19+
that moving in a Bing-like direction is great, for now.
20+
21+
Because our NDCG score is pretty poor. A score of 1 would be "matches
22+
Bing exactly", while a score of 0 would be "nothing in common with
23+
Bing". We are scoring 0.336 on our test set. However most of that
24+
comes from sticking Wikipedia results at the top, by querying
25+
Wikipedia with their [excellent
26+
API](https://www.mediawiki.org/wiki/API:Search) and using their
27+
ranking. Without that, we were scoring 0.059. Using Wikipedia alone,
28+
we would score 0.297.
29+
30+
I've experimented off and on with [learning to
31+
rank](https://en.wikipedia.org/wiki/Learning_to_rank). This was the
32+
industry standard for ranking before large language models, so seemed
33+
like an obvious place to start. Implementing a lot of [standard
34+
features](https://www.microsoft.com/en-us/research/project/mslr/)
35+
improved the results slightly over my original intuitively defined
36+
features, but still only gave us an NDCG score of 0.365.
37+
38+
Nevertheless, after trying for a while to improve on this further, I
39+
decided to deploy this ranking instead of the heuristic that we were
40+
using before. It didn't seem to make the results worse, and it seemed
41+
to be fast enough, at least on my local machine.
42+
43+
I didn't realise it at the time, but this was when things started
44+
breaking. You see, we only have one server. We don't do actual
45+
crawling on the server since that is done by volunteers. But we have
46+
to process the crawl data to find new links, prioritise those links
47+
for crawling, preprocess the data for indexing, and index it, as well
48+
as serving search results. All on one server.
49+
50+
And it turned out that adding a ton of features and using XGBoost for
51+
every search query added enough load to the server that things started
52+
to slow down. Even loading the main page could take three or four
53+
seconds. At the time, I didn't realise that this was causing the
54+
problem, since a bunch of other stuff was happening. We had some
55+
enthusiastic volunteers that were crawling many millions of pages a
56+
day.
57+
58+
Search got so slow that I decided we had to turn off the crawling. I
59+
made the server that sends batches to crawl return an empty list. I
60+
turned off the scripts that update the crawl queue and index new
61+
pages. And things got better.
62+
63+
But they still weren't as good as they were before. It took me a long
64+
time to realise that it was the change in ranking that was the
65+
problem.
66+
67+
In retrospect, we should have had better monitoring around everything
68+
so that we could more easily identify the cause of the slowdown. But
69+
this is a project that I do for fun, so I've put off doing this kind
70+
of thing, because I find it boring. Boring is not fun. But a broken
71+
website is also not fun. You live and learn.
72+
73+
Now I've changed the ranking back to the old heuristic and turned on
74+
the crawling again. I still think learning to rank has potential. But
75+
we can't afford it with our current resources. So I've started looking
76+
at alternatives.
77+
78+
What would make ranking almost infinitely scalable? If we don't do it
79+
ourselves, but get our users to do it, on the client side. This works
80+
really well for us, because we don't rank like a normal search
81+
engine. Our results are already ranked for each unigram and bigram
82+
query. We pull out the pre-ranked results for each unigram and bigram
83+
in a user's query, then re-rank them using our heuristic. This is
84+
perfectly feasible to do on the client side, since we don't have more
85+
than around 30 results per unigram or bigram on average.
86+
87+
It also gave me the final push I needed to implement ranking in
88+
Rust. I know that Python is a bad choice for a search engine, but I
89+
chose it because I knew I could build something quickly, and that I
90+
would probably give up at some point if I tried to do it in Rust. Yes,
91+
I know Java and C++, which would have been fine choices, but they
92+
would not be fun. This project has to be fun or it will not happen.
93+
94+
I am a beginner in Rust, and doing too hard things at the same time -
95+
building a search engine and learning Rust - seemed like a recipe for
96+
disaster. But I can build small bits in Rust, especially if I have
97+
already built them in Python. So over the last few days, I've [rebuilt
98+
our heuristic ranking in Rust](https://github.com/mwmbl/rankeval/).
99+
The rust compiles to web assembly which will eventually be called from
100+
our [excellent new front end](https://alpha.mwmbl.org/). Now all the
101+
back end needs to do is pull out pre-ranked results for each unigram
102+
and bigram. These can be easily cached, and potentially even kept in
103+
cloud storage. If we are ever going to serve millions of users on our
104+
shoe-string budget, then this is how we will have to do it.
105+
106+
If you would like to get involved, join us on our
107+
[https://matrix.to/#/#mwmbl:matrix.org](Matrix server), or send a pull
108+
request to fix my rubbish Rust code. Thank you for reading!

0 commit comments

Comments
 (0)