|
| 1 | +--- |
| 2 | +title: "Re-ranking search results on the client side" |
| 3 | +date: "2024-11-03" |
| 4 | +author: ["Daoud Clarke"] |
| 5 | +--- |
| 6 | + |
| 7 | +By many measures, [Mwmbl](https://mwmbl.org) is doing great. We have |
| 8 | +indexed over half a billion pages, we have over 4,000 registered |
| 9 | +users, and over 30,000 curations from those users. Our volunteers are |
| 10 | +crawling around 5 million pages a day. |
| 11 | + |
| 12 | +But the score that I care about most right now is |
| 13 | +[NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). This |
| 14 | +measures the quality of our search results against a "gold standard" |
| 15 | +which is just Bing search results for the same query. Obviously, we |
| 16 | +are not ultimately aiming to be like Bing, so eventually we will stop |
| 17 | +using Bing and start using our curated data, once we have enough and |
| 18 | +the quality is high enough. But we are far enough away from being good |
| 19 | +that moving in a Bing-like direction is great, for now. |
| 20 | + |
| 21 | +Because our NDCG score is pretty poor. A score of 1 would be "matches |
| 22 | +Bing exactly", while a score of 0 would be "nothing in common with |
| 23 | +Bing". We are scoring 0.336 on our test set. However most of that |
| 24 | +comes from sticking Wikipedia results at the top, by querying |
| 25 | +Wikipedia with their [excellent |
| 26 | +API](https://www.mediawiki.org/wiki/API:Search) and using their |
| 27 | +ranking. Without that, we were scoring 0.059. Using Wikipedia alone, |
| 28 | +we would score 0.297. |
| 29 | + |
| 30 | +I've experimented off and on with [learning to |
| 31 | +rank](https://en.wikipedia.org/wiki/Learning_to_rank). This was the |
| 32 | +industry standard for ranking before large language models, so seemed |
| 33 | +like an obvious place to start. Implementing a lot of [standard |
| 34 | +features](https://www.microsoft.com/en-us/research/project/mslr/) |
| 35 | +improved the results slightly over my original intuitively defined |
| 36 | +features, but still only gave us an NDCG score of 0.365. |
| 37 | + |
| 38 | +Nevertheless, after trying for a while to improve on this further, I |
| 39 | +decided to deploy this ranking instead of the heuristic that we were |
| 40 | +using before. It didn't seem to make the results worse, and it seemed |
| 41 | +to be fast enough, at least on my local machine. |
| 42 | + |
| 43 | +I didn't realise it at the time, but this was when things started |
| 44 | +breaking. You see, we only have one server. We don't do actual |
| 45 | +crawling on the server since that is done by volunteers. But we have |
| 46 | +to process the crawl data to find new links, prioritise those links |
| 47 | +for crawling, preprocess the data for indexing, and index it, as well |
| 48 | +as serving search results. All on one server. |
| 49 | + |
| 50 | +And it turned out that adding a ton of features and using XGBoost for |
| 51 | +every search query added enough load to the server that things started |
| 52 | +to slow down. Even loading the main page could take three or four |
| 53 | +seconds. At the time, I didn't realise that this was causing the |
| 54 | +problem, since a bunch of other stuff was happening. We had some |
| 55 | +enthusiastic volunteers that were crawling many millions of pages a |
| 56 | +day. |
| 57 | + |
| 58 | +Search got so slow that I decided we had to turn off the crawling. I |
| 59 | +made the server that sends batches to crawl return an empty list. I |
| 60 | +turned off the scripts that update the crawl queue and index new |
| 61 | +pages. And things got better. |
| 62 | + |
| 63 | +But they still weren't as good as they were before. It took me a long |
| 64 | +time to realise that it was the change in ranking that was the |
| 65 | +problem. |
| 66 | + |
| 67 | +In retrospect, we should have had better monitoring around everything |
| 68 | +so that we could more easily identify the cause of the slowdown. But |
| 69 | +this is a project that I do for fun, so I've put off doing this kind |
| 70 | +of thing, because I find it boring. Boring is not fun. But a broken |
| 71 | +website is also not fun. You live and learn. |
| 72 | + |
| 73 | +Now I've changed the ranking back to the old heuristic and turned on |
| 74 | +the crawling again. I still think learning to rank has potential. But |
| 75 | +we can't afford it with our current resources. So I've started looking |
| 76 | +at alternatives. |
| 77 | + |
| 78 | +What would make ranking almost infinitely scalable? If we don't do it |
| 79 | +ourselves, but get our users to do it, on the client side. This works |
| 80 | +really well for us, because we don't rank like a normal search |
| 81 | +engine. Our results are already ranked for each unigram and bigram |
| 82 | +query. We pull out the pre-ranked results for each unigram and bigram |
| 83 | +in a user's query, then re-rank them using our heuristic. This is |
| 84 | +perfectly feasible to do on the client side, since we don't have more |
| 85 | +than around 30 results per unigram or bigram on average. |
| 86 | + |
| 87 | +It also gave me the final push I needed to implement ranking in |
| 88 | +Rust. I know that Python is a bad choice for a search engine, but I |
| 89 | +chose it because I knew I could build something quickly, and that I |
| 90 | +would probably give up at some point if I tried to do it in Rust. Yes, |
| 91 | +I know Java and C++, which would have been fine choices, but they |
| 92 | +would not be fun. This project has to be fun or it will not happen. |
| 93 | + |
| 94 | +I am a beginner in Rust, and doing too hard things at the same time - |
| 95 | +building a search engine and learning Rust - seemed like a recipe for |
| 96 | +disaster. But I can build small bits in Rust, especially if I have |
| 97 | +already built them in Python. So over the last few days, I've [rebuilt |
| 98 | +our heuristic ranking in Rust](https://github.com/mwmbl/rankeval/). |
| 99 | +The rust compiles to web assembly which will eventually be called from |
| 100 | +our [excellent new front end](https://alpha.mwmbl.org/). Now all the |
| 101 | +back end needs to do is pull out pre-ranked results for each unigram |
| 102 | +and bigram. These can be easily cached, and potentially even kept in |
| 103 | +cloud storage. If we are ever going to serve millions of users on our |
| 104 | +shoe-string budget, then this is how we will have to do it. |
| 105 | + |
| 106 | +If you would like to get involved, join us on our |
| 107 | +[https://matrix.to/#/#mwmbl:matrix.org](Matrix server), or send a pull |
| 108 | +request to fix my rubbish Rust code. Thank you for reading! |
0 commit comments