-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pagefind 1.4.x ranking improvements #764
Comments
I'm glad to see a general discussion about this! Pagefind is great, but getting it to work for our case (a unified website that hosts documentation, marketing pages, blog, courses, newsletter archive, etc.) was a bit challenging for search. To get the most out of Pagefind we decided to build a custom component using Vue. Here are the settings we found worked best for us: {
ranking: {
termFrequency: 0.4,
termSimilarity: 10,
termSaturation: 1.6,
pageLength: 0.6,
},
excerptLength: 30, // The longer excerpt length actually helps with the sub result re-ranking (see below)
} We then have an index that holds custom weights for each page type: const pageTypes: Record<string, PageType> = {
blog: {
label: 'Blog Post',
weight: 0.9,
},
lesson: {
label: 'Lesson',
weight: 1,
},
course: {
label: 'Course',
weight: 0.8,
},
documentation: {
label: 'Documentation',
weight: 1.2,
},
// ... and so forth
}; In addition, we actually rescore the data, as we found that sometimes the parts in a single result were worse than the next document.
const calculateScore = (subResult: PagefindSubResult) => {
return subResult.weighted_locations.reduce((acc, loc) => acc + loc.balanced_score, 0);
};
const rerankFragment = (fragment: PagefindSearchFragment) => {
// Rerank the sub results
const subResults = fragment.sub_results
.map((sub) => ({
url: sub.url,
title: sub.title,
excerpt: sub.excerpt,
score: calculateScore(sub),
}))
.sort((a, b) => b.score - a.score)
.slice(0, 5);
return {
url: fragment.url,
meta: fragment.meta,
max_score: subResults.reduce((acc, sub) => Math.max(acc, sub.score), 0),
sub_results: subResults,
};
}
watch(() => props.results, async (value) => {
const fragments = await Promise.all(value.results.map(async (result) => {
const fragment = rerankFragment(await result.data());
const pageType = fragment.meta.type ? (pageTypes[fragment.meta.type] ?? null) : null;
return {
id: result.id,
score: result.score * (pageType?.weight ?? 1),
type: pageType,
fragment,
};
}));
fragments.sort((a, b) => b.score - a.score);
fragments.forEach((fragment, index) => {
const nextFragment = fragments[index + 1];
if (nextFragment) {
// cut off the sub results if the next fragment has a higher score
const cutoff = fragment.fragment.sub_results.findIndex((sub) => {
return sub.score < nextFragment.fragment.max_score;
});
if (cutoff !== -1) {
fragment.fragment.sub_results = fragment.fragment.sub_results.slice(0, Math.max(3, cutoff));
}
return;
}
const prevFragment = fragments[index - 1];
if (prevFragment) {
// cut off the sub results if the previous fragment has a higher score
const cutoff = fragment.fragment.sub_results.findIndex((sub) => {
return sub.score < prevFragment.fragment.max_score;
});
if (cutoff !== -1) {
fragment.fragment.sub_results = fragment.fragment.sub_results.slice(0, Math.max(3, cutoff));
}
}
});
displayItems.value = fragments;
}, {
immediate: true,
}); Maybe our approach can be a bit of inspiration for others :) That said, any kind of "search aliasing" where we can provide alternative page titles or spellings that could be used during search would be great. Plus the other things people have reported around typos and spelling errors would be amazing to see as well. |
Thanks for the long writeup @spaceemotion ! Some good ideas in there to draw inspiration from.
Also a good idea — could you open an issue with more details on this one? I don't believe there is an existing issue for this work |
NB: I'm going to use this thread to jot down observations on rankings, so for anyone reading feel free to jump in with thoughts on good approaches. I'll be working on solutions in lieu of discussions here but any thoughts people have are more than welcome. Unique word-extension ranking issueThe way Pagefind searches word extensions can bias pages higher than they should be in the results. As a loose example, when searching for the word Since this word is unique on the site, (this is the only page it exists on), BM25 ranks that term as very important and the page shoots up the rankings. Below it, we find pages that contain On other sites I have found more pathological cases, where a long unique word that was indexed completely swamps what would be an otherwise reasonable result for a search. This feels like an internal misuse of the concept from BM25. The goal is that if I were to search Applying this boost to words that are simply extensions of the search term is causing subpar results. In part, the existing "term similarity" concept tries to address this. If you're searching for |
In addition to the similarity thing I also noticed that we have a bit of an issue with heading matching. We have an article on importing data, so the word "import" appears many times. But if you search for "import XZY" - for which we have a dedicated FAQ article, and the exact phrasing appears in the page title/heading - the page gets ranked a lot lower, because the page contents themselves don't repeat the words that much. I tried to work around by that by adding a custom weight to all FAQ/help article titles, but it's not an ideal solution either. |
Discussion issue for the 1.4.x ranking improvements milestone.
See all tickets here: https://github.com/CloudCannon/pagefind/milestone/6
This issue is to discuss any changes that should be made to the core ranking algorithm. Examples:
The text was updated successfully, but these errors were encountered: