Pagefind 1.4.x ranking improvements #764

bglw · 2024-12-18T21:44:50Z

Discussion issue for the 1.4.x ranking improvements milestone.
See all tickets here: https://github.com/CloudCannon/pagefind/milestone/6

This issue is to discuss any changes that should be made to the core ranking algorithm. Examples:

Changing the algorithm's implementation
Using a different algorithm
Fixing any subtle bugs in the current ranking
Changing the default values of the user-exposed ranking parameters
Creating default presets of ranking parameters tailored to different use-cases (docs vs blogs, for example)

spaceemotion · 2025-01-11T22:04:29Z

I'm glad to see a general discussion about this! Pagefind is great, but getting it to work for our case (a unified website that hosts documentation, marketing pages, blog, courses, newsletter archive, etc.) was a bit challenging for search.

To get the most out of Pagefind we decided to build a custom component using Vue.

Here are the settings we found worked best for us:

{
  ranking: {
    termFrequency: 0.4,
    termSimilarity: 10,
    termSaturation: 1.6,
    pageLength: 0.6,
  },

  excerptLength: 30, // The longer excerpt length actually helps with the sub result re-ranking (see below)
}

We then have an index that holds custom weights for each page type:

const pageTypes: Record<string, PageType> = {
  blog: {
    label: 'Blog Post',
    weight: 0.9,
  },

  lesson: {
    label: 'Lesson',
    weight: 1,
  },

  course: {
    label: 'Course',
    weight: 0.8,
  },

  documentation: {
    label: 'Documentation',
    weight: 1.2,
  },

  // ... and so forth
};

In addition, we actually rescore the data, as we found that sometimes the parts in a single result were worse than the next document.

We calculate a score for each sub result using the weighted locations
We sort the sub results by score
We then limit each result to 5 sub results
Each (page) result then gets a general "score" using the sum of its sub-results
As said above, we then scale the score per page type
We sort the list from highest score to lowest
In a last pass, we cut out any sub results, in case they're lower than the next page result in the list (this makes it so the results stay good at the top of the list.

const calculateScore = (subResult: PagefindSubResult) => {
  return subResult.weighted_locations.reduce((acc, loc) => acc + loc.balanced_score, 0);
};

const rerankFragment = (fragment: PagefindSearchFragment) => {
  // Rerank the sub results
  const subResults = fragment.sub_results
    .map((sub) => ({
      url: sub.url,
      title: sub.title,
      excerpt: sub.excerpt,
      score: calculateScore(sub),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, 5);

  return {
    url: fragment.url,
    meta: fragment.meta,
    max_score: subResults.reduce((acc, sub) => Math.max(acc, sub.score), 0),
    sub_results: subResults,
  };
}

watch(() => props.results, async (value) => {
  const fragments = await Promise.all(value.results.map(async (result) => {
    const fragment = rerankFragment(await result.data());
    const pageType = fragment.meta.type ? (pageTypes[fragment.meta.type] ?? null) : null;

    return {
      id: result.id,
      score: result.score * (pageType?.weight ?? 1),
      type: pageType,
      fragment,
    };
  }));

  fragments.sort((a, b) => b.score - a.score);

  fragments.forEach((fragment, index) => {
    const nextFragment = fragments[index + 1];

    if (nextFragment) {
      // cut off the sub results if the next fragment has a higher score
      const cutoff = fragment.fragment.sub_results.findIndex((sub) => {
        return sub.score < nextFragment.fragment.max_score;
      });

      if (cutoff !== -1) {
        fragment.fragment.sub_results = fragment.fragment.sub_results.slice(0, Math.max(3, cutoff));
      }

      return;
    }

    const prevFragment = fragments[index - 1];

    if (prevFragment) {
      // cut off the sub results if the previous fragment has a higher score
      const cutoff = fragment.fragment.sub_results.findIndex((sub) => {
        return sub.score < prevFragment.fragment.max_score;
      });

      if (cutoff !== -1) {
        fragment.fragment.sub_results = fragment.fragment.sub_results.slice(0, Math.max(3, cutoff));
      }
    }
  });

  displayItems.value = fragments;
}, {
  immediate: true,
});

Maybe our approach can be a bit of inspiration for others :)

That said, any kind of "search aliasing" where we can provide alternative page titles or spellings that could be used during search would be great. Plus the other things people have reported around typos and spelling errors would be amazing to see as well.

bglw · 2025-01-21T09:37:22Z

Thanks for the long writeup @spaceemotion ! Some good ideas in there to draw inspiration from.

That said, any kind of "search aliasing" where we can provide alternative page titles or spellings that could be used during search would be great

Also a good idea — could you open an issue with more details on this one? I don't believe there is an existing issue for this work

bglw · 2025-01-21T10:02:09Z

NB: I'm going to use this thread to jot down observations on rankings, so for anyone reading feel free to jump in with thoughts on good approaches. I'll be working on solutions in lieu of discussions here but any thoughts people have are more than welcome.

Unique word-extension ranking issue

The way Pagefind searches word extensions can bias pages higher than they should be in the results.

As a loose example, when searching for the word pre on the Pagefind documentation playground, the top result scores high largely because it contains precompiled.

Since this word is unique on the site, (this is the only page it exists on), BM25 ranks that term as very important and the page shoots up the rankings. Below it, we find pages that contain pre, and prefix, and preload, that are likely better results here, but are lower down in part due to the words being slightly more common.

On other sites I have found more pathological cases, where a long unique word that was indexed completely swamps what would be an otherwise reasonable result for a search.

This feels like an internal misuse of the concept from BM25. The goal is that if I were to search Pagefind install on the docs, the bulk of the search ranking would be on install rather than on pagefind, since pagefind will exist on every page and is thus of middling use. We want to avoid results ranking high purely because they're a great match for the word pagefind.

Applying this boost to words that are simply extensions of the search term is causing subpar results.

In part, the existing "term similarity" concept tries to address this. If you're searching for pre, then prefix gets boosted above precompiled. With the default settings, this boost doesn't make a big impact versus the uniqueness ranking. So one solution might be to crank the term similarity setting substantially higher by default (e.g. 8.0 instead of 1.0). In my test cases so far, moving that parameter up makes a pretty clear positive impact on the results when searching for single words. But there might be a smarter blend of settings here, or a rethink of how to apply them, that works better. This change alone could cause problems for multi-word searches where we need the uniqueness parameter to have an effect, so this needs more thought in any case.

spaceemotion · 2025-01-21T10:05:19Z

In addition to the similarity thing I also noticed that we have a bit of an issue with heading matching.

We have an article on importing data, so the word "import" appears many times.

But if you search for "import XZY" - for which we have a dedicated FAQ article, and the exact phrasing appears in the page title/heading - the page gets ranked a lot lower, because the page contents themselves don't repeat the words that much.

I tried to work around by that by adding a custom weight to all FAQ/help article titles, but it's not an ideal solution either.

bglw added this to the Pagefind 1.4.x ranking improvements milestone Dec 18, 2024

bglw changed the title ~~Core ranking improvements~~ Pagefind 1.4.x ranking improvements Dec 18, 2024

spaceemotion mentioned this issue Jan 21, 2025

[1.4.x] Aliasing of phrases/words #777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagefind 1.4.x ranking improvements #764

Pagefind 1.4.x ranking improvements #764

bglw commented Dec 18, 2024 •

edited

Loading

spaceemotion commented Jan 11, 2025 •

edited

Loading

bglw commented Jan 21, 2025

bglw commented Jan 21, 2025

spaceemotion commented Jan 21, 2025

Pagefind 1.4.x ranking improvements #764

Pagefind 1.4.x ranking improvements #764

Comments

bglw commented Dec 18, 2024 • edited Loading

spaceemotion commented Jan 11, 2025 • edited Loading

bglw commented Jan 21, 2025

bglw commented Jan 21, 2025

Unique word-extension ranking issue

spaceemotion commented Jan 21, 2025

bglw commented Dec 18, 2024 •

edited

Loading

spaceemotion commented Jan 11, 2025 •

edited

Loading