Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagefind 1.4.x ranking improvements #764

Open
bglw opened this issue Dec 18, 2024 · 4 comments
Open

Pagefind 1.4.x ranking improvements #764

bglw opened this issue Dec 18, 2024 · 4 comments

Comments

@bglw
Copy link
Contributor

bglw commented Dec 18, 2024

Discussion issue for the 1.4.x ranking improvements milestone.
See all tickets here: https://github.com/CloudCannon/pagefind/milestone/6

This issue is to discuss any changes that should be made to the core ranking algorithm. Examples:

  • Changing the algorithm's implementation
  • Using a different algorithm
  • Fixing any subtle bugs in the current ranking
  • Changing the default values of the user-exposed ranking parameters
  • Creating default presets of ranking parameters tailored to different use-cases (docs vs blogs, for example)
@bglw bglw changed the title Core ranking improvements Pagefind 1.4.x ranking improvements Dec 18, 2024
@spaceemotion
Copy link

spaceemotion commented Jan 11, 2025

I'm glad to see a general discussion about this! Pagefind is great, but getting it to work for our case (a unified website that hosts documentation, marketing pages, blog, courses, newsletter archive, etc.) was a bit challenging for search.

To get the most out of Pagefind we decided to build a custom component using Vue.

Here are the settings we found worked best for us:

{
  ranking: {
    termFrequency: 0.4,
    termSimilarity: 10,
    termSaturation: 1.6,
    pageLength: 0.6,
  },

  excerptLength: 30, // The longer excerpt length actually helps with the sub result re-ranking (see below)
}

We then have an index that holds custom weights for each page type:

const pageTypes: Record<string, PageType> = {
  blog: {
    label: 'Blog Post',
    weight: 0.9,
  },

  lesson: {
    label: 'Lesson',
    weight: 1,
  },

  course: {
    label: 'Course',
    weight: 0.8,
  },

  documentation: {
    label: 'Documentation',
    weight: 1.2,
  },

  // ... and so forth
};

In addition, we actually rescore the data, as we found that sometimes the parts in a single result were worse than the next document.

  1. We calculate a score for each sub result using the weighted locations
  2. We sort the sub results by score
  3. We then limit each result to 5 sub results
  4. Each (page) result then gets a general "score" using the sum of its sub-results
  5. As said above, we then scale the score per page type
  6. We sort the list from highest score to lowest
  7. In a last pass, we cut out any sub results, in case they're lower than the next page result in the list (this makes it so the results stay good at the top of the list.
const calculateScore = (subResult: PagefindSubResult) => {
  return subResult.weighted_locations.reduce((acc, loc) => acc + loc.balanced_score, 0);
};

const rerankFragment = (fragment: PagefindSearchFragment) => {
  // Rerank the sub results
  const subResults = fragment.sub_results
    .map((sub) => ({
      url: sub.url,
      title: sub.title,
      excerpt: sub.excerpt,
      score: calculateScore(sub),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, 5);

  return {
    url: fragment.url,
    meta: fragment.meta,
    max_score: subResults.reduce((acc, sub) => Math.max(acc, sub.score), 0),
    sub_results: subResults,
  };
}

watch(() => props.results, async (value) => {
  const fragments = await Promise.all(value.results.map(async (result) => {
    const fragment = rerankFragment(await result.data());
    const pageType = fragment.meta.type ? (pageTypes[fragment.meta.type] ?? null) : null;

    return {
      id: result.id,
      score: result.score * (pageType?.weight ?? 1),
      type: pageType,
      fragment,
    };
  }));

  fragments.sort((a, b) => b.score - a.score);

  fragments.forEach((fragment, index) => {
    const nextFragment = fragments[index + 1];

    if (nextFragment) {
      // cut off the sub results if the next fragment has a higher score
      const cutoff = fragment.fragment.sub_results.findIndex((sub) => {
        return sub.score < nextFragment.fragment.max_score;
      });

      if (cutoff !== -1) {
        fragment.fragment.sub_results = fragment.fragment.sub_results.slice(0, Math.max(3, cutoff));
      }

      return;
    }

    const prevFragment = fragments[index - 1];

    if (prevFragment) {
      // cut off the sub results if the previous fragment has a higher score
      const cutoff = fragment.fragment.sub_results.findIndex((sub) => {
        return sub.score < prevFragment.fragment.max_score;
      });

      if (cutoff !== -1) {
        fragment.fragment.sub_results = fragment.fragment.sub_results.slice(0, Math.max(3, cutoff));
      }
    }
  });

  displayItems.value = fragments;
}, {
  immediate: true,
});

Maybe our approach can be a bit of inspiration for others :)

That said, any kind of "search aliasing" where we can provide alternative page titles or spellings that could be used during search would be great. Plus the other things people have reported around typos and spelling errors would be amazing to see as well.

@bglw
Copy link
Contributor Author

bglw commented Jan 21, 2025

Thanks for the long writeup @spaceemotion ! Some good ideas in there to draw inspiration from.

That said, any kind of "search aliasing" where we can provide alternative page titles or spellings that could be used during search would be great

Also a good idea — could you open an issue with more details on this one? I don't believe there is an existing issue for this work

@bglw
Copy link
Contributor Author

bglw commented Jan 21, 2025

NB: I'm going to use this thread to jot down observations on rankings, so for anyone reading feel free to jump in with thoughts on good approaches. I'll be working on solutions in lieu of discussions here but any thoughts people have are more than welcome.

Unique word-extension ranking issue

The way Pagefind searches word extensions can bias pages higher than they should be in the results.

As a loose example, when searching for the word pre on the Pagefind documentation playground, the top result scores high largely because it contains precompiled.

Since this word is unique on the site, (this is the only page it exists on), BM25 ranks that term as very important and the page shoots up the rankings. Below it, we find pages that contain pre, and prefix, and preload, that are likely better results here, but are lower down in part due to the words being slightly more common.

On other sites I have found more pathological cases, where a long unique word that was indexed completely swamps what would be an otherwise reasonable result for a search.

This feels like an internal misuse of the concept from BM25. The goal is that if I were to search Pagefind install on the docs, the bulk of the search ranking would be on install rather than on pagefind, since pagefind will exist on every page and is thus of middling use. We want to avoid results ranking high purely because they're a great match for the word pagefind.

Applying this boost to words that are simply extensions of the search term is causing subpar results.

In part, the existing "term similarity" concept tries to address this. If you're searching for pre, then prefix gets boosted above precompiled. With the default settings, this boost doesn't make a big impact versus the uniqueness ranking. So one solution might be to crank the term similarity setting substantially higher by default (e.g. 8.0 instead of 1.0). In my test cases so far, moving that parameter up makes a pretty clear positive impact on the results when searching for single words. But there might be a smarter blend of settings here, or a rethink of how to apply them, that works better. This change alone could cause problems for multi-word searches where we need the uniqueness parameter to have an effect, so this needs more thought in any case.

@spaceemotion
Copy link

In addition to the similarity thing I also noticed that we have a bit of an issue with heading matching.

We have an article on importing data, so the word "import" appears many times.

But if you search for "import XZY" - for which we have a dedicated FAQ article, and the exact phrasing appears in the page title/heading - the page gets ranked a lot lower, because the page contents themselves don't repeat the words that much.

I tried to work around by that by adding a custom weight to all FAQ/help article titles, but it's not an ideal solution either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants