Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove non-alphanumerics (except quotes) from search query #486

Merged
merged 2 commits into from
Mar 25, 2025

Conversation

colagrosso
Copy link
Collaborator

That is, don't replace non-alphanumerics with a space. This matches the behavior of Formatter::RemoveDiacriticsAndNonalphanumerics(), which would be used here except that function would also remove quotes. We actually discussed not introducing spaces previously, but I made a mistake and didn't apply the same change to the user's search query: #470 (comment)

This change fixes queries with compound words like:

Haycraft-Queen

and also fixes queries for authors with apostrophes:

O'Neill

No changes to existing DBs are necessary because they already have terms like haycraftqueen and oneill stored in IndexableCollections and IndexableAuthors.

Fixes #484

That is, don't replace non-alphanumerics with a space. This matches the
behavior of Formatter::RemoveDiacriticsAndNonalphanumerics(), which
would be used here except that function would also remove quotes.  We
actually discussed not introducing spaces previously, but I made a
mistake and didn't apply the same change to the user's search query:

standardebooks#470 (comment)

This change fixes queries with compound words like:

`Haycraft-Queen`

and also fixes queries for authors with apostrophes:

`O'Neill`

No changes to existing DBs are necessary because they already have terms
like `haycraftqueen` and `oneill` stored in `IndexableCollections` and
`IndexableAuthors`.
See standardebooks#484 for details. By adding a special case for hyphens, users can
search for these terms:

`beta` to match `Alpha-Beta`
`queen` to match `Haycraft-Queen`

These searches also work as expected:

`Alpha-Beta`
`Alpha`
`Haycraft-Queen`
`Haycraft`

I don't think these queries should work, and they do not:

`AlphaBeta`
`HaycraftQueen`

This commit changes `IndexableText`, `IndexableAuthors`, and
`IndexableCollections`, so existing DBs need an update. This will update
all published books:

```
cd /standardebooks.org/ebooks
for BOOK in $(find /standardebooks.org/ebooks -maxdepth 1 -type d)
do
  tsp nice /standardebooks.org/web/scripts/deploy-ebook-to-www --verbose --no-build --no-images --no-recompose --no-epubcheck --no-feeds --no-bulk-downloads "$BOOK"
done
```

And this PHP code will update placeholders:

```
<?
require_once('/standardebooks.org/web/lib/Core.php');

$ebooks = Ebook::GetAll();

foreach($ebooks as $ebook){
        if($ebook->IsPlaceholder()){
                print('Saving ' . $ebook->Identifier . "\n");

                // Need to force `Ebook::GetAllContributors()` to be called before `Ebook::Save()`. Otherwise, authors and translators will be deleted.
                $ebook->Authors;

                $ebook->Save();
        }
}
```
@colagrosso
Copy link
Collaborator Author

Based on the discussion in #484, I added a special case for replacing hyphens with spaces. Let me know how it looks to you and if you want me to move it into a Formatter function (or combined with Formatter::RemoveDiacriticsAndNonalphanumerics()).

With this change users can search for these terms:

beta to match Alpha-Beta
queen to match Haycraft-Queen

These searches also work as expected:

Alpha-Beta
Alpha
Haycraft-Queen
Haycraft

I don't think these queries should work, and they do not:

AlphaBeta
HaycraftQueen

This commit changes IndexableText, IndexableAuthors, and IndexableCollections, so existing DBs need an update. This will update all published books:

cd /standardebooks.org/ebooks
for BOOK in $(find /standardebooks.org/ebooks -maxdepth 1 -type d)
do
  tsp nice /standardebooks.org/web/scripts/deploy-ebook-to-www --verbose --no-build --no-images --no-recompose --no-epubcheck --no-feeds --no-bulk-downloads "$BOOK"
done

And this PHP code will update placeholders:

<?
require_once('/standardebooks.org/web/lib/Core.php');

$ebooks = Ebook::GetAll();

foreach($ebooks as $ebook){
        if($ebook->IsPlaceholder()){
                print('Saving ' . $ebook->Identifier . "\n");

                // Need to force `Ebook::GetAllContributors()` to be called before `Ebook::Save()`. Otherwise, authors and translators will be deleted.
                $ebook->Authors;

                $ebook->Save();
        }
}

@acabal
Copy link
Member

acabal commented Mar 25, 2025

OK great. This looks good, I'll merge it in and start the updates now. Thanks!

@acabal acabal merged commit 9202717 into standardebooks:master Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Searching for compound words doesn't appear to work
2 participants