Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have search engine ignore all abbreviation marks #2781

Open
stsccfr opened this issue Dec 17, 2024 · 4 comments
Open

Have search engine ignore all abbreviation marks #2781

stsccfr opened this issue Dec 17, 2024 · 4 comments
Assignees

Comments

@stsccfr
Copy link
Collaborator

stsccfr commented Dec 17, 2024

After extracting everything we have ever entered into a Forename field for #2779, it is clear that the variety of abbreviation marks that transcribers are using is preventing the search engine from finding the records. Abbreviation marks are not confined to the end of a name, like Jno. or Wm. but can also occur in the middle of a name like Rich:d or Eliz:th and we also have cases where two abbreviation marks have been used such as Eliz.th. and Will'm. and Rich'd. Unless Soundex is used each of these characters counts as if it were a real letter, and prevents the search engine from finding the record.

The emendation rules that we have in /lib/tasks/load_emendations.rake are what we use to handle abbreviations directly. We can only have so many of these rules, however, because they add to the processing time of an uploaded CSV file. The proposal here is to allow users to continue entering whatever abbreviation marks they think best represent what the register has, but have the search engine ignore them. What isn't clear to me is whether this should be done by pruning out abbreviation marks in the search_records (which would require rebuilding the entire DB) or whether it is possible to have the search engine ignore them if present (which could slow down searching).

Ignoring abbr marks is important because there are far too many abbreviated forms of names for us to cope with them by Soundex or wildcard searches or emendation rules. Soundex tends to return too many false matches, and is currently turned on for both forenames and surnames (one cannot select one or the other). Wildcards can only be used after at least two initial letters, and only when a single Place is specified. Emendation rules are too specific and we would need many thousands of them to cover the variety of abbreviations that we already have.

@DeniseColbert
Copy link
Contributor

Can the search engine ignore all punctuation marks? How would we achieve it? How much work is this?

Vino will look into the options.

@Vino-S
Copy link
Collaborator

Vino-S commented Jan 29, 2025

Made some changes in test2.
The query is efficient when place is provided. If not, currently query times out.

@Vino-S
Copy link
Collaborator

Vino-S commented Jan 29, 2025

Try autocomplete for names for test2

@stsccfr
Copy link
Collaborator Author

stsccfr commented Jan 29, 2025

I want to recap what we discussed today so that it's documented here. The UCF characters we must retain in forenames are:

square brackets []
curly brackets {}
underscore _
asterisk *
question mark ?
comma ,

The comma only occurs inside curly brackets as a range indicator, as in _{3,5} to indicate 3 to 5 unreadable letters.

The most common abbreviation marks that the search engine should ignore are:

full stop .
colon :
single quote '
hyphen -
semicolon ;
backtick `
double quote "
smart quote ’

Happily, there is no overlap between these two lists. What I proposed was that, for each forename field in a search_record, we create a corresponding search_ field which would be populated with whatever is in the forename field but stripped of abbreviation marks. The original forename field would then be used for display purposes since it contains what the transcriber actually entered. So for example, in a burial record we have the burial_person_forename field, so we would create a new search_burial_person_forename field and populate it according to the following:

search_burial_person_forename = burial_person_forename.gsub(/[.:;`'"’-]/, '')

The search engine would then try to match what was in the search_burial_person_forename field, and if it matches, the displayed record would then show the contents of the original burial_person_forename field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants