-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have search engine ignore all abbreviation marks #2781
Comments
Can the search engine ignore all punctuation marks? How would we achieve it? How much work is this? Vino will look into the options. |
Made some changes in test2. |
Try autocomplete for names for test2 |
I want to recap what we discussed today so that it's documented here. The UCF characters we must retain in forenames are: square brackets [] The comma only occurs inside curly brackets as a range indicator, as in _{3,5} to indicate 3 to 5 unreadable letters. The most common abbreviation marks that the search engine should ignore are: full stop . Happily, there is no overlap between these two lists. What I proposed was that, for each forename field in a search_record, we create a corresponding search_ field which would be populated with whatever is in the forename field but stripped of abbreviation marks. The original forename field would then be used for display purposes since it contains what the transcriber actually entered. So for example, in a burial record we have the burial_person_forename field, so we would create a new search_burial_person_forename field and populate it according to the following: search_burial_person_forename = burial_person_forename.gsub(/[.:;`'"’-]/, '') The search engine would then try to match what was in the search_burial_person_forename field, and if it matches, the displayed record would then show the contents of the original burial_person_forename field. |
After extracting everything we have ever entered into a Forename field for #2779, it is clear that the variety of abbreviation marks that transcribers are using is preventing the search engine from finding the records. Abbreviation marks are not confined to the end of a name, like Jno. or Wm. but can also occur in the middle of a name like Rich:d or Eliz:th and we also have cases where two abbreviation marks have been used such as Eliz.th. and Will'm. and Rich'd. Unless Soundex is used each of these characters counts as if it were a real letter, and prevents the search engine from finding the record.
The emendation rules that we have in
/lib/tasks/load_emendations.rake
are what we use to handle abbreviations directly. We can only have so many of these rules, however, because they add to the processing time of an uploaded CSV file. The proposal here is to allow users to continue entering whatever abbreviation marks they think best represent what the register has, but have the search engine ignore them. What isn't clear to me is whether this should be done by pruning out abbreviation marks in the search_records (which would require rebuilding the entire DB) or whether it is possible to have the search engine ignore them if present (which could slow down searching).Ignoring abbr marks is important because there are far too many abbreviated forms of names for us to cope with them by Soundex or wildcard searches or emendation rules. Soundex tends to return too many false matches, and is currently turned on for both forenames and surnames (one cannot select one or the other). Wildcards can only be used after at least two initial letters, and only when a single Place is specified. Emendation rules are too specific and we would need many thousands of them to cover the variety of abbreviations that we already have.
The text was updated successfully, but these errors were encountered: