-
-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Found 2 differences between the strutil implementation of Jara Winkler and a bunch of alternative when comparing implementations (and varying results for some specific short phonetic strings such as SN and STFN).
While other implementations (including web based https://calculatorshub.net/mathematical-calculators/jaro-winkler-calculator/#google_vignette) gave a 0.583 similarity, the strutil Go version gave 0.85. Other Go implementations also gave 0.583.
I found 2 possible bugs in this strutil library
-
Two characters from s1 and s2respectively, are considered matching only if they are the same and not farther than
Max(lenA, lenB)/2-1- the comparison is done withMax(lenA, lenB)/2instead -
There is no threshold to apply the Wrinkler prefix boost - other implementation use 0.7
The 0.7 threshold is an empirical rule to ensure that the prefix boost is only applied to pairs of strings that are already quite similar—its purpose is to avoid giving too much weight to the prefix in otherwise dissimilar strings.
Despite being told by ChatGPT that:
From Winkler’s paper: “The Jaro–Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length l. The prefix bonus is only added if the Jaro score is above a certain threshold (usually 0.7).”
I couldn't find mention of it in the spec but it is common practice in many other implementations, although some offer to disable it or specify a different one. It takes the name
- prefix boost threshold
- wrinkler modification score cutoff
Implementations
Reference
https://files.eric.ed.gov/fulltext/ED325505.pdf
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Notes
I will be submitting a pull request to address the issue.