Jaro Winkler string comparison algorithm matching runes maximum distance incorrect

Found 2 differences between the strutil implementation of Jara Winkler and a bunch of alternative when comparing implementations (and varying results for some specific short phonetic strings such as SN and STFN).

While other implementations (including web based https://calculatorshub.net/mathematical-calculators/jaro-winkler-calculator/#google_vignette) gave a 0.583 similarity, the strutil Go version gave 0.85. Other Go implementations also gave 0.583.

I found 2 possible bugs in this strutil library

* Two characters from s1 and s2respectively, are considered matching only if they are the same and not farther than `Max(lenA, lenB)/2-1` - the comparison is done with `Max(lenA, lenB)/2` instead

* There is no threshold to apply the Wrinkler prefix boost - other implementation use 0.7

The 0.7 threshold is an empirical rule to ensure that the prefix boost is only applied to pairs of strings that are already quite similar—its purpose is to avoid giving too much weight to the prefix in otherwise dissimilar strings.

Despite being told by ChatGPT that:

>From Winkler’s paper: “The Jaro–Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length l. The prefix bonus is only added if the Jaro score is above a certain threshold (usually 0.7).”

I couldn't find mention of it in the spec but it is common practice in many other implementations, although some offer to disable it or specify a different one. It takes the name

* prefix boost threshold
* wrinkler modification score cutoff

# Implementations

https://github.com/jamesturk/jellyfish/blob/b21046a4f0c490f2c20548ba9b2c6c15fe120847/python/jellyfish/_jellyfish.py#L104

# Reference

https://files.eric.ed.gov/fulltext/ED325505.pdf
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance


# Notes

I will be submitting a pull request to address the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jaro Winkler string comparison algorithm matching runes maximum distance incorrect #61

Implementations

Reference

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Jaro Winkler string comparison algorithm matching runes maximum distance incorrect #61

Description

Implementations

Reference

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions