Skip to content

Jaro Winkler string comparison algorithm matching runes maximum distance incorrect #61

@fisabelle-lmi

Description

@fisabelle-lmi

Found 2 differences between the strutil implementation of Jara Winkler and a bunch of alternative when comparing implementations (and varying results for some specific short phonetic strings such as SN and STFN).

While other implementations (including web based https://calculatorshub.net/mathematical-calculators/jaro-winkler-calculator/#google_vignette) gave a 0.583 similarity, the strutil Go version gave 0.85. Other Go implementations also gave 0.583.

I found 2 possible bugs in this strutil library

  • Two characters from s1 and s2respectively, are considered matching only if they are the same and not farther than Max(lenA, lenB)/2-1 - the comparison is done with Max(lenA, lenB)/2 instead

  • There is no threshold to apply the Wrinkler prefix boost - other implementation use 0.7

The 0.7 threshold is an empirical rule to ensure that the prefix boost is only applied to pairs of strings that are already quite similar—its purpose is to avoid giving too much weight to the prefix in otherwise dissimilar strings.

Despite being told by ChatGPT that:

From Winkler’s paper: “The Jaro–Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length l. The prefix bonus is only added if the Jaro score is above a certain threshold (usually 0.7).”

I couldn't find mention of it in the spec but it is common practice in many other implementations, although some offer to disable it or specify a different one. It takes the name

  • prefix boost threshold
  • wrinkler modification score cutoff

Implementations

https://github.com/jamesturk/jellyfish/blob/b21046a4f0c490f2c20548ba9b2c6c15fe120847/python/jellyfish/_jellyfish.py#L104

Reference

https://files.eric.ed.gov/fulltext/ED325505.pdf
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Notes

I will be submitting a pull request to address the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions