Skip to content

SUTime extracts dashed numbers as timex duration #145

Open
@garfieldnate

Description

@garfieldnate

Using the demo code from the website (except I don't specify the document date), and the input "Call 090-1234-5678", I get this output:

1234-5678 [from char offset 5 to 18] --> (1234-XX-XX,5678-XX-XX,PT38955312H)

The Timex attributes are:

{tid=t3, value=PT38955312H, type=DURATION, beginPoint=t1, endPoint=t2}

This happens to be a Japanese phone number, so I understand that it doesn't work right out-of-the-box. However, for extracting \d\d\d\d-\d\d\d\d I would expect the resulting timex to be a range, from year 1234 to year 5678, not a duration consisting of the number of hours in that length of time. The referenced Timex spec calls this an "anchored duration", and shows it being broken into multiple annotations (page 13), which would be easier to handle.

Even if this does turn out to be incorrectable or unfixable, can you help me find the source of it so I can delete it? I see plenty of duration rules in english.sutime, but they also seem to require other words to match ( like "the" and "year").

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions