Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wm_util.Date? #91

Open
hannesm opened this issue May 23, 2018 · 6 comments
Open

Wm_util.Date? #91

hannesm opened this issue May 23, 2018 · 6 comments

Comments

@hannesm
Copy link
Contributor

hannesm commented May 23, 2018

while reviewing the diff between 0.5 and 0.6, I couldn't resist to lookup the specification of HTTP 1.1 and especially the Date header (see (obsoleted) RFC 2616 Sec 3.3 and RFC 7231 Section 7.1.1.1) as references):

  • An HTTP-date value represents time as an instance of Coordinated Universal Time (UTC). -- why does the parser in Wm_util include (i.e. are there clients out there who actually set a different timezone which webmachine needs to support - and what is the reason to hardcode the 8 timezones below)?
          | "" | "Z" | "GMT" | "UTC" | "UT" -> 0
          | "PST" -> -480
          | "MST" | "PDT" -> -420
          | "CST" | "MDT" -> -360
          | "EST" | "CDT" -> -300
          | "EDT" -> -240
          | s -> Scanf.sscanf s "%c%02d%_[:]%02d" (fun sign hour min ->
              min + hour * (if sign = '-' then -60 else 60))
  • HTTP-date is case sensitive. this doesn't seem to be taken care of in webmachine at all
  • I fail to understand why year is bound to if year < 50 then 2000 + year else if year < 1000 then year + 1000 after %4d was used in sscanf (which parses a 4 digit number)

The function is named parse_rfc1123_date (which may actually parse a full RFC 1123 timestamp), but in HTTP the HTTP-Date is slightly different (a subset thereof, plus allowing other formats) according to the RFCs mentioned above.

Using sscanf is dangerous: there can be characters at the end of the input which are not matched by the expression, thus a date such as Mon, 10 Mar 1994 10:20:30 a0040abcdef is accepted by webmachine.

Should we revise the implementation to (a) include some test cases and (b) be more strict about its input? It is not clear to me whether there are still clients out there (which we would like to talk to) that don't use IMF-fixdate, but RFC 850 (Sunday, 06-Nov-94 08:49:37 GMT) or asctime () (Sun Nov 6 08:49:37 1994), which are both required by 7231. If this is worthwhile, I suspect using angstrom for the parser would be more convenient than the OCaml stdlib utilities. or is a dependency onto angstrom in webmachine intentionally avoided?

/cc @ansiwen @seliopou

@ansiwen
Copy link
Contributor

ansiwen commented May 23, 2018

I only looked into RFC 1123 and never verified if this is actually the correct RFC for HTTP headers. And RFC 1123 Section 5.2.14 states:

There is a strong trend towards the use of numeric timezone
indicators, and implementations SHOULD use numeric timezones
instead of timezone names.  However, all implementations MUST
accept either notation.  If timezone names are used, they MUST
be exactly as defined in RFC-822.

and RFC 822 Section 5.1. states:

zone        =  "UT"  / "GMT"                ; Universal Time
                                            ; North American : UT
            /  "EST" / "EDT"                ;  Eastern:  - 5/ - 4
            /  "CST" / "CDT"                ;  Central:  - 6/ - 5
            /  "MST" / "MDT"                ;  Mountain: - 7/ - 6
            /  "PST" / "PDT"                ;  Pacific:  - 8/ - 7
            /  1ALPHA                       ; Military: Z = UT;
                                            ;  A:-1; (J not used)
                                            ;  M:-12; N:+1; Y:+12
            / ( ("+" / "-") 4DIGIT )        ; Local differential
                                            ;  hours+min. (HHMM)

You are right, the year calculations are useless, because it's always parsed with 4 digits. I didn't spend too much time on that parser, I just wanted something independent of CalendarLib that is at least not worse than the old parser. So I avoided to read RFCs as much as possible, which is not a joy for me anyway. So if we need a different parser, and not RFC1123, feel free to replace it. So we should implement an RFC7231 parser instead? Regarding the accepted input I would go with Postel's law and be liberal about it as long as it doesn't impair security. But also here I don't have a strong opinion.

@cfcs
Copy link

cfcs commented May 24, 2018

FWIW I haven't been able to find any examples of clients that send timezones alternative to "GMT", and I believe that doing so is wrong.

I checked apache's httpd, and nginx, and they seem to agree:

@cfcs
Copy link

cfcs commented May 24, 2018

to address @hannesm's points:

  • Ignoring output after the timestamp seems to be what both nginx and apache does (see apache's use of * in the apr_date_checkmask(date, "## @$$ #### ##:##:## *") calls), so this is compliant with that.
    Specifying anything but GMT seems to be illegal, so the question is if you want to penalize illegal clients by failing the request, or silently use their broken date header.
    Given that the Date header is not extremely important for anything, and is mostly useful for the If- headers, I don't think ignoring broken clients is a problem here.

  • I can't think of any HTTP/1.1 headers names that aren't case-insensitive, so if webmachine doesn't handle that, it seems to be a bug. It sure is annoying since it increases time spent in the parser, but that's what you get for using a crappy protocol.

  • The 50 years thing: RFC7231 (HTTP/1.1) says:

    Recipients of a timestamp value in rfc850-date format, which uses a
    two-digit year, MUST interpret a timestamp that appears to be more
    than 50 years in the future as representing the most recent year in
    the past that had the same last two digits.

    • It follows that whoever wrote the spec considered 2-digit years legal, and the implementation if year < 50 then 2000 + year else if year < 1000 then year + 1000 is wrong.
      Take the examples 17 ; 40; 70.
    • 17: 2017 is not more than 50 years after 2018, so it should be 2017.
    • 40: 2040 is not more than 50 years after 2018, so it should be 2040.
    • 51: 2051 is not more than 50 years after 2018, so it should be 2051.
    • 70: 2070 is more than 50 years after 2018, so it should be 1970.
      With the implementation cited above applied two 2-digit years those would be:
    • 17: 2017
    • 40: 2040
    • 51: 1051
    • 70: 1070 - this is a ridiculous year in the context of purporting the be the time of generation of the HTTP request.
      I believe the logic should be:
     (* handle two-digit years: *)
     if year < 0 then failwith "no negative years";
     if year <= (current_year mod 100) + 50 then
        year = (current_year - current_year mod 100) + year
     else
        year = (current_year - 100 - current_year mod 100) + year

    Furthermore, as @hannesm points out it doesn't make sense to use this logic on input from a four-digit year, since 0017 is a valid four-digit year, and while nonsense in a real-world context, it could very well be used to specify HTTP-requests prophecized by Jesus.
    It doesn't make sense to apply it to four-digit years either, consider 0999 turning into 1999 and 1000 turning into 1000; that's almost a thousand years off.
    But of course this is a bit pedantic; I couldn't find any clients that actually send two-digit years so I'd be inclined to not support that.

ansiwen added a commit to ansiwen/ocaml-webmachine that referenced this issue May 24, 2018
Before the date has been parsed according to RFC 1123.  However, as
pointed out by issue inhabitedtype#91, RFC 7231 defines a subset of this format
for HTTP dates (IMF-fixdate).  This change simplifies the parser,
assuming the IMF-fixdate format according to RFC 7231.  In violence
to RFC 7231 the legacy RFC 850 and asctime() formats are
intentionally not handled, as before.

Closes inhabitedtype#91
ansiwen added a commit to ansiwen/ocaml-webmachine that referenced this issue May 24, 2018
Before the date has been parsed according to RFC 1123.  However, as
pointed out by issue inhabitedtype#91, RFC 7231 defines a subset of this format
for HTTP dates (IMF-fixdate).  This change simplifies the parser,
assuming the IMF-fixdate format according to RFC 7231.  In violence
to RFC 7231 the legacy RFC 850 and asctime() formats are
intentionally not handled, as before.

Closes inhabitedtype#91
@ansiwen
Copy link
Contributor

ansiwen commented May 24, 2018

@cfcs @hannesm

  • Regarding the time-zones: I agree, they are not required (they don't hurt either), we can safely dump them.

  • IMF-Date months are case-sensitive, this parser is case-sensitive, I don't see the problem here.

  • Regarding the year, there are two things to make clear:

    • With Scanf a %4d reads an integer having at most 4 decimal digits, so it would still match two digit integers. So the calculations would still make sense in this case.

    • The calculations don't follow RFC850, but as the commit message mentioned RFC 1123 and RFC 5322, and there they have - surprisingly 🙄 - a different understanding:
      RFC 5322 Sec. 4.3 says:

      Where a two or three digit year occurs in a date, the year is to be
      interpreted as follows: If a two digit year is encountered whose
      value is between 00 and 49, the year is interpreted by adding 2000,
      ending up with a value between 2000 and 2049. If a two digit year is
      encountered with a value between 50 and 99, or any three digit year
      is encountered, the year is interpreted by adding 1900.

      So I boldly claim, my calculations are correct. 😏

  • I think we are over-engineering a bit here, since this is meant to be not a validator but a simple parser that is able to understand the most common format, which appears to be called IMF-fixdate. And it does that already now. However, because I also like lean code, I simplified the parser a bit, assuming IMF-fixdate input, and made the scanf format a bit stricter. See PR Simplify http date parser according to RFC 7231 #92.

@cfcs
Copy link

cfcs commented May 24, 2018

  • Timezones: Nice!

  • IMF-Date-stuff: I had completely misunderstood, I thought we were talking about the header name (daTe: vs Date: etc), sorry!

  • Scanf: We lose the ability to tell 0017 from 2017, but that's fine since 0017 would not be legal to generate anyway.

  • Calculations:

    • Well it should follow RFC7231 from 2014, but oh well, this doesn't really seem to be used anyway, so I don't think it matters.
    • Your implementation of RFC5322's looks correct to me! :)
  • I think you PR looks good :)

@seliopou
Copy link
Member

ACKing discussion. I haven't had a chance to read it thoroughly but I will get to it this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants