Message codes revamp #1092

rdeltour · 2019-12-05T10:53:03Z

TL;DR: should we revamp EPUBCheck’s message code system? if yes, how?

Background: EPUBCheck message codes

All the validation messages (e.g. warnings and errors) produced by EPUBCheck are associated to codes. For example, message codes can be RSC-005, PKG-007, etc.

The first 3 letters indicate a topic the message is related to (e.g. HTM for XHTML Content Documents, PKG for package-related issues, NAV for issues relate to the Navigation Documents). The second part of the code is a number which is incremented when a new check is implemented an we need a new message.

Drawbacks of the current system

These codes and their organization can be confusing, for various reasons.

First, the topic code (the first 3 letters) may not be always helping what the error is related to:

all validation errors coming from schema checks (e.g. an invalid element or attribute) are reported as RSC-005 (parsing error), whether it’s about the Package Document, an XHTML Content Document, a Navigation Document, etc… Similarly, all warnings will be reported as RSC-017.
the difference between OPF- and PKG- is not obvious. "OPF" is EPUB 2 legacy, so errors related to the Package Document (.opf extension) will be OPF- and not PKG-. PKG- errors are more related to the package as a collection of files. So for instance, when the package document is missing ("OPF file could not be found), the error is reported as PKG-020, not OPF-something!
some topics do not really have code, so sometimes messages were shoehorned into existing categories. For instance HTM-048 is about SVG fixed-layout documents. Or the MED category is used for both media files (video, images), but also sometimes media overlays.

Then, the numbering scheme is a bit wonky:

using incremented numbers mean that some similar messages are some numbers apart. For instance, parsing errors are coded RSC-005, while parsing warnings are RSC-017.
sometimes, some message variants were needed or introduced after an original message. In this case, a lower cap letter is added to the number, like OPF-004, OPF-004a, OPF-004b, etc.

Possible refactoring

There are several way to revamp the message code system, for instance:

do not use message codes (remove them altogether)
work on a better code system
use only incremental numbers
use random codes (i.e. remove all logic, so there's nothing to be confused about)
reorganize code topics to match specs, more consistently
???

Questions

are you using the current message codes?
are you happy with the current message codes? or pulling your hair out because of them?
are you depending on the current message codes? is changing them absolutely a no-go?
any suggestions for improving the current system?

The text was updated successfully, but these errors were encountered:

samalloing · 2019-12-06T07:59:32Z

Hi @rdeltour

We use the message codes, but not yet in an automated way so they can be changed for us. We are happy with the current system, but we don't really have a outspoken opinion about how it should be better. The only thing that would be interesting to add is if something is an error or warning. We select which problems we need to deal with. We for example could say we can ignore all the warnings. This is important because the Severity element in the XML output says error even if it is a warning. We select on this severity element is our process. But also valid files (status is valid and well-formed) can have a severity error.

Thanks for all the work on epubchevk!

Sam

bitsgalore · 2019-12-10T14:21:46Z

Don't know if this helps, but you might want to have a look at how VeraPDF (a conformance checker for the PDF/A standards) handles this. They created validation rules, where each rule contains an explicit reference to the standard it applies to, as well as the clause in the standard on which the rule is based. Below is an example:

  <rule specification="ISO 19005-1:2005" clause="6.7.2" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
    <description>The document catalog dictionary of a conforming file shall contain the Metadata key.</description>
    <object>PDDocument</object>
    <test>metadata_size == 1</test>
    <check status="failed">
      <context>root/document[0]</context>
    </check>
  </rule>

The obvious advantage is that it establishes a direct link between the validator and the filespec. Perhaps it is possible to use something similar for EPUBCheck?

A possible argument against doing this is that it might complicate things if new versions of the filespec are organised differently than the current one, since that would break this link, and fixing this could turn into a major pain, especially if there are frequent updates to the spec. Also, looking at the evolution of EPUB thus far, I think changes to the format have been both more frequent and more radical than changes to the PDF/A profiles, so the situations for both formats may not be completely comparable. In any case this would require quite a bit of coordination between the writers of the filespec and the EPUBCheck developers.

It might also be a good idea to get in touch with the VeraPDF developers at the Open Preservation Foundation (OPF). One of the other tools they're maintaining is JHOVE, and they're currently working on a JHOVE EPUB module that wraps EPUBCheck. So they will probably be both interested in this and willing to help.

sci-phi · 2020-01-30T18:07:11Z

The VST system uses the short-codes from EPUBCheck to interpret or discard preflight messages

		if (message.code.equalsIgnoreCase("RSC-005")) {
			if (message.message.contains("spine")) {
				// Reject spine-related errors
				return RESULT_FAIL;
			}
		}

GarthConboy · 2020-01-30T23:13:41Z

Yes, we use, and are dependent upon, these error codes. These form the basis of our ingestion whitelisting system. The existence of these codes and their immutability makes integration of each updated epubcheck version easy for Google Play. We would vote (strongly) for "stay the course."

karenhanson · 2020-01-30T23:37:53Z

I contributed the first iteration of the EPUB module for JHOVE mentioned above and it's part of the current release candidate. It makes use of the severity level and the 3-letter prefix (PKG only) to assign Well-Formedness and Validity. The documentation explains how they are used. It being a new module I was anticipating some maintenance, and wondered if I might need to refine how the message codes are interpreted. Will stay tuned!

vincent-gros · 2020-01-31T10:48:50Z

Hi @rdeltour,

We use error codes for automated analysis on multiple files. It will be harder without them. Refactoring based on specs could be a good idea.

rdeltour added type: improvement The issue suggests an improvement of an existing feature status: in discussion The issue is being discussed by the development team labels Dec 5, 2019

rdeltour self-assigned this Dec 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message codes revamp #1092

Message codes revamp #1092

rdeltour commented Dec 5, 2019 •

edited

Loading

samalloing commented Dec 6, 2019

bitsgalore commented Dec 10, 2019 •

edited

Loading

sci-phi commented Jan 30, 2020

GarthConboy commented Jan 30, 2020

karenhanson commented Jan 30, 2020

vincent-gros commented Jan 31, 2020

Message codes revamp #1092

Message codes revamp #1092

Comments

rdeltour commented Dec 5, 2019 • edited Loading

Background: EPUBCheck message codes

Drawbacks of the current system

Possible refactoring

Questions

samalloing commented Dec 6, 2019

bitsgalore commented Dec 10, 2019 • edited Loading

sci-phi commented Jan 30, 2020

GarthConboy commented Jan 30, 2020

karenhanson commented Jan 30, 2020

vincent-gros commented Jan 31, 2020

rdeltour commented Dec 5, 2019 •

edited

Loading

bitsgalore commented Dec 10, 2019 •

edited

Loading