Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message codes revamp #1092

Open
rdeltour opened this issue Dec 5, 2019 · 6 comments
Open

Message codes revamp #1092

rdeltour opened this issue Dec 5, 2019 · 6 comments
Assignees
Labels
status: in discussion The issue is being discussed by the development team type: improvement The issue suggests an improvement of an existing feature

Comments

@rdeltour
Copy link
Member

rdeltour commented Dec 5, 2019

TL;DR: should we revamp EPUBCheck’s message code system? if yes, how?

Background: EPUBCheck message codes

All the validation messages (e.g. warnings and errors) produced by EPUBCheck are associated to codes. For example, message codes can be RSC-005, PKG-007, etc.

The first 3 letters indicate a topic the message is related to (e.g. HTM for XHTML Content Documents, PKG for package-related issues, NAV for issues relate to the Navigation Documents). The second part of the code is a number which is incremented when a new check is implemented an we need a new message.

Drawbacks of the current system

These codes and their organization can be confusing, for various reasons.

First, the topic code (the first 3 letters) may not be always helping what the error is related to:

  • all validation errors coming from schema checks (e.g. an invalid element or attribute) are reported as RSC-005 (parsing error), whether it’s about the Package Document, an XHTML Content Document, a Navigation Document, etc… Similarly, all warnings will be reported as RSC-017.
  • the difference between OPF- and PKG- is not obvious. "OPF" is EPUB 2 legacy, so errors related to the Package Document (.opf extension) will be OPF- and not PKG-. PKG- errors are more related to the package as a collection of files. So for instance, when the package document is missing ("OPF file could not be found), the error is reported as PKG-020, not OPF-something!
  • some topics do not really have code, so sometimes messages were shoehorned into existing categories. For instance HTM-048 is about SVG fixed-layout documents. Or the MED category is used for both media files (video, images), but also sometimes media overlays.

Then, the numbering scheme is a bit wonky:

  • using incremented numbers mean that some similar messages are some numbers apart. For instance, parsing errors are coded RSC-005, while parsing warnings are RSC-017.
  • sometimes, some message variants were needed or introduced after an original message. In this case, a lower cap letter is added to the number, like OPF-004, OPF-004a, OPF-004b, etc.

Possible refactoring

There are several way to revamp the message code system, for instance:

  • do not use message codes (remove them altogether)
  • work on a better code system
  • use only incremental numbers
  • use random codes (i.e. remove all logic, so there's nothing to be confused about)
  • reorganize code topics to match specs, more consistently
  • ???

Questions

  • are you using the current message codes?
  • are you happy with the current message codes? or pulling your hair out because of them?
  • are you depending on the current message codes? is changing them absolutely a no-go?
  • any suggestions for improving the current system?
@rdeltour rdeltour added type: improvement The issue suggests an improvement of an existing feature status: in discussion The issue is being discussed by the development team labels Dec 5, 2019
@rdeltour rdeltour self-assigned this Dec 5, 2019
@samalloing
Copy link

Hi @rdeltour

We use the message codes, but not yet in an automated way so they can be changed for us. We are happy with the current system, but we don't really have a outspoken opinion about how it should be better. The only thing that would be interesting to add is if something is an error or warning. We select which problems we need to deal with. We for example could say we can ignore all the warnings. This is important because the Severity element in the XML output says error even if it is a warning. We select on this severity element is our process. But also valid files (status is valid and well-formed) can have a severity error.

Thanks for all the work on epubchevk!

Sam

@bitsgalore
Copy link

bitsgalore commented Dec 10, 2019

Don't know if this helps, but you might want to have a look at how VeraPDF (a conformance checker for the PDF/A standards) handles this. They created validation rules, where each rule contains an explicit reference to the standard it applies to, as well as the clause in the standard on which the rule is based. Below is an example:

  <rule specification="ISO 19005-1:2005" clause="6.7.2" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
    <description>The document catalog dictionary of a conforming file shall contain the Metadata key.</description>
    <object>PDDocument</object>
    <test>metadata_size == 1</test>
    <check status="failed">
      <context>root/document[0]</context>
    </check>
  </rule>

The obvious advantage is that it establishes a direct link between the validator and the filespec. Perhaps it is possible to use something similar for EPUBCheck?

A possible argument against doing this is that it might complicate things if new versions of the filespec are organised differently than the current one, since that would break this link, and fixing this could turn into a major pain, especially if there are frequent updates to the spec. Also, looking at the evolution of EPUB thus far, I think changes to the format have been both more frequent and more radical than changes to the PDF/A profiles, so the situations for both formats may not be completely comparable. In any case this would require quite a bit of coordination between the writers of the filespec and the EPUBCheck developers.

It might also be a good idea to get in touch with the VeraPDF developers at the Open Preservation Foundation (OPF). One of the other tools they're maintaining is JHOVE, and they're currently working on a JHOVE EPUB module that wraps EPUBCheck. So they will probably be both interested in this and willing to help.

@sci-phi
Copy link

sci-phi commented Jan 30, 2020

The VST system uses the short-codes from EPUBCheck to interpret or discard preflight messages

		if (message.code.equalsIgnoreCase("RSC-005")) {
			if (message.message.contains("spine")) {
				// Reject spine-related errors
				return RESULT_FAIL;
			}
		}

@GarthConboy
Copy link
Contributor

Yes, we use, and are dependent upon, these error codes. These form the basis of our ingestion whitelisting system. The existence of these codes and their immutability makes integration of each updated epubcheck version easy for Google Play. We would vote (strongly) for "stay the course."

@karenhanson
Copy link

I contributed the first iteration of the EPUB module for JHOVE mentioned above and it's part of the current release candidate. It makes use of the severity level and the 3-letter prefix (PKG only) to assign Well-Formedness and Validity. The documentation explains how they are used. It being a new module I was anticipating some maintenance, and wondered if I might need to refine how the message codes are interpreted. Will stay tuned!

@vincent-gros
Copy link
Contributor

Hi @rdeltour,

We use error codes for automated analysis on multiple files. It will be harder without them. Refactoring based on specs could be a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: in discussion The issue is being discussed by the development team type: improvement The issue suggests an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

7 participants