Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor output quality on PDFs converted to Markdown with default settings #146

Open
lsorber opened this issue Jan 1, 2025 · 1 comment
Open
Assignees

Comments

@lsorber
Copy link
Contributor

lsorber commented Jan 1, 2025

Here's a minimal example that applies the largest sat-12l-sm model on the first paragraph of Einstein's Special Relativity paper (extracted from PDF):

from wtpsplit import SaT

specrel_head = """
# ON THE ELECTRODYNAMICS OF MOVING BODIES

## By A. EINSTEIN  June 30, 1905

It is known that Maxwell’s electrodynamics—as usually understood at the
present time—when applied to moving bodies, leads to asymmetries which do
not appear to be inherent in the phenomena. Take, for example, the recipro-
cal electrodynamic action of a magnet and a conductor. The observable phe-
nomenon here depends only on the relative motion of the conductor and the
magnet, whereas the customary view draws a sharp distinction between the two
cases in which either the one or the other of these bodies is in motion. For if the
magnet is in motion and the conductor at rest, there arises in the neighbour-
hood of the magnet an electric field with a certain definite energy, producing
a current at the places where parts of the conductor are situated. But if the
magnet is stationary and the conductor in motion, no electric field arises in the
neighbourhood of the magnet. In the conductor, however, we find an electro-
motive force, to which in itself there is no corresponding energy, but which gives
rise—assuming equality of relative motion in the two cases discussed—to elec-
tric currents of the same path and intensity as those produced by the electric
forces in the former case.
""".strip()

sat = SaT("sat-12l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
sat.split(specrel_head)

This is the example's output:

[
    "# ON THE ELECTRODYNAMICS OF MOVING BODIES",
    "",
    "## ",
    "By A. EINSTEIN  June 30, 1905",
    "",
    "",
    "It is known that Maxwell’s electrodynamics—as usually understood at the",
    "present time—when applied to moving bodies, leads to asymmetries which do",
    "not appear to be inherent in the phenomena. ",
    "Take, for example, the recipro-",
    "cal electrodynamic action of a magnet and a conductor. ",
    "The observable phe-",
    "nomenon here depends only on the relative motion of the conductor and the",
    "magnet, whereas the customary view draws a sharp distinction between the two",
    "cases in which either the one or the other of these bodies is in motion. ",
    "For if the",
    "magnet is in motion and the conductor at rest, there arises in the neighbour-",
    "hood of the magnet an electric field with a certain definite energy, producing",
    "a current at the places where parts of the conductor are situated. ",
    "But if the",
    "magnet is stationary and the conductor in motion, no electric field arises in the",
    "neighbourhood of the magnet. ",
    "In the conductor, however, we find an electro-",
    "motive force, to which in itself there is no corresponding energy, but which gives",
    "rise—assuming equality of relative motion in the two cases discussed—to elec-",
    "tric currents of the same path and intensity as those produced by the electric",
    "forces in the former case.",
]

The output quality seems to be quite low:

  1. The model produces many more sentences than there actually are, seemingly splitting on newlines.
  2. The model produces superfluous empty sentences.
  3. The model incorrectly splits Markdown headings into multiple sentences.

With the option sat.split(specrel_head, treat_newline_as_space=True) enabled, the output improves:

[
    "# ON THE ELECTRODYNAMICS OF MOVING BODIES\n\n## ",
    "By A. EINSTEIN  June 30, 1905\n\n",
    "It is known that Maxwell’s electrodynamics—as usually understood at the\npresent time—when applied to moving bodies, leads to asymmetries which do\nnot appear to be inherent in the phenomena. ",
    "Take, for example, the recipro-\ncal electrodynamic action of a magnet and a conductor. ",
    "The observable phe-\nnomenon here depends only on the relative motion of the conductor and the\nmagnet, whereas the customary view draws a sharp distinction between the two\ncases in which either the one or the other of these bodies is in motion. ",
    "For if the\nmagnet is in motion and the conductor at rest, there arises in the neighbour-\nhood of the magnet an electric field with a certain definite energy, producing\na current at the places where parts of the conductor are situated. ",
    "But if the\nmagnet is stationary and the conductor in motion, no electric field arises in the\nneighbourhood of the magnet. ",
    "In the conductor, however, we find an electro-\nmotive force, to which in itself there is no corresponding energy, but which gives\nrise—assuming equality of relative motion in the two cases discussed—to elec-\ntric currents of the same path and intensity as those produced by the electric\nforces in the former case.",
]

A few questions:

  1. To me, "class-\nroom" and "class- room" are not the same. When setting treat_newline_as_space=True, does the model actually treat newlines as spaces, or as its own character?
  2. For my understanding, is there a reason why treat_newline_as_space=False by default?
@markus583
Copy link
Collaborator

Hi! treat_newline_as_space was introduced to retain new lines provided in the input. So whatever newlines you have in your input, you will have them in the output, in addition to those found by the model. If it is set to True, they will not be retained. The model does not care in either case since it cannot process them; this is just an additional processing step.
But now I realize this is pretty ambiguous and should be changed to retain_newlines. I'd change it in the next release.

@markus583 markus583 self-assigned this Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants