Skip to content

[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support #1211

Description

@tosmart01

Problem

The current DOCX-Table-to-Markdown conversion loses critical formatting for:

  • Merged cells (rowspan/colspan)

  • Complex tables (nested structures, multi-level headers)

  • Styling (borders, alignment)

Markdown’s native table syntax (| --- |) lacks support for these features, resulting in broken or oversimplified output.

Solution

Implemented a non-invasive override to output tables as HTML instead of Markdown, preserving structure and merged cells. Key changes:

  1. CustomMarkdownify Class (extends _CustomMarkdownify):

    Overrides convert_table(), convert_td(), convert_tr(), and convert_th() to return raw HTML elements.

    Wraps tables in to ensure valid HTML5 output.

  2. CustomHtmlConverter & CustomDocxConverter:

    Propagate the modified table handling while maintaining other conversions (e.g., text, headings).

  3. CustomMarkitdown Class:

    Swaps the default DocxConverter with CustomDocxConverter at runtime.

HTML table result example:

Image

Code:

from typing import BinaryIO, Any

from bs4 import BeautifulSoup
from markitdown._markitdown import ConverterRegistration, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown.converters import DocxConverter, HtmlConverter
from markitdown.converters._markdownify import _CustomMarkdownify
from markitdown import MarkItDown, DocumentConverterResult, StreamInfo

from common.log import logger


class CustomMarkdownify(_CustomMarkdownify):
    def convert_table(self, el, text, parent_tags):
        headers = [f"h{i}" for i in range(1, 8)]
        for h in headers:
            for h_element in el.find_all(h):
                h_element.unwrap()
        return f"<html><body>{el}</body></html>"

    def convert_td(self, el, text, parent_tags):
        return str(el)

    def convert_tr(self, el, text, parent_tags):
        return str(el)

    def convert_th(self, el, text, parent_tags):
        return str(el)


class CustomHtmlConverter(HtmlConverter):
    def convert(
            self,
            file_stream: BinaryIO,
            stream_info: StreamInfo,
            **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        # Parse the stream
        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
            script.extract()

        # Print only the main content
        body_elm = soup.find("body")
        webpage_text = ""
        if body_elm:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(body_elm)
        else:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(soup)

        assert isinstance(webpage_text, str)

        # remove leading and trailing \n
        webpage_text = webpage_text.strip()

        return DocumentConverterResult(
            markdown=webpage_text,
            title=None if soup.title is None else soup.title.string,
        )


class CustomDocxConverter(DocxConverter):
    def __init__(self):
        super().__init__()
        self._html_converter = CustomHtmlConverter()


class CustomMarkitdown(MarkItDown):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.replace_converter()

    def replace_converter(self):
        for ix, convert in enumerate(self._converters):
            if isinstance(convert.converter, DocxConverter):
                self._converters[ix] = ConverterRegistration(converter=CustomDocxConverter(),
                                                             priority=PRIORITY_SPECIFIC_FILE_FORMAT)
                logger.info(f"replace markitdown docx converter to custom converter: {CustomDocxConverter}")
                break


if __name__ == '__main__':
    markdown = CustomMarkitdown()
    md = markdown.convert('test.docx')
    with open("result.md", "w", encoding="utf-8") as f:
        f.write(md.markdown)

Benefits

  • Perfect fidelity for merged/complex tables.
  • No upstream breaks (override-based, doesn’t modify core logic).
  • Works with renderers supporting HTML (GitHub, Typora, etc.).

Request

Consider merging this as an opt-in feature (e.g., via table_format="html" flag) or as the default behavior for complex tables.


Why This Matters

  • Many users need DOCX tables to render correctly in Markdown viewers.
  • HTML tables are the only reliable way to express merged cells in Markdown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions