[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support

### Problem
The current DOCX-Table-to-Markdown conversion loses critical formatting for:

- Merged cells (rowspan/colspan)

- Complex tables (nested structures, multi-level headers)

- Styling (borders, alignment)

Markdown’s native table syntax (| --- |) lacks support for these features, resulting in broken or oversimplified output.

### Solution
Implemented a non-invasive override to output tables as HTML instead of Markdown, preserving structure and merged cells. Key changes:

1. CustomMarkdownify Class (extends _CustomMarkdownify):

   > Overrides convert_table(), convert_td(), convert_tr(), and convert_th() to return raw HTML elements.

   > Wraps tables in <html><body> to ensure valid HTML5 output.

2. CustomHtmlConverter & CustomDocxConverter:

   > Propagate the modified table handling while maintaining other conversions (e.g., text, headings).

3. CustomMarkitdown Class:

   > Swaps the default DocxConverter with CustomDocxConverter at runtime.


### HTML table result example:

![Image](https://github.com/user-attachments/assets/da5511b9-fde0-45a6-bf27-b9a6c023d369)

### Code:
```python
from typing import BinaryIO, Any

from bs4 import BeautifulSoup
from markitdown._markitdown import ConverterRegistration, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown.converters import DocxConverter, HtmlConverter
from markitdown.converters._markdownify import _CustomMarkdownify
from markitdown import MarkItDown, DocumentConverterResult, StreamInfo

from common.log import logger


class CustomMarkdownify(_CustomMarkdownify):
    def convert_table(self, el, text, parent_tags):
        headers = [f"h{i}" for i in range(1, 8)]
        for h in headers:
            for h_element in el.find_all(h):
                h_element.unwrap()
        return f"<html><body>{el}</body></html>"

    def convert_td(self, el, text, parent_tags):
        return str(el)

    def convert_tr(self, el, text, parent_tags):
        return str(el)

    def convert_th(self, el, text, parent_tags):
        return str(el)


class CustomHtmlConverter(HtmlConverter):
    def convert(
            self,
            file_stream: BinaryIO,
            stream_info: StreamInfo,
            **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        # Parse the stream
        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
            script.extract()

        # Print only the main content
        body_elm = soup.find("body")
        webpage_text = ""
        if body_elm:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(body_elm)
        else:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(soup)

        assert isinstance(webpage_text, str)

        # remove leading and trailing \n
        webpage_text = webpage_text.strip()

        return DocumentConverterResult(
            markdown=webpage_text,
            title=None if soup.title is None else soup.title.string,
        )


class CustomDocxConverter(DocxConverter):
    def __init__(self):
        super().__init__()
        self._html_converter = CustomHtmlConverter()


class CustomMarkitdown(MarkItDown):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.replace_converter()

    def replace_converter(self):
        for ix, convert in enumerate(self._converters):
            if isinstance(convert.converter, DocxConverter):
                self._converters[ix] = ConverterRegistration(converter=CustomDocxConverter(),
                                                             priority=PRIORITY_SPECIFIC_FILE_FORMAT)
                logger.info(f"replace markitdown docx converter to custom converter: {CustomDocxConverter}")
                break


if __name__ == '__main__':
    markdown = CustomMarkitdown()
    md = markdown.convert('test.docx')
    with open("result.md", "w", encoding="utf-8") as f:
        f.write(md.markdown)
```


### **Benefits**

- ✅ **Perfect fidelity** for merged/complex tables.
- ✅ **No upstream breaks** (override-based, doesn’t modify core logic).
- ✅ **Works with renderers** supporting HTML (GitHub, Typora, etc.).

### **Request**

Consider merging this as an **opt-in feature** (e.g., via `table_format="html"` flag) or as the default behavior for complex tables.

------

### **Why This Matters**

- Many users need DOCX tables to render correctly in Markdown viewers.
- HTML tables are the only reliable way to express merged cells in Markdown.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support #1211

Problem

Solution

HTML table result example:

Code:

Benefits

Request

Why This Matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support #1211

Description

Problem

Solution

HTML table result example:

Code:

Benefits

Request

Why This Matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions