Skip to content

Fix regex in mdextractor for code blocks with escaped backticks #9

@chigwell

Description

@chigwell

User Story
As a maintainer,
I want robust test coverage for code blocks containing escaped backticks
so that the Markdown parser reliably handles edge cases without false positives.

Background
The current regex pattern in mdextractor/__init__.py uses r"```(?:\w+\s+)?(.*?)```" with re.DOTALL, which may prematurely close code blocks containing legitimate backticks (e.g., echo "```"). This creates maintenance risks:

  • The test_nested_code_blocks unit test demonstrates incorrect parsing of inner backticks
  • Real-world code snippets with escaped backticks could be truncated
  • Language specifier detection might interfere with content extraction

Acceptance Criteria

  • Add test case to tests/test_mdextractor.py verifying:
    def test_backticks_inside_code_content(self):
        text = '''```sh
        echo "```"
        ```'''
        self.assertEqual(extract_md_blocks(text), ['echo "```"'])
  • Update regex pattern to handle backticks within code content
  • Ensure existing tests pass after modification
  • Verify extraction works for:
    • Code blocks containing \``, ``, and ```` sequences
    • Mixed-language examples with internal backticks
    • Consecutive valid blocks separated by text
  • Document pattern limitations in function docstring if any edge cases remain

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions