Skip to content

Fix markdown parser regex to support hyphenated language identifiers #7

@chigwell

Description

@chigwell

User Story
As a developer using markdown code blocks,
I want to use hyphenated language identifiers (e.g., python-3) in code fences
so that valid syntax isn’t rejected by the parser.

Background
The current regex pattern in mdextractor/__init__.py (r"```(?:\w+\s+)?(.*?)```") fails to recognize language specifiers containing hyphens (e.g., python-3). This violates CommonMark’s allowance for hyphenated identifiers and breaks compatibility with tools/linters expecting such syntax. The \w+ group in the regex excludes hyphens, causing valid code blocks to be misparsed or ignored.

Acceptance Criteria

  • Update the regex in extract_md_blocks (file: mdextractor/__init__.py) to r"```([\w-]+\s+)?(.*?)```".
  • Add test cases to tests/test_mdextractor.py verifying hyphenated identifiers:
    • Test python-3 as a language specifier.
    • Test mixed alphanumeric-hyphen combinations (e.g., rust-2021-edition).
  • Ensure existing tests (e.g., test_with_language_specifier, test_single_line) pass with the updated regex.
  • Validate that code blocks without hyphens (e.g., python) remain unaffected.
  • Confirm nested backticks and malformed fences (e.g., test_malformed_fences) are still handled correctly.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions