-
-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
Description
User Story
As a developer using markdown code blocks,
I want to use hyphenated language identifiers (e.g., python-3) in code fences
so that valid syntax isn’t rejected by the parser.
Background
The current regex pattern in mdextractor/__init__.py (r"```(?:\w+\s+)?(.*?)```") fails to recognize language specifiers containing hyphens (e.g., python-3). This violates CommonMark’s allowance for hyphenated identifiers and breaks compatibility with tools/linters expecting such syntax. The \w+ group in the regex excludes hyphens, causing valid code blocks to be misparsed or ignored.
Acceptance Criteria
- Update the regex in
extract_md_blocks(file:mdextractor/__init__.py) tor"```([\w-]+\s+)?(.*?)```". - Add test cases to
tests/test_mdextractor.pyverifying hyphenated identifiers:- Test
python-3as a language specifier. - Test mixed alphanumeric-hyphen combinations (e.g.,
rust-2021-edition).
- Test
- Ensure existing tests (e.g.,
test_with_language_specifier,test_single_line) pass with the updated regex. - Validate that code blocks without hyphens (e.g.,
python) remain unaffected. - Confirm nested backticks and malformed fences (e.g.,
test_malformed_fences) are still handled correctly.