Skip to content

Add CSS selector flag to HTML converter #1228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AnupamKumar-1
Copy link

@AnupamKumar-1 AnupamKumar-1 commented Apr 30, 2025

  • This PR introduces the -s/--selector flag to the MarkItDown CLI, allowing users to specify a CSS selector to narrow the scope of HTML parsing.
  • CLI Update: Added -s/--selector flag to the CLI to allow passing a CSS selector.
  • HtmlConverter Modification: The CSS selector is passed through to the HtmlConverter for processing.
  • Scoped BeautifulSoup Parsing: When the -s/--selector flag is used, BeautifulSoup parsing is scoped to the elements matching the selector, improving parsing performance and relevance.
  • Error Handling: If no matches are found for the provided selector, a ValueError is raised to inform the user about the lack of matching elements.

How to Test:

  • Run the CLI with the -s/--selector flag
  • python -m markitdown -s 'article.entry' path/to/input.html

All related tests pass:

  • pytest -s tests/test_html_selector.py

- Introduce  in CLI
- Pass selector through to HtmlConverter
- Scope BeautifulSoup parsing to selected nodes only
- Raise ValueError on no matches
@AnupamKumar-1
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant