HTML Reader Mode

A Python library to extract the main content from an HTML document, similar to the "Reader Mode" feature found in web browsers. It filters out navigation, ads, sidebars, and other non-content elements.

Installation

pip install html-reader-mode

Usage

from html_reader_mode import HTMLReaderMode

html_content = """
<html>
    <body>
        <div id="header">Header content</div>
        <div id="content">
            <h1>Article Title</h1>
            <p>This is the main content of the article.</p>
        </div>
        <div id="footer">Footer content</div>
    </body>
</html>
"""

reader = HTMLReaderMode()
content = reader.sanitize(html_content)

print(content)
# Output:
# [{'tag': 'h1', 'content': 'Article Title'}, {'tag': 'p', 'content': 'This is the main content of the article.'}]

Features

Content Extraction: Identifies and extracts the main text blocks.
Noise Reduction: Removes scripts, styles, and high-link-density blocks (like navigation menus).
Customizable: Configure block tags, script tags, and filtering thresholds.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML Reader Mode

Installation

Usage

Features

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HTML Reader Mode

Installation

Usage

Features

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages