Skip to content

lamasters/html-reader-mode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML Reader Mode

A Python library to extract the main content from an HTML document, similar to the "Reader Mode" feature found in web browsers. It filters out navigation, ads, sidebars, and other non-content elements.

Installation

pip install html-reader-mode

Usage

from html_reader_mode import HTMLReaderMode

html_content = """
<html>
    <body>
        <div id="header">Header content</div>
        <div id="content">
            <h1>Article Title</h1>
            <p>This is the main content of the article.</p>
        </div>
        <div id="footer">Footer content</div>
    </body>
</html>
"""

reader = HTMLReaderMode()
content = reader.sanitize(html_content)

print(content)
# Output:
# [{'tag': 'h1', 'content': 'Article Title'}, {'tag': 'p', 'content': 'This is the main content of the article.'}]

Features

  • Content Extraction: Identifies and extracts the main text blocks.
  • Noise Reduction: Removes scripts, styles, and high-link-density blocks (like navigation menus).
  • Customizable: Configure block tags, script tags, and filtering thresholds.

License

MIT

About

Parse an HTML document and extract the main content. Similar to the reader mode in web browsers.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages