Skip to content

Commit

Permalink
Merge pull request #80 from weblyzard/feature/python3.12
Browse files Browse the repository at this point in the history
Feature/python3.12
  • Loading branch information
AlbertWeichselbraun authored Jan 16, 2024
2 parents be56e73 + c0630ff commit 16404b0
Show file tree
Hide file tree
Showing 58 changed files with 1,357 additions and 1,149 deletions.
1 change: 1 addition & 0 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
55fa29ca39f9ed5895f9e88b2eb0f17e4d84245f
3 changes: 0 additions & 3 deletions .github/workflows/codeql-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,7 @@ name: "CodeQL"

on:
push:
branches: [ master ]
pull_request:
# The branches below must be a subset of the branches above
branches: [ master ]
schedule:
- cron: '26 5 * * 2'

Expand Down
22 changes: 6 additions & 16 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,37 +2,27 @@ name: build

on:
push:
branches: [ master ]
pull_request:
branches: [ master ]

jobs:
build:

runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
strategy:
fail-fast: false
matrix:
python-version: ['3.6', '3.7', '3.8', '3.9', '3.10', '3.11']
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
- name: Install build environment
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest pytest-cov codecov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
python setup.py install
- name: Lint with flake8
python -m pip install tox setuptools pytest pytest-cov codecov
- name: Build and test with tox.
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=80 --statistics
- name: Test with pytest
run: |
py.test --cov=inscripits ./tests && codecov
tox
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@ tests/reference.txt
*.c
docs/paper/*.pdf
htmlcov/
poetry.lock
9 changes: 3 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,8 @@
FROM python:3.11-slim-bullseye AS builder

WORKDIR /inscriptis
COPY requirements.txt .
RUN python -m venv .venv && .venv/bin/python -m pip install --upgrade pip
RUN .venv/bin/pip install --no-cache-dir -r requirements.txt && \
.venv/bin/pip install --no-cache-dir Flask waitress && \
RUN .venv/bin/pip install --no-cache-dir inscriptis[web-service] && \
find /inscriptis/.venv \( -type d -a -name test -o -name tests \) -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) -exec rm -rf '{}' \+

#
Expand All @@ -18,10 +16,9 @@ LABEL maintainer="[email protected]"

# Note: only copy the src directory, to prevent bloating the image with
# irrelevant files from the project directory.
WORKDIR /inscriptis/src
WORKDIR /inscriptis
COPY --from=builder /inscriptis /inscriptis
COPY ./src /inscriptis/src

ENV PATH="/inscriptis/.venv/bin:$PATH"
CMD ["waitress-serve", "inscriptis.service.web:app", "--port=5000", "--host=0.0.0.0"]
CMD ["uvicorn", "inscriptis.service.web:app", "--port=5000", "--host=0.0.0.0"]
EXPOSE 5000
32 changes: 18 additions & 14 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,9 +131,9 @@ the corresponding text representation.
Command line parameters
-----------------------

The inscript.py command line client supports the following parameters::
The inscript command line client supports the following parameters::

usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]
usage: inscript [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]
[--table-cell-separator TABLE_CELL_SEPARATOR] [-v]
[input]

Expand Down Expand Up @@ -172,19 +172,19 @@ HTML to text conversion
-----------------------
convert the given page to text and output the result to the screen::

$ inscript.py https://www.fhgr.ch
$ inscript https://www.fhgr.ch
convert the file to text and save the output to fhgr.txt::

$ inscript.py fhgr.html -o fhgr.txt
$ inscript fhgr.html -o fhgr.txt

convert the file using strict indentation (i.e., minimize indentation and extra spaces) and save the output to fhgr-layout-optimized.txt::

$ inscript.py --indentation strict fhgr.html -o fhgr-layout-optimized.txt
$ inscript --indentation strict fhgr.html -o fhgr-layout-optimized.txt
convert HTML provided via stdin and save the output to output.txt::

$ echo "<body><p>Make it so!</p></body>" | inscript.py -o output.txt
$ echo "<body><p>Make it so!</p></body>" | inscript -o output.txt


HTML to annotated text conversion
Expand All @@ -193,7 +193,7 @@ convert and annotate HTML from a Web page using the provided annotation rules.

Download the example `annotation-profile.json <https://github.com/weblyzard/inscriptis/blob/master/examples/annotation-profile.json>`_ and save it to your working directory::

$ inscript.py https://www.fhgr.ch -r annotation-profile.json
$ inscript https://www.fhgr.ch -r annotation-profile.json

The annotation rules are specified in `annotation-profile.json`:

Expand Down Expand Up @@ -241,7 +241,7 @@ Annotation postprocessors enable the post processing of annotations to formats
that are suitable for your particular application. Post processors can be
specified with the ``-p`` or ``--postprocessor`` command line argument::

$ inscript.py https://www.fhgr.ch \
$ inscript https://www.fhgr.ch \
-r ./examples/annotation-profile.json \
-p surface

Expand Down Expand Up @@ -286,7 +286,7 @@ Currently, inscriptis supports the following postprocessors:

.. code-block:: bash
inscript.py --annotation-rules ./wikipedia.json \
inscript --annotation-rules ./wikipedia.json \
--postprocessor html \
https://en.wikipedia.org/wiki/Chur.html
Expand All @@ -311,14 +311,18 @@ Currently, inscriptis supports the following postprocessors:
Web Service
===========

The Flask Web Service translates HTML pages to the corresponding plain text.
A FastAPI-based Web Service that uses Inscriptis for translating HTML pages to plain text.

Run the Web Service on your host system
---------------------------------------
Provide additional requirement `python3-flask <https://flask.palletsprojects.com/en/2.2.x/>`_, then start the inscriptis Web service with the following command::
Install the optional feature `web-service` for inscriptis::
$ pip install inscriptis[web-service]

Start the Inscriptis Web service with the following command::

$ uvicorn inscriptis.service.web:app --port 5000 --host 127.0.0.1

$ export FLASK_APP="inscriptis.service.web"
$ python3 -m flask run

Run the Web Service with Docker
-------------------------------
Expand Down Expand Up @@ -499,7 +503,7 @@ The following options are available for fine tuning inscriptis' HTML rendering:
1. **More rigorous indentation:** call ``inscriptis.get_text()`` with the
parameter ``indentation='extended'`` to also use indentation for tags such as
``<div>`` and ``<span>`` that do not provide indentation in their standard
definition. This strategy is the default in ``inscript.py`` and many other
definition. This strategy is the default in ``inscript`` and many other
tools such as Lynx. If you do not want extended indentation you can use the
parameter ``indentation='standard'`` instead.
Expand Down
Loading

0 comments on commit 16404b0

Please sign in to comment.