Skip to content

Add scripts to preprocess WARP-pdf #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions corpus/llm-jp-corpus-v4/ja/ja_warp_pdf/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# python generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info

# venv
.venv

# bunkai
bunkai_model/
1 change: 1 addition & 0 deletions corpus/llm-jp-corpus-v4/ja/ja_warp_pdf/.python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12.5
31 changes: 31 additions & 0 deletions corpus/llm-jp-corpus-v4/ja/ja_warp_pdf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# ja-warp-pdf

Preprocess text extracted from PDFs provided by WARP.

## Environment

- Python 3.12.5

## Installation

Use [rye](https://rye.astral.sh/) to install the dependencies.

```bash
RUSTFLAGS="-A invalid_reference_casting" rye sync
```

Then download the Bunkai sentence splitter model.

```bash
rye run bunkai --model bunkai_model --setup
```

## Usage

### Conversion

This process converts text to remove unnecessary characters.

```bash
rye run python scripts/convert.py --input-file <input-file> --output-file <output-file>
```
5 changes: 5 additions & 0 deletions corpus/llm-jp-corpus-v4/ja/ja_warp_pdf/examples/example.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{"docId": "example0", "text": "本勉強会では、自然言語処理および計算機システムの研究者が集まり大規模言語モデルの研究開発について定期的に情報共有を行っています。"}
{"docId": "example1", "text": "本勉強会では、自然言語処理および\n計算機システムの研究者が集まり\n大規模言語モデルの研究開発について\n定期的に情報共有を行っています。"}
{"docId": "example2", "text": "本勉強会では、自然言\n語処理および計算機シ\nステムの研究者が集ま\nり大規模言語モデルの\n研究開発について定期\n的に情報共有を行って\nいます。"}
{"docId": "example3", "text": "本\n勉\n強\n会\nで\nは\n、\n自\n然\n言\n語\n処\n理\nお\nよ\nび\n計\n算\n機\nシ\nス\nテ\nム\nの\n研\n究\n者\nが\n集\nま\nり\n大\n規\n模\n言\n語\nモ\nデ\nル\nの\n研\n究\n開\n発\nに\nつ\nい\nて\n定\n期\n的\nに\n情\n報\n共\n有\nを\n行\nっ\nて\nい\nま\nす\n。"}
{"docId": "example4", "text": "本 勉 強 会 で は 、 自 然 言 語 処 理 お よ び 計 算 機 シ ス テ ム の 研 究 者 が 集 ま り 大 規 模 言 語 モ デ ル の 研 究 開 発 に つ い て 定 期 的 に 情 報 共 有 を 行 っ て い ま す 。"}
19 changes: 19 additions & 0 deletions corpus/llm-jp-corpus-v4/ja/ja_warp_pdf/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[project]
name = "ja-warp-pdf"
version = "0.1.0"
description = "Add your description here"
authors = [
{ name = "Hirokazu Kiyomaru", email = "[email protected]" }
]
dependencies = [
"bunkai[lb] @ git+https://github.com/hkiyomaru/bunkai@feature/sequential-prediction",
"transformers==4.33.3",
"pytest>=8.3.4",
]
readme = "README.md"
requires-python = ">= 3.8"

[tool.rye]
managed = true
virtual = true
dev-dependencies = []
132 changes: 132 additions & 0 deletions corpus/llm-jp-corpus-v4/ja/ja_warp_pdf/requirements-dev.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# generated by rye
# use `rye lock` or `rye sync` to update this lockfile
#
# last locked with the following flags:
# pre: false
# features: []
# all-features: false
# with-sources: false
# generate-hashes: false
# universal: false

bunkai @ git+https://github.com/hkiyomaru/bunkai@12a2dfa9eb47e203f4135edac3087a2809d92ca7
certifi==2024.12.14
# via requests
charset-normalizer==3.4.1
# via requests
dataclasses-json==0.6.7
# via bunkai
emoji==2.14.0
# via bunkai
emojis==0.7.0
# via bunkai
filelock==3.16.1
# via huggingface-hub
# via torch
# via transformers
# via triton
fsspec==2024.12.0
# via huggingface-hub
# via torch
huggingface-hub==0.27.1
# via transformers
idna==3.10
# via requests
iniconfig==2.0.0
# via pytest
janome==0.5.0
# via bunkai
jinja2==3.1.5
# via torch
markupsafe==3.0.2
# via jinja2
marshmallow==3.25.1
# via dataclasses-json
more-itertools==10.6.0
# via bunkai
mpmath==1.3.0
# via sympy
mypy-extensions==1.0.0
# via typing-inspect
networkx==3.4.2
# via torch
numpy==2.2.1
# via bunkai
# via transformers
nvidia-cublas-cu12==12.4.5.8
# via nvidia-cudnn-cu12
# via nvidia-cusolver-cu12
# via torch
nvidia-cuda-cupti-cu12==12.4.127
# via torch
nvidia-cuda-nvrtc-cu12==12.4.127
# via torch
nvidia-cuda-runtime-cu12==12.4.127
# via torch
nvidia-cudnn-cu12==9.1.0.70
# via torch
nvidia-cufft-cu12==11.2.1.3
# via torch
nvidia-curand-cu12==10.3.5.147
# via torch
nvidia-cusolver-cu12==11.6.1.9
# via torch
nvidia-cusparse-cu12==12.3.1.170
# via nvidia-cusolver-cu12
# via torch
nvidia-nccl-cu12==2.21.5
# via torch
nvidia-nvjitlink-cu12==12.4.127
# via nvidia-cusolver-cu12
# via nvidia-cusparse-cu12
# via torch
nvidia-nvtx-cu12==12.4.127
# via torch
packaging==24.2
# via huggingface-hub
# via marshmallow
# via pytest
# via transformers
pluggy==1.5.0
# via pytest
pytest==8.3.4
pyyaml==6.0.2
# via huggingface-hub
# via transformers
regex==2024.11.6
# via bunkai
# via transformers
requests==2.32.3
# via bunkai
# via huggingface-hub
# via transformers
safetensors==0.5.2
# via transformers
setuptools==75.8.0
# via torch
spans==1.1.1
# via bunkai
sympy==1.13.1
# via torch
tokenizers==0.13.3
# via transformers
toml==0.10.2
# via bunkai
torch==2.5.1
# via bunkai
tqdm==4.67.1
# via bunkai
# via huggingface-hub
# via transformers
transformers==4.33.3
# via bunkai
triton==3.1.0
# via torch
typing-extensions==4.12.2
# via huggingface-hub
# via torch
# via typing-inspect
typing-inspect==0.9.0
# via dataclasses-json
urllib3==2.3.0
# via requests
132 changes: 132 additions & 0 deletions corpus/llm-jp-corpus-v4/ja/ja_warp_pdf/requirements.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# generated by rye
# use `rye lock` or `rye sync` to update this lockfile
#
# last locked with the following flags:
# pre: false
# features: []
# all-features: false
# with-sources: false
# generate-hashes: false
# universal: false

bunkai @ git+https://github.com/hkiyomaru/bunkai@12a2dfa9eb47e203f4135edac3087a2809d92ca7
certifi==2024.12.14
# via requests
charset-normalizer==3.4.1
# via requests
dataclasses-json==0.6.7
# via bunkai
emoji==2.14.0
# via bunkai
emojis==0.7.0
# via bunkai
filelock==3.16.1
# via huggingface-hub
# via torch
# via transformers
# via triton
fsspec==2024.12.0
# via huggingface-hub
# via torch
huggingface-hub==0.27.1
# via transformers
idna==3.10
# via requests
iniconfig==2.0.0
# via pytest
janome==0.5.0
# via bunkai
jinja2==3.1.5
# via torch
markupsafe==3.0.2
# via jinja2
marshmallow==3.25.1
# via dataclasses-json
more-itertools==10.6.0
# via bunkai
mpmath==1.3.0
# via sympy
mypy-extensions==1.0.0
# via typing-inspect
networkx==3.4.2
# via torch
numpy==2.2.1
# via bunkai
# via transformers
nvidia-cublas-cu12==12.4.5.8
# via nvidia-cudnn-cu12
# via nvidia-cusolver-cu12
# via torch
nvidia-cuda-cupti-cu12==12.4.127
# via torch
nvidia-cuda-nvrtc-cu12==12.4.127
# via torch
nvidia-cuda-runtime-cu12==12.4.127
# via torch
nvidia-cudnn-cu12==9.1.0.70
# via torch
nvidia-cufft-cu12==11.2.1.3
# via torch
nvidia-curand-cu12==10.3.5.147
# via torch
nvidia-cusolver-cu12==11.6.1.9
# via torch
nvidia-cusparse-cu12==12.3.1.170
# via nvidia-cusolver-cu12
# via torch
nvidia-nccl-cu12==2.21.5
# via torch
nvidia-nvjitlink-cu12==12.4.127
# via nvidia-cusolver-cu12
# via nvidia-cusparse-cu12
# via torch
nvidia-nvtx-cu12==12.4.127
# via torch
packaging==24.2
# via huggingface-hub
# via marshmallow
# via pytest
# via transformers
pluggy==1.5.0
# via pytest
pytest==8.3.4
pyyaml==6.0.2
# via huggingface-hub
# via transformers
regex==2024.11.6
# via bunkai
# via transformers
requests==2.32.3
# via bunkai
# via huggingface-hub
# via transformers
safetensors==0.5.2
# via transformers
setuptools==75.8.0
# via torch
spans==1.1.1
# via bunkai
sympy==1.13.1
# via torch
tokenizers==0.13.3
# via transformers
toml==0.10.2
# via bunkai
torch==2.5.1
# via bunkai
tqdm==4.67.1
# via bunkai
# via huggingface-hub
# via transformers
transformers==4.33.3
# via bunkai
triton==3.1.0
# via torch
typing-extensions==4.12.2
# via huggingface-hub
# via torch
# via typing-inspect
typing-inspect==0.9.0
# via dataclasses-json
urllib3==2.3.0
# via requests
Loading