Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

@praateekmahajan praateekmahajan commented Oct 2, 2025

Description

Usage

python tutorials/finewebpdfs/download.py --download-dir raw_warcs --output-dir pdfs_output --limit 10 --verbose

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
@praateekmahajan praateekmahajan marked this pull request as draft October 2, 2025 02:04
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 2, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Comment on lines 45 to 47
with open("/raid/praateekm/NeMo-Curator/finepdfs.jsonl") as f:
dataset = [json.loads(line) for line in f]

Copy link
Contributor Author

@praateekmahajan praateekmahajan Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated using

import json
import time
from datasets import load_dataset

dataset = load_dataset(
	"HuggingFaceFW/finepdfs",
	streaming=True,
	split="train"
).select_columns(["offset", "file_path"])

t0 = time.perf_counter()
with open("finepdfs.jsonl", "a") as f:
    for i, row in enumerate(dataset):
        f.write(json.dumps(row) + "\n")
        if i % 100_000 == 0:
            print(f"{i:,} records done in {(time.perf_counter()-t0):.2f}s")
            t0 = time.perf_counter()


# Skip parquet index files - only process actual WARC files
if file_path.endswith(".parquet") or "/cc-index/table/" in file_path:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why bunch of files are index files. So for now I've skipped them. A more thorough approach will include them too

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant