- 
                Notifications
    
You must be signed in to change notification settings  - Fork 185
 
[WIP] Download FineWebPDFs #1158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Download FineWebPDFs #1158
Conversation
Signed-off-by: Praateek <[email protected]>
| 
           Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here.  | 
    
        
          
                tutorials/finewebpdfs/download.py
              
                Outdated
          
        
      | with open("/raid/praateekm/NeMo-Curator/finepdfs.jsonl") as f: | ||
| dataset = [json.loads(line) for line in f] | ||
| 
               | 
          
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generated using
import json
import time
from datasets import load_dataset
dataset = load_dataset(
	"HuggingFaceFW/finepdfs",
	streaming=True,
	split="train"
).select_columns(["offset", "file_path"])
t0 = time.perf_counter()
with open("finepdfs.jsonl", "a") as f:
    for i, row in enumerate(dataset):
        f.write(json.dumps(row) + "\n")
        if i % 100_000 == 0:
            print(f"{i:,} records done in {(time.perf_counter()-t0):.2f}s")
            t0 = time.perf_counter()        
          
                tutorials/finewebpdfs/download.py
              
                Outdated
          
        
      | 
               | 
          ||
| # Skip parquet index files - only process actual WARC files | ||
| if file_path.endswith(".parquet") or "/cc-index/table/" in file_path: | ||
| continue | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why bunch of files are index files. So for now I've skipped them. A more thorough approach will include them too
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Description
Usage
Checklist