A complete automated pipeline for scraping and parsing real estate data from Njuskalo.hr.
This project is designed for data collection, research, and analysis purposes, with full support for resuming, skipping, and modular execution of each pipeline step.
- Python: 3.8+
- OS: Linux, macOS, Windows
- Dependencies: listed in
requirements.txt
pip install -r requirements.txt
playwright installpython pipeline.pypython pipeline.py --step 1 # Category scraper only
python pipeline.py --step 2 # HTML scraper only
python pipeline.py --step 3 # Phone fetcher only
python pipeline.py --step 4 # Parser onlypython pipeline.py --skip-existingrun_pipeline.bat(Windows-only helper script; not required on Linux/macOS)
- Script:
njuskalo_category_tree_scraper.py - Purpose: Scrape category tree and collect all property listing URLs
- Output: Category data and URL lists
- Script:
scrape_leaf_entries.py - Purpose: Download individual property listing HTML pages
- Output:
backend/website/directory with HTML files - Features: checkpointing, proxy rotation, session management, retry logic
- Script:
fetch_phones_from_api.py - Purpose: Extract phone numbers via Njuskalo API
- Output:
backend/phoneDB/phones.dbSQLite database - Features: async processing (50 req/sec), token auto-refresh, skip processed ads
- Script:
parser_ultrafast.py - Purpose: Parse HTML files to structured JSON data
- Output:
backend/json/directory with parsed data - Features: memory-cached lookups, multi-core parsing, 3–5× faster than standard parser
backend/
├── website/ # HTML files from Step 2
├── phoneDB/ # Phone database from Step 3
│ ├── phones.db # SQLite database
│ └── phones.log # Phone fetcher logs
├── json/ # Parsed JSON from Step 4
└── logs/ # Parser logs
{
"id": "123456",
"title": "2-Bedroom Apartment in Zagreb",
"price": "150000 EUR",
"location": "Zagreb - Maksimir",
"phone": "+385991234567",
"url": "https://www.njuskalo.hr/nekretnine/ad-id-123456"
}RESCRAPE_NULL_PHONES = False– re-scrape ads with no phone numbers if set toTrueBATCH_SIZE = 50– number of concurrent API requests
BATCH_SIZE = 200– files per processing batchMAX_WORKERS = cpu_count()– use all CPU cores
- Pipeline log:
pipeline.log– overall pipeline execution - Phone log:
backend/phoneDB/phones.log– phone fetching details - Parser logs:
backend/logs/– individual parsing logs
Expected throughput (based on typical tests):
- Phone Fetcher: ~50 requests/second
- Parser: 3–5× faster than standard parser
- Complete Pipeline: performance depends on dataset size + network speed
The pipeline supports resuming from any step. Each step checks for existing output and can skip completed work.
python pipeline.py --step 3 # Re-run phone fetcher onlypython pipeline.py --skip-existingEnsure these scripts exist in your project directory:
njuskalo_category_tree_scraper.pyscrape_leaf_entries.pyfetch_phones_from_api.pyparser_ultrafast.pybearer_token_finder.py(required by phone fetcher)
- The
backend/folder is excluded from Git (see.gitignore) - Phone fetching requires valid bearer tokens (handled automatically)
- Parser uses memory caching for maximum speed
- All steps include comprehensive error handling and logging
This project is provided for educational and research purposes only.
Before running scrapers against Njuskalo.hr, please review and comply with their Terms of Service. The author(s) are not responsible for misuse.