This web scraper get the data from properties for sale and rent from the Israeli OnMap website.
The website has four main data sources: buy, rent, new homes and commercial data.
Listing type | Description |
---|---|
Buy | Properties for sale |
Rent | Properties for rent |
Commercial | Commercial properties for rent |
New homes | Properties that are on planning or construction phase |
The scraper is built using a mixture of Selenium and BeautifulSoup. Selenium is in charge of scrolling each webpage to the bottom so that BeautifulSoup can read the entire HTML.
Make sure to install all the required packages for the scraper to work:
$ pip install -r requirements.txt
If you are planning on storing the scraped information in a database, please install MySQL.
Then to create the database structure:
$ mysql -u <username> -p < db/on_map.sql
Make sure to change the values in the DBConfig
class in config.py
in order to match your database configuration.
Run web_scraper.py
from the Command Line.
usage: web_scraper.py [-h] [--limit n] [--print] [--save] [--database]
[--fetch] [--verbose]
{buy,rent,commercial,new_homes,all}
Scraping OnMap website | Checkout https://www.onmap.co.il/en/
positional arguments:
{buy,rent,commercial,new_homes,all}
choose which type of properties you would like to
scrape
optional arguments:
-h, --help show this help message and exit
--limit n, -l n limit to n number of scrolls per page
--print, -p print the results to the screen
--save, -s save the scraped information into a csv file in the
same directory
--database, -d inserts new information found into the on_map database
--fetch, -f fetches more information for each property using
Nominatim API
--verbose, -v prints messages during the scraper execution
Using the GeoFetcher
class, we are able to add more geolocation information to each property.
This class is based on Geopy and uses Nominatim as the geolocation service.
Even though we are fetching the information asynchronously with asyncio and AioHTTPAdapter, since Nominatim provides a free service, its request limit is low.
Thus, some properties may appear with None
features after fetching additional information.
If you wish, you can increase the DELAY_TIME
in conf.py
as a way to obtain all the information.
The current ERD for the of this project is:
-
In
property_types
, we have whether the property is an apartment, penthouse, cottage, and so on. -
In
cities
, we have all the city names of the properties. -
In
listings
, we have the listing types offered on the website:buy, rent, commercial, new homes
. -
In
properties
, each record is a different property in the website, providing address, price, number of rooms, in which floor it is located, the area and the number of parking spots available. If the property is under constructions, theConStatus
tells what the construction status is. Latitude, longitude, and details in Hebrew are obtaibed using GeoPy with Nominatim service and might not be available for all properties due to request limitations since Nominatim is a free and limited API.
-
The database currently is not 100% in accordance with 3NF standards. The additional data fetched from the API is not normalized.
-
The API performance can be furthered enhanced.
-
For macOS users, there is a known error when using geckodriver. The error is:
OSError: [Errno 86] Bad CPU type in executable: '/Users/username/.wdm/drivers/geckodriver/macos/v0.30.0/geckodriver'
And the fix is:
$ cd ~/.wdm/drivers/geckodriver/macos/v0.30.0
$ curl -Lsk -O https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-macos.tar.gz
$ ls
geckodriver
geckodriver-v0.30.0-macos-aarch64.tar.gz
geckodriver-v0.30.0-macos.tar.gz
$ rm geckodriver
$ tar zxvf geckodriver-v0.30.0-macos.tar.gz
For a short presentation with some data on rent in Israel and specifically in Tel Aviv, click here.
@lnros - Leonardo Rosenberg
@Shahar9772 - Shahar Shoshany