Astro City Road transport authority (ACRTA) in US is capturing vehicle and drive data to provide insight into road usage. This will allow ACRTA to create policies that incentivise safe and environmentally friendly driving behaviour.
You are a data engineer tasked with assisting ACRTA in processing its raw data into information that can be used by data scientists and policy makers
Sample data has been provided to you alongside this document. The data contains the following tables
Parquet files containing data coming from the car-mounted devices, which provides you with the car statistics at regular intervals.
| Name | Type | Primary Key | Description |
|---|---|---|---|
trip_id |
string | yes (part of key) | unique identifier for the trip |
datetime |
timestamp | yes (part of key) | timestamp in UTC associated with the reading |
vehicle_spec_id |
long | no | unique identifier for the vehicle specification |
engine_coolant_temp |
double | no | engine coolant temperature in celsius |
eng_load |
double | no | load on the engine range is from 0 to 255 |
fuel_level |
double | no | level of the fuel, ranges from 0 to 255 |
iat |
double | no | intake air temperature in celsius |
rpm |
double | no | engine revolutions per minute |
lat |
double | no | latitude |
long |
double | no | longitude |
velocity |
double | no | vehicle speed in kilometers per hour |
csv file that contains different vehicle technical specifications which comes from the manufacturer of the car.
| Name | Type | Primary Key | Description |
|---|---|---|---|
vehicle_spec_id |
long | yes | unique identifier for the vehicle type |
year |
int | no | year of manufacturing |
make |
string | no | brand |
model |
string | no | model name for vehicle |
drivetrain |
int | no | drive axels which are directly provided power with |
fuel_tank_capacity |
int | no | capacity of hte fuel tank in litres |
Create a spark job that will produce a parquet table with the following fields
| Name | Type | Primary Key | Description |
|---|---|---|---|
date_pst |
date | yes (part of key) | date in PST timezone |
trip_id |
long | yes (part of key) | unique identifier of the trip |
make |
string | no | make of the vehicle, null if make is unknown |
model |
string | no | model of the vehicle, null if model is unknown |
trip_duration_minutes |
decimal | no | total duration of the trip that occurred within date_pst in minutes |
distance_travelled |
decimal | no | total distance travelled for vehicle on date_pst in km |
The job should have the following characteristics
- Job is take a date as an input. Only row for the given date is to be produced
- The input date (and data produced) is to be in Pacific Standard Time (PST) timezone
- When the job is run multiple times with the same date then all data for the given date should be replaced
- When the job is run multiple times with different dates then all data for previous runs should be preserved
Record all assumptions made about the data and justify any decisions made as comments within the code
Write a SQL query using Spark SQL dialect that will produce the following data
| Name | Type | Primary Key | Description |
|---|---|---|---|
trip_id |
long | yes | unique identifier for the trip |
average_eng_load_perc |
decimal | no | average engine load as a percentage. engine load percentage is calculated by 100 * (eng_load / 255) |
average_velocity |
decimal | no | average velocity over the entirety of the trip in km/hr |
fuel_used |
decimal | no | amount of fuel used in litres during the trip |
Additionally the query should also have the following characteristics
- Only trips for which we have vehicle specifications should be output (i.e. vehicle must be in the vehicle specification table)
- Fuel in the tank can be calculated by
(fuel_level / 255) * fuel_tank_capacity
Record all assumptions made about the data and justify any decisions made as comments within the SQL script
You may assume the following tables exist and have already been exposed
Drive- contains all data in drive dataVehicle- contains all data in vehicle specifications
Have the code ready and running prior to the interview. You can share your screen and walk us through your approach, any challenges you had and any assumptions you made. Have your environment prepared to do some live coding as there may be some additional questions asked on the same dataset.
Your solution can be in any spark supported language you are comfortable with (e.g. Scala, Python, .NET) and using any IDE you feel most productive in. If unable to complete due to time and complexity do not worry, we are still interested to hear about your planned approach.
Your code should include
- daily trip job, including a note of your assumptions you made
- trip sql sql, including a note of any assumptions you made