Spark Challenge

Problem statement

Astro City Road transport authority (ACRTA) in US is capturing vehicle and drive data to provide insight into road usage. This will allow ACRTA to create policies that incentivise safe and environmentally friendly driving behaviour.

You are a data engineer tasked with assisting ACRTA in processing its raw data into information that can be used by data scientists and policy makers

Data description

Sample data has been provided to you alongside this document. The data contains the following tables

Drive data

Parquet files containing data coming from the car-mounted devices, which provides you with the car statistics at regular intervals.

Name	Type	Primary Key	Description
`trip_id`	string	yes (part of key)	unique identifier for the trip
`datetime`	timestamp	yes (part of key)	timestamp in UTC associated with the reading
`vehicle_spec_id`	long	no	unique identifier for the vehicle specification
`engine_coolant_temp`	double	no	engine coolant temperature in celsius
`eng_load`	double	no	load on the engine range is from 0 to 255
`fuel_level`	double	no	level of the fuel, ranges from 0 to 255
`iat`	double	no	intake air temperature in celsius
`rpm`	double	no	engine revolutions per minute
`lat`	double	no	latitude
`long`	double	no	longitude
`velocity`	double	no	vehicle speed in kilometers per hour

Vehicle specifications

csv file that contains different vehicle technical specifications which comes from the manufacturer of the car.

Name	Type	Primary Key	Description
`vehicle_spec_id`	long	yes	unique identifier for the vehicle type
`year`	int	no	year of manufacturing
`make`	string	no	brand
`model`	string	no	model name for vehicle
`drivetrain`	int	no	drive axels which are directly provided power with
`fuel_tank_capacity`	int	no	capacity of hte fuel tank in litres

Challenge Tasks

Daily trip job

Create a spark job that will produce a parquet table with the following fields

Name	Type	Primary Key	Description
`date_pst`	date	yes (part of key)	date in PST timezone
`trip_id`	long	yes (part of key)	unique identifier of the trip
`make`	string	no	make of the vehicle, null if make is unknown
`model`	string	no	model of the vehicle, null if model is unknown
`trip_duration_minutes`	decimal	no	total duration of the trip that occurred within `date_pst` in minutes
`distance_travelled`	decimal	no	total distance travelled for vehicle on `date_pst` in km

The job should have the following characteristics

Job is take a date as an input. Only row for the given date is to be produced
The input date (and data produced) is to be in Pacific Standard Time (PST) timezone
When the job is run multiple times with the same date then all data for the given date should be replaced
When the job is run multiple times with different dates then all data for previous runs should be preserved

Record all assumptions made about the data and justify any decisions made as comments within the code

Trip SQL

Write a SQL query using Spark SQL dialect that will produce the following data

Name	Type	Primary Key	Description
`trip_id`	long	yes	unique identifier for the trip
`average_eng_load_perc`	decimal	no	average engine load as a percentage. engine load percentage is calculated by `100 * (eng_load / 255)`
`average_velocity`	decimal	no	average velocity over the entirety of the trip in km/hr
`fuel_used`	decimal	no	amount of fuel used in litres during the trip

Additionally the query should also have the following characteristics

Only trips for which we have vehicle specifications should be output (i.e. vehicle must be in the vehicle specification table)
Fuel in the tank can be calculated by (fuel_level / 255) * fuel_tank_capacity

Record all assumptions made about the data and justify any decisions made as comments within the SQL script

You may assume the following tables exist and have already been exposed

Drive - contains all data in drive data
Vehicle - contains all data in vehicle specifications

Evaluation

Have the code ready and running prior to the interview. You can share your screen and walk us through your approach, any challenges you had and any assumptions you made. Have your environment prepared to do some live coding as there may be some additional questions asked on the same dataset.

Your solution can be in any spark supported language you are comfortable with (e.g. Scala, Python, .NET) and using any IDE you feel most productive in. If unable to complete due to time and complexity do not worry, we are still interested to hear about your planned approach.

Your code should include

daily trip job, including a note of your assumptions you made
trip sql sql, including a note of any assumptions you made

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
supporting-data		supporting-data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Challenge

Problem statement

Data description

Drive data

Vehicle specifications

Challenge Tasks

Daily trip job

Trip SQL

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

clearbank/tech-test-data

Folders and files

Latest commit

History

Repository files navigation

Spark Challenge

Problem statement

Data description

Drive data

Vehicle specifications

Challenge Tasks

Daily trip job

Trip SQL

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages