Skip to content

clearbank/tech-test-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Spark Challenge

Problem statement

Astro City Road transport authority (ACRTA) in US is capturing vehicle and drive data to provide insight into road usage. This will allow ACRTA to create policies that incentivise safe and environmentally friendly driving behaviour.

You are a data engineer tasked with assisting ACRTA in processing its raw data into information that can be used by data scientists and policy makers

Data description

Sample data has been provided to you alongside this document. The data contains the following tables

Drive data

Parquet files containing data coming from the car-mounted devices, which provides you with the car statistics at regular intervals.

Name Type Primary Key Description
trip_id string yes (part of key) unique identifier for the trip
datetime timestamp yes (part of key) timestamp in UTC associated with the reading
vehicle_spec_id long no unique identifier for the vehicle specification
engine_coolant_temp double no engine coolant temperature in celsius
eng_load double no load on the engine range is from 0 to 255
fuel_level double no level of the fuel, ranges from 0 to 255
iat double no intake air temperature in celsius
rpm double no engine revolutions per minute
lat double no latitude
long double no longitude
velocity double no vehicle speed in kilometers per hour

Vehicle specifications

csv file that contains different vehicle technical specifications which comes from the manufacturer of the car.

Name Type Primary Key Description
vehicle_spec_id long yes unique identifier for the vehicle type
year int no year of manufacturing
make string no brand
model string no model name for vehicle
drivetrain int no drive axels which are directly provided power with
fuel_tank_capacity int no capacity of hte fuel tank in litres

Challenge Tasks

Daily trip job

Create a spark job that will produce a parquet table with the following fields

Name Type Primary Key Description
date_pst date yes (part of key) date in PST timezone
trip_id long yes (part of key) unique identifier of the trip
make string no make of the vehicle, null if make is unknown
model string no model of the vehicle, null if model is unknown
trip_duration_minutes decimal no total duration of the trip that occurred within date_pst in minutes
distance_travelled decimal no total distance travelled for vehicle on date_pst in km

The job should have the following characteristics

  • Job is take a date as an input. Only row for the given date is to be produced
  • The input date (and data produced) is to be in Pacific Standard Time (PST) timezone
  • When the job is run multiple times with the same date then all data for the given date should be replaced
  • When the job is run multiple times with different dates then all data for previous runs should be preserved

Record all assumptions made about the data and justify any decisions made as comments within the code

Trip SQL

Write a SQL query using Spark SQL dialect that will produce the following data

Name Type Primary Key Description
trip_id long yes unique identifier for the trip
average_eng_load_perc decimal no average engine load as a percentage. engine load percentage is calculated by 100 * (eng_load / 255)
average_velocity decimal no average velocity over the entirety of the trip in km/hr
fuel_used decimal no amount of fuel used in litres during the trip

Additionally the query should also have the following characteristics

  • Only trips for which we have vehicle specifications should be output (i.e. vehicle must be in the vehicle specification table)
  • Fuel in the tank can be calculated by (fuel_level / 255) * fuel_tank_capacity

Record all assumptions made about the data and justify any decisions made as comments within the SQL script

You may assume the following tables exist and have already been exposed

  • Drive - contains all data in drive data
  • Vehicle - contains all data in vehicle specifications

Evaluation

Have the code ready and running prior to the interview. You can share your screen and walk us through your approach, any challenges you had and any assumptions you made. Have your environment prepared to do some live coding as there may be some additional questions asked on the same dataset.

Your solution can be in any spark supported language you are comfortable with (e.g. Scala, Python, .NET) and using any IDE you feel most productive in. If unable to complete due to time and complexity do not worry, we are still interested to hear about your planned approach.

Your code should include

  1. daily trip job, including a note of your assumptions you made
  2. trip sql sql, including a note of any assumptions you made

About

The Technical Test provided for Data Engineering Roles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published