- Step 1: Scope the Project and Gather Data
- Step 2: Explore and Assess the Data
- Step 3: Define the Data Model
- Step 4: Run ETL to Model the Data
- Step 5: Complete Project Write Up
- This data comes from the US National Tourism and Trade Office.
- A data dictionary is included in the workspace.
- There's a sample file so you can take a look at the data in csv format before reading it all in.
- You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
- https://i94.cbp.dhs.gov/I94/#/home
- The immigration data and the global temperate data is in an attached disk.
- You can access the immigration data in a folder with the following path: ../../data/18-83510-I94-Data-2016/.
- There's a file for each month of the year. An example file name is i94_apr16_sub.sas7bdat.
What is a Form I-94?
- Form I-94 is the DHS Arrival/Departure Record issued to aliens who are admitted to the U.S.,
- who are adjusting status while in the U.S. or extending their stay, among other things.
- A CBP officer generally attaches the I-94 to the non-immigrant visitor's passport upon U.S. entry.
- This data comes from OpenSoft.
- This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000.
- This data comes from the US Census Bureau's 2015 American Community Survey.
Using the immigration and demographics datasets, we will create a star schema optimized for queries on immigration analysis. This includes the following tables.
Since we're interested in the flow of travellers through the united states. The i94 data will serve as our fact table.
-
- immigration
- | Col | Description |
---|---|---|
1 | cicid | Application number / Citizenship and Immigration C... |
2 | arrival_year | Arrival Year |
3 | arrival_month | Arrival Month |
4 | citizinship | Country Immigrant is Originally From (country of citizernship) |
5 | residence | Country of Immigrant Residence |
6 | port | AIR / SEAPORT of entry into the US ('XXX': 'NOT REPORTED/UNKNOWN' - '888': 'UNIDENTIFED AIR / SEAPORT' -'UNK': 'UNKNOWN POE') |
7 | port_city | port city |
8 | port_state | port state |
9 | arrival_date | Arrival Date to USA |
10 | travel_mode | (1: 'Air' - 2: 'Sea' - 3: 'Land' - 9: 'Not reported') |
11 | us_state | U.S. State / Address of Immigrant Inside USA ('99'='All Other Codes') actually representing the final address of the migrants, that is where they currently live in the US. |
12 | departure_date | Departure Date from the USA |
13 | age | Age of Respondent in Years |
14 | visa_category | Visa codes collapsed into three categories (Business - Pleasure - Student) |
15 | dep_issued_visa | Department of State where where Visa was issued - CIC does not use This is where your visa was issued. It will be a U.S. embassy or U.S. consulate. |
16 | visa_expiration_date | Character Date Field - Date to which admitted to U.S. (allowed to stay until) - CIC does not use visa expiration date |
17 | gender | Non-immigrant sex |
18 | airline | Airline used to arrive in U.S. |
19 | admission_number | Admission Number - An 11-digit number assigned to an alien when he enters the Unites States. |
20 | flight_number | Flight number of Airline used to arrive in U.S. |
21 | visa_type | VISATYPE - Class of admission legally admitting the non-immigrant to temporarily stay in U.S. |
-
- date - to aggregate the data suing various time units
|-- arrdate: date (nullable = true)
|-- arrival_day: integer (nullable = true)
|-- arrival_week: integer (nullable = true)
|-- arrival_month: integer (nullable = true)
|-- arrival_year: integer (nullable = true)
|-- arrival_weekday: integer (nullable = true)
- date - to aggregate the data suing various time units
-
- demographics - To look at the demographic data of the areas with the most travelers
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- median_age: double (nullable = true)
|-- male_population: integer (nullable = true)
|-- female_population: integer (nullable = true)
|-- total_population: integer (nullable = true)
|-- n_veterans: integer (nullable = true)
|-- foreign_born: integer (nullable = true)
|-- avg_household_size: double (nullable = true)
|-- state_code: string (nullable = true)
|-- Race: string (nullable = true)
|-- Count: integer (nullable = true)
- demographics - To look at the demographic data of the areas with the most travelers
-
Consdiering the significant size of the immigration dataset (~ 3 million rows) for only a month, the most sensible technology choice for such an approach would be spark, especially if we were to process data over a longer period of time.
-
Apache spark was used because of:
- it's ability to handle multiple file formats with large amounts of data.
- Apache Spark offers a lightning-fast unified analytics engine for big data.
- Spark has easy-to-use APIs for operating on large datasets
IF the data was increased by 100x
- Spark can handle the increase but we would consider increasing the number of nodes in our cluster.
- We would still use spark as it as our data processing platform since it is the best suited platform for very large datasets.
- Our data would be stored in an Amazon S3 bucket (instead of storing it in the EMR cluster along with the staging tables) and loaded to our staging tables.
IF the data populates a dashboard that must be updated on a daily basis by 7am every day.
- We would use Apache Airflow to schedule and run data pipelines.
If the database needed to be accessed by 100+ people:
- We would move our analytics database into Amazon Redshift
- Once the data is ready to be consumed, it would be stored in a postgres database on a redshift cluster that easily supports multiuser access.