diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..52cb04c0
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,4 @@
+*.db
+zippedData/*
+*.db
+zippedData/*
diff --git a/Movie_README.md b/Movie_README.md
new file mode 100644
index 00000000..27b4ee4c
--- /dev/null
+++ b/Movie_README.md
@@ -0,0 +1,155 @@
+# π¬ Movie Studio Investment Analysis
+
+This notebook explores movie performance data to help our company decide **what types of films to create**.
+We will use exploratory data analysis and statistical modeling to answer business questions about ROI.
+
+
+> **Deliverables included (starter set):**
+> - `README.md` β Notebook summary and guide
+> - `presentation.pptx` (or `presentation.md`) β Non-technical presentation for Management
+> - `student-checkpoint.ipynb` β analysis notebook developed by the project team
+> - `data/` β expected location for raw inputs (e.g., `zippedData/...`), usually **gitignored**
+
+---
+
+## 1) Overview
+
+**Business Problem.** The company is launching a **new movie studio** and needs evidence-based guidance on **what kinds of films to green-light**.
+
+**Objective.** Use exploratory data analysis to identify patterns linked to **Return on Investment (ROI)** and produce **three concrete recommendations** for slate strategy.
+
+**Current Scope.** We begin with:
+- **The Numbers** (`tn.movie_budgets.csv.gz`): budgets and grosses
+- **IMDB** (`im.db` β `movie_basics`): title, year, runtime, genres
+
+The workflow is **modular** so teammates can add Rotten Tomatoes, TMDB, Box Office Mojo, etc.
+
+---
+
+## 2) Business Understanding
+
+**Stakeholder.** Head of the new studio (green-lighting decisions).
+
+**Key Questions.**
+1. **Genres vs ROI** β Which genres yield the best returns?
+2. **Budget vs ROI %** β Are bigger budgets more or less profitable?
+3. **Runtime vs ROI** β Does movie length impact profitability?
+
+**Decision Use.** Prioritize genre slate, design budget bands, and define runtime guardrails by genre.
+
+---
+
+## 3) Data Understanding & Preparation
+
+**Sources.**
+- **The Numbers** β production budgets, domestic/worldwide grosses, release dates.
+- **IMDB** β movie metadata (titles, start_year, runtime_minutes, genres).
+
+**Join Logic.** Normalize titles and match on **title + year** to merge The Numbers with IMDB.
+
+**Target Metric.** `ROI = (worldwide_gross β production_budget) / production_budget` (requires budget > 0).
+
+**Cleaning.**
+- Cast currency strings to numeric.
+- Drop missing/non-finite ROI.
+- **Multi-genre policy (simple):** explode genres; each film contributes to every genre itβs labeled with. Results are **directional** because genres overlap.
+
+---
+
+## 4) Methods (aligned to syllabus)
+
+- **Descriptive statistics:** counts, mean/median ROI, % profitable (ROI > 0).
+- **t-based confidence intervals** (n β₯ 30; CLT justification).
+- **Optional one-sample t-tests:** compare each genreβs mean to the overall mean.
+- **Simple linear regression** (StatsModels): ROI ~ log10(budget), ROI ~ runtime.
+- **Diagnostics:** QQ plot (normality) and residuals vs fitted (homoscedasticity).
+- **Outlier awareness:** distributions plotted; optional winsorization/log transforms for sensitivity.
+
+---
+
+## 5) Analysis Summary (current iteration)
+
+### 5.1 Genres vs ROI (Simple Multi-Genre)
+- Per-genre **n, mean ROI, median ROI, % profitable**.
+- Visuals: Top 10 by **Avg ROI**, Top 10 by **% Profitable**, **boxplots** (outliers hidden).
+- Interpretation: Favor genres above overall Avg ROI **and** with high % profitable; medians close to means suggest less outlier risk.
+
+### 5.2 Budget vs ROI
+- Regression **ROI ~ log10(budget)** with diagnostics; informs budget bands and slate mix.
+
+### 5.3 Runtime vs ROI
+- Regression **ROI ~ runtime_minutes** with diagnostics; checks for diminishing returns with very long runtimes.
+- Visuals: Runtime vs Return on Investment with rehression line.
+- Interpretation: The graph shows that runtime has no meaningful effect on profitability.
+
+> Teammates can extend with Rotten Tomatoes/TMDB (ratings, votes), cast/star power, franchise flags, marketing proxies.
+
+---
+
+## 6) Recommendations (draft; refine with added evidence)
+
+1. **Slate focus:** Prioritize the top 2β3 genres that are above the overall Avg ROI **and** show high % profitable with medians close to means.
+2. **Budget discipline:** Concentrate budgets in bands where ROI is resilient. Use a **tiered slate** (a few midβhigh budget bets plus a steady pipeline of mid/low budgets).
+3. **Runtime guardrails:** Avoid extreme runtimes unless the genre historically sustains them; target the runtime range with stable ROI in the regression.
+
+**Next iteration:** Add confidence intervals to slides, perform sensitivity (winsorized/log ROI), and enrich with ratings and cast variables.
+
+---
+
+## 7) How to Run
+
+**Environment:** Python 3.x; `pandas`, `numpy`, `matplotlib`, `scipy`, `statsmodels`, `sqlite3` (stdlib).
+
+**Data layout (example):**
+```
+data/
+βββ zippedData/
+ βββ tn.movie_budgets.csv.gz
+ βββ im.db
+```
+
+**Notebook:**
+1. Open `student-checkpoint.ipynb`.
+2. Run all cells (update any paths if needed).
+3. Export charts to the `figures/` folder for the deck.
+
+**.gitignore suggestions:** `/data/`, `*.zip`, `*.gz`, `*.db`, `*.DS_Store`
+
+---
+
+## 8) Repository Structure (suggested)
+
+```
+.
+βββ README.md
+βββ presentation.pptx
+βββ student-checkpoint.ipynb
+βββ notebooks/
+βββ figures/
+βββ data/
+β βββ zippedData/ # raw inputs (not tracked by git)
+βββ .gitignore
+βββ LICENSE (optional)
+```
+
+---
+
+## 9) Team Collaboration Notes
+
+- Add new analyses as **separate notebooks** (e.g., `notebooks/ratings_analysis.ipynb`).
+- Export figures to `/figures` and reference them in `presentation.pptx`.
+- Use **clear commit messages** (`feat:`, `fix:`, `viz:`, `doc:`) and keep commits small.
+
+---
+
+## 10) Limitations & Next Steps
+
+**Limitations:** multi-genre overlap (directional results), ROI skewness (blockbusters), imperfect title-year joins.
+
+**Next Steps:** add t-CIs to slides; winsorized/log ROI sensitivity; integrate ratings/votes/star power/franchise; consider multivariate/regularized regression when feature set grows.
+
+---
+
+## 11) Contact
+Owner: _Your Name_ Β· _your.email@example.com_ Β· _LinkedIn URL_
+Collaborators: _Teammate A, Teammate B, β¦_
diff --git a/Untitled.ipynb b/Untitled.ipynb
new file mode 100644
index 00000000..8bee5bdd
--- /dev/null
+++ b/Untitled.ipynb
@@ -0,0 +1,32 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python (learn-env)",
+ "language": "python",
+ "name": "learn-env"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/index.ipynb b/index.ipynb
index 3623bc14..d1510ab0 100644
--- a/index.ipynb
+++ b/index.ipynb
@@ -2,7 +2,6 @@
"cells": [
{
"cell_type": "markdown",
- "id": "5d35b2b4",
"metadata": {},
"source": [
"# Phase 2 Project Description"
@@ -10,7 +9,6 @@
},
{
"cell_type": "markdown",
- "id": "b5e9e179",
"metadata": {},
"source": [
"You've made it through the second phase of this course, and now you will put your new skills to use with a large end-of-Phase project!\n",
@@ -25,7 +23,6 @@
},
{
"cell_type": "markdown",
- "id": "58851385",
"metadata": {},
"source": [
"## Project Overview"
@@ -33,7 +30,6 @@
},
{
"cell_type": "markdown",
- "id": "6f37995f",
"metadata": {},
"source": [
"For this project, you will use exploratory data analysis to generate insights for a business stakeholder."
@@ -41,7 +37,6 @@
},
{
"cell_type": "markdown",
- "id": "8b0f1668",
"metadata": {},
"source": [
"### Business Problem"
@@ -49,7 +44,6 @@
},
{
"cell_type": "markdown",
- "id": "dce55d1d",
"metadata": {},
"source": [
"Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they donβt know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create."
@@ -57,7 +51,6 @@
},
{
"cell_type": "markdown",
- "id": "d3d557bf",
"metadata": {},
"source": [
"### The Data"
@@ -65,7 +58,6 @@
},
{
"cell_type": "markdown",
- "id": "ca34efb7",
"metadata": {},
"source": [
"In the folder `zippedData` are movie datasets from:\n",
@@ -93,7 +85,6 @@
},
{
"cell_type": "markdown",
- "id": "5ace6e4f",
"metadata": {},
"source": [
"### Key Points"
@@ -101,7 +92,6 @@
},
{
"cell_type": "markdown",
- "id": "c9d2edeb",
"metadata": {},
"source": [
"* **Your analysis should yield three concrete business recommendations.** The ultimate purpose of exploratory analysis is not just to learn about the data, but to help an organization perform better. Explicitly relate your findings to business needs by recommending actions that you think the business should take.\n",
@@ -113,7 +103,6 @@
},
{
"cell_type": "markdown",
- "id": "474e2ec3",
"metadata": {},
"source": [
"## Deliverables"
@@ -121,7 +110,6 @@
},
{
"cell_type": "markdown",
- "id": "eaeda85f",
"metadata": {},
"source": [
"There are three deliverables for this project:\n",
@@ -133,7 +121,6 @@
},
{
"cell_type": "markdown",
- "id": "a7f8e274",
"metadata": {},
"source": [
"### Non-Technical Presentation"
@@ -141,7 +128,6 @@
},
{
"cell_type": "markdown",
- "id": "540d5c27",
"metadata": {},
"source": [
"The non-technical presentation is a slide deck presenting your analysis to business stakeholders.\n",
@@ -183,7 +169,6 @@
},
{
"cell_type": "markdown",
- "id": "d27915ba",
"metadata": {},
"source": [
"### Jupyter Notebook"
@@ -191,7 +176,6 @@
},
{
"cell_type": "markdown",
- "id": "2d5d45ea",
"metadata": {},
"source": [
"The Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a data science audience.\n",
@@ -219,7 +203,6 @@
},
{
"cell_type": "markdown",
- "id": "2027aa4c",
"metadata": {},
"source": [
"### GitHub Repository"
@@ -227,7 +210,6 @@
},
{
"cell_type": "markdown",
- "id": "b8057390",
"metadata": {},
"source": [
"The GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history.\n",
@@ -276,7 +258,6 @@
},
{
"cell_type": "markdown",
- "id": "f19694e7",
"metadata": {},
"source": [
"## Grading"
@@ -284,7 +265,6 @@
},
{
"cell_type": "markdown",
- "id": "06e9cfb7",
"metadata": {},
"source": [
"***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 2 are:\n",
@@ -296,7 +276,6 @@
},
{
"cell_type": "markdown",
- "id": "a4c04769",
"metadata": {},
"source": [
"### Data Communication"
@@ -304,7 +283,6 @@
},
{
"cell_type": "markdown",
- "id": "0834a4ee",
"metadata": {},
"source": [
"Communication is a key \"soft skill\". In [this survey](https://www.payscale.com/data-packages/job-skills), 46% of hiring managers said that recent college grads were missing this skill.\n",
@@ -324,7 +302,6 @@
},
{
"cell_type": "markdown",
- "id": "276dff7c",
"metadata": {},
"source": [
"#### Exceeds Objective"
@@ -332,7 +309,6 @@
},
{
"cell_type": "markdown",
- "id": "e87c2713",
"metadata": {},
"source": [
"Creates and describes appropriate visualizations for given business questions, where each visualization fulfills all elements of the checklist\n",
@@ -342,7 +318,6 @@
},
{
"cell_type": "markdown",
- "id": "b4e8a4c7",
"metadata": {},
"source": [
"#### Meets Objective (Passing Bar)"
@@ -350,7 +325,6 @@
},
{
"cell_type": "markdown",
- "id": "bc4e21d0",
"metadata": {},
"source": [
"Creates and describes appropriate visualizations for given business questions\n",
@@ -360,7 +334,6 @@
},
{
"cell_type": "markdown",
- "id": "d0403eb9",
"metadata": {},
"source": [
"#### Approaching Objective"
@@ -368,7 +341,6 @@
},
{
"cell_type": "markdown",
- "id": "22dd4ad6",
"metadata": {},
"source": [
"Creates visualizations that are not related to the business questions, or uses an inappropriate type of visualization\n",
@@ -380,7 +352,6 @@
},
{
"cell_type": "markdown",
- "id": "aa1b808d",
"metadata": {},
"source": [
"#### Does Not Meet Objective"
@@ -388,7 +359,6 @@
},
{
"cell_type": "markdown",
- "id": "a8a64869",
"metadata": {},
"source": [
"Does not submit the required number of visualizations"
@@ -396,7 +366,6 @@
},
{
"cell_type": "markdown",
- "id": "db2e0ce8",
"metadata": {},
"source": [
"### Authoring Jupyter Notebooks"
@@ -404,7 +373,6 @@
},
{
"cell_type": "markdown",
- "id": "91cc89b5",
"metadata": {},
"source": [
"According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), 74.1% of data scientists use a Jupyter development environment, which is more than twice the percentage of the next-most-popular IDE, Visual Studio Code. Jupyter Notebooks allow for reproducible, skim-able code documents for a data science audience. Comfort and skill with authoring Jupyter Notebooks will prepare you for job interviews, take-home challenges, and on-the-job tasks as a data scientist.\n",
@@ -416,7 +384,6 @@
},
{
"cell_type": "markdown",
- "id": "b9272672",
"metadata": {},
"source": [
"#### Exceeds Objective"
@@ -424,7 +391,6 @@
},
{
"cell_type": "markdown",
- "id": "efc937e5",
"metadata": {},
"source": [
"Uses Markdown and code comments to create a well-organized, skim-able document that follows all best practices\n",
@@ -434,7 +400,6 @@
},
{
"cell_type": "markdown",
- "id": "d01725ea",
"metadata": {},
"source": [
"#### Meets Objective (Passing Bar)"
@@ -442,7 +407,6 @@
},
{
"cell_type": "markdown",
- "id": "2c854f50",
"metadata": {},
"source": [
"Uses some Markdown to create an organized notebook, with an introduction at the top and a conclusion at the bottom"
@@ -450,7 +414,6 @@
},
{
"cell_type": "markdown",
- "id": "3e0b3385",
"metadata": {},
"source": [
"#### Approaching Objective"
@@ -458,7 +421,6 @@
},
{
"cell_type": "markdown",
- "id": "67767f89",
"metadata": {},
"source": [
"Uses Markdown cells to organize, but either uses only headers and does not provide any explanations or justifications, or uses only plaintext without any headers to segment out sections of the notebook\n",
@@ -468,7 +430,6 @@
},
{
"cell_type": "markdown",
- "id": "195ef62a",
"metadata": {},
"source": [
"#### Does Not Meet Objective"
@@ -476,7 +437,6 @@
},
{
"cell_type": "markdown",
- "id": "709181b9",
"metadata": {},
"source": [
"Does not submit a notebook, or does not use Markdown cells at all to organize the notebook"
@@ -484,7 +444,6 @@
},
{
"cell_type": "markdown",
- "id": "290335d1",
"metadata": {},
"source": [
"### Data Manipulation and Analysis with `pandas`"
@@ -492,7 +451,6 @@
},
{
"cell_type": "markdown",
- "id": "2c0aae32",
"metadata": {},
"source": [
"`pandas` is a very popular data manipulation library, with over 2 million downloads on Anaconda (`conda install pandas`) and over 19 million downloads on PyPI (`pip install pandas`) at the time of this writing. In our own internal data, we see that the overwhelming majority of Flatiron School DS grads use `pandas` on the job in some capacity.\n",
@@ -510,7 +468,6 @@
},
{
"cell_type": "markdown",
- "id": "e070c91b",
"metadata": {},
"source": [
"#### Exceeds Objective"
@@ -518,7 +475,6 @@
},
{
"cell_type": "markdown",
- "id": "20092dcd",
"metadata": {},
"source": [
"Uses `pandas` to prepare data and answer business questions in an idiomatic, performant way"
@@ -526,7 +482,6 @@
},
{
"cell_type": "markdown",
- "id": "882b158d",
"metadata": {},
"source": [
"#### Meets Objective (Passing Bar)"
@@ -534,7 +489,6 @@
},
{
"cell_type": "markdown",
- "id": "c2c426e6",
"metadata": {},
"source": [
"Successfully uses `pandas` to prepare data in order to answer business questions\n",
@@ -544,7 +498,6 @@
},
{
"cell_type": "markdown",
- "id": "88d1667b",
"metadata": {},
"source": [
"#### Approaching Objective"
@@ -552,7 +505,6 @@
},
{
"cell_type": "markdown",
- "id": "ec132034",
"metadata": {},
"source": [
"Uses `pandas` to prepare data, but makes significant errors\n",
@@ -562,7 +514,6 @@
},
{
"cell_type": "markdown",
- "id": "c5e3c86b",
"metadata": {},
"source": [
"#### Does Not Meet Objective"
@@ -570,7 +521,6 @@
},
{
"cell_type": "markdown",
- "id": "d9566206",
"metadata": {},
"source": [
"Unable to prepare data using `pandas`\n",
@@ -580,7 +530,6 @@
},
{
"cell_type": "markdown",
- "id": "b0923637",
"metadata": {},
"source": [
"## Getting Started"
@@ -588,7 +537,6 @@
},
{
"cell_type": "markdown",
- "id": "8e37e815",
"metadata": {},
"source": [
"Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP.\n",
@@ -604,7 +552,6 @@
},
{
"cell_type": "markdown",
- "id": "290d61a5",
"metadata": {},
"source": [
"## Summary"
@@ -612,7 +559,6 @@
},
{
"cell_type": "markdown",
- "id": "ac002279",
"metadata": {},
"source": [
"This project will give you a valuable opportunity to develop your data science skills using real-world data. The end-of-phase projects are a critical part of the program because they give you a chance to bring together all the skills you've learned, apply them to realistic projects for a business stakeholder, practice communication skills, and get feedback to help you improve. You've got this!"
@@ -635,7 +581,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.9.16"
+ "version": "3.8.5"
}
},
"nbformat": 4,
diff --git a/student-phase2project-cg.ipynb b/student-phase2project-cg.ipynb
new file mode 100644
index 00000000..19851964
--- /dev/null
+++ b/student-phase2project-cg.ipynb
@@ -0,0 +1,1383 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Final Project Submission\n",
+ "\n",
+ "Please fill out:\n",
+ "* Student name: Catherine Gachiri\n",
+ "* Student pace: Remote\n",
+ "* Scheduled project review date/time: 14/09/2025\n",
+ "* Instructor name: Fidelis Wanalwenge\n",
+ "* Blog post URL:\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# π¬ Movie Studio Investment Analysis\n",
+ "\n",
+ "## Project Overview\n",
+ "This notebook explores movie performance data to help our company decide **what types of films to create**. \n",
+ "We will use exploratory data analysis and statistical modeling to answer business questions about ROI.\n",
+ "\n",
+ "**Key Data Sources:**\n",
+ "- **The Numbers** (`tn.movie_budgets.csv.gz`) β Budgets & grosses (used for ROI).\n",
+ "- **IMDB** (`im.db`) β Movie metadata (genres, runtime, year).\n",
+ "- **Box Office Mojo** (`bom.movie_gross.csv.gz`) β Additional grosses (optional).\n",
+ "\n",
+ "**Goal:** Build a dataset that combines **financial data** (budgets & grosses) with **metadata** (genres, runtime, release timing) for statistical analysis."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Business Understanding\n",
+ "Our stakeholders (head of the new movie studio) want to know:\n",
+ "\n",
+ "1. **Genres vs ROI** β Which genres yield the best returns? \n",
+ "2. **Release Quarter vs ROI** β Does timing affect financial success? \n",
+ "3. **Budget vs ROI %** β Are bigger budgets more (or less) profitable? \n",
+ "4. **Runtime vs ROI** β Does movie length impact profitability?\n",
+ "\n",
+ "We will prepare a clean dataset to test these hypotheses."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 1: Load and Inspect Data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Box Office Mojo sample:\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " title | \n",
+ " studio | \n",
+ " domestic_gross | \n",
+ " foreign_gross | \n",
+ " year | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " Toy Story 3 | \n",
+ " BV | \n",
+ " 415000000.0 | \n",
+ " 652000000 | \n",
+ " 2010 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " Alice in Wonderland (2010) | \n",
+ " BV | \n",
+ " 334200000.0 | \n",
+ " 691300000 | \n",
+ " 2010 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " Harry Potter and the Deathly Hallows Part 1 | \n",
+ " WB | \n",
+ " 296000000.0 | \n",
+ " 664300000 | \n",
+ " 2010 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " Inception | \n",
+ " WB | \n",
+ " 292600000.0 | \n",
+ " 535700000 | \n",
+ " 2010 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " Shrek Forever After | \n",
+ " P/DW | \n",
+ " 238700000.0 | \n",
+ " 513900000 | \n",
+ " 2010 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title studio domestic_gross \\\n",
+ "0 Toy Story 3 BV 415000000.0 \n",
+ "1 Alice in Wonderland (2010) BV 334200000.0 \n",
+ "2 Harry Potter and the Deathly Hallows Part 1 WB 296000000.0 \n",
+ "3 Inception WB 292600000.0 \n",
+ "4 Shrek Forever After P/DW 238700000.0 \n",
+ "\n",
+ " foreign_gross year \n",
+ "0 652000000 2010 \n",
+ "1 691300000 2010 \n",
+ "2 664300000 2010 \n",
+ "3 535700000 2010 \n",
+ "4 513900000 2010 "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "(3387, 5)"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The Numbers sample:\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " release_date | \n",
+ " movie | \n",
+ " production_budget | \n",
+ " domestic_gross | \n",
+ " worldwide_gross | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 1 | \n",
+ " Dec 18, 2009 | \n",
+ " Avatar | \n",
+ " $425,000,000 | \n",
+ " $760,507,625 | \n",
+ " $2,776,345,279 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2 | \n",
+ " May 20, 2011 | \n",
+ " Pirates of the Caribbean: On Stranger Tides | \n",
+ " $410,600,000 | \n",
+ " $241,063,875 | \n",
+ " $1,045,663,875 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 3 | \n",
+ " Jun 7, 2019 | \n",
+ " Dark Phoenix | \n",
+ " $350,000,000 | \n",
+ " $42,762,350 | \n",
+ " $149,762,350 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 4 | \n",
+ " May 1, 2015 | \n",
+ " Avengers: Age of Ultron | \n",
+ " $330,600,000 | \n",
+ " $459,005,868 | \n",
+ " $1,403,013,963 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 5 | \n",
+ " Dec 15, 2017 | \n",
+ " Star Wars Ep. VIII: The Last Jedi | \n",
+ " $317,000,000 | \n",
+ " $620,181,382 | \n",
+ " $1,316,721,747 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id release_date movie \\\n",
+ "0 1 Dec 18, 2009 Avatar \n",
+ "1 2 May 20, 2011 Pirates of the Caribbean: On Stranger Tides \n",
+ "2 3 Jun 7, 2019 Dark Phoenix \n",
+ "3 4 May 1, 2015 Avengers: Age of Ultron \n",
+ "4 5 Dec 15, 2017 Star Wars Ep. VIII: The Last Jedi \n",
+ "\n",
+ " production_budget domestic_gross worldwide_gross \n",
+ "0 $425,000,000 $760,507,625 $2,776,345,279 \n",
+ "1 $410,600,000 $241,063,875 $1,045,663,875 \n",
+ "2 $350,000,000 $42,762,350 $149,762,350 \n",
+ "3 $330,600,000 $459,005,868 $1,403,013,963 \n",
+ "4 $317,000,000 $620,181,382 $1,316,721,747 "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "(5782, 6)"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "IMDB Tables:\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " movie_basics | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " directors | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " known_for | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " movie_akas | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " movie_ratings | \n",
+ "
\n",
+ " \n",
+ " | 5 | \n",
+ " persons | \n",
+ "
\n",
+ " \n",
+ " | 6 | \n",
+ " principals | \n",
+ "
\n",
+ " \n",
+ " | 7 | \n",
+ " writers | \n",
+ "
\n",
+ " \n",
+ " | 8 | \n",
+ " movies_financials_ratings | \n",
+ "
\n",
+ " \n",
+ " | 9 | \n",
+ " merged_with_ratings | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name\n",
+ "0 movie_basics\n",
+ "1 directors\n",
+ "2 known_for\n",
+ "3 movie_akas\n",
+ "4 movie_ratings\n",
+ "5 persons\n",
+ "6 principals\n",
+ "7 writers\n",
+ "8 movies_financials_ratings\n",
+ "9 merged_with_ratings"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import sqlite3\n",
+ "import re\n",
+ "from pathlib import Path\n",
+ "\n",
+ "# Define paths\n",
+ "data_dir = Path('zippedData')\n",
+ "bom_path = data_dir/'bom.movie_gross.csv.gz'\n",
+ "tn_path = data_dir/'tn.movie_budgets.csv.gz'\n",
+ "imdb_path = Path('zippedData/im.db')\n",
+ "\n",
+ "# Load Box Office Mojo\n",
+ "bom = pd.read_csv(bom_path)\n",
+ "print(\"Box Office Mojo sample:\")\n",
+ "display(bom.head())\n",
+ "display(bom.shape)\n",
+ "\n",
+ "# Load The Numbers (budgets)\n",
+ "tn = pd.read_csv(tn_path)\n",
+ "print(\"The Numbers sample:\")\n",
+ "display(tn.head())\n",
+ "display(tn.shape)\n",
+ "\n",
+ "# Inspect IMDB tables\n",
+ "con = sqlite3.connect(imdb_path)\n",
+ "tables = pd.read_sql(\"SELECT name FROM sqlite_master WHERE type='table';\", con)\n",
+ "print(\"IMDB Tables:\")\n",
+ "display(tables)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 2: Clean `The Numbers` Dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " release_date | \n",
+ " movie | \n",
+ " production_budget | \n",
+ " domestic_gross | \n",
+ " worldwide_gross | \n",
+ " year | \n",
+ " quarter | \n",
+ " ROI | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 1 | \n",
+ " 2009-12-18 | \n",
+ " Avatar | \n",
+ " 425000000.0 | \n",
+ " 760507625.0 | \n",
+ " 2.776345e+09 | \n",
+ " 2009 | \n",
+ " 4 | \n",
+ " 5.532577 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2 | \n",
+ " 2011-05-20 | \n",
+ " Pirates of the Caribbean: On Stranger Tides | \n",
+ " 410600000.0 | \n",
+ " 241063875.0 | \n",
+ " 1.045664e+09 | \n",
+ " 2011 | \n",
+ " 2 | \n",
+ " 1.546673 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 3 | \n",
+ " 2019-06-07 | \n",
+ " Dark Phoenix | \n",
+ " 350000000.0 | \n",
+ " 42762350.0 | \n",
+ " 1.497624e+08 | \n",
+ " 2019 | \n",
+ " 2 | \n",
+ " -0.572108 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 4 | \n",
+ " 2015-05-01 | \n",
+ " Avengers: Age of Ultron | \n",
+ " 330600000.0 | \n",
+ " 459005868.0 | \n",
+ " 1.403014e+09 | \n",
+ " 2015 | \n",
+ " 2 | \n",
+ " 3.243841 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 5 | \n",
+ " 2017-12-15 | \n",
+ " Star Wars Ep. VIII: The Last Jedi | \n",
+ " 317000000.0 | \n",
+ " 620181382.0 | \n",
+ " 1.316722e+09 | \n",
+ " 2017 | \n",
+ " 4 | \n",
+ " 3.153696 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id release_date movie \\\n",
+ "0 1 2009-12-18 Avatar \n",
+ "1 2 2011-05-20 Pirates of the Caribbean: On Stranger Tides \n",
+ "2 3 2019-06-07 Dark Phoenix \n",
+ "3 4 2015-05-01 Avengers: Age of Ultron \n",
+ "4 5 2017-12-15 Star Wars Ep. VIII: The Last Jedi \n",
+ "\n",
+ " production_budget domestic_gross worldwide_gross year quarter ROI \n",
+ "0 425000000.0 760507625.0 2.776345e+09 2009 4 5.532577 \n",
+ "1 410600000.0 241063875.0 1.045664e+09 2011 2 1.546673 \n",
+ "2 350000000.0 42762350.0 1.497624e+08 2019 2 -0.572108 \n",
+ "3 330600000.0 459005868.0 1.403014e+09 2015 2 3.243841 \n",
+ "4 317000000.0 620181382.0 1.316722e+09 2017 4 3.153696 "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "(5782, 9)"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "tn_clean = tn.copy()\n",
+ "\n",
+ "# Convert currency columns to numeric\n",
+ "currency_cols = [\"production_budget\", \"domestic_gross\", \"worldwide_gross\"]\n",
+ "for col in currency_cols:\n",
+ " tn_clean[col] = (tn_clean[col]\n",
+ " .replace('[\\$,]', '', regex=True)\n",
+ " .astype(float))\n",
+ "\n",
+ "# Parse release date\n",
+ "tn_clean[\"release_date\"] = pd.to_datetime(tn_clean[\"release_date\"], errors=\"coerce\")\n",
+ "tn_clean[\"year\"] = tn_clean[\"release_date\"].dt.year\n",
+ "tn_clean[\"quarter\"] = tn_clean[\"release_date\"].dt.quarter\n",
+ "\n",
+ "# Compute ROI\n",
+ "tn_clean[\"ROI\"] = (tn_clean[\"worldwide_gross\"] - tn_clean[\"production_budget\"]) / tn_clean[\"production_budget\"]\n",
+ "\n",
+ "display(tn_clean.head())\n",
+ "display(tn_clean.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 3: Extract Metadata from IMDB"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " movie_id | \n",
+ " primary_title | \n",
+ " start_year | \n",
+ " runtime_minutes | \n",
+ " genres | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " tt0063540 | \n",
+ " Sunghursh | \n",
+ " 2013 | \n",
+ " 175.0 | \n",
+ " Action,Crime,Drama | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " tt0066787 | \n",
+ " One Day Before the Rainy Season | \n",
+ " 2019 | \n",
+ " 114.0 | \n",
+ " Biography,Drama | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " tt0069049 | \n",
+ " The Other Side of the Wind | \n",
+ " 2018 | \n",
+ " 122.0 | \n",
+ " Drama | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " tt0069204 | \n",
+ " Sabse Bada Sukh | \n",
+ " 2018 | \n",
+ " NaN | \n",
+ " Comedy,Drama | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " tt0100275 | \n",
+ " The Wandering Soap Opera | \n",
+ " 2017 | \n",
+ " 80.0 | \n",
+ " Comedy,Drama,Fantasy | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " movie_id primary_title start_year runtime_minutes \\\n",
+ "0 tt0063540 Sunghursh 2013 175.0 \n",
+ "1 tt0066787 One Day Before the Rainy Season 2019 114.0 \n",
+ "2 tt0069049 The Other Side of the Wind 2018 122.0 \n",
+ "3 tt0069204 Sabse Bada Sukh 2018 NaN \n",
+ "4 tt0100275 The Wandering Soap Opera 2017 80.0 \n",
+ "\n",
+ " genres \n",
+ "0 Action,Crime,Drama \n",
+ "1 Biography,Drama \n",
+ "2 Drama \n",
+ "3 Comedy,Drama \n",
+ "4 Comedy,Drama,Fantasy "
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "con = sqlite3.connect(imdb_path)\n",
+ "imdb = pd.read_sql(\"\"\"\n",
+ " SELECT movie_id, primary_title, start_year, runtime_minutes, genres\n",
+ " FROM movie_basics\n",
+ " WHERE start_year BETWEEN 1980 AND 2025\n",
+ " AND primary_title IS NOT NULL\n",
+ "\"\"\", con)\n",
+ "con.close()\n",
+ "\n",
+ "imdb.head()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 4: Normalize Titles and Join Datasets"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " release_date | \n",
+ " movie | \n",
+ " production_budget | \n",
+ " domestic_gross | \n",
+ " worldwide_gross | \n",
+ " year | \n",
+ " quarter | \n",
+ " ROI | \n",
+ " title_norm | \n",
+ " movie_id | \n",
+ " primary_title | \n",
+ " start_year | \n",
+ " runtime_minutes | \n",
+ " genres | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 1 | \n",
+ " 2009-12-18 | \n",
+ " Avatar | \n",
+ " 425000000.0 | \n",
+ " 760507625.0 | \n",
+ " 2.776345e+09 | \n",
+ " 2009 | \n",
+ " 4 | \n",
+ " 5.532577 | \n",
+ " avatar | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2 | \n",
+ " 2011-05-20 | \n",
+ " Pirates of the Caribbean: On Stranger Tides | \n",
+ " 410600000.0 | \n",
+ " 241063875.0 | \n",
+ " 1.045664e+09 | \n",
+ " 2011 | \n",
+ " 2 | \n",
+ " 1.546673 | \n",
+ " pirates of the caribbean on stranger tides | \n",
+ " tt1298650 | \n",
+ " Pirates of the Caribbean: On Stranger Tides | \n",
+ " 2011.0 | \n",
+ " 136.0 | \n",
+ " Action,Adventure,Fantasy | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 3 | \n",
+ " 2019-06-07 | \n",
+ " Dark Phoenix | \n",
+ " 350000000.0 | \n",
+ " 42762350.0 | \n",
+ " 1.497624e+08 | \n",
+ " 2019 | \n",
+ " 2 | \n",
+ " -0.572108 | \n",
+ " dark phoenix | \n",
+ " tt6565702 | \n",
+ " Dark Phoenix | \n",
+ " 2019.0 | \n",
+ " 113.0 | \n",
+ " Action,Adventure,Sci-Fi | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 4 | \n",
+ " 2015-05-01 | \n",
+ " Avengers: Age of Ultron | \n",
+ " 330600000.0 | \n",
+ " 459005868.0 | \n",
+ " 1.403014e+09 | \n",
+ " 2015 | \n",
+ " 2 | \n",
+ " 3.243841 | \n",
+ " avengers age of ultron | \n",
+ " tt2395427 | \n",
+ " Avengers: Age of Ultron | \n",
+ " 2015.0 | \n",
+ " 141.0 | \n",
+ " Action,Adventure,Sci-Fi | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 5 | \n",
+ " 2017-12-15 | \n",
+ " Star Wars Ep. VIII: The Last Jedi | \n",
+ " 317000000.0 | \n",
+ " 620181382.0 | \n",
+ " 1.316722e+09 | \n",
+ " 2017 | \n",
+ " 4 | \n",
+ " 3.153696 | \n",
+ " star wars ep viii the last jedi | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id release_date movie \\\n",
+ "0 1 2009-12-18 Avatar \n",
+ "1 2 2011-05-20 Pirates of the Caribbean: On Stranger Tides \n",
+ "2 3 2019-06-07 Dark Phoenix \n",
+ "3 4 2015-05-01 Avengers: Age of Ultron \n",
+ "4 5 2017-12-15 Star Wars Ep. VIII: The Last Jedi \n",
+ "\n",
+ " production_budget domestic_gross worldwide_gross year quarter \\\n",
+ "0 425000000.0 760507625.0 2.776345e+09 2009 4 \n",
+ "1 410600000.0 241063875.0 1.045664e+09 2011 2 \n",
+ "2 350000000.0 42762350.0 1.497624e+08 2019 2 \n",
+ "3 330600000.0 459005868.0 1.403014e+09 2015 2 \n",
+ "4 317000000.0 620181382.0 1.316722e+09 2017 4 \n",
+ "\n",
+ " ROI title_norm movie_id \\\n",
+ "0 5.532577 avatar NaN \n",
+ "1 1.546673 pirates of the caribbean on stranger tides tt1298650 \n",
+ "2 -0.572108 dark phoenix tt6565702 \n",
+ "3 3.243841 avengers age of ultron tt2395427 \n",
+ "4 3.153696 star wars ep viii the last jedi NaN \n",
+ "\n",
+ " primary_title start_year runtime_minutes \\\n",
+ "0 NaN NaN NaN \n",
+ "1 Pirates of the Caribbean: On Stranger Tides 2011.0 136.0 \n",
+ "2 Dark Phoenix 2019.0 113.0 \n",
+ "3 Avengers: Age of Ultron 2015.0 141.0 \n",
+ "4 NaN NaN NaN \n",
+ "\n",
+ " genres \n",
+ "0 NaN \n",
+ "1 Action,Adventure,Fantasy \n",
+ "2 Action,Adventure,Sci-Fi \n",
+ "3 Action,Adventure,Sci-Fi \n",
+ "4 NaN "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "(5782, 15)"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Title normalization function\n",
+ "def normalize_title(title: str) -> str:\n",
+ " if pd.isna(title):\n",
+ " return np.nan\n",
+ " title = title.lower().strip()\n",
+ " title = re.sub(r\"\\([^)]*\\)\", \"\", title) # remove parentheticals\n",
+ " title = re.sub(r\"[^a-z0-9 ]\", \"\", title) # drop punctuation\n",
+ " title = re.sub(r\"\\s+\", \" \", title).strip()\n",
+ " return title\n",
+ "\n",
+ "tn_clean[\"title_norm\"] = tn_clean[\"movie\"].map(normalize_title)\n",
+ "imdb[\"title_norm\"] = imdb[\"primary_title\"].map(normalize_title)\n",
+ "\n",
+ "# Bring in ratings to get numvotes\n",
+ "con = sqlite3.connect(imdb_path)\n",
+ "ratings = pd.read_sql(\"SELECT movie_id, averagerating, numvotes FROM movie_ratings;\", con)\n",
+ "con.close()\n",
+ "\n",
+ "imdb_full = (imdb.merge(ratings, on='movie_id', how='left')\n",
+ " .assign(numvotes=lambda d: d['numvotes'].fillna(0),\n",
+ " runtime_minutes=lambda d: d['runtime_minutes'].fillna(-1)))\n",
+ "\n",
+ "# Sort by best proxy for canonical record, then keep first per key\n",
+ "imdb_dedup = (imdb_full.sort_values(['title_norm','start_year','numvotes','runtime_minutes'],\n",
+ " ascending=[True, True, False, False])\n",
+ " .drop_duplicates(['title_norm','start_year'], keep='first')\n",
+ " .drop(columns=['averagerating','numvotes'])) # keep if you need them\n",
+ "\n",
+ "# Re-join with deduped IMDB\n",
+ "movies_dedup = tn_clean.merge(\n",
+ " imdb_dedup, left_on=['title_norm','year'], right_on=['title_norm','start_year'], how='left'\n",
+ ")\n",
+ "len(movies_dedup) - len(tn_clean) # β should now be ~0 (or much smaller)\n",
+ "\n",
+ "display(movies_dedup.head())\n",
+ "display(movies_dedup.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 5: Create Final Analysis Dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " movie | \n",
+ " release_date | \n",
+ " year | \n",
+ " quarter | \n",
+ " production_budget | \n",
+ " worldwide_gross | \n",
+ " ROI | \n",
+ " runtime_minutes | \n",
+ " genres | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " Avatar | \n",
+ " 2009-12-18 | \n",
+ " 2009 | \n",
+ " 4 | \n",
+ " 425000000.0 | \n",
+ " 2.776345e+09 | \n",
+ " 5.532577 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " Pirates of the Caribbean: On Stranger Tides | \n",
+ " 2011-05-20 | \n",
+ " 2011 | \n",
+ " 2 | \n",
+ " 410600000.0 | \n",
+ " 1.045664e+09 | \n",
+ " 1.546673 | \n",
+ " 136.0 | \n",
+ " Action,Adventure,Fantasy | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " Dark Phoenix | \n",
+ " 2019-06-07 | \n",
+ " 2019 | \n",
+ " 2 | \n",
+ " 350000000.0 | \n",
+ " 1.497624e+08 | \n",
+ " -0.572108 | \n",
+ " 113.0 | \n",
+ " Action,Adventure,Sci-Fi | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " Avengers: Age of Ultron | \n",
+ " 2015-05-01 | \n",
+ " 2015 | \n",
+ " 2 | \n",
+ " 330600000.0 | \n",
+ " 1.403014e+09 | \n",
+ " 3.243841 | \n",
+ " 141.0 | \n",
+ " Action,Adventure,Sci-Fi | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " Star Wars Ep. VIII: The Last Jedi | \n",
+ " 2017-12-15 | \n",
+ " 2017 | \n",
+ " 4 | \n",
+ " 317000000.0 | \n",
+ " 1.316722e+09 | \n",
+ " 3.153696 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 5 | \n",
+ " Star Wars Ep. VII: The Force Awakens | \n",
+ " 2015-12-18 | \n",
+ " 2015 | \n",
+ " 4 | \n",
+ " 306000000.0 | \n",
+ " 2.053311e+09 | \n",
+ " 5.710167 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 6 | \n",
+ " Avengers: Infinity War | \n",
+ " 2018-04-27 | \n",
+ " 2018 | \n",
+ " 2 | \n",
+ " 300000000.0 | \n",
+ " 2.048134e+09 | \n",
+ " 5.827114 | \n",
+ " 149.0 | \n",
+ " Action,Adventure,Sci-Fi | \n",
+ "
\n",
+ " \n",
+ " | 7 | \n",
+ " Pirates of the Caribbean: At WorldΓ’ΒΒs End | \n",
+ " 2007-05-24 | \n",
+ " 2007 | \n",
+ " 2 | \n",
+ " 300000000.0 | \n",
+ " 9.634204e+08 | \n",
+ " 2.211401 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 8 | \n",
+ " Justice League | \n",
+ " 2017-11-17 | \n",
+ " 2017 | \n",
+ " 4 | \n",
+ " 300000000.0 | \n",
+ " 6.559452e+08 | \n",
+ " 1.186484 | \n",
+ " 120.0 | \n",
+ " Action,Adventure,Fantasy | \n",
+ "
\n",
+ " \n",
+ " | 9 | \n",
+ " Spectre | \n",
+ " 2015-11-06 | \n",
+ " 2015 | \n",
+ " 4 | \n",
+ " 300000000.0 | \n",
+ " 8.796209e+08 | \n",
+ " 1.932070 | \n",
+ " 148.0 | \n",
+ " Action,Adventure,Thriller | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " movie release_date year quarter \\\n",
+ "0 Avatar 2009-12-18 2009 4 \n",
+ "1 Pirates of the Caribbean: On Stranger Tides 2011-05-20 2011 2 \n",
+ "2 Dark Phoenix 2019-06-07 2019 2 \n",
+ "3 Avengers: Age of Ultron 2015-05-01 2015 2 \n",
+ "4 Star Wars Ep. VIII: The Last Jedi 2017-12-15 2017 4 \n",
+ "5 Star Wars Ep. VII: The Force Awakens 2015-12-18 2015 4 \n",
+ "6 Avengers: Infinity War 2018-04-27 2018 2 \n",
+ "7 Pirates of the Caribbean: At WorldΓ’ΒΒs End 2007-05-24 2007 2 \n",
+ "8 Justice League 2017-11-17 2017 4 \n",
+ "9 Spectre 2015-11-06 2015 4 \n",
+ "\n",
+ " production_budget worldwide_gross ROI runtime_minutes \\\n",
+ "0 425000000.0 2.776345e+09 5.532577 NaN \n",
+ "1 410600000.0 1.045664e+09 1.546673 136.0 \n",
+ "2 350000000.0 1.497624e+08 -0.572108 113.0 \n",
+ "3 330600000.0 1.403014e+09 3.243841 141.0 \n",
+ "4 317000000.0 1.316722e+09 3.153696 NaN \n",
+ "5 306000000.0 2.053311e+09 5.710167 NaN \n",
+ "6 300000000.0 2.048134e+09 5.827114 149.0 \n",
+ "7 300000000.0 9.634204e+08 2.211401 NaN \n",
+ "8 300000000.0 6.559452e+08 1.186484 120.0 \n",
+ "9 300000000.0 8.796209e+08 1.932070 148.0 \n",
+ "\n",
+ " genres \n",
+ "0 NaN \n",
+ "1 Action,Adventure,Fantasy \n",
+ "2 Action,Adventure,Sci-Fi \n",
+ "3 Action,Adventure,Sci-Fi \n",
+ "4 NaN \n",
+ "5 NaN \n",
+ "6 Action,Adventure,Sci-Fi \n",
+ "7 NaN \n",
+ "8 Action,Adventure,Fantasy \n",
+ "9 Action,Adventure,Thriller "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "(5782, 9)"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "analysis_df = movies_dedup[[\n",
+ " \"movie\", \"release_date\", \"year\", \"quarter\",\n",
+ " \"production_budget\", \"worldwide_gross\", \"ROI\",\n",
+ " \"runtime_minutes\", \"genres\"\n",
+ "]].copy()\n",
+ "\n",
+ "display(analysis_df.head(10))\n",
+ "analysis_df.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Base dataset: (5782, 9)\n",
+ "Genre dataset: (4030, 11)\n",
+ "Quarter dataset: (5782, 9)\n",
+ "Budget dataset: (5782, 9)\n",
+ "Runtime dataset: (1582, 9)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Drop rows without ROI\n",
+ "df = analysis_df.dropna(subset=[\"ROI\"]).copy()\n",
+ "\n",
+ "# Extract primary genre (first listed)\n",
+ "#df[\"primary_genre\"] = df[\"genres\"].dropna().apply(lambda x: x.split(\",\")[0] if isinstance(x, str) else np.nan)\n",
+ "\n",
+ "# 1) Explode genres so a movie appears once per genre\n",
+ "df_multi = df.dropna(subset=[\"genres\"]).copy()\n",
+ "df_multi[\"genre\"] = df_multi[\"genres\"].str.split(\",\")\n",
+ "df_multi = df_multi.explode(\"genre\")\n",
+ "df_multi[\"genre\"] = df_multi[\"genre\"].str.strip()\n",
+ "\n",
+ "# 2) Cluster id: all repeated rows from same movie share this\n",
+ "# (use an actual unique id if you have it; title+year is a good fallback)\n",
+ "df_multi[\"cluster_id\"] = (\n",
+ " df_multi[\"movie\"].str.lower().str.strip() + \"_\" + df_multi[\"year\"].astype(str)\n",
+ ")\n",
+ "\n",
+ "# (optional) keep genres with enough data\n",
+ "counts = df_multi[\"genre\"].value_counts()\n",
+ "keep = counts[counts >= 30].index\n",
+ "sub = df_multi[df_multi[\"genre\"].isin(keep)].copy()\n",
+ "\n",
+ "# Hypothesis-specific datasets\n",
+ "df_genre = df_multi.dropna(subset=[\"genre\"])\n",
+ "df_quarter = df.dropna(subset=[\"quarter\"])\n",
+ "df_budget = df[df[\"production_budget\"] > 0].copy()\n",
+ "df_runtime = df.dropna(subset=[\"runtime_minutes\"])\n",
+ "\n",
+ "print(\"Base dataset:\", df.shape)\n",
+ "print(\"Genre dataset:\", df_genre.shape)\n",
+ "print(\"Quarter dataset:\", df_quarter.shape)\n",
+ "print(\"Budget dataset:\", df_budget.shape)\n",
+ "print(\"Runtime dataset:\", df_runtime.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Test whether the length of a movie (runtime) has an effect on its profitability (ROI) using Simple Linear Regression Model. \n",
+ "\n",
+ "- Independent Variable (X): Runtime (minutes)\n",
+ "\n",
+ "\n",
+ "- Dependent Variable (Y): ROI (Return on Investment) \n",
+ "\n",
+ "\n",
+ "- Null Hypothesis (Hβ): There is no relationship between movie runtime and ROI\n",
+ "\n",
+ "\n",
+ "- Alternative Hypothesis (Hβ): There is a relationship between movie runtime and ROI"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " primary_title runtime_minutes ROI\n",
+ "0 Foodfight! 91.0 -0.998362\n",
+ "1 The Secret Life of Walter Mitty 114.0 1.064409\n",
+ "2 A Walk Among the Tombstones 114.0 1.218164\n",
+ "3 Jurassic World 124.0 6.669092\n",
+ "4 The Rum Diary 119.0 -0.521228\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Preparation - Create a join key using title + year\n",
+ "\n",
+ "tn_clean[\"title_key\"] = tn_clean[\"movie\"].str.lower().str.strip() + tn_clean[\"year\"].astype(str)\n",
+ "imdb[\"title_key\"] = imdb[\"primary_title\"].str.lower().str.strip() + imdb[\"start_year\"].astype(str)\n",
+ "\n",
+ "runtime_data = imdb.merge(tn_clean, on=\"title_key\", how=\"inner\")\n",
+ "\n",
+ "print(runtime_data[[\"primary_title\", \"runtime_minutes\", \"ROI\"]].head())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import warnings\n",
+ "warnings.simplefilter(action=\"ignore\", category=FutureWarning)\n",
+ "\n",
+ "import statsmodels.api as sm\n",
+ "\n",
+ "# Drop missing runtimes or ROIs\n",
+ "runtime_data = runtime_data.dropna(subset=[\"runtime_minutes\", \"ROI\"])\n",
+ "\n",
+ "# Define variables\n",
+ "X = runtime_data[\"runtime_minutes\"]\n",
+ "y = runtime_data[\"ROI\"]\n",
+ "\n",
+ "# Add intercept\n",
+ "X = sm.add_constant(X)\n",
+ "\n",
+ "# Fit regression\n",
+ "model = sm.OLS(y, X).fit()\n",
+ "print(model.summary())\n",
+ "\n",
+ "# 95% CI for slope\n",
+ "print(\"95% CI for slope:\", model.conf_int().loc[\"runtime_minutes\"])\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The RΒ² value (0.001) means runtime explains almost nothing about how profitable a movie is.\n",
+ "\n",
+ "The slope (-0.0144) and p-value (0.359) mean longer or shorter movies donβt significantly change profits.\n",
+ "\n",
+ "The confidence interval includes zero, which means the effect could be slightly positive or negative, but overall itβs too small to matter.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Visualization\n",
+ "\n",
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "plt.figure(figsize=(10,7))\n",
+ "sns.regplot(x=\"runtime_minutes\", y=\"ROI\", data=runtime_data,\n",
+ " scatter_kws={\"alpha\":0.5}, line_kws={\"color\":\"red\"})\n",
+ "\n",
+ "plt.title(\"Runtime vs Return on Investment with Regression Line\")\n",
+ "plt.xlabel(\"Runtime (minutes)\")\n",
+ "plt.ylabel(\"Return on Investment\")\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The visualization displayed above is a scatter plot of movie runtime vs ROI, with the regression line (in red) drawn across the data points. This makes it clear that the trend line is almost flat, confirming that runtime has no meaningful effect on profitability.\n",
+ "\n",
+ "- Null Hypothesis (Hβ): There is no relationship between movie runtime and ROI.\n",
+ "\n",
+ "- Alternative Hypothesis (Hβ):There is a relationship between movie runtime and ROI.\n",
+ "\n",
+ "- Since the p-value = 0.359 > 0.05, we fail to reject Hβ β meaning runtime does not significantly impact profitability."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Business Recommendation\n",
+ "\n",
+ "The analysis shows that movie length does not significantly impact profitability, whether a film runs shorter or longer has almost no effect on its return on investment (ROI).\n",
+ "\n",
+ "#### Implication for the Company:\n",
+ "\n",
+ "Runtime should not be a deciding factor when selecting or producing films.Strategic focus should shift to more influential drivers of success such as:\n",
+ "\n",
+ " - Budget management (spending efficiently to maximize returns)\n",
+ "\n",
+ " - Release timing (launching films in profitable quarters/seasons)\n",
+ "\n",
+ " - Marketing and distribution strategies\n",
+ "\n",
+ "Advice to Stakeholder: When developing original video content do not prioritize on movie length.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}