diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..52cb04c0
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,4 @@
+*.db
+zippedData/*
+*.db
+zippedData/*
diff --git a/Movie_README.md b/Movie_README.md
new file mode 100644
index 00000000..27b4ee4c
--- /dev/null
+++ b/Movie_README.md
@@ -0,0 +1,155 @@
+# 🎬 Movie Studio Investment Analysis
+
+This notebook explores movie performance data to help our company decide **what types of films to create**.  
+We will use exploratory data analysis and statistical modeling to answer business questions about ROI.
+
+
+> **Deliverables included (starter set):**
+> - `README.md` — Notebook summary and guide
+> - `presentation.pptx` (or `presentation.md`) — Non-technical presentation for Management
+> - `student-checkpoint.ipynb` — analysis notebook developed by the project team
+> - `data/` — expected location for raw inputs (e.g., `zippedData/...`), usually **gitignored**
+
+---
+
+## 1) Overview
+
+**Business Problem.** The company is launching a **new movie studio** and needs evidence-based guidance on **what kinds of films to green-light**.
+
+**Objective.** Use exploratory data analysis to identify patterns linked to **Return on Investment (ROI)** and produce **three concrete recommendations** for slate strategy.
+
+**Current Scope.** We begin with:
+- **The Numbers** (`tn.movie_budgets.csv.gz`): budgets and grosses
+- **IMDB** (`im.db` → `movie_basics`): title, year, runtime, genres
+
+The workflow is **modular** so teammates can add Rotten Tomatoes, TMDB, Box Office Mojo, etc.
+
+---
+
+## 2) Business Understanding
+
+**Stakeholder.** Head of the new studio (green-lighting decisions).
+
+**Key Questions.**
+1. **Genres vs ROI** – Which genres yield the best returns?
+2. **Budget vs ROI %** – Are bigger budgets more or less profitable?
+3. **Runtime vs ROI** – Does movie length impact profitability?
+
+**Decision Use.** Prioritize genre slate, design budget bands, and define runtime guardrails by genre.
+
+---
+
+## 3) Data Understanding & Preparation
+
+**Sources.**
+- **The Numbers** — production budgets, domestic/worldwide grosses, release dates.
+- **IMDB** — movie metadata (titles, start_year, runtime_minutes, genres).
+
+**Join Logic.** Normalize titles and match on **title + year** to merge The Numbers with IMDB.
+
+**Target Metric.** `ROI = (worldwide_gross − production_budget) / production_budget` (requires budget > 0).
+
+**Cleaning.**
+- Cast currency strings to numeric.
+- Drop missing/non-finite ROI.
+- **Multi-genre policy (simple):** explode genres; each film contributes to every genre it’s labeled with. Results are **directional** because genres overlap.
+
+---
+
+## 4) Methods (aligned to syllabus)
+
+- **Descriptive statistics:** counts, mean/median ROI, % profitable (ROI > 0).
+- **t-based confidence intervals** (n ≥ 30; CLT justification).
+- **Optional one-sample t-tests:** compare each genre’s mean to the overall mean.
+- **Simple linear regression** (StatsModels): ROI ~ log10(budget), ROI ~ runtime.
+- **Diagnostics:** QQ plot (normality) and residuals vs fitted (homoscedasticity).
+- **Outlier awareness:** distributions plotted; optional winsorization/log transforms for sensitivity.
+
+---
+
+## 5) Analysis Summary (current iteration)
+
+### 5.1 Genres vs ROI (Simple Multi-Genre)
+- Per-genre **n, mean ROI, median ROI, % profitable**.
+- Visuals: Top 10 by **Avg ROI**, Top 10 by **% Profitable**, **boxplots** (outliers hidden).  
+- Interpretation: Favor genres above overall Avg ROI **and** with high % profitable; medians close to means suggest less outlier risk.
+
+### 5.2 Budget vs ROI
+- Regression **ROI ~ log10(budget)** with diagnostics; informs budget bands and slate mix.
+
+### 5.3 Runtime vs ROI
+- Regression **ROI ~ runtime_minutes** with diagnostics; checks for diminishing returns with very long runtimes.
+- Visuals: Runtime vs Return on Investment with rehression line.
+- Interpretation:  The graph shows that runtime has no meaningful effect on profitability.
+
+> Teammates can extend with Rotten Tomatoes/TMDB (ratings, votes), cast/star power, franchise flags, marketing proxies.
+
+---
+
+## 6) Recommendations (draft; refine with added evidence)
+
+1. **Slate focus:** Prioritize the top 2–3 genres that are above the overall Avg ROI **and** show high % profitable with medians close to means.
+2. **Budget discipline:** Concentrate budgets in bands where ROI is resilient. Use a **tiered slate** (a few mid–high budget bets plus a steady pipeline of mid/low budgets).
+3. **Runtime guardrails:** Avoid extreme runtimes unless the genre historically sustains them; target the runtime range with stable ROI in the regression.
+
+**Next iteration:** Add confidence intervals to slides, perform sensitivity (winsorized/log ROI), and enrich with ratings and cast variables.
+
+---
+
+## 7) How to Run
+
+**Environment:** Python 3.x; `pandas`, `numpy`, `matplotlib`, `scipy`, `statsmodels`, `sqlite3` (stdlib).
+
+**Data layout (example):**
+```
+data/
+└── zippedData/
+    ├── tn.movie_budgets.csv.gz
+    └── im.db
+```
+
+**Notebook:**
+1. Open `student-checkpoint.ipynb`.
+2. Run all cells (update any paths if needed).
+3. Export charts to the `figures/` folder for the deck.
+
+**.gitignore suggestions:** `/data/`, `*.zip`, `*.gz`, `*.db`, `*.DS_Store`
+
+---
+
+## 8) Repository Structure (suggested)
+
+```
+.
+├── README.md
+├── presentation.pptx            
+├── student-checkpoint.ipynb
+├── notebooks/                   
+├── figures/                     
+├── data/
+│   └── zippedData/              # raw inputs (not tracked by git)
+├── .gitignore
+└── LICENSE (optional)
+```
+
+---
+
+## 9) Team Collaboration Notes
+
+- Add new analyses as **separate notebooks** (e.g., `notebooks/ratings_analysis.ipynb`).
+- Export figures to `/figures` and reference them in `presentation.pptx`.
+- Use **clear commit messages** (`feat:`, `fix:`, `viz:`, `doc:`) and keep commits small.
+
+---
+
+## 10) Limitations & Next Steps
+
+**Limitations:** multi-genre overlap (directional results), ROI skewness (blockbusters), imperfect title-year joins.
+
+**Next Steps:** add t-CIs to slides; winsorized/log ROI sensitivity; integrate ratings/votes/star power/franchise; consider multivariate/regularized regression when feature set grows.
+
+---
+
+## 11) Contact
+Owner: _Your Name_ · _your.email@example.com_ · _LinkedIn URL_  
+Collaborators: _Teammate A, Teammate B, …_
diff --git a/Untitled.ipynb b/Untitled.ipynb
new file mode 100644
index 00000000..8bee5bdd
--- /dev/null
+++ b/Untitled.ipynb
@@ -0,0 +1,32 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python (learn-env)",
+   "language": "python",
+   "name": "learn-env"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/index.ipynb b/index.ipynb
index 3623bc14..d1510ab0 100644
--- a/index.ipynb
+++ b/index.ipynb
@@ -2,7 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "5d35b2b4",
    "metadata": {},
    "source": [
     "# Phase 2 Project Description"
@@ -10,7 +9,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "b5e9e179",
    "metadata": {},
    "source": [
     "You've made it through the second phase of this course, and now you will put your new skills to use with a large end-of-Phase project!\n",
@@ -25,7 +23,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "58851385",
    "metadata": {},
    "source": [
     "## Project Overview"
@@ -33,7 +30,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6f37995f",
    "metadata": {},
    "source": [
     "For this project, you will use exploratory data analysis to generate insights for a business stakeholder."
@@ -41,7 +37,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "8b0f1668",
    "metadata": {},
    "source": [
     "### Business Problem"
@@ -49,7 +44,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dce55d1d",
    "metadata": {},
    "source": [
     "Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create."
@@ -57,7 +51,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d3d557bf",
    "metadata": {},
    "source": [
     "### The Data"
@@ -65,7 +58,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ca34efb7",
    "metadata": {},
    "source": [
     "In the folder `zippedData` are movie datasets from:\n",
@@ -93,7 +85,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "5ace6e4f",
    "metadata": {},
    "source": [
     "### Key Points"
@@ -101,7 +92,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c9d2edeb",
    "metadata": {},
    "source": [
     "* **Your analysis should yield three concrete business recommendations.** The ultimate purpose of exploratory analysis is not just to learn about the data, but to help an organization perform better. Explicitly relate your findings to business needs by recommending actions that you think the business should take.\n",
@@ -113,7 +103,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "474e2ec3",
    "metadata": {},
    "source": [
     "## Deliverables"
@@ -121,7 +110,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "eaeda85f",
    "metadata": {},
    "source": [
     "There are three deliverables for this project:\n",
@@ -133,7 +121,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a7f8e274",
    "metadata": {},
    "source": [
     "### Non-Technical Presentation"
@@ -141,7 +128,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "540d5c27",
    "metadata": {},
    "source": [
     "The non-technical presentation is a slide deck presenting your analysis to business stakeholders.\n",
@@ -183,7 +169,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d27915ba",
    "metadata": {},
    "source": [
     "### Jupyter Notebook"
@@ -191,7 +176,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2d5d45ea",
    "metadata": {},
    "source": [
     "The Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a data science audience.\n",
@@ -219,7 +203,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2027aa4c",
    "metadata": {},
    "source": [
     "### GitHub Repository"
@@ -227,7 +210,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "b8057390",
    "metadata": {},
    "source": [
     "The GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history.\n",
@@ -276,7 +258,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f19694e7",
    "metadata": {},
    "source": [
     "## Grading"
@@ -284,7 +265,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "06e9cfb7",
    "metadata": {},
    "source": [
     "***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 2 are:\n",
@@ -296,7 +276,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a4c04769",
    "metadata": {},
    "source": [
     "### Data Communication"
@@ -304,7 +283,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "0834a4ee",
    "metadata": {},
    "source": [
     "Communication is a key \"soft skill\". In [this survey](https://www.payscale.com/data-packages/job-skills), 46% of hiring managers said that recent college grads were missing this skill.\n",
@@ -324,7 +302,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "276dff7c",
    "metadata": {},
    "source": [
     "#### Exceeds Objective"
@@ -332,7 +309,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e87c2713",
    "metadata": {},
    "source": [
     "Creates and describes appropriate visualizations for given business questions, where each visualization fulfills all elements of the checklist\n",
@@ -342,7 +318,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "b4e8a4c7",
    "metadata": {},
    "source": [
     "#### Meets Objective (Passing Bar)"
@@ -350,7 +325,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "bc4e21d0",
    "metadata": {},
    "source": [
     "Creates and describes appropriate visualizations for given business questions\n",
@@ -360,7 +334,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d0403eb9",
    "metadata": {},
    "source": [
     "#### Approaching Objective"
@@ -368,7 +341,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "22dd4ad6",
    "metadata": {},
    "source": [
     "Creates visualizations that are not related to the business questions, or uses an inappropriate type of visualization\n",
@@ -380,7 +352,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aa1b808d",
    "metadata": {},
    "source": [
     "#### Does Not Meet Objective"
@@ -388,7 +359,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a8a64869",
    "metadata": {},
    "source": [
     "Does not submit the required number of visualizations"
@@ -396,7 +366,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "db2e0ce8",
    "metadata": {},
    "source": [
     "### Authoring Jupyter Notebooks"
@@ -404,7 +373,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "91cc89b5",
    "metadata": {},
    "source": [
     "According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), 74.1% of data scientists use a Jupyter development environment, which is more than twice the percentage of the next-most-popular IDE, Visual Studio Code. Jupyter Notebooks allow for reproducible, skim-able code documents for a data science audience. Comfort and skill with authoring Jupyter Notebooks will prepare you for job interviews, take-home challenges, and on-the-job tasks as a data scientist.\n",
@@ -416,7 +384,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "b9272672",
    "metadata": {},
    "source": [
     "#### Exceeds Objective"
@@ -424,7 +391,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "efc937e5",
    "metadata": {},
    "source": [
     "Uses Markdown and code comments to create a well-organized, skim-able document that follows all best practices\n",
@@ -434,7 +400,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d01725ea",
    "metadata": {},
    "source": [
     "#### Meets Objective (Passing Bar)"
@@ -442,7 +407,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2c854f50",
    "metadata": {},
    "source": [
     "Uses some Markdown to create an organized notebook, with an introduction at the top and a conclusion at the bottom"
@@ -450,7 +414,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "3e0b3385",
    "metadata": {},
    "source": [
     "#### Approaching Objective"
@@ -458,7 +421,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "67767f89",
    "metadata": {},
    "source": [
     "Uses Markdown cells to organize, but either uses only headers and does not provide any explanations or justifications, or uses only plaintext without any headers to segment out sections of the notebook\n",
@@ -468,7 +430,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "195ef62a",
    "metadata": {},
    "source": [
     "#### Does Not Meet Objective"
@@ -476,7 +437,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "709181b9",
    "metadata": {},
    "source": [
     "Does not submit a notebook, or does not use Markdown cells at all to organize the notebook"
@@ -484,7 +444,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "290335d1",
    "metadata": {},
    "source": [
     "### Data Manipulation and Analysis with `pandas`"
@@ -492,7 +451,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2c0aae32",
    "metadata": {},
    "source": [
     "`pandas` is a very popular data manipulation library, with over 2 million downloads on Anaconda (`conda install pandas`) and over 19 million downloads on PyPI (`pip install pandas`) at the time of this writing. In our own internal data, we see that the overwhelming majority of Flatiron School DS grads use `pandas` on the job in some capacity.\n",
@@ -510,7 +468,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e070c91b",
    "metadata": {},
    "source": [
     "#### Exceeds Objective"
@@ -518,7 +475,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "20092dcd",
    "metadata": {},
    "source": [
     "Uses `pandas` to prepare data and answer business questions in an idiomatic, performant way"
@@ -526,7 +482,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "882b158d",
    "metadata": {},
    "source": [
     "#### Meets Objective (Passing Bar)"
@@ -534,7 +489,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c2c426e6",
    "metadata": {},
    "source": [
     "Successfully uses `pandas` to prepare data in order to answer business questions\n",
@@ -544,7 +498,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "88d1667b",
    "metadata": {},
    "source": [
     "#### Approaching Objective"
@@ -552,7 +505,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ec132034",
    "metadata": {},
    "source": [
     "Uses `pandas` to prepare data, but makes significant errors\n",
@@ -562,7 +514,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c5e3c86b",
    "metadata": {},
    "source": [
     "#### Does Not Meet Objective"
@@ -570,7 +521,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d9566206",
    "metadata": {},
    "source": [
     "Unable to prepare data using `pandas`\n",
@@ -580,7 +530,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "b0923637",
    "metadata": {},
    "source": [
     "## Getting Started"
@@ -588,7 +537,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "8e37e815",
    "metadata": {},
    "source": [
     "Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP.\n",
@@ -604,7 +552,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "290d61a5",
    "metadata": {},
    "source": [
     "## Summary"
@@ -612,7 +559,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ac002279",
    "metadata": {},
    "source": [
     "This project will give you a valuable opportunity to develop your data science skills using real-world data. The end-of-phase projects are a critical part of the program because they give you a chance to bring together all the skills you've learned, apply them to realistic projects for a business stakeholder, practice communication skills, and get feedback to help you improve. You've got this!"
@@ -635,7 +581,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.16"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,
diff --git a/student-phase2project-cg.ipynb b/student-phase2project-cg.ipynb
new file mode 100644
index 00000000..19851964
--- /dev/null
+++ b/student-phase2project-cg.ipynb
@@ -0,0 +1,1383 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Final Project Submission\n",
+    "\n",
+    "Please fill out:\n",
+    "* Student name: Catherine Gachiri\n",
+    "* Student pace: Remote\n",
+    "* Scheduled project review date/time: 14/09/2025\n",
+    "* Instructor name: Fidelis Wanalwenge\n",
+    "* Blog post URL:\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🎬 Movie Studio Investment Analysis\n",
+    "\n",
+    "## Project Overview\n",
+    "This notebook explores movie performance data to help our company decide **what types of films to create**.  \n",
+    "We will use exploratory data analysis and statistical modeling to answer business questions about ROI.\n",
+    "\n",
+    "**Key Data Sources:**\n",
+    "- **The Numbers** (`tn.movie_budgets.csv.gz`) → Budgets & grosses (used for ROI).\n",
+    "- **IMDB** (`im.db`) → Movie metadata (genres, runtime, year).\n",
+    "- **Box Office Mojo** (`bom.movie_gross.csv.gz`) → Additional grosses (optional).\n",
+    "\n",
+    "**Goal:** Build a dataset that combines **financial data** (budgets & grosses) with **metadata** (genres, runtime, release timing) for statistical analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Business Understanding\n",
+    "Our stakeholders (head of the new movie studio) want to know:\n",
+    "\n",
+    "1. **Genres vs ROI** – Which genres yield the best returns?  \n",
+    "2. **Release Quarter vs ROI** – Does timing affect financial success?  \n",
+    "3. **Budget vs ROI %** – Are bigger budgets more (or less) profitable?  \n",
+    "4. **Runtime vs ROI** – Does movie length impact profitability?\n",
+    "\n",
+    "We will prepare a clean dataset to test these hypotheses."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1: Load and Inspect Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Box Office Mojo sample:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>title</th>\n",
+       "      <th>studio</th>\n",
+       "      <th>domestic_gross</th>\n",
+       "      <th>foreign_gross</th>\n",
+       "      <th>year</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Toy Story 3</td>\n",
+       "      <td>BV</td>\n",
+       "      <td>415000000.0</td>\n",
+       "      <td>652000000</td>\n",
+       "      <td>2010</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Alice in Wonderland (2010)</td>\n",
+       "      <td>BV</td>\n",
+       "      <td>334200000.0</td>\n",
+       "      <td>691300000</td>\n",
+       "      <td>2010</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Harry Potter and the Deathly Hallows Part 1</td>\n",
+       "      <td>WB</td>\n",
+       "      <td>296000000.0</td>\n",
+       "      <td>664300000</td>\n",
+       "      <td>2010</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Inception</td>\n",
+       "      <td>WB</td>\n",
+       "      <td>292600000.0</td>\n",
+       "      <td>535700000</td>\n",
+       "      <td>2010</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Shrek Forever After</td>\n",
+       "      <td>P/DW</td>\n",
+       "      <td>238700000.0</td>\n",
+       "      <td>513900000</td>\n",
+       "      <td>2010</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                         title studio  domestic_gross  \\\n",
+       "0                                  Toy Story 3     BV     415000000.0   \n",
+       "1                   Alice in Wonderland (2010)     BV     334200000.0   \n",
+       "2  Harry Potter and the Deathly Hallows Part 1     WB     296000000.0   \n",
+       "3                                    Inception     WB     292600000.0   \n",
+       "4                          Shrek Forever After   P/DW     238700000.0   \n",
+       "\n",
+       "  foreign_gross  year  \n",
+       "0     652000000  2010  \n",
+       "1     691300000  2010  \n",
+       "2     664300000  2010  \n",
+       "3     535700000  2010  \n",
+       "4     513900000  2010  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(3387, 5)"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The Numbers sample:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>release_date</th>\n",
+       "      <th>movie</th>\n",
+       "      <th>production_budget</th>\n",
+       "      <th>domestic_gross</th>\n",
+       "      <th>worldwide_gross</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>Dec 18, 2009</td>\n",
+       "      <td>Avatar</td>\n",
+       "      <td>$425,000,000</td>\n",
+       "      <td>$760,507,625</td>\n",
+       "      <td>$2,776,345,279</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>May 20, 2011</td>\n",
+       "      <td>Pirates of the Caribbean: On Stranger Tides</td>\n",
+       "      <td>$410,600,000</td>\n",
+       "      <td>$241,063,875</td>\n",
+       "      <td>$1,045,663,875</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Jun 7, 2019</td>\n",
+       "      <td>Dark Phoenix</td>\n",
+       "      <td>$350,000,000</td>\n",
+       "      <td>$42,762,350</td>\n",
+       "      <td>$149,762,350</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4</td>\n",
+       "      <td>May 1, 2015</td>\n",
+       "      <td>Avengers: Age of Ultron</td>\n",
+       "      <td>$330,600,000</td>\n",
+       "      <td>$459,005,868</td>\n",
+       "      <td>$1,403,013,963</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5</td>\n",
+       "      <td>Dec 15, 2017</td>\n",
+       "      <td>Star Wars Ep. VIII: The Last Jedi</td>\n",
+       "      <td>$317,000,000</td>\n",
+       "      <td>$620,181,382</td>\n",
+       "      <td>$1,316,721,747</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   id  release_date                                        movie  \\\n",
+       "0   1  Dec 18, 2009                                       Avatar   \n",
+       "1   2  May 20, 2011  Pirates of the Caribbean: On Stranger Tides   \n",
+       "2   3   Jun 7, 2019                                 Dark Phoenix   \n",
+       "3   4   May 1, 2015                      Avengers: Age of Ultron   \n",
+       "4   5  Dec 15, 2017            Star Wars Ep. VIII: The Last Jedi   \n",
+       "\n",
+       "  production_budget domestic_gross worldwide_gross  \n",
+       "0      $425,000,000   $760,507,625  $2,776,345,279  \n",
+       "1      $410,600,000   $241,063,875  $1,045,663,875  \n",
+       "2      $350,000,000    $42,762,350    $149,762,350  \n",
+       "3      $330,600,000   $459,005,868  $1,403,013,963  \n",
+       "4      $317,000,000   $620,181,382  $1,316,721,747  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(5782, 6)"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "IMDB Tables:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>movie_basics</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>directors</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>known_for</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>movie_akas</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>movie_ratings</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>persons</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>principals</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>writers</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>movies_financials_ratings</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>merged_with_ratings</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                        name\n",
+       "0               movie_basics\n",
+       "1                  directors\n",
+       "2                  known_for\n",
+       "3                 movie_akas\n",
+       "4              movie_ratings\n",
+       "5                    persons\n",
+       "6                 principals\n",
+       "7                    writers\n",
+       "8  movies_financials_ratings\n",
+       "9        merged_with_ratings"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import sqlite3\n",
+    "import re\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Define paths\n",
+    "data_dir = Path('zippedData')\n",
+    "bom_path = data_dir/'bom.movie_gross.csv.gz'\n",
+    "tn_path = data_dir/'tn.movie_budgets.csv.gz'\n",
+    "imdb_path = Path('zippedData/im.db')\n",
+    "\n",
+    "# Load Box Office Mojo\n",
+    "bom = pd.read_csv(bom_path)\n",
+    "print(\"Box Office Mojo sample:\")\n",
+    "display(bom.head())\n",
+    "display(bom.shape)\n",
+    "\n",
+    "# Load The Numbers (budgets)\n",
+    "tn = pd.read_csv(tn_path)\n",
+    "print(\"The Numbers sample:\")\n",
+    "display(tn.head())\n",
+    "display(tn.shape)\n",
+    "\n",
+    "# Inspect IMDB tables\n",
+    "con = sqlite3.connect(imdb_path)\n",
+    "tables = pd.read_sql(\"SELECT name FROM sqlite_master WHERE type='table';\", con)\n",
+    "print(\"IMDB Tables:\")\n",
+    "display(tables)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2: Clean `The Numbers` Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>release_date</th>\n",
+       "      <th>movie</th>\n",
+       "      <th>production_budget</th>\n",
+       "      <th>domestic_gross</th>\n",
+       "      <th>worldwide_gross</th>\n",
+       "      <th>year</th>\n",
+       "      <th>quarter</th>\n",
+       "      <th>ROI</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>2009-12-18</td>\n",
+       "      <td>Avatar</td>\n",
+       "      <td>425000000.0</td>\n",
+       "      <td>760507625.0</td>\n",
+       "      <td>2.776345e+09</td>\n",
+       "      <td>2009</td>\n",
+       "      <td>4</td>\n",
+       "      <td>5.532577</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>2011-05-20</td>\n",
+       "      <td>Pirates of the Caribbean: On Stranger Tides</td>\n",
+       "      <td>410600000.0</td>\n",
+       "      <td>241063875.0</td>\n",
+       "      <td>1.045664e+09</td>\n",
+       "      <td>2011</td>\n",
+       "      <td>2</td>\n",
+       "      <td>1.546673</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>3</td>\n",
+       "      <td>2019-06-07</td>\n",
+       "      <td>Dark Phoenix</td>\n",
+       "      <td>350000000.0</td>\n",
+       "      <td>42762350.0</td>\n",
+       "      <td>1.497624e+08</td>\n",
+       "      <td>2019</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-0.572108</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4</td>\n",
+       "      <td>2015-05-01</td>\n",
+       "      <td>Avengers: Age of Ultron</td>\n",
+       "      <td>330600000.0</td>\n",
+       "      <td>459005868.0</td>\n",
+       "      <td>1.403014e+09</td>\n",
+       "      <td>2015</td>\n",
+       "      <td>2</td>\n",
+       "      <td>3.243841</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5</td>\n",
+       "      <td>2017-12-15</td>\n",
+       "      <td>Star Wars Ep. VIII: The Last Jedi</td>\n",
+       "      <td>317000000.0</td>\n",
+       "      <td>620181382.0</td>\n",
+       "      <td>1.316722e+09</td>\n",
+       "      <td>2017</td>\n",
+       "      <td>4</td>\n",
+       "      <td>3.153696</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   id release_date                                        movie  \\\n",
+       "0   1   2009-12-18                                       Avatar   \n",
+       "1   2   2011-05-20  Pirates of the Caribbean: On Stranger Tides   \n",
+       "2   3   2019-06-07                                 Dark Phoenix   \n",
+       "3   4   2015-05-01                      Avengers: Age of Ultron   \n",
+       "4   5   2017-12-15            Star Wars Ep. VIII: The Last Jedi   \n",
+       "\n",
+       "   production_budget  domestic_gross  worldwide_gross  year  quarter       ROI  \n",
+       "0        425000000.0     760507625.0     2.776345e+09  2009        4  5.532577  \n",
+       "1        410600000.0     241063875.0     1.045664e+09  2011        2  1.546673  \n",
+       "2        350000000.0      42762350.0     1.497624e+08  2019        2 -0.572108  \n",
+       "3        330600000.0     459005868.0     1.403014e+09  2015        2  3.243841  \n",
+       "4        317000000.0     620181382.0     1.316722e+09  2017        4  3.153696  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(5782, 9)"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "tn_clean = tn.copy()\n",
+    "\n",
+    "# Convert currency columns to numeric\n",
+    "currency_cols = [\"production_budget\", \"domestic_gross\", \"worldwide_gross\"]\n",
+    "for col in currency_cols:\n",
+    "    tn_clean[col] = (tn_clean[col]\n",
+    "                     .replace('[\\$,]', '', regex=True)\n",
+    "                     .astype(float))\n",
+    "\n",
+    "# Parse release date\n",
+    "tn_clean[\"release_date\"] = pd.to_datetime(tn_clean[\"release_date\"], errors=\"coerce\")\n",
+    "tn_clean[\"year\"] = tn_clean[\"release_date\"].dt.year\n",
+    "tn_clean[\"quarter\"] = tn_clean[\"release_date\"].dt.quarter\n",
+    "\n",
+    "# Compute ROI\n",
+    "tn_clean[\"ROI\"] = (tn_clean[\"worldwide_gross\"] - tn_clean[\"production_budget\"]) / tn_clean[\"production_budget\"]\n",
+    "\n",
+    "display(tn_clean.head())\n",
+    "display(tn_clean.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3: Extract Metadata from IMDB"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>primary_title</th>\n",
+       "      <th>start_year</th>\n",
+       "      <th>runtime_minutes</th>\n",
+       "      <th>genres</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>tt0063540</td>\n",
+       "      <td>Sunghursh</td>\n",
+       "      <td>2013</td>\n",
+       "      <td>175.0</td>\n",
+       "      <td>Action,Crime,Drama</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>tt0066787</td>\n",
+       "      <td>One Day Before the Rainy Season</td>\n",
+       "      <td>2019</td>\n",
+       "      <td>114.0</td>\n",
+       "      <td>Biography,Drama</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>tt0069049</td>\n",
+       "      <td>The Other Side of the Wind</td>\n",
+       "      <td>2018</td>\n",
+       "      <td>122.0</td>\n",
+       "      <td>Drama</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>tt0069204</td>\n",
+       "      <td>Sabse Bada Sukh</td>\n",
+       "      <td>2018</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Comedy,Drama</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>tt0100275</td>\n",
+       "      <td>The Wandering Soap Opera</td>\n",
+       "      <td>2017</td>\n",
+       "      <td>80.0</td>\n",
+       "      <td>Comedy,Drama,Fantasy</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "    movie_id                    primary_title  start_year  runtime_minutes  \\\n",
+       "0  tt0063540                        Sunghursh        2013            175.0   \n",
+       "1  tt0066787  One Day Before the Rainy Season        2019            114.0   \n",
+       "2  tt0069049       The Other Side of the Wind        2018            122.0   \n",
+       "3  tt0069204                  Sabse Bada Sukh        2018              NaN   \n",
+       "4  tt0100275         The Wandering Soap Opera        2017             80.0   \n",
+       "\n",
+       "                 genres  \n",
+       "0    Action,Crime,Drama  \n",
+       "1       Biography,Drama  \n",
+       "2                 Drama  \n",
+       "3          Comedy,Drama  \n",
+       "4  Comedy,Drama,Fantasy  "
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "con = sqlite3.connect(imdb_path)\n",
+    "imdb = pd.read_sql(\"\"\"\n",
+    "    SELECT movie_id, primary_title, start_year, runtime_minutes, genres\n",
+    "    FROM movie_basics\n",
+    "    WHERE start_year BETWEEN 1980 AND 2025\n",
+    "      AND primary_title IS NOT NULL\n",
+    "\"\"\", con)\n",
+    "con.close()\n",
+    "\n",
+    "imdb.head()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4: Normalize Titles and Join Datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>release_date</th>\n",
+       "      <th>movie</th>\n",
+       "      <th>production_budget</th>\n",
+       "      <th>domestic_gross</th>\n",
+       "      <th>worldwide_gross</th>\n",
+       "      <th>year</th>\n",
+       "      <th>quarter</th>\n",
+       "      <th>ROI</th>\n",
+       "      <th>title_norm</th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>primary_title</th>\n",
+       "      <th>start_year</th>\n",
+       "      <th>runtime_minutes</th>\n",
+       "      <th>genres</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>2009-12-18</td>\n",
+       "      <td>Avatar</td>\n",
+       "      <td>425000000.0</td>\n",
+       "      <td>760507625.0</td>\n",
+       "      <td>2.776345e+09</td>\n",
+       "      <td>2009</td>\n",
+       "      <td>4</td>\n",
+       "      <td>5.532577</td>\n",
+       "      <td>avatar</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>2011-05-20</td>\n",
+       "      <td>Pirates of the Caribbean: On Stranger Tides</td>\n",
+       "      <td>410600000.0</td>\n",
+       "      <td>241063875.0</td>\n",
+       "      <td>1.045664e+09</td>\n",
+       "      <td>2011</td>\n",
+       "      <td>2</td>\n",
+       "      <td>1.546673</td>\n",
+       "      <td>pirates of the caribbean on stranger tides</td>\n",
+       "      <td>tt1298650</td>\n",
+       "      <td>Pirates of the Caribbean: On Stranger Tides</td>\n",
+       "      <td>2011.0</td>\n",
+       "      <td>136.0</td>\n",
+       "      <td>Action,Adventure,Fantasy</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>3</td>\n",
+       "      <td>2019-06-07</td>\n",
+       "      <td>Dark Phoenix</td>\n",
+       "      <td>350000000.0</td>\n",
+       "      <td>42762350.0</td>\n",
+       "      <td>1.497624e+08</td>\n",
+       "      <td>2019</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-0.572108</td>\n",
+       "      <td>dark phoenix</td>\n",
+       "      <td>tt6565702</td>\n",
+       "      <td>Dark Phoenix</td>\n",
+       "      <td>2019.0</td>\n",
+       "      <td>113.0</td>\n",
+       "      <td>Action,Adventure,Sci-Fi</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4</td>\n",
+       "      <td>2015-05-01</td>\n",
+       "      <td>Avengers: Age of Ultron</td>\n",
+       "      <td>330600000.0</td>\n",
+       "      <td>459005868.0</td>\n",
+       "      <td>1.403014e+09</td>\n",
+       "      <td>2015</td>\n",
+       "      <td>2</td>\n",
+       "      <td>3.243841</td>\n",
+       "      <td>avengers age of ultron</td>\n",
+       "      <td>tt2395427</td>\n",
+       "      <td>Avengers: Age of Ultron</td>\n",
+       "      <td>2015.0</td>\n",
+       "      <td>141.0</td>\n",
+       "      <td>Action,Adventure,Sci-Fi</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5</td>\n",
+       "      <td>2017-12-15</td>\n",
+       "      <td>Star Wars Ep. VIII: The Last Jedi</td>\n",
+       "      <td>317000000.0</td>\n",
+       "      <td>620181382.0</td>\n",
+       "      <td>1.316722e+09</td>\n",
+       "      <td>2017</td>\n",
+       "      <td>4</td>\n",
+       "      <td>3.153696</td>\n",
+       "      <td>star wars ep viii the last jedi</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   id release_date                                        movie  \\\n",
+       "0   1   2009-12-18                                       Avatar   \n",
+       "1   2   2011-05-20  Pirates of the Caribbean: On Stranger Tides   \n",
+       "2   3   2019-06-07                                 Dark Phoenix   \n",
+       "3   4   2015-05-01                      Avengers: Age of Ultron   \n",
+       "4   5   2017-12-15            Star Wars Ep. VIII: The Last Jedi   \n",
+       "\n",
+       "   production_budget  domestic_gross  worldwide_gross  year  quarter  \\\n",
+       "0        425000000.0     760507625.0     2.776345e+09  2009        4   \n",
+       "1        410600000.0     241063875.0     1.045664e+09  2011        2   \n",
+       "2        350000000.0      42762350.0     1.497624e+08  2019        2   \n",
+       "3        330600000.0     459005868.0     1.403014e+09  2015        2   \n",
+       "4        317000000.0     620181382.0     1.316722e+09  2017        4   \n",
+       "\n",
+       "        ROI                                  title_norm   movie_id  \\\n",
+       "0  5.532577                                      avatar        NaN   \n",
+       "1  1.546673  pirates of the caribbean on stranger tides  tt1298650   \n",
+       "2 -0.572108                                dark phoenix  tt6565702   \n",
+       "3  3.243841                      avengers age of ultron  tt2395427   \n",
+       "4  3.153696             star wars ep viii the last jedi        NaN   \n",
+       "\n",
+       "                                 primary_title  start_year  runtime_minutes  \\\n",
+       "0                                          NaN         NaN              NaN   \n",
+       "1  Pirates of the Caribbean: On Stranger Tides      2011.0            136.0   \n",
+       "2                                 Dark Phoenix      2019.0            113.0   \n",
+       "3                      Avengers: Age of Ultron      2015.0            141.0   \n",
+       "4                                          NaN         NaN              NaN   \n",
+       "\n",
+       "                     genres  \n",
+       "0                       NaN  \n",
+       "1  Action,Adventure,Fantasy  \n",
+       "2   Action,Adventure,Sci-Fi  \n",
+       "3   Action,Adventure,Sci-Fi  \n",
+       "4                       NaN  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(5782, 15)"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# Title normalization function\n",
+    "def normalize_title(title: str) -> str:\n",
+    "    if pd.isna(title):\n",
+    "        return np.nan\n",
+    "    title = title.lower().strip()\n",
+    "    title = re.sub(r\"\\([^)]*\\)\", \"\", title)  # remove parentheticals\n",
+    "    title = re.sub(r\"[^a-z0-9 ]\", \"\", title)   # drop punctuation\n",
+    "    title = re.sub(r\"\\s+\", \" \", title).strip()\n",
+    "    return title\n",
+    "\n",
+    "tn_clean[\"title_norm\"] = tn_clean[\"movie\"].map(normalize_title)\n",
+    "imdb[\"title_norm\"] = imdb[\"primary_title\"].map(normalize_title)\n",
+    "\n",
+    "# Bring in ratings to get numvotes\n",
+    "con = sqlite3.connect(imdb_path)\n",
+    "ratings = pd.read_sql(\"SELECT movie_id, averagerating, numvotes FROM movie_ratings;\", con)\n",
+    "con.close()\n",
+    "\n",
+    "imdb_full = (imdb.merge(ratings, on='movie_id', how='left')\n",
+    "                 .assign(numvotes=lambda d: d['numvotes'].fillna(0),\n",
+    "                         runtime_minutes=lambda d: d['runtime_minutes'].fillna(-1)))\n",
+    "\n",
+    "# Sort by best proxy for canonical record, then keep first per key\n",
+    "imdb_dedup = (imdb_full.sort_values(['title_norm','start_year','numvotes','runtime_minutes'],\n",
+    "                                    ascending=[True, True, False, False])\n",
+    "                        .drop_duplicates(['title_norm','start_year'], keep='first')\n",
+    "                        .drop(columns=['averagerating','numvotes']))  # keep if you need them\n",
+    "\n",
+    "# Re-join with deduped IMDB\n",
+    "movies_dedup = tn_clean.merge(\n",
+    "    imdb_dedup, left_on=['title_norm','year'], right_on=['title_norm','start_year'], how='left'\n",
+    ")\n",
+    "len(movies_dedup) - len(tn_clean)   # ← should now be ~0 (or much smaller)\n",
+    "\n",
+    "display(movies_dedup.head())\n",
+    "display(movies_dedup.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 5: Create Final Analysis Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>movie</th>\n",
+       "      <th>release_date</th>\n",
+       "      <th>year</th>\n",
+       "      <th>quarter</th>\n",
+       "      <th>production_budget</th>\n",
+       "      <th>worldwide_gross</th>\n",
+       "      <th>ROI</th>\n",
+       "      <th>runtime_minutes</th>\n",
+       "      <th>genres</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Avatar</td>\n",
+       "      <td>2009-12-18</td>\n",
+       "      <td>2009</td>\n",
+       "      <td>4</td>\n",
+       "      <td>425000000.0</td>\n",
+       "      <td>2.776345e+09</td>\n",
+       "      <td>5.532577</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Pirates of the Caribbean: On Stranger Tides</td>\n",
+       "      <td>2011-05-20</td>\n",
+       "      <td>2011</td>\n",
+       "      <td>2</td>\n",
+       "      <td>410600000.0</td>\n",
+       "      <td>1.045664e+09</td>\n",
+       "      <td>1.546673</td>\n",
+       "      <td>136.0</td>\n",
+       "      <td>Action,Adventure,Fantasy</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Dark Phoenix</td>\n",
+       "      <td>2019-06-07</td>\n",
+       "      <td>2019</td>\n",
+       "      <td>2</td>\n",
+       "      <td>350000000.0</td>\n",
+       "      <td>1.497624e+08</td>\n",
+       "      <td>-0.572108</td>\n",
+       "      <td>113.0</td>\n",
+       "      <td>Action,Adventure,Sci-Fi</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Avengers: Age of Ultron</td>\n",
+       "      <td>2015-05-01</td>\n",
+       "      <td>2015</td>\n",
+       "      <td>2</td>\n",
+       "      <td>330600000.0</td>\n",
+       "      <td>1.403014e+09</td>\n",
+       "      <td>3.243841</td>\n",
+       "      <td>141.0</td>\n",
+       "      <td>Action,Adventure,Sci-Fi</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Star Wars Ep. VIII: The Last Jedi</td>\n",
+       "      <td>2017-12-15</td>\n",
+       "      <td>2017</td>\n",
+       "      <td>4</td>\n",
+       "      <td>317000000.0</td>\n",
+       "      <td>1.316722e+09</td>\n",
+       "      <td>3.153696</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Star Wars Ep. VII: The Force Awakens</td>\n",
+       "      <td>2015-12-18</td>\n",
+       "      <td>2015</td>\n",
+       "      <td>4</td>\n",
+       "      <td>306000000.0</td>\n",
+       "      <td>2.053311e+09</td>\n",
+       "      <td>5.710167</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Avengers: Infinity War</td>\n",
+       "      <td>2018-04-27</td>\n",
+       "      <td>2018</td>\n",
+       "      <td>2</td>\n",
+       "      <td>300000000.0</td>\n",
+       "      <td>2.048134e+09</td>\n",
+       "      <td>5.827114</td>\n",
+       "      <td>149.0</td>\n",
+       "      <td>Action,Adventure,Sci-Fi</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Pirates of the Caribbean: At Worldâs End</td>\n",
+       "      <td>2007-05-24</td>\n",
+       "      <td>2007</td>\n",
+       "      <td>2</td>\n",
+       "      <td>300000000.0</td>\n",
+       "      <td>9.634204e+08</td>\n",
+       "      <td>2.211401</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>Justice League</td>\n",
+       "      <td>2017-11-17</td>\n",
+       "      <td>2017</td>\n",
+       "      <td>4</td>\n",
+       "      <td>300000000.0</td>\n",
+       "      <td>6.559452e+08</td>\n",
+       "      <td>1.186484</td>\n",
+       "      <td>120.0</td>\n",
+       "      <td>Action,Adventure,Fantasy</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Spectre</td>\n",
+       "      <td>2015-11-06</td>\n",
+       "      <td>2015</td>\n",
+       "      <td>4</td>\n",
+       "      <td>300000000.0</td>\n",
+       "      <td>8.796209e+08</td>\n",
+       "      <td>1.932070</td>\n",
+       "      <td>148.0</td>\n",
+       "      <td>Action,Adventure,Thriller</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                         movie release_date  year  quarter  \\\n",
+       "0                                       Avatar   2009-12-18  2009        4   \n",
+       "1  Pirates of the Caribbean: On Stranger Tides   2011-05-20  2011        2   \n",
+       "2                                 Dark Phoenix   2019-06-07  2019        2   \n",
+       "3                      Avengers: Age of Ultron   2015-05-01  2015        2   \n",
+       "4            Star Wars Ep. VIII: The Last Jedi   2017-12-15  2017        4   \n",
+       "5         Star Wars Ep. VII: The Force Awakens   2015-12-18  2015        4   \n",
+       "6                       Avengers: Infinity War   2018-04-27  2018        2   \n",
+       "7   Pirates of the Caribbean: At Worldâs End   2007-05-24  2007        2   \n",
+       "8                               Justice League   2017-11-17  2017        4   \n",
+       "9                                      Spectre   2015-11-06  2015        4   \n",
+       "\n",
+       "   production_budget  worldwide_gross       ROI  runtime_minutes  \\\n",
+       "0        425000000.0     2.776345e+09  5.532577              NaN   \n",
+       "1        410600000.0     1.045664e+09  1.546673            136.0   \n",
+       "2        350000000.0     1.497624e+08 -0.572108            113.0   \n",
+       "3        330600000.0     1.403014e+09  3.243841            141.0   \n",
+       "4        317000000.0     1.316722e+09  3.153696              NaN   \n",
+       "5        306000000.0     2.053311e+09  5.710167              NaN   \n",
+       "6        300000000.0     2.048134e+09  5.827114            149.0   \n",
+       "7        300000000.0     9.634204e+08  2.211401              NaN   \n",
+       "8        300000000.0     6.559452e+08  1.186484            120.0   \n",
+       "9        300000000.0     8.796209e+08  1.932070            148.0   \n",
+       "\n",
+       "                      genres  \n",
+       "0                        NaN  \n",
+       "1   Action,Adventure,Fantasy  \n",
+       "2    Action,Adventure,Sci-Fi  \n",
+       "3    Action,Adventure,Sci-Fi  \n",
+       "4                        NaN  \n",
+       "5                        NaN  \n",
+       "6    Action,Adventure,Sci-Fi  \n",
+       "7                        NaN  \n",
+       "8   Action,Adventure,Fantasy  \n",
+       "9  Action,Adventure,Thriller  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(5782, 9)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "analysis_df = movies_dedup[[\n",
+    "    \"movie\", \"release_date\", \"year\", \"quarter\",\n",
+    "    \"production_budget\", \"worldwide_gross\", \"ROI\",\n",
+    "    \"runtime_minutes\", \"genres\"\n",
+    "]].copy()\n",
+    "\n",
+    "display(analysis_df.head(10))\n",
+    "analysis_df.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Base dataset: (5782, 9)\n",
+      "Genre dataset: (4030, 11)\n",
+      "Quarter dataset: (5782, 9)\n",
+      "Budget dataset: (5782, 9)\n",
+      "Runtime dataset: (1582, 9)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Drop rows without ROI\n",
+    "df = analysis_df.dropna(subset=[\"ROI\"]).copy()\n",
+    "\n",
+    "# Extract primary genre (first listed)\n",
+    "#df[\"primary_genre\"] = df[\"genres\"].dropna().apply(lambda x: x.split(\",\")[0] if isinstance(x, str) else np.nan)\n",
+    "\n",
+    "# 1) Explode genres so a movie appears once per genre\n",
+    "df_multi = df.dropna(subset=[\"genres\"]).copy()\n",
+    "df_multi[\"genre\"] = df_multi[\"genres\"].str.split(\",\")\n",
+    "df_multi = df_multi.explode(\"genre\")\n",
+    "df_multi[\"genre\"] = df_multi[\"genre\"].str.strip()\n",
+    "\n",
+    "# 2) Cluster id: all repeated rows from same movie share this\n",
+    "# (use an actual unique id if you have it; title+year is a good fallback)\n",
+    "df_multi[\"cluster_id\"] = (\n",
+    "    df_multi[\"movie\"].str.lower().str.strip() + \"_\" + df_multi[\"year\"].astype(str)\n",
+    ")\n",
+    "\n",
+    "# (optional) keep genres with enough data\n",
+    "counts = df_multi[\"genre\"].value_counts()\n",
+    "keep = counts[counts >= 30].index\n",
+    "sub = df_multi[df_multi[\"genre\"].isin(keep)].copy()\n",
+    "\n",
+    "# Hypothesis-specific datasets\n",
+    "df_genre   = df_multi.dropna(subset=[\"genre\"])\n",
+    "df_quarter = df.dropna(subset=[\"quarter\"])\n",
+    "df_budget  = df[df[\"production_budget\"] > 0].copy()\n",
+    "df_runtime = df.dropna(subset=[\"runtime_minutes\"])\n",
+    "\n",
+    "print(\"Base dataset:\", df.shape)\n",
+    "print(\"Genre dataset:\", df_genre.shape)\n",
+    "print(\"Quarter dataset:\", df_quarter.shape)\n",
+    "print(\"Budget dataset:\", df_budget.shape)\n",
+    "print(\"Runtime dataset:\", df_runtime.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Test whether the length of a movie (runtime) has an effect on its profitability (ROI) using Simple Linear Regression Model. \n",
+    "\n",
+    "- Independent Variable (X): Runtime (minutes)\n",
+    "\n",
+    "\n",
+    "- Dependent Variable (Y): ROI (Return on Investment) \n",
+    "\n",
+    "\n",
+    "- Null Hypothesis (H₀): There is no relationship between movie runtime and ROI\n",
+    "\n",
+    "\n",
+    "- Alternative Hypothesis (H₁): There is a relationship between movie runtime and ROI"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                     primary_title  runtime_minutes       ROI\n",
+      "0                       Foodfight!             91.0 -0.998362\n",
+      "1  The Secret Life of Walter Mitty            114.0  1.064409\n",
+      "2      A Walk Among the Tombstones            114.0  1.218164\n",
+      "3                   Jurassic World            124.0  6.669092\n",
+      "4                    The Rum Diary            119.0 -0.521228\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Preparation - Create a join key using title + year\n",
+    "\n",
+    "tn_clean[\"title_key\"] = tn_clean[\"movie\"].str.lower().str.strip() + tn_clean[\"year\"].astype(str)\n",
+    "imdb[\"title_key\"] = imdb[\"primary_title\"].str.lower().str.strip() + imdb[\"start_year\"].astype(str)\n",
+    "\n",
+    "runtime_data = imdb.merge(tn_clean, on=\"title_key\", how=\"inner\")\n",
+    "\n",
+    "print(runtime_data[[\"primary_title\", \"runtime_minutes\", \"ROI\"]].head())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "warnings.simplefilter(action=\"ignore\", category=FutureWarning)\n",
+    "\n",
+    "import statsmodels.api as sm\n",
+    "\n",
+    "# Drop missing runtimes or ROIs\n",
+    "runtime_data = runtime_data.dropna(subset=[\"runtime_minutes\", \"ROI\"])\n",
+    "\n",
+    "# Define variables\n",
+    "X = runtime_data[\"runtime_minutes\"]\n",
+    "y = runtime_data[\"ROI\"]\n",
+    "\n",
+    "# Add intercept\n",
+    "X = sm.add_constant(X)\n",
+    "\n",
+    "# Fit regression\n",
+    "model = sm.OLS(y, X).fit()\n",
+    "print(model.summary())\n",
+    "\n",
+    "# 95% CI for slope\n",
+    "print(\"95% CI for slope:\", model.conf_int().loc[\"runtime_minutes\"])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The R² value (0.001) means runtime explains almost nothing about how profitable a movie is.\n",
+    "\n",
+    "The slope (-0.0144) and p-value (0.359) mean longer or shorter movies don’t significantly change profits.\n",
+    "\n",
+    "The confidence interval includes zero, which means the effect could be slightly positive or negative, but overall it’s too small to matter.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Visualization\n",
+    "\n",
+    "import seaborn as sns\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "plt.figure(figsize=(10,7))\n",
+    "sns.regplot(x=\"runtime_minutes\", y=\"ROI\", data=runtime_data,\n",
+    "            scatter_kws={\"alpha\":0.5}, line_kws={\"color\":\"red\"})\n",
+    "\n",
+    "plt.title(\"Runtime vs Return on Investment with Regression Line\")\n",
+    "plt.xlabel(\"Runtime (minutes)\")\n",
+    "plt.ylabel(\"Return on Investment\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The visualization displayed above is a scatter plot of movie runtime vs ROI, with the regression line (in red) drawn across the data points. This makes it clear that the trend line is almost flat, confirming that runtime has no meaningful effect on profitability.\n",
+    "\n",
+    "- Null Hypothesis (H₀): There is no relationship between movie runtime and ROI.\n",
+    "\n",
+    "- Alternative Hypothesis (H₁):There is a relationship between movie runtime and ROI.\n",
+    "\n",
+    "- Since the p-value = 0.359 > 0.05, we fail to reject H₀ → meaning runtime does not significantly impact profitability."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Business Recommendation\n",
+    "\n",
+    "The analysis shows that movie length does not significantly impact profitability, whether a film runs shorter or longer has almost no effect on its return on investment (ROI).\n",
+    "\n",
+    "#### Implication for the Company:\n",
+    "\n",
+    "Runtime should not be a deciding factor when selecting or producing films.Strategic focus should shift to more influential drivers of success such as:\n",
+    "\n",
+    " - Budget management (spending efficiently to maximize returns)\n",
+    "\n",
+    " - Release timing (launching films in profitable quarters/seasons)\n",
+    "\n",
+    " - Marketing and distribution strategies\n",
+    "\n",
+    "Advice to Stakeholder: When developing original video content do not prioritize on movie length.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}