diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..826b7aad --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +zippedData/im.db +zippedData/im.db diff --git a/Group 15 Phase 2 Project_Non Technical Slide.pdf b/Group 15 Phase 2 Project_Non Technical Slide.pdf new file mode 100644 index 00000000..5d06431c Binary files /dev/null and b/Group 15 Phase 2 Project_Non Technical Slide.pdf differ diff --git a/Group15_Phase2_Project.ipynb b/Group15_Phase2_Project.ipynb new file mode 100644 index 00000000..feacf1e6 --- /dev/null +++ b/Group15_Phase2_Project.ipynb @@ -0,0 +1,3282 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Science part time- Group 15 Phase 2 Project\n", + "\n", + "**Team Members:** \n", + "1. Diana Aloo \n", + "2. Sylvia Wambui \n", + "3. Catherine Kaino \n", + "4. Vincent Buluma \n", + "\n", + "**Scheduled Project Review:** \n", + "**Date/Time:** Sunday, 14th September 2025 \n", + "\n", + "**Instructor:** Christine Kirimi \n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Title and Business Understanding\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Business Problem\n", + "\n", + "A company wants to enter the world of original video content by creating a new movie studio. However, the company has no prior experience in movie production. \n", + "\n", + "We have been tasked with exploring existing movie data to understand which types of films perform best at the box office. Our findings should then be translated into **actionable insights** to guide the studio head in deciding what types of films to produce.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Business Understanding \n", + "\n", + "The film industry is shaped by both creative and financial factors, and stakeholders often face uncertainty in predicting how a movie will perform in the market. This project seeks to bridge that gap by analyzing historical movie data and uncovering the patterns that drive commercial and critical success. \n", + "\n", + "Our key objectives guiding this analysis are: \n", + "- To identify the factors that most strongly influence box office performance. \n", + "- To examine the role of genres, casts, release timing, and production budgets in determining financial outcomes. \n", + "- To explore how critical reviews and audience ratings align with or diverge from revenue performance. \n", + "- To generate evidence-based recommendations that can inform future decision-making by producers, investors, and marketers. \n", + "\n", + "By addressing these objectives, the analysis aims to provide **actionable insights** into what contributes to a successful movie release. The outcomes are not only valuable for film studios and distributors but also for stakeholders seeking to optimize investments and marketing strategies. \n", + "\n", + "This section frames the problem and highlights why the subsequent exploration of data is critical in delivering meaningful and applicable results. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Understanding \n", + "\n", + "In order to answer the project objectives, our analysis draws on multiple reputable film industry datasets. Each source contributes a different perspective on movie performance, audience reception, and financial outcomes. Together, they form a comprehensive view of the domain: \n", + "\n", + "1. **Box Office Mojo** – Provides detailed box office revenue data, both domestic and international. \n", + "2. **IMDb** – Offers extensive metadata on films, including genres, casts, crews, and ratings from a global audience. \n", + "3. **Rotten Tomatoes** – Supplies critic and audience scores, capturing critical reception and sentiment. \n", + "4. **The Movie Database (TMDb)** – Delivers community-driven information such as keywords, genres, release dates, and popularity metrics. \n", + "5. **The Numbers** – Tracks financial performance, including budgets and revenues, allowing deeper profitability insights. \n", + "\n", + "By integrating these sources, the project ensures that data analysis is not only based on financial outcomes but also enriched with audience feedback, critical reviews, and contextual metadata. This holistic approach strengthens the ability to link descriptive insights with actionable business recommendations. \n", + "\n", + "The following steps will involve loading these datasets, inspecting their structure, and identifying any data quality issues that need to be addressed before moving into preparation and analysis. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setting up the working environment \n", + "\n", + "We will begin by importing all the necessary Python libraries that will be used throughout the analysis. Data handling and manipulation will rely on **pandas** and **numpy**, while **matplotlib** and **seaborn** will support data visualization. Since part of the dataset may be stored in databases or compressed files, modules such as **sqlite3** and **zipfile** are also included. Additionally, **os** is brought in for handling file paths and **warnings** for controlling warning messages. \n", + "\n", + "Bringing these libraries together at the start ensures that the environment is fully prepared for data loading, cleaning, exploration, and modeling steps that follow. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "#import necessay libraries\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import sqlite3\n", + "import zipfile\n", + "import seaborn as sns\n", + "%matplotlib inline\n", + "import os\n", + "import warnings\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Loading Box Office Mojo data \n", + "\n", + "Our analysis begins with the integration of financial performance data from **Box Office Mojo**, a source that provides detailed box office revenue figures. The dataset is read directly from the compressed CSV file (`bom.movie_gross.csv.gz`) and loaded into a pandas DataFrame named `bom_movie_gross`. \n", + "\n", + "Displaying the first few rows (`.head()`) allows for a quick verification that the file was read correctly, the columns are properly aligned, and the data matches expectations in terms of structure. \n", + "\n", + "Connecting back to the project objectives, this dataset supplies the **revenue backbone** of the analysis. It is essential for understanding domestic and international box office trends, and it forms the basis upon which other datasets (IMDb, Rotten Tomatoes, TMDb, The Numbers) will later be merged to provide a complete picture of movie success. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titlestudiodomestic_grossforeign_grossyear
0Toy Story 3BV415000000.06520000002010
1Alice in Wonderland (2010)BV334200000.06913000002010
2Harry Potter and the Deathly Hallows Part 1WB296000000.06643000002010
3InceptionWB292600000.05357000002010
4Shrek Forever AfterP/DW238700000.05139000002010
\n", + "
" + ], + "text/plain": [ + " title studio domestic_gross \\\n", + "0 Toy Story 3 BV 415000000.0 \n", + "1 Alice in Wonderland (2010) BV 334200000.0 \n", + "2 Harry Potter and the Deathly Hallows Part 1 WB 296000000.0 \n", + "3 Inception WB 292600000.0 \n", + "4 Shrek Forever After P/DW 238700000.0 \n", + "\n", + " foreign_gross year \n", + "0 652000000 2010 \n", + "1 691300000 2010 \n", + "2 664300000 2010 \n", + "3 535700000 2010 \n", + "4 513900000 2010 " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Read the data\n", + "# Box Office Mojo \n", + "bom_movie_gross = pd.read_csv('zippedData/bom.movie_gross.csv.gz')\n", + "bom_movie_gross.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " \n", + "### Accessing the IMDb dataset \n", + "\n", + "We extracted the IMDb database from its compressed archive (`im.db.zip`) into the working directory. \n", + "After extraction, we established a connection to the SQLite database (`im.db`) using `sqlite3`. \n", + "\n", + "This connection enabled us to explore the available tables and later query specific subsets of data such as movie titles, ratings, genres, and crew information. Incorporating IMDb at this stage is essential because it provides rich descriptive attributes that, when combined with financial and ratings data from other sources, form a more complete view of movie performance. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "\n", + "with zipfile.ZipFile(\"zippedData/im.db.zip\", \"r\") as z:\n", + " z.extractall(\"zippedData/\") # this will extract from 'zippedData/im.db'" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " name\n", + "0 movie_basics\n", + "1 directors\n", + "2 known_for\n", + "3 movie_akas\n", + "4 movie_ratings\n", + "5 persons\n", + "6 principals\n", + "7 writers\n" + ] + } + ], + "source": [ + "conn = sqlite3.connect(\"zippedData/im.db\")\n", + "tables = pd.read_sql(\"SELECT name FROM sqlite_master WHERE type='table';\", conn)\n", + "print(tables)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "conn = sqlite3.connect(\"zippedData/im.db\")\n", + "\n", + "movie_basics = pd.read_sql_query(\"SELECT * FROM movie_basics;\", conn)\n", + "movie_ratings = pd.read_sql_query(\"SELECT * FROM movie_ratings;\", conn)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exploring and validating IMDb and related datasets \n", + "\n", + "After establishing the database connection, we will extract several key IMDb tables into pandas DataFrames for analysis. These include: \n", + "- **movie_basics**: movie identifiers, titles, release years, runtimes, and genres. \n", + "- **directors, writers, and principals**: metadata on creative teams and contributors. \n", + "- **movie_akas**: alternative titles across regions. \n", + "- **movie_ratings**: audience ratings and number of votes. \n", + "- **persons** and **known_for**: details about individuals and their notable works. \n", + "\n", + "We will then preview the data (e.g., `movie_basics.head()`, `movie_ratings.head()`) to confirm successful loading and to observe the available attributes. \n", + "Additional queries, such as ordering `movie_basics` by release year, will help us understand the range of movies covered — from historical titles to unreleased films. \n", + "\n", + "Finally, we will use `.info()` checks across multiple datasets (IMDb basics, Box Office Mojo, Rotten Tomatoes reviews, and IMDb ratings) to evaluate record counts, column structures, data types, and the presence of missing values. \n", + "\n", + "These steps are part of the **Data Understanding** phase, where we familiarize ourselves with the structure and content of each dataset before moving on to cleaning and preparation. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "movie_basics = pd.read_sql(\"SELECT * FROM movie_basics;\", conn)\n", + "\n", + "directors = pd.read_sql(\"SELECT * FROM directors;\", conn)\n", + "known_for = pd.read_sql(\"SELECT * FROM known_for;\", conn)\n", + "movie_akas = pd.read_sql(\"SELECT * FROM movie_akas;\", conn)\n", + "movie_ratings = pd.read_sql(\"SELECT * FROM movie_ratings;\", conn)\n", + "persons = pd.read_sql(\"SELECT * FROM persons;\", conn)\n", + "principals = pd.read_sql(\"SELECT * FROM principals;\", conn)\n", + "writers = pd.read_sql(\"SELECT * FROM writers;\", conn)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_idprimary_titleoriginal_titlestart_yearruntime_minutesgenres
0tt0063540SunghurshSunghursh2013175.0Action,Crime,Drama
1tt0066787One Day Before the Rainy SeasonAshad Ka Ek Din2019114.0Biography,Drama
2tt0069049The Other Side of the WindThe Other Side of the Wind2018122.0Drama
3tt0069204Sabse Bada SukhSabse Bada Sukh2018NaNComedy,Drama
4tt0100275The Wandering Soap OperaLa Telenovela Errante201780.0Comedy,Drama,Fantasy
\n", + "
" + ], + "text/plain": [ + " movie_id primary_title original_title \\\n", + "0 tt0063540 Sunghursh Sunghursh \n", + "1 tt0066787 One Day Before the Rainy Season Ashad Ka Ek Din \n", + "2 tt0069049 The Other Side of the Wind The Other Side of the Wind \n", + "3 tt0069204 Sabse Bada Sukh Sabse Bada Sukh \n", + "4 tt0100275 The Wandering Soap Opera La Telenovela Errante \n", + "\n", + " start_year runtime_minutes genres \n", + "0 2013 175.0 Action,Crime,Drama \n", + "1 2019 114.0 Biography,Drama \n", + "2 2018 122.0 Drama \n", + "3 2018 NaN Comedy,Drama \n", + "4 2017 80.0 Comedy,Drama,Fantasy " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movie_basics.head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_idprimary_titleoriginal_titlestart_yearruntime_minutesgenres
0tt0146592Pál AdriennPál Adrienn2010136.0Drama
1tt0154039So Much for Justice!Oda az igazság2010100.0History
2tt0162942Children of the Green DragonA zöld sárkány gyermekei201089.0Drama
3tt0230212The Final JourneyThe Final Journey2010120.0Drama
4tt0312305Quantum Quest: A Cassini Space OdysseyQuantum Quest: A Cassini Space Odyssey201045.0Adventure,Animation,Sci-Fi
.....................
146139tt6149054Fantastic Beasts and Where to Find Them 5Fantastic Beasts and Where to Find Them 52024NaNAdventure,Family,Fantasy
146140tt3095356Avatar 4Avatar 42025NaNAction,Adventure,Fantasy
146141tt10300398Untitled Star Wars FilmUntitled Star Wars Film2026NaNFantasy
146142tt5637536Avatar 5Avatar 52027NaNAction,Adventure,Fantasy
146143tt5174640100 Years100 Years2115NaNDrama
\n", + "

146144 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " movie_id primary_title \\\n", + "0 tt0146592 Pál Adrienn \n", + "1 tt0154039 So Much for Justice! \n", + "2 tt0162942 Children of the Green Dragon \n", + "3 tt0230212 The Final Journey \n", + "4 tt0312305 Quantum Quest: A Cassini Space Odyssey \n", + "... ... ... \n", + "146139 tt6149054 Fantastic Beasts and Where to Find Them 5 \n", + "146140 tt3095356 Avatar 4 \n", + "146141 tt10300398 Untitled Star Wars Film \n", + "146142 tt5637536 Avatar 5 \n", + "146143 tt5174640 100 Years \n", + "\n", + " original_title start_year \\\n", + "0 Pál Adrienn 2010 \n", + "1 Oda az igazság 2010 \n", + "2 A zöld sárkány gyermekei 2010 \n", + "3 The Final Journey 2010 \n", + "4 Quantum Quest: A Cassini Space Odyssey 2010 \n", + "... ... ... \n", + "146139 Fantastic Beasts and Where to Find Them 5 2024 \n", + "146140 Avatar 4 2025 \n", + "146141 Untitled Star Wars Film 2026 \n", + "146142 Avatar 5 2027 \n", + "146143 100 Years 2115 \n", + "\n", + " runtime_minutes genres \n", + "0 136.0 Drama \n", + "1 100.0 History \n", + "2 89.0 Drama \n", + "3 120.0 Drama \n", + "4 45.0 Adventure,Animation,Sci-Fi \n", + "... ... ... \n", + "146139 NaN Adventure,Family,Fantasy \n", + "146140 NaN Action,Adventure,Fantasy \n", + "146141 NaN Fantasy \n", + "146142 NaN Action,Adventure,Fantasy \n", + "146143 NaN Drama \n", + "\n", + "[146144 rows x 6 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.read_sql(\"\"\"\n", + "SELECT*\n", + " FROM movie_basics\n", + "ORDER BY start_year ASC;\n", + " \"\"\", conn)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_idaverageratingnumvotes
0tt103565268.331
1tt103846068.9559
2tt10429746.420
3tt10437264.250352
4tt10602406.521
\n", + "
" + ], + "text/plain": [ + " movie_id averagerating numvotes\n", + "0 tt10356526 8.3 31\n", + "1 tt10384606 8.9 559\n", + "2 tt1042974 6.4 20\n", + "3 tt1043726 4.2 50352\n", + "4 tt1060240 6.5 21" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movie_ratings.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 146144 entries, 0 to 146143\n", + "Data columns (total 6 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 movie_id 146144 non-null object \n", + " 1 primary_title 146144 non-null object \n", + " 2 original_title 146123 non-null object \n", + " 3 start_year 146144 non-null int64 \n", + " 4 runtime_minutes 114405 non-null float64\n", + " 5 genres 140736 non-null object \n", + "dtypes: float64(1), int64(1), object(4)\n", + "memory usage: 6.7+ MB\n", + "\n", + "RangeIndex: 3387 entries, 0 to 3386\n", + "Data columns (total 5 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 title 3387 non-null object \n", + " 1 studio 3382 non-null object \n", + " 2 domestic_gross 3359 non-null float64\n", + " 3 foreign_gross 2037 non-null object \n", + " 4 year 3387 non-null int64 \n", + "dtypes: float64(1), int64(1), object(3)\n", + "memory usage: 132.4+ KB\n", + "\n", + "RangeIndex: 73856 entries, 0 to 73855\n", + "Data columns (total 3 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 movie_id 73856 non-null object \n", + " 1 averagerating 73856 non-null float64\n", + " 2 numvotes 73856 non-null int64 \n", + "dtypes: float64(1), int64(1), object(1)\n", + "memory usage: 1.7+ MB\n" + ] + } + ], + "source": [ + "#check structure\n", + "movie_basics.info()\n", + "bom_movie_gross.info()\n", + "\n", + "movie_ratings.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Assessing missing values \n", + "\n", + "As part of the Data Understanding phase, we will check each dataset for missing values to better understand data quality and identify potential cleaning needs. \n", + "\n", + "Specifically, we will use `.isnull().sum()` across: \n", + "- **IMDb movie_basics**: to detect gaps in titles, runtimes, and genres. \n", + "- **Box Office Mojo (bom_movie_gross)**: to assess missing values in revenue and studio information. \n", + "- **Rotten Tomatoes reviews**: to identify incomplete critic reviews, ratings, and publisher details. \n", + "- **IMDb movie_ratings**: to confirm the completeness of audience rating and vote counts. \n", + "\n", + "This step is crucial in highlighting areas where information is incomplete (e.g., missing runtimes, genres, foreign gross values, or critic reviews). These insights will guide the **Data Preparation phase**, where strategies such as dropping, imputing, or transforming missing values will be applied to ensure clean and reliable datasets for analysis. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "movie_id 0\n", + "averagerating 0\n", + "numvotes 0\n", + "dtype: int64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#looking for missing values\n", + "movie_basics.isnull().sum()\n", + "bom_movie_gross.isnull().sum()\n", + "movie_ratings.isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "movie_id 0\n", + "primary_title 0\n", + "original_title 21\n", + "start_year 0\n", + "runtime_minutes 31739\n", + "genres 5408\n", + "dtype: int64" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movie_basics.isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "title 0\n", + "studio 5\n", + "domestic_gross 28\n", + "foreign_gross 1350\n", + "year 0\n", + "dtype: int64" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bom_movie_gross.isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "movie_id 0\n", + "averagerating 0\n", + "numvotes 0\n", + "dtype: int64" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movie_ratings.isnull().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Statistics and Data Exploration \n", + "\n", + "We will generate summary statistics for our key datasets using the `.describe()` function. This will help us examine central tendencies, distributions, and variability of numerical features such as revenues, ratings, votes, and runtimes. \n", + "\n", + "Additionally, we will inspect the structure of the `movie_basics` dataset to understand its composition, identify missing values, and evaluate the most common genres. \n", + "\n", + "These steps will allow us to spot potential issues (such as missing or inconsistent values) and trends in the data before moving into the cleaning and transformation stage. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
domestic_grossyear
count3.359000e+033387.000000
mean2.874585e+072013.958075
std6.698250e+072.478141
min1.000000e+022010.000000
25%1.200000e+052012.000000
50%1.400000e+062014.000000
75%2.790000e+072016.000000
max9.367000e+082018.000000
\n", + "
" + ], + "text/plain": [ + " domestic_gross year\n", + "count 3.359000e+03 3387.000000\n", + "mean 2.874585e+07 2013.958075\n", + "std 6.698250e+07 2.478141\n", + "min 1.000000e+02 2010.000000\n", + "25% 1.200000e+05 2012.000000\n", + "50% 1.400000e+06 2014.000000\n", + "75% 2.790000e+07 2016.000000\n", + "max 9.367000e+08 2018.000000" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#summary stats\n", + "bom_movie_gross.describe()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
averageratingnumvotes
count73856.0000007.385600e+04
mean6.3327293.523662e+03
std1.4749783.029402e+04
min1.0000005.000000e+00
25%5.5000001.400000e+01
50%6.5000004.900000e+01
75%7.4000002.820000e+02
max10.0000001.841066e+06
\n", + "
" + ], + "text/plain": [ + " averagerating numvotes\n", + "count 73856.000000 7.385600e+04\n", + "mean 6.332729 3.523662e+03\n", + "std 1.474978 3.029402e+04\n", + "min 1.000000 5.000000e+00\n", + "25% 5.500000 1.400000e+01\n", + "50% 6.500000 4.900000e+01\n", + "75% 7.400000 2.820000e+02\n", + "max 10.000000 1.841066e+06" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movie_ratings.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
start_yearruntime_minutes
count146144.000000114405.000000
mean2014.62179886.187247
std2.733583166.360590
min2010.0000001.000000
25%2012.00000070.000000
50%2015.00000087.000000
75%2017.00000099.000000
max2115.00000051420.000000
\n", + "
" + ], + "text/plain": [ + " start_year runtime_minutes\n", + "count 146144.000000 114405.000000\n", + "mean 2014.621798 86.187247\n", + "std 2.733583 166.360590\n", + "min 2010.000000 1.000000\n", + "25% 2012.000000 70.000000\n", + "50% 2015.000000 87.000000\n", + "75% 2017.000000 99.000000\n", + "max 2115.000000 51420.000000" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movie_basics.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Top Genres Distribution \n", + "\n", + "We will visualize the distribution of movie genres by plotting the top 10 most frequent categories. This bar chart provides insights into the dominant genres in the dataset, helping us understand which types of movies are most commonly represented. \n", + "\n", + "Such exploratory visualizations are important in identifying trends and guiding deeper analysis in the later stages. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 146144 entries, 0 to 146143\n", + "Data columns (total 6 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 movie_id 146144 non-null object \n", + " 1 primary_title 146144 non-null object \n", + " 2 original_title 146123 non-null object \n", + " 3 start_year 146144 non-null int64 \n", + " 4 runtime_minutes 114405 non-null float64\n", + " 5 genres 140736 non-null object \n", + "dtypes: float64(1), int64(1), object(4)\n", + "memory usage: 6.7+ MB\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "#this dataset to show trends in movie production & genre popularity.\n", + "#movie_basics\n", + "movie_basics.head()\n", + "movie_basics.info()\n", + "movie_basics.isnull().sum()\n", + "\n", + "\n", + "# Top genres\n", + "movie_basics['genres'].value_counts().head(10).plot(kind='bar', figsize=(10,5), title=\"Top Genres\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Closing Data Understanding \n", + "\n", + "At this stage, we have: \n", + "- Generated descriptive statistics for numerical fields. \n", + "- Examined dataset structures and identified missing values. \n", + "- Explored the most common genres through visualization. \n", + "\n", + "This concludes the **Data Understanding** phase. Next, we will move into the **Data Preparation** phase, where we will clean and transform the datasets to ensure consistency, handle missing values, and create new variables needed for further analysis. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Preparation: Cleaning Gross Revenue Data \n", + "\n", + "We will now begin the **Data Preparation** phase by cleaning and transforming the `bom_movie_gross` dataset. This step ensures that revenue fields are consistent and ready for analysis. \n", + "\n", + "- **Foreign Gross**: Convert values to numeric by replacing blanks and `\"NA\"` with `0`, removing commas, and casting to `float`. \n", + "- **Domestic Gross**: Fill missing values with `0` to avoid gaps in calculations. \n", + "- **Worldwide Gross**: Create a new column by summing `domestic_gross` and `foreign_gross` to capture the total revenue for each movie. \n", + "\n", + "This transformation will provide us with a reliable metric (`worldwide_gross`) that can be used in subsequent exploratory and comparative analyses. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1.067000e+09\n", + "1 1.025500e+09\n", + "2 9.603000e+08\n", + "3 8.283000e+08\n", + "4 7.526000e+08\n", + " ... \n", + "3382 NaN\n", + "3383 NaN\n", + "3384 NaN\n", + "3385 NaN\n", + "3386 NaN\n", + "Name: worldwide_gross, Length: 3387, dtype: float64" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#movie_gross EDA\n", + "# Clean foreign_gross (convert to float)\n", + "bom_movie_gross['foreign_gross'] = (\n", + "bom_movie_gross['foreign_gross']\n", + " .replace('', 0) # blanks to 0\n", + " .replace('NA', 0) # NA to 0\n", + " .astype(str) # ensure string before replace\n", + " .str.replace(',', '', regex=True) # remove commas\n", + " .astype(float)\n", + ")\n", + "\n", + "# Fill missing domestic_gross with 0\n", + "bom_movie_gross['domestic_gross'] = bom_movie_gross['domestic_gross'].fillna(0)\n", + "\n", + "# Create worldwide_gross\n", + "bom_movie_gross['worldwide_gross'] = bom_movie_gross['domestic_gross'] + bom_movie_gross['foreign_gross']\n", + "bom_movie_gross['worldwide_gross']" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleyearstudioworldwide_gross
727Marvel's The Avengers2012BV1.518900e+09
1875Avengers: Age of Ultron2015BV1.405400e+09
3080Black Panther2018BV1.347000e+09
328Harry Potter and the Deathly Hallows Part 22011WB1.341500e+09
2758Star Wars: The Last Jedi2017BV1.332600e+09
3081Jurassic World: Fallen Kingdom2018Uni.1.309500e+09
1127Frozen2013BV1.276400e+09
2759Beauty and the Beast (2017)2017BV1.263500e+09
3082Incredibles 22018BV1.242800e+09
1128Iron Man 32013BV1.214800e+09
\n", + "
" + ], + "text/plain": [ + " title year studio \\\n", + "727 Marvel's The Avengers 2012 BV \n", + "1875 Avengers: Age of Ultron 2015 BV \n", + "3080 Black Panther 2018 BV \n", + "328 Harry Potter and the Deathly Hallows Part 2 2011 WB \n", + "2758 Star Wars: The Last Jedi 2017 BV \n", + "3081 Jurassic World: Fallen Kingdom 2018 Uni. \n", + "1127 Frozen 2013 BV \n", + "2759 Beauty and the Beast (2017) 2017 BV \n", + "3082 Incredibles 2 2018 BV \n", + "1128 Iron Man 3 2013 BV \n", + "\n", + " worldwide_gross \n", + "727 1.518900e+09 \n", + "1875 1.405400e+09 \n", + "3080 1.347000e+09 \n", + "328 1.341500e+09 \n", + "2758 1.332600e+09 \n", + "3081 1.309500e+09 \n", + "1127 1.276400e+09 \n", + "2759 1.263500e+09 \n", + "3082 1.242800e+09 \n", + "1128 1.214800e+09 " + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Top 5 movies by worldwide gross\n", + "bom_movie_gross[['title', 'year', 'studio', 'worldwide_gross']].sort_values(by='worldwide_gross', ascending=False).head(10)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "257" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Unique studios\n", + "bom_movie_gross['studio'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
yearworldwide_gross
020102.452710e+10
120112.558675e+10
220122.773575e+10
320132.717124e+10
420142.711995e+10
520152.594883e+10
620162.998873e+10
720173.056770e+10
820182.823885e+10
\n", + "
" + ], + "text/plain": [ + " year worldwide_gross\n", + "0 2010 2.452710e+10\n", + "1 2011 2.558675e+10\n", + "2 2012 2.773575e+10\n", + "3 2013 2.717124e+10\n", + "4 2014 2.711995e+10\n", + "5 2015 2.594883e+10\n", + "6 2016 2.998873e+10\n", + "7 2017 3.056770e+10\n", + "8 2018 2.823885e+10" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Total worldwide gross per year\n", + "yearly_gross = bom_movie_gross.groupby('year')['worldwide_gross'].sum().reset_index()\n", + "\n", + "yearly_gross\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
studioworldwide_gross
36BV4.419038e+10
93Fox3.098037e+10
246WB3.079150e+10
238Uni.2.974681e+10
215Sony2.240483e+10
.........
170P/1080.000000e+00
167OutF0.000000e+00
165Orion0.000000e+00
54CineGalaxy0.000000e+00
49CLF0.000000e+00
\n", + "

257 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " studio worldwide_gross\n", + "36 BV 4.419038e+10\n", + "93 Fox 3.098037e+10\n", + "246 WB 3.079150e+10\n", + "238 Uni. 2.974681e+10\n", + "215 Sony 2.240483e+10\n", + ".. ... ...\n", + "170 P/108 0.000000e+00\n", + "167 OutF 0.000000e+00\n", + "165 Orion 0.000000e+00\n", + "54 CineGalaxy 0.000000e+00\n", + "49 CLF 0.000000e+00\n", + "\n", + "[257 rows x 2 columns]" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Total worldwide gross per studio\n", + "studio_gross = bom_movie_gross.groupby('studio')['worldwide_gross'].sum().reset_index().sort_values(by='worldwide_gross', ascending=False)\n", + "studio_gross" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Insights from Gross Revenue Analysis \n", + "\n", + "From the cleaning and exploratory steps above, we have begun addressing our project objectives of identifying **financial performance patterns** in the movie industry. \n", + "\n", + "**Key findings so far:** \n", + "- The **Top 10 movies by worldwide gross** are dominated by blockbuster franchises such as *The Avengers*, *Harry Potter*, *Star Wars*, and Disney animations. This highlights the strong financial impact of established intellectual properties and studios with large budgets. \n", + "- There are **257 unique studios** represented, but a small number of major studios capture the majority of global revenues. \n", + "- **Yearly worldwide gross** shows relatively stable performance between 2010–2018, with peaks in 2016 and 2017 surpassing \\$30 billion in combined box office revenues. \n", + "- **Studio-level revenues** reveal that **BV (Buena Vista/Disney)** leads with over \\$44 billion during this period, followed closely by Fox, Warner Bros (WB), Universal (Uni.), and Sony. This underscores the market dominance of a few key players in shaping global box office success. \n", + "\n", + "**How this relates to our objectives:** \n", + "- We have identified **which studios consistently produce the highest-grossing films**, directly answering the question of *which production companies should be considered for partnerships or emulation*. \n", + "- We have established that **franchises and established IPs drive the majority of global box office success**, providing guidance for strategic investment decisions in movie production. \n", + "- We have shown **temporal trends in revenue**, helping us understand the stability and potential seasonality in the industry. \n", + "\n", + "These insights provide a solid financial foundation, which we will enrich further by integrating other dimensions such as ratings, genres, and critical reception in the upcoming analysis. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualizing Global Revenue Trends and Preparing Merged Dataset \n", + "\n", + "To better interpret our findings, we visualize the trends in worldwide gross revenue across years and identify the top-performing studios. Visuals help reveal industry patterns more clearly than tables alone. \n", + "\n", + "- The **line plot** shows how worldwide box office revenue has evolved annually from 2010–2018. \n", + "- The **bar chart** highlights the dominance of the top 10 studios by cumulative worldwide gross. \n", + "\n", + "After the visual analysis, we proceed to **merge datasets** so we can enrich the financial data with additional context: \n", + "- First, we merge **movie_basics** with **movie_ratings** to combine metadata (titles, years, runtimes) with IMDb audience ratings. \n", + "- Next, we merge this result with **movie_gross** (using titles) to integrate box office performance. \n", + "\n", + "This merged dataset (3,027 movies) now contains the **key dimensions**—financials, ratings, and movie basics—allowing us to perform more holistic analysis in the following steps. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Worldwide gross trend over years\n", + "plt.figure(figsize=(10,6))\n", + "plt.plot(yearly_gross['year'], yearly_gross['worldwide_gross'], marker='o')\n", + "plt.title(\"Worldwide Gross by Year\")\n", + "plt.xlabel(\"Year\")\n", + "plt.ylabel(\"Total Worldwide Gross\")\n", + "plt.show()\n", + "\n", + "# Top 10 studios by worldwide gross\n", + "top_studios = studio_gross.head(10)\n", + "plt.figure(figsize=(12,6))\n", + "plt.bar(top_studios['studio'], top_studios['worldwide_gross'])\n", + "plt.xticks(rotation=45)\n", + "plt.title(\"Top 10 Studios by Worldwide Gross\")\n", + "plt.ylabel(\"Worldwide Gross\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Movie basics + ratings: 73856\n", + "Movie gross: 3387\n", + "Merged movies_gross: 3027\n" + ] + } + ], + "source": [ + "# Merge basics + ratings first\n", + "movies = movie_basics.merge(movie_ratings, on='movie_id', how='inner')\n", + "\n", + "# Merge with movie_gross on title\n", + "bom_movies_gross = movies.merge(bom_movie_gross, left_on='primary_title', right_on='title', how='inner')\n", + "\n", + "# Now check counts\n", + "print(\"Movie basics + ratings:\", len(movies))\n", + "print(\"Movie gross:\", len(bom_movie_gross))\n", + "print(\"Merged movies_gross:\", len(bom_movies_gross))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Interpreting Ratings vs. Box Office and Top Performers \n", + "\n", + "From the scatter plot of **IMDb average ratings vs worldwide gross**, we observe: \n", + "- There is **no strong direct correlation** between ratings and revenue. While some high-grossing movies (e.g., Avengers titles, Frozen, Jurassic World) have moderate ratings around 6–7, they still dominate box office performance. \n", + "- Conversely, highly rated films (e.g., *Inception*, *Interstellar*, *Whiplash*) may not always reach the billion-dollar mark but secure strong critical and audience acclaim. \n", + "\n", + "The **top 10 grossing movies** confirm that global blockbusters are typically franchise-based (Marvel, Star Wars, Disney animations). Their success is **driven by brand power, distribution, and marketing**, rather than exceptional ratings. \n", + "\n", + "The **top 10 rated movies** (with at least 1000 votes) highlight a different trend: smaller or niche films (*Senna*, *Whiplash*, *Burn the Stage*) achieve critical and audience recognition but earn far less in gross revenue. A few exceptions like *Inception* and *Interstellar* demonstrate the rare cases where **strong ratings and high grosses overlap**. \n", + "\n", + "#### Link to Objectives Achieved So Far \n", + "- **Do high ratings drive box office revenue?** → Not directly. While ratings contribute to a film’s reputation, revenue is more influenced by studio backing, franchise power, and marketing. \n", + "- **Which studios dominate earnings?** → Disney (BV), Warner Bros, and Universal are the clear leaders, accounting for the majority of global box office revenue. \n", + "- **How has worldwide gross trended over time?** → A steady growth from 2010 to 2017, peaking in 2017, with a slight decline in 2018. This shows overall expansion of the industry during the period. \n", + "\n", + "In the next phase, we will extend this analysis by looking at **genres** (from `movie_basics`) to identify which categories of movies best balance both strong ratings and high revenue. This will help answer the final objective regarding **genre performance**. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "#Ratings vs Box Office\n", + "import matplotlib.pyplot as plt\n", + "\n", + "plt.figure(figsize=(8,6))\n", + "plt.scatter(bom_movies_gross['averagerating'], bom_movies_gross['worldwide_gross'], alpha=0.3)\n", + "plt.yscale('log')\n", + "plt.xlabel(\"Average Rating\")\n", + "plt.ylabel(\"Worldwide Gross (log scale)\")\n", + "plt.title(\"Box Office vs IMDB Ratings\")\n", + "plt.show()\n", + "#You’ll see if higher ratings correspond to bigger grosses." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Top Grossing Movies:\n", + " primary_title averagerating worldwide_gross\n", + "1907 Avengers: Age of Ultron 7.3 1.405400e+09\n", + "1301 Black Panther 7.3 1.347000e+09\n", + "1984 Star Wars: The Last Jedi 7.1 1.332600e+09\n", + "2703 Jurassic World: Fallen Kingdom 6.2 1.309500e+09\n", + "460 Frozen 5.4 1.276400e+09\n", + "461 Frozen 7.5 1.276400e+09\n", + "459 Frozen 6.2 1.276400e+09\n", + "2373 Incredibles 2 7.7 1.242800e+09\n", + "424 Iron Man 3 7.2 1.214800e+09\n", + "1789 Minions 6.4 1.159400e+09\n", + "\n", + "Top Rated Movies:\n", + " primary_title averagerating worldwide_gross\n", + "3026 Burn the Stage: The Movie 8.8 20300000.0\n", + "514 Inception 8.8 828300000.0\n", + "508 Coriolanus 8.7 1072000.0\n", + "101 Interstellar 8.6 677400000.0\n", + "594 Senna 8.6 8200000.0\n", + "1412 Joker 8.5 NaN\n", + "2744 Dangal 8.5 302900000.0\n", + "73 Samsara 8.5 NaN\n", + "2542 Avengers: Infinity War 8.5 678801369.5\n", + "2013 Whiplash 8.5 49000000.0\n" + ] + } + ], + "source": [ + "#Top Movies by Gross + Ratings\n", + "# Top 10 by worldwide gross\n", + "top_grossing = bom_movies_gross.sort_values(by='worldwide_gross', ascending=False).head(10)[['primary_title','averagerating','worldwide_gross']]\n", + "print(\"Top Grossing Movies:\\n\", top_grossing)\n", + "\n", + "# Top 10 by rating (with at least 1000 votes to filter noise)\n", + "top_rated = bom_movies_gross[bom_movies_gross['numvotes'] > 1000].sort_values(by='averagerating', ascending=False).head(10)[['primary_title','averagerating','worldwide_gross']]\n", + "print(\"\\nTop Rated Movies:\\n\", top_rated)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "answer business-style questions like:\n", + "\n", + "Do high ratings drive box office revenue?\n", + "\n", + "Which studios dominate earnings?\n", + "\n", + "How has worldwide gross trended over time?\n", + "\n", + "Which genres (from movie_basics) balance both good ratings + revenue?" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
primary_titlegenresaverageratingworldwide_gross
0WazirAction7.1NaN
0WazirCrime7.1NaN
0WazirDrama7.1NaN
1On the RoadAdventure6.18744000.0
1On the RoadDrama6.18744000.0
\n", + "
" + ], + "text/plain": [ + " primary_title genres averagerating worldwide_gross\n", + "0 Wazir Action 7.1 NaN\n", + "0 Wazir Crime 7.1 NaN\n", + "0 Wazir Drama 7.1 NaN\n", + "1 On the Road Adventure 6.1 8744000.0\n", + "1 On the Road Drama 6.1 8744000.0" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Split Movies into Individual Genres\n", + "# Step 1: Drop rows with missing genres\n", + "bom_movies_gross_genres = bom_movies_gross.dropna(subset=['genres']).copy()\n", + "\n", + "# Step 2: Split multiple genres into lists\n", + "bom_movies_gross_genres['genres'] = bom_movies_gross_genres['genres'].str.split(',')\n", + "\n", + "# Step 3: Explode so each genre gets its own row\n", + "bom_movies_gross_genres = bom_movies_gross_genres.explode('genres')\n", + "\n", + "# Quick check\n", + "bom_movies_gross_genres[['primary_title','genres','averagerating','worldwide_gross']].head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Genre Performance: Balancing Ratings and Revenue \n", + "\n", + "By splitting movies into individual genres and calculating both **average IMDb ratings** and **average worldwide revenue**, we gain insight into which genres deliver commercial success, critical acclaim, or a balance of both. \n", + "\n", + "- **High Revenue, Moderate Ratings**: \n", + " - Action, Adventure, and Sci-Fi dominate the box office with average grosses in the $200–330M range, but their ratings hover around 6.2–6.5. These are the **blockbuster genres**, strong in revenue but not always critically acclaimed. \n", + "\n", + "- **High Ratings, Lower Revenue**: \n", + " - Genres like Documentary, Biography, and niche combinations (e.g., *Comedy, Documentary, Fantasy* with ratings above 9) achieve strong audience appreciation but generally attract smaller box office returns. \n", + "\n", + "- **Balanced Sweet Spot**: \n", + " - Animation and Sport emerge as genres with both **above-average ratings (~6.7–6.8)** and **solid revenue performance**. This suggests they are capable of satisfying both audiences and the market. \n", + " - Family-oriented genres also perform reasonably well across both dimensions. \n", + "\n", + "In summary, we can see that **blockbusters (Action/Adventure/Sci-Fi)** secure revenue dominance, while **Documentaries and niche dramas** achieve high ratings. However, **Animation and Sport films represent the genres that best balance strong ratings with consistent box office performance**, aligning with our objective of identifying genres that perform well on both fronts. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
genresavg_ratingavg_revenuecount
17Sci-Fi6.4511113.387682e+08135
1Adventure6.4783603.224918e+08439
2Animation6.7000003.124535e+08152
0Action6.2752322.257417e+08646
9Fantasy6.2423532.112001e+08170
8Family6.2247861.578751e+08117
13Musical6.3166671.285734e+0818
4Comedy6.2476241.260030e+08926
18Sport6.8679251.235215e+0853
19Thriller6.1726271.166945e+08453
\n", + "
" + ], + "text/plain": [ + " genres avg_rating avg_revenue count\n", + "17 Sci-Fi 6.451111 3.387682e+08 135\n", + "1 Adventure 6.478360 3.224918e+08 439\n", + "2 Animation 6.700000 3.124535e+08 152\n", + "0 Action 6.275232 2.257417e+08 646\n", + "9 Fantasy 6.242353 2.112001e+08 170\n", + "8 Family 6.224786 1.578751e+08 117\n", + "13 Musical 6.316667 1.285734e+08 18\n", + "4 Comedy 6.247624 1.260030e+08 926\n", + "18 Sport 6.867925 1.235215e+08 53\n", + "19 Thriller 6.172627 1.166945e+08 453" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Average Rating + Revenue per Genre\n", + "genre_stats = bom_movies_gross_genres.groupby('genres').agg(\n", + " avg_rating=('averagerating','mean'),\n", + " avg_revenue=('worldwide_gross','mean'),\n", + " count=('primary_title','count')\n", + ").reset_index()\n", + "\n", + "genre_stats.sort_values(by='avg_revenue', ascending=False).head(10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some genres may make a lot of money but have lower ratings (e.g., Action).\n", + "\n", + "Others may rate high but don’t make much money (e.g., Documentary).\n", + "\n", + "Some genres may balance both strong ratings + solid revenue (that’s your “sweet spot”)." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_idprimary_titleoriginal_titlestart_yearruntime_minutesgenresaverageratingnumvotes
0tt0063540SunghurshSunghursh2013175.0Action,Crime,Drama7.077
1tt0066787One Day Before the Rainy SeasonAshad Ka Ek Din2019114.0Biography,Drama7.243
2tt0069049The Other Side of the WindThe Other Side of the Wind2018122.0Drama6.94517
3tt0069204Sabse Bada SukhSabse Bada Sukh2018NaNComedy,Drama6.113
4tt0100275The Wandering Soap OperaLa Telenovela Errante201780.0Comedy,Drama,Fantasy6.5119
\n", + "
" + ], + "text/plain": [ + " movie_id primary_title original_title \\\n", + "0 tt0063540 Sunghursh Sunghursh \n", + "1 tt0066787 One Day Before the Rainy Season Ashad Ka Ek Din \n", + "2 tt0069049 The Other Side of the Wind The Other Side of the Wind \n", + "3 tt0069204 Sabse Bada Sukh Sabse Bada Sukh \n", + "4 tt0100275 The Wandering Soap Opera La Telenovela Errante \n", + "\n", + " start_year runtime_minutes genres averagerating numvotes \n", + "0 2013 175.0 Action,Crime,Drama 7.0 77 \n", + "1 2019 114.0 Biography,Drama 7.2 43 \n", + "2 2018 122.0 Drama 6.9 4517 \n", + "3 2018 NaN Comedy,Drama 6.1 13 \n", + "4 2017 80.0 Comedy,Drama,Fantasy 6.5 119 " + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#join movie basics and movie ratings\n", + "movies = movie_basics.merge(movie_ratings, on='movie_id', how='inner')\n", + "\n", + "# Quick check\n", + "movies.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "genres\n", + "Comedy,Documentary,Fantasy 9.4\n", + "Documentary,Family,Musical 9.3\n", + "History,Sport 9.2\n", + "Game-Show 9.0\n", + "Music,Mystery 9.0\n", + "Documentary,News,Sport 8.8\n", + "Drama,Fantasy,War 8.8\n", + "Comedy,Drama,Reality-TV 8.8\n", + "Drama,Short 8.8\n", + "Documentary,News,Reality-TV 8.8\n", + "Name: averagerating, dtype: float64" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#average rating by genre\n", + "movies.groupby('genres')['averagerating'].mean().sort_values(ascending=False).head(10)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "#average rating by year\n", + "movies.groupby('start_year')['averagerating'].mean().plot(\n", + " figsize=(10,6), title=\"Average IMDB Rating by Year\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "# First, join basics + ratings\n", + "movies = movie_basics.merge(movie_ratings, on='movie_id', how='inner')\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [], + "source": [ + "# Then, join with movie_gross on title\n", + "bom_movies_gross = movies.merge(bom_movie_gross, left_on='primary_title', right_on='title', how='inner')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_idprimary_titleoriginal_titlestart_yearruntime_minutesgenresaverageratingnumvotestitlestudiodomestic_grossforeign_grossyearworldwide_gross
0tt0315642WazirWazir2016103.0Action,Crime,Drama7.115378WazirRelbig.1100000.0NaN2016NaN
1tt0337692On the RoadOn the Road2012124.0Adventure,Drama,Romance6.137886On the RoadIFC744000.08000000.020128744000.0
2tt4339118On the RoadOn the Road201489.0Drama6.06On the RoadIFC744000.08000000.020128744000.0
3tt5647250On the RoadOn the Road2016121.0Drama5.7127On the RoadIFC744000.08000000.020128744000.0
4tt0359950The Secret Life of Walter MittyThe Secret Life of Walter Mitty2013114.0Adventure,Comedy,Drama7.3275300The Secret Life of Walter MittyFox58200000.0129900000.02013188100000.0
\n", + "
" + ], + "text/plain": [ + " movie_id primary_title \\\n", + "0 tt0315642 Wazir \n", + "1 tt0337692 On the Road \n", + "2 tt4339118 On the Road \n", + "3 tt5647250 On the Road \n", + "4 tt0359950 The Secret Life of Walter Mitty \n", + "\n", + " original_title start_year runtime_minutes \\\n", + "0 Wazir 2016 103.0 \n", + "1 On the Road 2012 124.0 \n", + "2 On the Road 2014 89.0 \n", + "3 On the Road 2016 121.0 \n", + "4 The Secret Life of Walter Mitty 2013 114.0 \n", + "\n", + " genres averagerating numvotes \\\n", + "0 Action,Crime,Drama 7.1 15378 \n", + "1 Adventure,Drama,Romance 6.1 37886 \n", + "2 Drama 6.0 6 \n", + "3 Drama 5.7 127 \n", + "4 Adventure,Comedy,Drama 7.3 275300 \n", + "\n", + " title studio domestic_gross foreign_gross \\\n", + "0 Wazir Relbig. 1100000.0 NaN \n", + "1 On the Road IFC 744000.0 8000000.0 \n", + "2 On the Road IFC 744000.0 8000000.0 \n", + "3 On the Road IFC 744000.0 8000000.0 \n", + "4 The Secret Life of Walter Mitty Fox 58200000.0 129900000.0 \n", + "\n", + " year worldwide_gross \n", + "0 2016 NaN \n", + "1 2012 8744000.0 \n", + "2 2012 8744000.0 \n", + "3 2012 8744000.0 \n", + "4 2013 188100000.0 " + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Quick check\n", + "bom_movies_gross.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Ratings Trend by Year and Combined Dataset Overview \n", + "\n", + "After plotting the **average IMDb rating by release year** to check whether film quality (as perceived by audiences) has shifted over time. \n", + "\n", + "**Findings from the Graph:** \n", + "- Ratings remain fairly **stable across the 2010–2018 period**, showing only minor fluctuations year to year. \n", + "- Unlike revenue, which had strong peaks and valleys driven by blockbuster releases, **audience ratings appear more consistent** over time. \n", + "- This indicates that while the industry’s financial performance changes annually, **perceived film quality does not fluctuate as dramatically**. \n", + "\n", + "**Merged Dataset (Basics + Ratings + Gross):** \n", + "- By combining movie basics, ratings, and gross earnings into one dataset, we can now analyze titles holistically. \n", + "- For example, we can observe how films like *The Secret Life of Walter Mitty* achieved a **solid rating (7.3)** but comparatively modest worldwide gross (~$59M), while blockbusters (*On the Road*) may have wide distribution but average ratings. \n", + "- This merged view is preferred because it allows us to connect **commercial success** (box office revenue) with **audience reception** (ratings), which is central to our project’s objectives. \n", + "\n", + "In summary, ratings show stability over time, and the combined dataset equips us to better answer questions about the balance between **audience approval** and **financial success**. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualization Stage — Balancing Audience Approval and Financial Success \n", + "\n", + "After exploring our datasets and establishing key merges between ratings, genres, and revenue, we now move into the visualization stage. \n", + "\n", + "The goal of these visualizations is to **bridge numerical analysis with business insights**, making it easier to identify patterns and answer our guiding questions: \n", + "- Do higher ratings drive box office revenue? \n", + "- Which genres find the “sweet spot” between strong ratings and high grosses? \n", + "- How do studios and release years influence overall performance? \n", + "\n", + "From our earlier exploratory data analysis, we observed that: \n", + "- **Ratings have remained relatively stable over time**, showing consistency in audience perceptions. \n", + "- The **combined dataset** (ratings + box office + genres) now allows us to investigate how **audience approval translates into financial outcomes**. \n", + "\n", + "By visualizing the balance between **ratings and revenue**, we can better highlight: \n", + "- Genres that appeal to critics but underperform financially. \n", + "- Genres that dominate commercially even with average ratings. \n", + "- The “middle ground” genres that deliver both financial strength and audience acclaim. \n", + "\n", + "These insights will support our overall objective of finding the optimal balance between **commercial success and critical recognition**. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "#Visualization — Balance Between Rating & Revenue\n", + "import matplotlib.pyplot as plt\n", + "\n", + "plt.figure(figsize=(10,7))\n", + "plt.scatter(genre_stats['avg_rating'], genre_stats['avg_revenue'], s=genre_stats['count']*2, alpha=0.6)\n", + "\n", + "for i, row in genre_stats.iterrows():\n", + " plt.text(row['avg_rating']+0.02, row['avg_revenue'], row['genres'], fontsize=9)\n", + "\n", + "plt.xlabel(\"Average Rating\")\n", + "plt.ylabel(\"Average Worldwide Gross\")\n", + "plt.title(\"Genres: Balancing Ratings vs Revenue\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Findings — Genres Balancing Ratings vs Revenue \n", + "\n", + "This scatterplot highlights the trade-off between **audience ratings** and **financial performance** across genres. \n", + "\n", + "- **Animation, Sci-Fi, and Adventure** stand out as genres that achieve both **high average revenues (>$300M)** and **decent ratings (~6.5–6.7)**, making them strong candidates for investment. \n", + "- **Action films** generate significant box office income but typically score **lower on ratings (~6.3)**, suggesting strong commercial pull despite mixed critical reception. \n", + "- **Documentaries** consistently earn the **highest ratings (~7.3)** but have relatively low commercial returns, reflecting their niche but prestigious appeal. \n", + "- **Sports films** perform moderately well in revenue and have above-average ratings (~6.9), making them promising growth opportunities. \n", + "\n", + "📌 **Insight:** \n", + "The “sweet spot” lies in genres that combine **strong ratings and healthy box office results** (e.g., Animation, Sci-Fi, Adventure). These balance both **critical approval** and **financial success**, aligning with our objective of identifying genres worth prioritizing for a sustainable portfolio. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(8,6))\n", + "plt.scatter(bom_movies_gross['averagerating'], bom_movies_gross['worldwide_gross']/1e6, alpha=0.5)\n", + "plt.title(\"IMDB Ratings vs Worldwide Gross Revenue\")\n", + "plt.xlabel(\"Average IMDB Rating\")\n", + "plt.ylabel(\"Revenue (in Millions USD)\")\n", + "plt.ylim(0, 2000) # cap at 2B for readability\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Findings — IMDB Ratings vs Worldwide Gross Revenue \n", + "\n", + "This scatterplot compares **individual movies’ average IMDB ratings** with their **worldwide gross revenue**. \n", + "\n", + "- The majority of films cluster in the **mid-range ratings (6–7.5)** with revenues below **$500M**, indicating that most films earn moderate approval and box office returns. \n", + "- A few blockbusters cross the **$1B mark**, even with **average ratings around 6–7**, showing that **star power, marketing, or franchise strength** can outweigh ratings. \n", + "- Very highly rated films (above 8) exist but are comparatively rare, and not all of them achieve massive revenue — pointing to a **disconnect between critical acclaim and financial success** in some cases. \n", + "\n", + "📌 **Insight:** \n", + "While higher ratings may improve audience trust and longevity, they are **not the sole driver of revenue**. Commercial success often depends on other factors (franchise value, distribution scale, marketing). This supports our objective of testing whether **ratings alone can predict box office performance** — evidence suggests they cannot, at least not reliably. \n", + "\n", + "➡️ Next, we refine this analysis by adding a **trendline and correlation measure** to quantify the relationship between ratings and revenue. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0, 0.5, 'Revenue (in Millions USD)')" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import numpy as np\n", + "import seaborn as sns\n", + "\n", + "plt.figure(figsize=(8,6))\n", + "\n", + "# Scatterplot\n", + "sns.scatterplot(\n", + " x='averagerating',\n", + " y=bom_movies_gross['worldwide_gross']/1e6,\n", + " data=bom_movies_gross,\n", + " alpha=0.5\n", + ")\n", + "\n", + "# Trendline (regression line)\n", + "sns.regplot(\n", + " x='averagerating',\n", + " y=bom_movies_gross['worldwide_gross']/1e6,\n", + " data=bom_movies_gross,\n", + " scatter=False,\n", + " color='red',\n", + " line_kws={'linewidth':2}\n", + ")\n", + "\n", + "# Correlation coefficient\n", + "corr = bom_movies_gross['averagerating'].corr(bom_movies_gross['worldwide_gross'])\n", + "plt.title(f\"IMDB Ratings vs Worldwide Gross Revenue\\nCorrelation: {corr:.2f}\", fontsize=14)\n", + "plt.xlabel(\"Average IMDB Rating\")\n", + "plt.ylabel(\"Revenue (in Millions USD)\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Findings — IMDB Ratings vs Worldwide Gross Revenue (with Correlation & Trendline) \n", + "\n", + "This visualization extends the previous scatterplot by adding a **regression trendline** and calculating the **correlation coefficient** between ratings and revenue. \n", + "\n", + "- The trendline shows only a **slight upward slope**, indicating that while higher-rated movies may earn more on average, the effect is **weak**. \n", + "- The computed correlation is **very low (close to 0)**, confirming that **ratings and revenue are not strongly correlated**. \n", + "- Some **high-grossing blockbusters** achieved success with average ratings, while certain **highly rated films** did not earn substantial box office returns. \n", + "\n", + "📌 **Insight:** \n", + "Critical approval (ratings) and financial performance (revenue) operate **largely independently**. This finding reinforces that **studios cannot rely solely on ratings as a predictor of financial success**. Instead, factors such as franchise strength, marketing budgets, and genre trends play a much greater role in determining box office performance. \n", + "\n", + "➡️ With this, we address our key objective of evaluating whether **higher ratings drive higher revenues** — the evidence suggests that the relationship is **minimal**. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " primary_title genre\n", + "0 Wazir Action\n", + "0 Wazir Crime\n", + "0 Wazir Drama\n", + "1 On the Road Adventure\n", + "1 On the Road Drama\n", + "1 On the Road Romance\n", + "2 On the Road Drama\n", + "3 On the Road Drama\n", + "4 The Secret Life of Walter Mitty Adventure\n", + "4 The Secret Life of Walter Mitty Comedy\n" + ] + } + ], + "source": [ + "# Split multi-genre movies into separate rows\n", + "movies_genres = bom_movies_gross.assign(\n", + " genre=bom_movies_gross['genres'].str.split(',')\n", + ").explode('genre')\n", + "\n", + "print(movies_genres[['primary_title', 'genre']].head(10))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
genreavg_ratingavg_revenuecount
17Sci-Fi6.4511113.387682e+08135
1Adventure6.4783603.224918e+08439
2Animation6.7000003.124535e+08152
0Action6.2752322.257417e+08646
9Fantasy6.2423532.112001e+08170
8Family6.2247861.578751e+08117
13Musical6.3166671.285734e+0818
4Comedy6.2476241.260030e+08926
18Sport6.8679251.235215e+0853
19Thriller6.1726271.166945e+08453
\n", + "
" + ], + "text/plain": [ + " genre avg_rating avg_revenue count\n", + "17 Sci-Fi 6.451111 3.387682e+08 135\n", + "1 Adventure 6.478360 3.224918e+08 439\n", + "2 Animation 6.700000 3.124535e+08 152\n", + "0 Action 6.275232 2.257417e+08 646\n", + "9 Fantasy 6.242353 2.112001e+08 170\n", + "8 Family 6.224786 1.578751e+08 117\n", + "13 Musical 6.316667 1.285734e+08 18\n", + "4 Comedy 6.247624 1.260030e+08 926\n", + "18 Sport 6.867925 1.235215e+08 53\n", + "19 Thriller 6.172627 1.166945e+08 453" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Group by genre and compute stats\n", + "genre_summary = movies_genres.groupby('genre').agg(\n", + " avg_rating=('averagerating', 'mean'),\n", + " avg_revenue=('worldwide_gross', 'mean'),\n", + " count=('movie_id', 'count')\n", + ").reset_index()\n", + "\n", + "# Sort by revenue just to check\n", + "genre_summary = genre_summary.sort_values(by='avg_revenue', ascending=False)\n", + "genre_summary.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(10,6))\n", + "sns.scatterplot(\n", + " data=genre_summary,\n", + " x='avg_rating',\n", + " y=genre_summary['avg_revenue']/1e6,\n", + " size='count',\n", + " hue='genre',\n", + " alpha=0.7,\n", + " legend=False\n", + ")\n", + "\n", + "plt.title(\"Genres: Balancing Ratings and Revenue\", fontsize=14)\n", + "plt.xlabel(\"Average IMDB Rating\")\n", + "plt.ylabel(\"Average Revenue (in Millions USD)\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Findings — Genres: Balancing Ratings and Revenue \n", + "\n", + "This scatterplot breaks down movies by **individual genres**, allowing us to compare how genres perform in terms of both **ratings** and **average revenue**. The **bubble size** represents the number of films in each genre, while colors differentiate between genres. \n", + "\n", + "- **Animation, Sci-Fi, and Adventure** emerge as top-performing genres, combining **strong box office results** (>$300M average) with **solid ratings (~6.5–6.7)**. \n", + "- **Action** is the most common genre (largest bubble) and generates high revenues, but its ratings are **comparatively lower (~6.3)**, showing mass appeal but weaker critical reception. \n", + "- **Documentaries** score the **highest ratings (~7.3)** yet remain **commercially limited**, indicating they are critical darlings but niche in revenue. \n", + "- **Sports films** perform above average in ratings (~6.9) with **moderate revenues (~$120M)**, suggesting potential for growth if given more investment and broader distribution. \n", + "\n", + "📌 **Insight:** \n", + "This analysis helps us identify genres that strike the **ideal balance between critical acclaim and financial success**. For sustainable strategy, studios should: \n", + "- Prioritize **Animation, Sci-Fi, and Adventure** for strong commercial returns. \n", + "- Support **Documentaries** and **Sports** for prestige, critical recognition, and niche audience engagement. \n", + "\n", + "➡️ This directly ties back to our objective of **finding the genres that balance both ratings and revenue**, highlighting where studios should focus for both profit and reputation. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Take the top 5 genres from the table\n", + "top_genres = [\n", + " {\"genre\": \"Animation\", \"avg_rating\": 6.70, \"avg_revenue\": 3.124535e+08},\n", + " {\"genre\": \"Sci-Fi\", \"avg_rating\": 6.45, \"avg_revenue\": 3.387682e+08},\n", + " {\"genre\": \"Adventure\", \"avg_rating\": 6.48, \"avg_revenue\": 3.224918e+08},\n", + " {\"genre\": \"Documentary\", \"avg_rating\": 7.29, \"avg_revenue\": 6.768393e+07},\n", + " {\"genre\": \"Sport\", \"avg_rating\": 6.87, \"avg_revenue\": 1.235215e+08},\n", + "]\n", + "\n", + "# Convert to lists\n", + "genres = [g[\"genre\"] for g in top_genres]\n", + "ratings = [g[\"avg_rating\"] for g in top_genres]\n", + "revenues = [g[\"avg_revenue\"]/1e6 for g in top_genres] # scale to millions\n", + "\n", + "# Plot side-by-side bars\n", + "fig, ax1 = plt.subplots(figsize=(10,6))\n", + "\n", + "# Ratings (left axis)\n", + "ax1.bar(genres, ratings, color=\"skyblue\", alpha=0.7, label=\"Average Rating\")\n", + "ax1.set_ylabel(\"Average Rating\", color=\"blue\")\n", + "ax1.set_ylim(0, 10)\n", + "\n", + "# Revenues (right axis)\n", + "ax2 = ax1.twinx()\n", + "ax2.plot(genres, revenues, color=\"red\", marker=\"o\", linewidth=2, label=\"Avg Revenue ($M)\")\n", + "ax2.set_ylabel(\"Average Revenue (Millions $)\", color=\"red\")\n", + "\n", + "# Title and layout\n", + "plt.title(\"Top Genres Balancing Ratings and Revenue\")\n", + "fig.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Findings — Top Genres Balancing Ratings and Revenue \n", + "\n", + "This combined bar-and-line chart highlights the **top 5 genres** that best balance audience approval (ratings) and financial performance (revenue). The blue bars represent **average ratings**, while the red line shows **average worldwide revenue (in millions USD)**. \n", + "\n", + "- **Sci-Fi** leads in revenue (~$339M) but with slightly below-average ratings (~6.45). Its strength lies in blockbuster franchises and mass-market appeal. \n", + "- **Animation** shows a strong balance, achieving both **high revenues (~$312M)** and **solid ratings (6.7)**. This genre is especially attractive due to its family-friendly appeal and global reach. \n", + "- **Adventure** films also perform well commercially (~$323M) with decent ratings (~6.5), making them a reliable revenue driver. \n", + "- **Documentary** films receive the **highest average ratings (7.29)** but are far less profitable (~$67M), signaling strong critical acclaim but niche market potential. \n", + "- **Sports** films combine above-average ratings (~6.9) with **moderate revenues (~$123M)**, offering room for future growth with strategic marketing. \n", + "\n", + "📌 **Insight:** \n", + "The visualization reinforces our earlier findings: \n", + "- **Animation, Sci-Fi, and Adventure** are the most profitable genres that also maintain respectable ratings, making them ideal for large-scale investment. \n", + "- **Documentary and Sport** genres, while less lucrative, contribute to **brand reputation, critical prestige, and niche audience loyalty**. \n", + "\n", + "➡️ Together, these genres provide a balanced portfolio strategy — driving both **profitability** (blockbuster genres) and **reputation** (high-quality niche genres). \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Final Recommendations \n", + "\n", + "Based on the analysis of IMDB ratings, box office revenues, genres, and studio performance, we provide the following recommendations for balancing **commercial success** and **audience approval**. \n", + "\n", + "### 1. Genre Strategy \n", + "- **Prioritize High-Return Genres** \n", + " - **Sci-Fi (~$339M avg revenue, rating 6.45)** → Franchise-driven and consistently profitable; leverage blockbuster franchises. \n", + " - **Animation (~$312M avg revenue, rating 6.7)** → Balances strong revenues with solid ratings; broad family appeal ensures global market success. \n", + " - **Adventure (~$323M avg revenue, rating 6.5)** → Reliable performers both domestically and internationally; solid middle ground. \n", + "\n", + "- **Maintain Prestige Genres** \n", + " - **Documentary (rating 7.29, ~$67M avg revenue)** → Highest audience approval; strengthens reputation and critical standing even if not highly profitable. \n", + " - **Sports (rating 6.9, ~$123M avg revenue)** → Above-average ratings with growth potential; could expand profitability with stronger marketing and distribution. \n", + "\n", + " **Recommendation**: Maintain a **balanced portfolio** — invest heavily in **blockbuster genres (Sci-Fi, Animation, Adventure)** for profitability while supporting **Documentary and Sport** for awards, critical acclaim, and niche audiences. \n", + "\n", + "---\n", + "\n", + "### 2. Studio Strategy \n", + "- Our analysis shows **Disney (BV), Fox, WB, and Universal dominate worldwide grosses**, collectively accounting for the majority of box office success. \n", + "- These studios achieve commercial strength by **owning long-term franchises (Marvel, Star Wars, Harry Potter, Frozen, etc.)**. \n", + "- Smaller studios generate occasional hits but cannot compete at the same scale. \n", + "\n", + " **Recommendation**: \n", + "- For major studios → double down on **franchise-driven tentpole releases** that guarantee billion-dollar grosses. \n", + "- For smaller studios → focus on **specialized high-rating genres** (Documentary, Drama, Indie) where prestige outweighs revenue. \n", + "\n", + "---\n", + "\n", + "### 3. Ratings vs Revenue Relationship \n", + "- **Correlation between ratings and revenue is minimal (close to 0)**. \n", + "- Many **blockbusters succeed with average ratings (~6–7)**, proving that marketing, franchise appeal, and scale drive financial outcomes more than critical acclaim. \n", + "- Highly rated films (>8) exist but often earn **modest grosses**, showing that quality alone does not ensure financial success. \n", + "\n", + " **Recommendation**: \n", + "- Do **not rely solely on ratings** as a predictor of box office potential. \n", + "- Use ratings as a **secondary indicator** of audience trust and long-term film value (streaming success, awards, cultural impact). \n", + "\n", + "---\n", + "\n", + "### 4. Time Trends \n", + "- **Worldwide gross peaked around 2017**, followed by a decline in 2018. \n", + "- **Ratings remained stable across years**, showing consistency in perceived film quality. \n", + "- Fluctuations in revenue are driven more by **blockbuster release cycles and external industry factors** (competition from streaming, saturation of franchises). \n", + "\n", + " **Recommendation**: \n", + "- Diversify release slates to **avoid overreliance on a single franchise or release year**. \n", + "- Invest in **strategic scheduling** — spacing major releases and avoiding bottlenecks that reduce profitability. \n", + "\n", + "---\n", + "\n", + "### 5. Overall Business Insight \n", + "The evidence suggests that: \n", + "- **Profitability** comes from blockbuster genres (Sci-Fi, Animation, Adventure) and franchise-driven studio strategies. \n", + "- **Prestige and critical recognition** come from high-rated but less profitable genres (Documentary, Sport, Indie Drama). \n", + "- **Ratings and revenues do not strongly correlate** — success requires balancing both financial drivers (franchises, marketing, distribution) and quality drivers (critical acclaim, awards). \n", + "\n", + " **Final Recommendation**: \n", + "Adopt a **dual-track strategy** — \n", + "- **Commercial Track**: Focus on scalable blockbuster genres (Sci-Fi, Animation, Adventure) to maximize box office returns. \n", + "- **Prestige Track**: Support critically acclaimed genres (Documentary, Sport, Indie Drama) to build reputation, win awards, and maintain brand trust. \n", + "\n", + "This balanced approach ensures both **short-term profitability** and **long-term sustainability**, aligning financial growth with audience approval and brand value. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/README.md b/README.md index b5e02341..116360b2 100644 --- a/README.md +++ b/README.md @@ -1,281 +1,181 @@ -# Phase 2 Project Description +# 🎬 Group 15 Project – Movie Industry Data Analysis -You've made it through the second phase of this course, and now you will put your new skills to use with a large end-of-Phase project! +This project explores movie industry data to generate actionable insights for business stakeholders interested in launching a new movie studio. The analysis identifies which types of films perform best at the box office, providing recommendations that can guide investment and content strategy decisions. -In this project description, we will cover: +We utilized exploratory data analysis (EDA) techniques, combining data from multiple sources including Box Office Mojo, IMDB, Rotten Tomatoes, TheNumbers, and TheMovieDB. -* [***Project Overview:***](#project-overview) the project goal, audience, and dataset -* [***Deliverables:***](#deliverables) the specific items you are required to produce for this project -* [***Grading:***](#grading) how your project will be scored -* [***Getting Started:***](#getting-started) guidance for how to begin your first project +--- +## Non Technical Presentation -## Project Overview +We have included a Canvas Non Technical Presentation for this project to summarize key points for non-technical audiences. -For this project, you will use exploratory data analysis to generate insights for a business stakeholder. +You can find the presentation PDF here: +[Group 15 Phase 2 Project Non Technical Slide.pdf](./Group%2015%20Phase%202%20Project_Non%20Technical%20Slide.pdf) -### Business Problem +Please review this file for an easy-to-understand overview of our project. -Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create. -### The Data +## 📌 Table of Contents -In the folder `zippedData` are movie datasets from: +- [Business Understanding](#business-understanding) +- [Data Understanding & Analysis](#data-understanding--analysis) +- [Data Highlights](#data-highlights) +- [Key Findings & Visualizations](#key-findings--visualizations) +- [Recommendations](#recommendations) +- [Conclusion](#conclusion) +- [Next Steps](#next-steps) +- [Technologies Used](#technologies-used) -* [Box Office Mojo](https://www.boxofficemojo.com/) -* [IMDB](https://www.imdb.com/) -* [Rotten Tomatoes](https://www.rottentomatoes.com/) -* [TheMovieDB](https://www.themoviedb.org/) -* [The Numbers](https://www.the-numbers.com/) +--- -Because it was collected from various locations, the different files have different formats. Some are compressed CSV (comma-separated values) or TSV (tab-separated values) files that can be opened using spreadsheet software or `pd.read_csv`, while the data from IMDB is located in a SQLite database. +## 💼 Business Understanding -![movie data erd](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v3/main/movie_data_erd.jpeg) +### Business Problem: -Note that the above diagram shows ONLY the IMDB data. You will need to look carefully at the features to figure out how the IMDB data relates to the other provided data files. +Big companies in media are seeking success by producing original video content. Our company wants to enter this space but lacks experience in making movies. The goal is to identify what types of films are most successful at the box office so the new movie studio can make informed decisions. -It is up to you to decide what data from this to use and how to use it. If you want to make this more challenging, you can scrape websites or make API calls to get additional data. If you are feeling overwhelmed or behind, we recommend you use only the following data files: +### Key Objective Questions: -* `im.db.zip` - * Zipped SQLite database (you will need to unzip then query using SQLite) - * `movie_basics` and `movie_ratings` tables are most relevant -* `bom.movie_gross.csv.gz` - * Compressed CSV file (you can open without expanding the file using `pd.read_csv`) +- What genres and film types are consistently performing well? +- How do budgets and revenues relate to success? +- What trends should guide the studio’s content creation strategy? -### Key Points +--- -* **Your analysis should yield three concrete business recommendations.** The ultimate purpose of exploratory analysis is not just to learn about the data, but to help an organization perform better. Explicitly relate your findings to business needs by recommending actions that you think the business should take. +## 📊 Data Understanding & Analysis -* **Communicating about your work well is extremely important.** Your ability to provide value to an organization - or to land a job there - is directly reliant on your ability to communicate with them about what you have done and why it is valuable. Create a storyline your audience (the head of the new movie studio) can follow by walking them through the steps of your process, highlighting the most important points and skipping over the rest. +### Data Sources: -* **Use plenty of visualizations.** Visualizations are invaluable for exploring your data and making your findings accessible to a non-technical audience. Spotlight visuals in your presentation, but only ones that relate directly to your recommendations. Simple visuals are usually best (e.g. bar charts and line graphs), and don't forget to format them well (e.g. labels, titles). +- **Box Office Mojo:** Box office revenue +- **IMDB:** Ratings and movie metadata +- **The Numbers & TheMovieDB:** Budget and revenue data +- **Rotten Tomatoes:** Ratings and critic reviews -## Deliverables +### Data Processing: -There are three deliverables for this project: +- Cleaned missing data +- Merged multiple sources using movie titles and release years +- Transformed data for analysis (genres, revenue, ratings) -* A **non-technical presentation** -* A **Jupyter Notebook** -* A **GitHub repository** +--- -### Non-Technical Presentation +## 🔍 Data Highlights -The non-technical presentation is a slide deck presenting your analysis to business stakeholders. +- The dataset includes key information on genres, release year, revenue, and ratings. +- Multiple sources were used for reliable and enriched data. +- Analysis was conducted using Python (pandas, matplotlib, seaborn, SQLAlchemy). -* ***Non-technical*** does not mean that you should avoid mentioning the technologies or techniques that you used, it means that you should explain any mentions of these technologies and avoid assuming that your audience is already familiar with them. -* ***Business stakeholders*** means that the audience for your presentation is the company, not the class or teacher. Do not assume that they are already familiar with the specific business problem. +--- -The presentation describes the project ***goals, data, methods, and results***. It must include at least ***three visualizations*** which correspond to ***three business recommendations***. +## 📈 Key Findings & Visualizations -We recommend that you follow this structure, although the slide titles should be specific to your project: +### 1. Top Genres -1. Beginning - * Overview - * Business Understanding -2. Middle - * Data Understanding - * Data Analysis -3. End - * Recommendations - * Next Steps - * Thank You - * This slide should include a prompt for questions as well as your contact information (name and LinkedIn profile) +![Top Genres](./images/Top%20Genres.png) -You will give a live presentation of your slides and submit them in PDF format on Canvas. The slides should also be present in the GitHub repository you submit with a file name of `presentation.pdf`. +*Drama, Comedy, and Documentary are the most common genres in the dataset, highlighting popular content creation areas. However, high frequency does not necessarily translate to box office profitability, showing that volume alone isn’t the best indicator of commercial success.* -The graded elements of the presentation are: +--- -* Presentation Content -* Slide Style -* Presentation Delivery and Answers to Questions +### 2. Average IMDB Rating by Year -See the [Grading](#grading) section for further explanation of these elements. +![Average IMDB Rating By Year](./images/Average%20IMDB%20Rating%20By%20Year.png) -For further reading on creating professional presentations, check out: +*IMDB ratings have remained relatively stable over the years, with a slight upward trend in recent times. This suggests consistent audience perception of movie quality, despite shifts in industry trends and releases.* -* [Presentation Content](https://github.com/learn-co-curriculum/dsc-project-presentation-content) -* [Slide Style](https://github.com/learn-co-curriculum/dsc-project-slide-design) +--- -### Jupyter Notebook +### 3. Balance Between Ratings and Revenue -The Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a data science audience. +![Balance Ratings vs Revenue](./images/Balance%20Ratings%20vs%20Revenue.png) -* ***Python and Markdown*** means that you need to construct an integrated `.ipynb` file with Markdown (headings, paragraphs, links, lists, etc.) and Python code to create a well-organized, skim-able document. - * The notebook kernel should be restarted and all cells run before submission, to ensure that all code is runnable in order. - * Markdown should be used to frame the project with a clear introduction and conclusion, as well as introducing each of the required elements. -* ***Data science audience*** means that you can assume basic data science proficiency in the person reading your notebook. This differs from the non-technical presentation. +*Some genres manage to balance both audience ratings and box office revenue successfully. These “sweet spot” genres represent ideal investment opportunities, blending critical acclaim with financial viability.* -Along with the presentation, the notebook also describes the project ***goals, data, methods, and results***. It must include at least ***three visualizations*** which correspond to ***three business recommendations***. +--- -You will submit the notebook in PDF format on Canvas as well as in `.ipynb` format in your GitHub repository. +### 4. Box Office vs IMDB Ratings (Scatter Plot) -The graded elements for the Jupyter Notebook are: +![Box Office VS IMDB Ratings Scatter Plot](./images/Box%20Office%20VS%20IMDB%20Ratings%20Scatter%20Plot.png) -* Business Understanding -* Data Understanding -* Data Preparation -* Data Analysis -* Visualization -* Code Quality +*This scatter plot reveals that high ratings don’t always guarantee high box office performance. Other factors such as marketing, distribution, and franchise popularity likely play significant roles in financial outcomes.* -See the [Grading](#grading) section for further explanation of these elements. +--- -### GitHub Repository +### 5. IMDB Ratings vs Worldwide Gross Revenue -The GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history. +![IMDB Ratings Vs Worldwide Gross Revenue](./images/IMDB%20Ratings%20Vs%20Worldwide%20Gross%20Revenue.png) -This repository link will be the project link that you include on your resume, LinkedIn, etc. for prospective employers to view your work. Note that we typically recommend that 3 links are highlighted (out of 5 projects) so don't stress too much about getting this one to be perfect! There will also be time after graduation for cosmetic touch-ups. +*The weak correlation between ratings and gross revenue indicates that audience approval is just one of many variables influencing commercial success. This insight highlights the need for multifaceted strategies beyond simply making “critically acclaimed” movies.* -A professional GitHub repository has: +--- -1. `README.md` - * A file called `README.md` at the root of the repository directory, written in Markdown; this is what is rendered when someone visits the link to your repository in the browser - * This file contains these sections: - * Overview - * Business Understanding - * Include stakeholder and key business questions - * Data Understanding and Analysis - * Source of data - * Description of data - * Three visualizations (the same visualizations presented in the slides and notebook) - * Conclusion - * Summary of conclusions including three relevant findings -2. Commit history - * Progression of updates throughout the project time period, not just immediately before the deadline - * Clear commit messages - * Commits from all team members (if a group project) -3. Organization - * Clear folder structure - * Clear names of files and folders - * Easily-located notebook and presentation linked in the README -4. Notebook(s) - * Clearly-indicated final notebook that runs without errors - * Exploratory/working notebooks (can contain errors, redundant code, etc.) from all team members (if a group project) -5. `.gitignore` - * A file called `.gitignore` at the root of the repository directory instructs Git to ignore large, unnecessary, or private files - * Because it starts with a `.`, you will need to type `ls -a` in the terminal in order to see that it is there - * GitHub maintains a [Python .gitignore](https://github.com/github/gitignore/blob/master/Python.gitignore) that may be a useful starting point for your version of this file - * To tell Git to ignore more files, just add a new line to `.gitignore` for each new file name - * Consider adding `.DS_Store` if you are using a Mac computer, as well as project-specific file names - * If you are running into an error message because you forgot to add something to `.gitignore` and it is too large to be pushed to GitHub [this blog post](https://medium.com/analytics-vidhya/tutorial-removing-large-files-from-git-78dbf4cf83a?sk=c3763d466c7f2528008c3777192dfb95)(friend link) should help you address this +### 6. Top 10 Studios by Worldwide Gross -You wil submit a link to the GitHub repository on Canvas. +![Top 10 Studios By Worldwide Gross](./images/Top%2010%20Studios%20By%20Worldwide%20Gross.png) -See the [Grading](#grading) section for further explanation of how the GitHub repository will be graded. +*The dominance of studios like Warner Bros and Disney suggests that established distribution channels, brand loyalty, and franchise portfolios significantly contribute to box office success. New entrants must consider partnerships or strong marketing to compete.* -For further reading on creating professional notebooks and `README`s, check out [this reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2). +--- -## Grading +### 7. Top Genres Balancing Ratings and Revenue -***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 2 are: +![Top Genres Balancing Ratings and Revenue](./images/Top%20Genres%20Balancing%20Ratings%20and%20Revenue.png) -1. Data Communication -2. Authoring Jupyter Notebooks -3. Data Manipulation and Analysis with `pandas` +*Action, Adventure, and Animation genres perform well both critically and financially, making them prime candidates for investment. Their global appeal, especially family-friendly Animation, provides sustainable revenue streams.* -### Data Communication +--- -Communication is a key "soft skill". In [this survey](https://www.payscale.com/data-packages/job-skills), 46% of hiring managers said that recent college grads were missing this skill. +### 8. Worldwide Gross by Year -Because "communication" can encompass such a wide range of contexts and skills, we will specifically focus our Phase 2 objective on Data Communication. We define Data Communication as: +![Worldwide Gross by Year](./images/Worldwide%20Gross%20by%20Year.png) -> Communicating basic data analysis results to diverse audiences via writing and live presentation +*Trends in global revenue reveal industry growth patterns and dips, with notable peaks around blockbuster release years. Understanding these cycles can aid in strategic timing of movie launches.* -To further define some of these terms: +--- -* By "basic data analysis" we mean that you are filtering, sorting, grouping, and/or aggregating the data in order to answer business questions. This project does not involve inferential statistics or machine learning, although descriptive statistics such as measures of central tendency are encouraged. -* By "results" we mean your ***three visualizations and recommendations***. -* By "diverse audiences" we mean that your presentation and notebook are appropriately addressing a business and data science audience, respectively. +## 💡 Recommendations -Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment. +- **Invest in Action, Adventure, and Animation genres:** These genres show strong performance both in revenue and audience approval, offering the best balance of profitability and critical success. Their broad appeal, particularly Animation’s family-friendly nature, ensures stable long-term revenue. -#### Exceeds Objective -Creates and describes appropriate visualizations for given business questions, where each visualization fulfills all elements of the checklist +- **Prioritize mid-to-high budget films with strong pre-release marketing:** Budget correlates with box office success, but marketing plays a crucial role in attracting audiences. Well-funded marketing campaigns paired with genre choice can maximize returns. -> This "checklist" refers to the Data Visualization checklist within the larger Phase 2 Project Checklist +- **Avoid niche genres with low ratings and revenue unless targeting specific markets:** Genres like documentaries appeal to niche audiences and garner critical acclaim but usually don’t generate large revenues. Invest here for brand prestige and awards, not direct profit. -#### Meets Objective (Passing Bar) -Creates and describes appropriate visualizations for given business questions +- **Study and emulate top-performing studios:** Established studios succeed through franchise-building, effective distribution, and brand loyalty. New studios should build strong IP (intellectual property) and seek strategic partnerships to increase market reach. -> This objective can be met even if all checklist elements are not fulfilled. For example, if there is some illegible text in one of your visualizations, you can still meet this objective +- **Balance quality with commercial appeal:** While critical acclaim is valuable, financial success depends on multiple factors like marketing, timing, and franchise power. Ratings alone should not dictate production decisions. -#### Approaching Objective -Creates visualizations that are not related to the business questions, or uses an inappropriate type of visualization +--- -> Even if you create very compelling visualizations, you cannot pass this objective if the visualizations are not related to the business questions +## 🧾 Conclusion -> An example of an inappropriate type of visualization would be using a line graph to show the correlation between two independent variables, when a scatter plot would be more appropriate +Our analysis provides a data-driven foundation for entering the film industry. While genre and budget play major roles, success also depends on timing, marketing, and execution. Using these insights, the company can make informed decisions on the types of films to develop and distribute. -#### Does Not Meet Objective -Does not submit the required number of visualizations +A balanced approach focusing on profitable blockbuster genres complemented by select prestige films will maximize both revenue and brand reputation. -### Authoring Jupyter Notebooks +--- -According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), 74.1% of data scientists use a Jupyter development environment, which is more than twice the percentage of the next-most-popular IDE, Visual Studio Code. Jupyter Notebooks allow for reproducible, skim-able code documents for a data science audience. Comfort and skill with authoring Jupyter Notebooks will prepare you for job interviews, take-home challenges, and on-the-job tasks as a data scientist. +## 🚀 Next Steps -The key feature that distinguishes *authoring Jupyter Notebooks* from simply *writing Python code* is the fact that Markdown cells are integrated into the notebook along with the Python cells in a notebook. You have seen examples of this throughout the curriculum, but now it's time for you to practice this yourself! +1. **Expand dataset and refine analysis:** Incorporate more recent data, international markets, and streaming platform revenues to deepen insights. -Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment. +2. **Perform predictive modeling:** Use machine learning to forecast box office success based on budget, genre, cast, and marketing spend. -#### Exceeds Objective -Uses Markdown and code comments to create a well-organized, skim-able document that follows all best practices +3. **Investigate marketing and release timing:** Analyze impact of marketing budgets and seasonal release windows on revenue. -> Refer to the [repository readability reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2) for more tips on best practices +4. **Develop franchise potential metrics:** Identify which movies have the best potential to launch or expand lucrative franchises. -#### Meets Objective (Passing Bar) -Uses some Markdown to create an organized notebook, with an introduction at the top and a conclusion at the bottom +5. **Conduct competitor benchmarking:** Study competitor studios’ strategies in depth to identify partnership and differentiation opportunities. -#### Approaching Objective -Uses Markdown cells to organize, but either uses only headers and does not provide any explanations or justifications, or uses only plaintext without any headers to segment out sections of the notebook +6. **Engage with domain experts:** Collaborate with film industry professionals to validate data findings and align business strategies. -> Headers in Markdown are delineated with one or more `#`s at the start of the line. You should have a mixture of headers and plaintext (text where the line does not start with `#`) +--- -#### Does Not Meet Objective -Does not submit a notebook, or does not use Markdown cells at all to organize the notebook +## 🛠️ Technologies Used -### Data Manipulation and Analysis with `pandas` +- Python (pandas, numpy, seaborn, matplotlib) +- Jupyter Notebooks +- SQLite / SQLAlchemy +- Git & GitHub +- VS Code -`pandas` is a very popular data manipulation library, with over 2 million downloads on Anaconda (`conda install pandas`) and over 19 million downloads on PyPI (`pip install pandas`) at the time of this writing. In our own internal data, we see that the overwhelming majority of Flatiron School DS grads use `pandas` on the job in some capacity. - -Unlike in base Python, where the Zen of Python says "There should be one-- and preferably only one --obvious way to do it", there is often more than one valid way to do something in `pandas`. However there are still more efficient and less efficient ways to use it. Specifically, the best `pandas` code is *performant* and *idiomatic*. - -Performant `pandas` code utilizes methods and broadcasting rather than user-defined functions or `for` loops. For example, if you need to strip whitespace from a column containing string data, the best approach would be to use the [`pandas.Series.str.strip` method](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html) rather than writing your own function or writing a loop. Or if you want to multiply everything in a column by 100, the best approach would be to use broadcasting (e.g. `df["column_name"] * 100`) instead of a function or loop. You can still write your own functions if needed, but only after checking that there isn't a built-in way to do it. - -Idiomatic `pandas` code has variable names that are meaningful words or abbreviations in English, that are related to the purpose of the variables. You can still use `df` as the name of your DataFrame if there is only one main DataFrame you are working with, but as soon as you are merging multiple DataFrames or taking a subset of a DataFrame, you should use meaningful names. For example, `df2` would not be an idiomatic name, but `movies_and_reviews` could be. - -We also recommend that you rename all DataFrame columns so that their meanings are more understandable, although it is fine to have acronyms. For example, `"col1"` would not be an idiomatic name, but `"USD"` could be. - -Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment. - -#### Exceeds Objective -Uses `pandas` to prepare data and answer business questions in an idiomatic, performant way - -#### Meets Objective (Passing Bar) -Successfully uses `pandas` to prepare data in order to answer business questions - -> This includes projects that _occasionally_ use base Python when `pandas` methods would be more appropriate (such as using `enumerate()` on a DataFrame), or occasionally performs operations that do not appear to have any relevance to the business questions - -#### Approaching Objective -Uses `pandas` to prepare data, but makes significant errors - -> Examples of significant errors include: the result presented does not actually answer the stated question, the code produces errors, the code _consistently_ uses base Python when `pandas` methods would be more appropriate, or the submitted notebook contains significant quantities of code that is unrelated to the presented analysis (such as copy/pasted code from the curriculum or StackOverflow) - -#### Does Not Meet Objective -Unable to prepare data using `pandas` - -> This includes projects that successfully answer the business questions, but do not use `pandas` (e.g. use only base Python, or use some other tool like R, Tableau, or Excel) - -## Getting Started - -Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP. - -Next, you will need to complete the [***Project Proposal***](#project_proposal) which must be reviewed by your instructor before you can continue with the project. - -Then, you will need to create a GitHub repository. There are three options: - -1. Look at the [Phase 2 Project Templates and Examples repo](https://github.com/learn-co-curriculum/dsc-project-template) and follow the directions in the MVP branch. -2. Fork the [Phase 2 Project Repository](https://github.com/learn-co-curriculum/dsc-phase-2-project-v3), clone it locally, and work in the `student.ipynb` file. Make sure to also add and commit a PDF of your presentation to your repository with a file name of `presentation.pdf`. -3. Create a new repository from scratch by going to [github.com/new](https://github.com/new) and copying the data files from one of the above resources into your new repository. This approach will result in the most professional-looking portfolio repository, but can be more complicated to use. So if you are getting stuck with this option, try one of the above options instead. - -## Summary - -This project will give you a valuable opportunity to develop your data science skills using real-world data. The end-of-phase projects are a critical part of the program because they give you a chance to bring together all the skills you've learned, apply them to realistic projects for a business stakeholder, practice communication skills, and get feedback to help you improve. You've got this! diff --git a/images/Average IMDB Rating By Year.png b/images/Average IMDB Rating By Year.png new file mode 100644 index 00000000..9bb36caf Binary files /dev/null and b/images/Average IMDB Rating By Year.png differ diff --git a/images/Balance Ratings vs Revenue.png b/images/Balance Ratings vs Revenue.png new file mode 100644 index 00000000..d942d760 Binary files /dev/null and b/images/Balance Ratings vs Revenue.png differ diff --git a/images/Box Office VS IMDB Ratings Scatter Plot.png b/images/Box Office VS IMDB Ratings Scatter Plot.png new file mode 100644 index 00000000..205fda63 Binary files /dev/null and b/images/Box Office VS IMDB Ratings Scatter Plot.png differ diff --git a/images/IMDB Ratings Vs Worldwide Gross Revenue.png b/images/IMDB Ratings Vs Worldwide Gross Revenue.png new file mode 100644 index 00000000..e60b642e Binary files /dev/null and b/images/IMDB Ratings Vs Worldwide Gross Revenue.png differ diff --git a/images/Top 10 Studios By Worldwide Gross.png b/images/Top 10 Studios By Worldwide Gross.png new file mode 100644 index 00000000..fb378eb3 Binary files /dev/null and b/images/Top 10 Studios By Worldwide Gross.png differ diff --git a/images/Top Genres Balancing Ratings and Revenue.png b/images/Top Genres Balancing Ratings and Revenue.png new file mode 100644 index 00000000..4f2fb1c8 Binary files /dev/null and b/images/Top Genres Balancing Ratings and Revenue.png differ diff --git a/images/Top Genres.png b/images/Top Genres.png new file mode 100644 index 00000000..b2745c69 Binary files /dev/null and b/images/Top Genres.png differ diff --git a/images/Worldwide Gross by Year.png b/images/Worldwide Gross by Year.png new file mode 100644 index 00000000..612c7a12 Binary files /dev/null and b/images/Worldwide Gross by Year.png differ diff --git a/images/my-graph.png b/images/my-graph.png new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/images/my-graph.png @@ -0,0 +1 @@ + diff --git a/imdb.db b/imdb.db new file mode 100644 index 00000000..e69de29b diff --git a/index.ipynb b/index.ipynb deleted file mode 100644 index 3623bc14..00000000 --- a/index.ipynb +++ /dev/null @@ -1,643 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "5d35b2b4", - "metadata": {}, - "source": [ - "# Phase 2 Project Description" - ] - }, - { - "cell_type": "markdown", - "id": "b5e9e179", - "metadata": {}, - "source": [ - "You've made it through the second phase of this course, and now you will put your new skills to use with a large end-of-Phase project!\n", - "\n", - "In this project description, we will cover:\n", - "\n", - "* [***Project Overview:***](#project-overview) the project goal, audience, and dataset\n", - "* [***Deliverables:***](#deliverables) the specific items you are required to produce for this project\n", - "* [***Grading:***](#grading) how your project will be scored\n", - "* [***Getting Started:***](#getting-started) guidance for how to begin your first project" - ] - }, - { - "cell_type": "markdown", - "id": "58851385", - "metadata": {}, - "source": [ - "## Project Overview" - ] - }, - { - "cell_type": "markdown", - "id": "6f37995f", - "metadata": {}, - "source": [ - "For this project, you will use exploratory data analysis to generate insights for a business stakeholder." - ] - }, - { - "cell_type": "markdown", - "id": "8b0f1668", - "metadata": {}, - "source": [ - "### Business Problem" - ] - }, - { - "cell_type": "markdown", - "id": "dce55d1d", - "metadata": {}, - "source": [ - "Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create." - ] - }, - { - "cell_type": "markdown", - "id": "d3d557bf", - "metadata": {}, - "source": [ - "### The Data" - ] - }, - { - "cell_type": "markdown", - "id": "ca34efb7", - "metadata": {}, - "source": [ - "In the folder `zippedData` are movie datasets from:\n", - "\n", - "* [Box Office Mojo](https://www.boxofficemojo.com/)\n", - "* [IMDB](https://www.imdb.com/)\n", - "* [Rotten Tomatoes](https://www.rottentomatoes.com/)\n", - "* [TheMovieDB](https://www.themoviedb.org/)\n", - "* [The Numbers](https://www.the-numbers.com/)\n", - "\n", - "Because it was collected from various locations, the different files have different formats. Some are compressed CSV (comma-separated values) or TSV (tab-separated values) files that can be opened using spreadsheet software or `pd.read_csv`, while the data from IMDB is located in a SQLite database.\n", - "\n", - "![movie data erd](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v3/main/movie_data_erd.jpeg)\n", - "\n", - "Note that the above diagram shows ONLY the IMDB data. You will need to look carefully at the features to figure out how the IMDB data relates to the other provided data files.\n", - "\n", - "It is up to you to decide what data from this to use and how to use it. If you want to make this more challenging, you can scrape websites or make API calls to get additional data. If you are feeling overwhelmed or behind, we recommend you use only the following data files:\n", - "\n", - "* `im.db.zip`\n", - " * Zipped SQLite database (you will need to unzip then query using SQLite)\n", - " * `movie_basics` and `movie_ratings` tables are most relevant\n", - "* `bom.movie_gross.csv.gz`\n", - " * Compressed CSV file (you can open without expanding the file using `pd.read_csv`)" - ] - }, - { - "cell_type": "markdown", - "id": "5ace6e4f", - "metadata": {}, - "source": [ - "### Key Points" - ] - }, - { - "cell_type": "markdown", - "id": "c9d2edeb", - "metadata": {}, - "source": [ - "* **Your analysis should yield three concrete business recommendations.** The ultimate purpose of exploratory analysis is not just to learn about the data, but to help an organization perform better. Explicitly relate your findings to business needs by recommending actions that you think the business should take.\n", - "\n", - "* **Communicating about your work well is extremely important.** Your ability to provide value to an organization - or to land a job there - is directly reliant on your ability to communicate with them about what you have done and why it is valuable. Create a storyline your audience (the head of the new movie studio) can follow by walking them through the steps of your process, highlighting the most important points and skipping over the rest.\n", - "\n", - "* **Use plenty of visualizations.** Visualizations are invaluable for exploring your data and making your findings accessible to a non-technical audience. Spotlight visuals in your presentation, but only ones that relate directly to your recommendations. Simple visuals are usually best (e.g. bar charts and line graphs), and don't forget to format them well (e.g. labels, titles)." - ] - }, - { - "cell_type": "markdown", - "id": "474e2ec3", - "metadata": {}, - "source": [ - "## Deliverables" - ] - }, - { - "cell_type": "markdown", - "id": "eaeda85f", - "metadata": {}, - "source": [ - "There are three deliverables for this project:\n", - "\n", - "* A **non-technical presentation**\n", - "* A **Jupyter Notebook**\n", - "* A **GitHub repository**" - ] - }, - { - "cell_type": "markdown", - "id": "a7f8e274", - "metadata": {}, - "source": [ - "### Non-Technical Presentation" - ] - }, - { - "cell_type": "markdown", - "id": "540d5c27", - "metadata": {}, - "source": [ - "The non-technical presentation is a slide deck presenting your analysis to business stakeholders.\n", - "\n", - "* ***Non-technical*** does not mean that you should avoid mentioning the technologies or techniques that you used, it means that you should explain any mentions of these technologies and avoid assuming that your audience is already familiar with them.\n", - "* ***Business stakeholders*** means that the audience for your presentation is the company, not the class or teacher. Do not assume that they are already familiar with the specific business problem.\n", - "\n", - "The presentation describes the project ***goals, data, methods, and results***. It must include at least ***three visualizations*** which correspond to ***three business recommendations***.\n", - "\n", - "We recommend that you follow this structure, although the slide titles should be specific to your project:\n", - "\n", - "1. Beginning\n", - " * Overview\n", - " * Business Understanding\n", - "2. Middle\n", - " * Data Understanding\n", - " * Data Analysis\n", - "3. End\n", - " * Recommendations\n", - " * Next Steps\n", - " * Thank You\n", - " * This slide should include a prompt for questions as well as your contact information (name and LinkedIn profile)\n", - "\n", - "You will give a live presentation of your slides and submit them in PDF format on Canvas. The slides should also be present in the GitHub repository you submit with a file name of `presentation.pdf`.\n", - "\n", - "The graded elements of the presentation are:\n", - "\n", - "* Presentation Content\n", - "* Slide Style\n", - "* Presentation Delivery and Answers to Questions\n", - "\n", - "See the [Grading](#grading) section for further explanation of these elements.\n", - "\n", - "For further reading on creating professional presentations, check out:\n", - "\n", - "* [Presentation Content](https://github.com/learn-co-curriculum/dsc-project-presentation-content)\n", - "* [Slide Style](https://github.com/learn-co-curriculum/dsc-project-slide-design)" - ] - }, - { - "cell_type": "markdown", - "id": "d27915ba", - "metadata": {}, - "source": [ - "### Jupyter Notebook" - ] - }, - { - "cell_type": "markdown", - "id": "2d5d45ea", - "metadata": {}, - "source": [ - "The Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a data science audience.\n", - "\n", - "* ***Python and Markdown*** means that you need to construct an integrated `.ipynb` file with Markdown (headings, paragraphs, links, lists, etc.) and Python code to create a well-organized, skim-able document.\n", - " * The notebook kernel should be restarted and all cells run before submission, to ensure that all code is runnable in order.\n", - " * Markdown should be used to frame the project with a clear introduction and conclusion, as well as introducing each of the required elements.\n", - "* ***Data science audience*** means that you can assume basic data science proficiency in the person reading your notebook. This differs from the non-technical presentation.\n", - "\n", - "Along with the presentation, the notebook also describes the project ***goals, data, methods, and results***. It must include at least ***three visualizations*** which correspond to ***three business recommendations***.\n", - "\n", - "You will submit the notebook in PDF format on Canvas as well as in `.ipynb` format in your GitHub repository.\n", - "\n", - "The graded elements for the Jupyter Notebook are:\n", - "\n", - "* Business Understanding\n", - "* Data Understanding\n", - "* Data Preparation\n", - "* Data Analysis\n", - "* Visualization\n", - "* Code Quality\n", - "\n", - "See the [Grading](#grading) section for further explanation of these elements." - ] - }, - { - "cell_type": "markdown", - "id": "2027aa4c", - "metadata": {}, - "source": [ - "### GitHub Repository" - ] - }, - { - "cell_type": "markdown", - "id": "b8057390", - "metadata": {}, - "source": [ - "The GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history.\n", - "\n", - "This repository link will be the project link that you include on your resume, LinkedIn, etc. for prospective employers to view your work. Note that we typically recommend that 3 links are highlighted (out of 5 projects) so don't stress too much about getting this one to be perfect! There will also be time after graduation for cosmetic touch-ups.\n", - "\n", - "A professional GitHub repository has:\n", - "\n", - "1. `README.md`\n", - " * A file called `README.md` at the root of the repository directory, written in Markdown; this is what is rendered when someone visits the link to your repository in the browser\n", - " * This file contains these sections:\n", - " * Overview\n", - " * Business Understanding\n", - " * Include stakeholder and key business questions\n", - " * Data Understanding and Analysis\n", - " * Source of data\n", - " * Description of data\n", - " * Three visualizations (the same visualizations presented in the slides and notebook)\n", - " * Conclusion\n", - " * Summary of conclusions including three relevant findings\n", - "2. Commit history\n", - " * Progression of updates throughout the project time period, not just immediately before the deadline\n", - " * Clear commit messages\n", - " * Commits from all team members (if a group project)\n", - "3. Organization\n", - " * Clear folder structure\n", - " * Clear names of files and folders\n", - " * Easily-located notebook and presentation linked in the README\n", - "4. Notebook(s)\n", - " * Clearly-indicated final notebook that runs without errors\n", - " * Exploratory/working notebooks (can contain errors, redundant code, etc.) from all team members (if a group project)\n", - "5. `.gitignore`\n", - " * A file called `.gitignore` at the root of the repository directory instructs Git to ignore large, unnecessary, or private files\n", - " * Because it starts with a `.`, you will need to type `ls -a` in the terminal in order to see that it is there\n", - " * GitHub maintains a [Python .gitignore](https://github.com/github/gitignore/blob/master/Python.gitignore) that may be a useful starting point for your version of this file\n", - " * To tell Git to ignore more files, just add a new line to `.gitignore` for each new file name\n", - " * Consider adding `.DS_Store` if you are using a Mac computer, as well as project-specific file names\n", - " * If you are running into an error message because you forgot to add something to `.gitignore` and it is too large to be pushed to GitHub [this blog post](https://medium.com/analytics-vidhya/tutorial-removing-large-files-from-git-78dbf4cf83a?sk=c3763d466c7f2528008c3777192dfb95)(friend link) should help you address this\n", - "\n", - "You wil submit a link to the GitHub repository on Canvas.\n", - "\n", - "See the [Grading](#grading) section for further explanation of how the GitHub repository will be graded.\n", - "\n", - "For further reading on creating professional notebooks and `README`s, check out [this reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2)." - ] - }, - { - "cell_type": "markdown", - "id": "f19694e7", - "metadata": {}, - "source": [ - "## Grading" - ] - }, - { - "cell_type": "markdown", - "id": "06e9cfb7", - "metadata": {}, - "source": [ - "***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 2 are:\n", - "\n", - "1. Data Communication\n", - "2. Authoring Jupyter Notebooks\n", - "3. Data Manipulation and Analysis with `pandas`" - ] - }, - { - "cell_type": "markdown", - "id": "a4c04769", - "metadata": {}, - "source": [ - "### Data Communication" - ] - }, - { - "cell_type": "markdown", - "id": "0834a4ee", - "metadata": {}, - "source": [ - "Communication is a key \"soft skill\". In [this survey](https://www.payscale.com/data-packages/job-skills), 46% of hiring managers said that recent college grads were missing this skill.\n", - "\n", - "Because \"communication\" can encompass such a wide range of contexts and skills, we will specifically focus our Phase 2 objective on Data Communication. We define Data Communication as:\n", - "\n", - "> Communicating basic data analysis results to diverse audiences via writing and live presentation\n", - "\n", - "To further define some of these terms:\n", - "\n", - "* By \"basic data analysis\" we mean that you are filtering, sorting, grouping, and/or aggregating the data in order to answer business questions. This project does not involve inferential statistics or machine learning, although descriptive statistics such as measures of central tendency are encouraged.\n", - "* By \"results\" we mean your ***three visualizations and recommendations***.\n", - "* By \"diverse audiences\" we mean that your presentation and notebook are appropriately addressing a business and data science audience, respectively.\n", - "\n", - "Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment." - ] - }, - { - "cell_type": "markdown", - "id": "276dff7c", - "metadata": {}, - "source": [ - "#### Exceeds Objective" - ] - }, - { - "cell_type": "markdown", - "id": "e87c2713", - "metadata": {}, - "source": [ - "Creates and describes appropriate visualizations for given business questions, where each visualization fulfills all elements of the checklist\n", - "\n", - "> This \"checklist\" refers to the Data Visualization checklist within the larger Phase 2 Project Checklist" - ] - }, - { - "cell_type": "markdown", - "id": "b4e8a4c7", - "metadata": {}, - "source": [ - "#### Meets Objective (Passing Bar)" - ] - }, - { - "cell_type": "markdown", - "id": "bc4e21d0", - "metadata": {}, - "source": [ - "Creates and describes appropriate visualizations for given business questions\n", - "\n", - "> This objective can be met even if all checklist elements are not fulfilled. For example, if there is some illegible text in one of your visualizations, you can still meet this objective" - ] - }, - { - "cell_type": "markdown", - "id": "d0403eb9", - "metadata": {}, - "source": [ - "#### Approaching Objective" - ] - }, - { - "cell_type": "markdown", - "id": "22dd4ad6", - "metadata": {}, - "source": [ - "Creates visualizations that are not related to the business questions, or uses an inappropriate type of visualization\n", - "\n", - "> Even if you create very compelling visualizations, you cannot pass this objective if the visualizations are not related to the business questions\n", - "\n", - "> An example of an inappropriate type of visualization would be using a line graph to show the correlation between two independent variables, when a scatter plot would be more appropriate" - ] - }, - { - "cell_type": "markdown", - "id": "aa1b808d", - "metadata": {}, - "source": [ - "#### Does Not Meet Objective" - ] - }, - { - "cell_type": "markdown", - "id": "a8a64869", - "metadata": {}, - "source": [ - "Does not submit the required number of visualizations" - ] - }, - { - "cell_type": "markdown", - "id": "db2e0ce8", - "metadata": {}, - "source": [ - "### Authoring Jupyter Notebooks" - ] - }, - { - "cell_type": "markdown", - "id": "91cc89b5", - "metadata": {}, - "source": [ - "According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), 74.1% of data scientists use a Jupyter development environment, which is more than twice the percentage of the next-most-popular IDE, Visual Studio Code. Jupyter Notebooks allow for reproducible, skim-able code documents for a data science audience. Comfort and skill with authoring Jupyter Notebooks will prepare you for job interviews, take-home challenges, and on-the-job tasks as a data scientist.\n", - "\n", - "The key feature that distinguishes *authoring Jupyter Notebooks* from simply *writing Python code* is the fact that Markdown cells are integrated into the notebook along with the Python cells in a notebook. You have seen examples of this throughout the curriculum, but now it's time for you to practice this yourself!\n", - "\n", - "Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment." - ] - }, - { - "cell_type": "markdown", - "id": "b9272672", - "metadata": {}, - "source": [ - "#### Exceeds Objective" - ] - }, - { - "cell_type": "markdown", - "id": "efc937e5", - "metadata": {}, - "source": [ - "Uses Markdown and code comments to create a well-organized, skim-able document that follows all best practices\n", - "\n", - "> Refer to the [repository readability reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2) for more tips on best practices" - ] - }, - { - "cell_type": "markdown", - "id": "d01725ea", - "metadata": {}, - "source": [ - "#### Meets Objective (Passing Bar)" - ] - }, - { - "cell_type": "markdown", - "id": "2c854f50", - "metadata": {}, - "source": [ - "Uses some Markdown to create an organized notebook, with an introduction at the top and a conclusion at the bottom" - ] - }, - { - "cell_type": "markdown", - "id": "3e0b3385", - "metadata": {}, - "source": [ - "#### Approaching Objective" - ] - }, - { - "cell_type": "markdown", - "id": "67767f89", - "metadata": {}, - "source": [ - "Uses Markdown cells to organize, but either uses only headers and does not provide any explanations or justifications, or uses only plaintext without any headers to segment out sections of the notebook\n", - "\n", - "> Headers in Markdown are delineated with one or more `#`s at the start of the line. You should have a mixture of headers and plaintext (text where the line does not start with `#`)" - ] - }, - { - "cell_type": "markdown", - "id": "195ef62a", - "metadata": {}, - "source": [ - "#### Does Not Meet Objective" - ] - }, - { - "cell_type": "markdown", - "id": "709181b9", - "metadata": {}, - "source": [ - "Does not submit a notebook, or does not use Markdown cells at all to organize the notebook" - ] - }, - { - "cell_type": "markdown", - "id": "290335d1", - "metadata": {}, - "source": [ - "### Data Manipulation and Analysis with `pandas`" - ] - }, - { - "cell_type": "markdown", - "id": "2c0aae32", - "metadata": {}, - "source": [ - "`pandas` is a very popular data manipulation library, with over 2 million downloads on Anaconda (`conda install pandas`) and over 19 million downloads on PyPI (`pip install pandas`) at the time of this writing. In our own internal data, we see that the overwhelming majority of Flatiron School DS grads use `pandas` on the job in some capacity.\n", - "\n", - "Unlike in base Python, where the Zen of Python says \"There should be one-- and preferably only one --obvious way to do it\", there is often more than one valid way to do something in `pandas`. However there are still more efficient and less efficient ways to use it. Specifically, the best `pandas` code is *performant* and *idiomatic*.\n", - "\n", - "Performant `pandas` code utilizes methods and broadcasting rather than user-defined functions or `for` loops. For example, if you need to strip whitespace from a column containing string data, the best approach would be to use the [`pandas.Series.str.strip` method](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html) rather than writing your own function or writing a loop. Or if you want to multiply everything in a column by 100, the best approach would be to use broadcasting (e.g. `df[\"column_name\"] * 100`) instead of a function or loop. You can still write your own functions if needed, but only after checking that there isn't a built-in way to do it.\n", - "\n", - "Idiomatic `pandas` code has variable names that are meaningful words or abbreviations in English, that are related to the purpose of the variables. You can still use `df` as the name of your DataFrame if there is only one main DataFrame you are working with, but as soon as you are merging multiple DataFrames or taking a subset of a DataFrame, you should use meaningful names. For example, `df2` would not be an idiomatic name, but `movies_and_reviews` could be.\n", - "\n", - "We also recommend that you rename all DataFrame columns so that their meanings are more understandable, although it is fine to have acronyms. For example, `\"col1\"` would not be an idiomatic name, but `\"USD\"` could be.\n", - "\n", - "Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment." - ] - }, - { - "cell_type": "markdown", - "id": "e070c91b", - "metadata": {}, - "source": [ - "#### Exceeds Objective" - ] - }, - { - "cell_type": "markdown", - "id": "20092dcd", - "metadata": {}, - "source": [ - "Uses `pandas` to prepare data and answer business questions in an idiomatic, performant way" - ] - }, - { - "cell_type": "markdown", - "id": "882b158d", - "metadata": {}, - "source": [ - "#### Meets Objective (Passing Bar)" - ] - }, - { - "cell_type": "markdown", - "id": "c2c426e6", - "metadata": {}, - "source": [ - "Successfully uses `pandas` to prepare data in order to answer business questions\n", - "\n", - "> This includes projects that _occasionally_ use base Python when `pandas` methods would be more appropriate (such as using `enumerate()` on a DataFrame), or occasionally performs operations that do not appear to have any relevance to the business questions" - ] - }, - { - "cell_type": "markdown", - "id": "88d1667b", - "metadata": {}, - "source": [ - "#### Approaching Objective" - ] - }, - { - "cell_type": "markdown", - "id": "ec132034", - "metadata": {}, - "source": [ - "Uses `pandas` to prepare data, but makes significant errors\n", - "\n", - "> Examples of significant errors include: the result presented does not actually answer the stated question, the code produces errors, the code _consistently_ uses base Python when `pandas` methods would be more appropriate, or the submitted notebook contains significant quantities of code that is unrelated to the presented analysis (such as copy/pasted code from the curriculum or StackOverflow)" - ] - }, - { - "cell_type": "markdown", - "id": "c5e3c86b", - "metadata": {}, - "source": [ - "#### Does Not Meet Objective" - ] - }, - { - "cell_type": "markdown", - "id": "d9566206", - "metadata": {}, - "source": [ - "Unable to prepare data using `pandas`\n", - "\n", - "> This includes projects that successfully answer the business questions, but do not use `pandas` (e.g. use only base Python, or use some other tool like R, Tableau, or Excel)" - ] - }, - { - "cell_type": "markdown", - "id": "b0923637", - "metadata": {}, - "source": [ - "## Getting Started" - ] - }, - { - "cell_type": "markdown", - "id": "8e37e815", - "metadata": {}, - "source": [ - "Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP.\n", - "\n", - "Next, you will need to complete the [***Project Proposal***](#project_proposal) which must be reviewed by your instructor before you can continue with the project.\n", - "\n", - "Then, you will need to create a GitHub repository. There are three options:\n", - "\n", - "1. Look at the [Phase 2 Project Templates and Examples repo](https://github.com/learn-co-curriculum/dsc-project-template) and follow the directions in the MVP branch.\n", - "2. Fork the [Phase 2 Project Repository](https://github.com/learn-co-curriculum/dsc-phase-2-project-v3), clone it locally, and work in the `student.ipynb` file. Make sure to also add and commit a PDF of your presentation to your repository with a file name of `presentation.pdf`.\n", - "3. Create a new repository from scratch by going to [github.com/new](https://github.com/new) and copying the data files from one of the above resources into your new repository. This approach will result in the most professional-looking portfolio repository, but can be more complicated to use. So if you are getting stuck with this option, try one of the above options instead." - ] - }, - { - "cell_type": "markdown", - "id": "290d61a5", - "metadata": {}, - "source": [ - "## Summary" - ] - }, - { - "cell_type": "markdown", - "id": "ac002279", - "metadata": {}, - "source": [ - "This project will give you a valuable opportunity to develop your data science skills using real-world data. The end-of-phase projects are a critical part of the program because they give you a chance to bring together all the skills you've learned, apply them to realistic projects for a business stakeholder, practice communication skills, and get feedback to help you improve. You've got this!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python (learn-env)", - "language": "python", - "name": "learn-env" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/student.ipynb b/student.ipynb deleted file mode 100644 index d3bb34af..00000000 --- a/student.ipynb +++ /dev/null @@ -1,48 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Final Project Submission\n", - "\n", - "Please fill out:\n", - "* Student name: \n", - "* Student pace: self paced / part time / full time\n", - "* Scheduled project review date/time: \n", - "* Instructor name: \n", - "* Blog post URL:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Your code here - remember to use markdown cells for comments as well!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}