Skip to content

raugan/data-bank-sql-case-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏦 Data Bank — SQL Case Study

SQL Status License Case Study

Overview

Data Bank is a fictional neo-bank that ties customer cloud data storage limits directly to their account balance. Launched as Case Study #4 of Danny Ma's 8 Week SQL Challenge, this project analyses 500 customers across three tables — node assignments, regions, and transactions — to answer business questions about distributed network security, transaction behaviour, and data storage provisioning under four different allocation models (including a daily compounding interest variant). The project also includes a PowerPoint presentation built from the query results, targeting both investors and the internal management team.

Dataset

Table Description Rows
regions The 5 geographic regions where Data Bank nodes operate 5
customer_nodes Each period a customer was assigned to a specific node; end_date = '9999-12-31' marks their current active assignment 3,500
customer_transactions All deposits, withdrawals, and purchases made via Data Bank debit card 5,868

Data source note: the case study webpage's sample table shows region 1 = Africa. The canonical dataset (the embedded DB Fiddle linked from the case study page) has region 1 = Australia and region 3 = Africa. This repo uses the canonical data throughout; a comment in databank_setup.sql flags the discrepancy.

Repo Structure

data-bank-sql-case-study/
├── README.md
├── .gitignore
├── LICENSE
├── sql/
│   ├── databank_setup.sql         # schema, full dataset, data quality notes
│   └── databank_solutions.sql     # all case study questions answered and commented
├── screenshots/
└── assets/
    ├── extension-questions.md          # written analysis for the Extension section
    └── data-bank-presentation.pptx    # investor + management PowerPoint

Questions Answered

A. Customer Nodes Exploration

  1. How many unique nodes are there on the Data Bank system?
  2. What is the number of nodes per region?
  3. How many customers are allocated to each region?
  4. How many days on average are customers reallocated to a different node?
  5. Median, 80th and 95th percentile for reallocation days, per region

B. Customer Transactions

  1. Unique count and total amount for each transaction type
  2. Average total deposit counts and amounts per customer
  3. Monthly count of customers making more than 1 deposit and at least 1 purchase or withdrawal
  4. Closing balance for each customer at the end of each month
  5. Percentage of customers who increased their closing balance by more than 5%

C. Data Allocation Challenge

Three provisioning models built and compared:

  • Option 1: allocation based on end-of-previous-month closing balance
  • Option 2: allocation based on average running balance during the month
  • Option 3: real-time allocation based on running balance at each transaction

Plus the three sub-components required to build them: running balance column, monthly closing balance, and per-customer min/avg/max running balance.

D. Extra Challenge — Daily Compounding Interest

A fourth provisioning model applying 6% annual interest compounded daily, using a recursive CTE date spine (118 days × 500 customers) with a daily POWER() calculation. Simple (non-compounding) interest is excluded by design.

Extension

Open-ended analysis for two audiences:

  • Headline security insights for investors — drawn from Part A node data
  • Data provisioning comparison for management — comparing all four options

See assets/extension-questions.md and assets/data-bank-presentation.pptx.

SQL Concepts Demonstrated

  • Window functions: ROW_NUMBER(), LAG(), SUM() OVER(), AVG() OVER()
  • Manual percentile calculation in MySQL 8 (workaround for missing PERCENTILE_CONT WITHIN GROUP)
  • Recursive CTEs for generating a date spine across a date range
  • CROSS JOIN for building a customer × date grid
  • LEFT JOIN with aggregated CTEs for sparse time-series data
  • POWER() for daily compound interest calculation
  • GREATEST() for non-negative data allocation floor
  • Sentinel value handling (end_date = '9999-12-31')
  • Conditional aggregation with CASE and boolean SUM

Findings

  • All 5 nodes are active in every region — the architecture has no node concentration risk by design. Customers are reallocated every 15 days on average, with a 95th-percentile maximum of 28 days, meaning even the slowest-rotating customers move within a month.
  • Australia leads with 110 customers across the 5 regions, while Europe has the fewest at 88 — no single region holds more than 22% of the customer base.
  • Deposits dominate transaction activity: 2,671 deposits totalling $1,359,168, versus 1,617 purchases and 1,580 withdrawals. The average customer makes 5.3 deposits at $508.61 each.
  • Only 23.4% of customers (117 out of 500) grew their closing balance by more than 5% across their active period — suggesting the majority of customers are net spending down their balances over time.
  • The three provisioning options differ by an order of magnitude in their monthly storage demands: Option 1 peaks at ~$240,602 (March), Option 3 peaks at ~$962,883 (February), and the compounding interest model peaks at ~$8,412,729 (March) — making Option 3 roughly 4× and the compounding model roughly 35× more infrastructure-intensive than the conservative Option 1.

What I Learned

The data allocation challenge (Part C) required thinking carefully about what each option actually means computationally, not just analytically. Option 3 in particular — summing all positive running balance snapshots rather than a single period-end figure — produces numbers that look alarming until you understand why: it's accumulating storage demand across every transaction event in the month, not per customer per month. Getting that semantics right in the query required tracing through a few customers manually before trusting the aggregate result.

Part D's recursive CTE approach also surfaced a MySQL-vs-PostgreSQL portability gap worth noting: the case study's default DB Fiddle runs PostgreSQL, which has PERCENTILE_CONT WITHIN GROUP built in. MySQL 8 doesn't, so A5 required a manual ROW_NUMBER + CEIL/FLOOR approach instead. It produces the same result but is meaningfully more verbose — the kind of thing worth flagging in a technical interview if asked why the query looks more complex than it needs to be.

How to Run

  1. Clone this repo:
    git clone https://github.com/raugan/data-bank-sql-case-study.git
    cd data-bank-sql-case-study
  2. Load the schema and data:
    mysql -u root -p < sql/databank_setup.sql
  3. Run the solutions:
    mysql -u root -p < sql/databank_solutions.sql

Both scripts were tested end-to-end on a fresh MySQL 8.0 instance before being committed.

Source

Case study by Danny Ma — 8 Week SQL Challenge, Case Study #4: Data Bank

About

SQL case study analysing Data Bank, a fictional neo-bank that ties cloud storage to account balance, using MySQL. Covers node security, transaction analysis, and four data provisioning models including daily compounding interest. Part of the 8 Week SQL Challenge by Danny Ma.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors