GitHub - vijayagopalsb/exploratory-data-analysis: Exploratory Data Analysis (EDA) is a critical step in any data science project. It involves summarizing the dataset, detecting patterns, and uncovering relationships before applying ML Models.

Exploratory Data Analysis (EDA) Pipeline

Exploratory Data Analysis (EDA) is a critical step in any data science project. It involves summarizing the dataset, detecting patterns, and uncovering relationships before applying ML Models.

Project Overview

This project implements an Exploratory Data Analysis (EDA) Pipeline using a Chain of Responsibility pattern. The pipeline consists of multiple handlers that sequentially process a dataset, performing various data analysis tasks such as handling missing values, univariate analysis, bivariate analysis, and correlation analysis.

Directory Structure

exploratory-data-analysis/
│── main.py
│── logging_config.py
│── README.md
│── LICENSE
|
└── handlers/
    │── base_handler.py
    │── understand_data.py
    │── handle_missing_values.py
    │── univariate_analysis.py
    │── bivariate_analysis.py
    │── correlation_analysis.py

Installation & Setup

Prerequisites

Ensure you have Python 3.8+ installed and the required libraries:

pip install pandas numpy seaborn matplotlib

Running the Project

python main.py

Components & Handlers

Base Handler (base_handler.py)

Defines the base class EDAHandler, which serves as the foundation for all other handlers.
Understanding Dataset (understand_data.py)
- Logs dataset information such as column names, data types, and missing values.
- First step in the EDA pipeline.
Handling Missing Values (handle_missing_values.py)
- Replaces missing values using a specified strategy (mean, median, mode).
- Passes the processed data to the next handler.
Univariate Analysis (univariate_analysis.py)
- Generates histograms for numeric columns.
- Displays basic distribution of individual features.
Bivariate Analysis (bivariate_analysis.py)
- Analyzes relationships between the target column and numerical variables.
- Converts categorical variables to string format to avoid plotting issues.
Correlation Analysis (correlation_analysis.py)
- Computes and visualizes correlation matrices using heatmaps.
Logging Configuration (logging_config.py)
- Manages logging of pipeline execution steps and errors.

EDA Pipeline Execution

The pipeline follows a Chain of Responsibility pattern:

# Initialize Handlers
correlation_analysis = CorrelationAnalysis()

bivariate_analysis = BivariateAnalysis(target_column="survived", next_handler=correlation_analysis)

univariate_analysis = UnivariateAnalysis(next_handler=bivariate_analysis)

handle_missing_values = HandleMissingValues(strategy="mean", next_handler=univariate_analysis)

pipeline = UnderstandDataset(next_handler=handle_missing_values)


# Start the pipeline with the dataset
pipeline.handle(df)

Example Log Output

Upon execution, logs provide insights into the pipeline process:

2025-03-28 21:31:30,544 - INFO ->>> Starting EDA Pipeline ...
2025-03-28 21:31:30,544 - INFO ->>> Step 1: -->> Understanding the dataset ...
2025-03-28 21:31:30,544 - INFO ->>>
Display Features of Dataset:

RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   survived     891 non-null    int64
 1   pclass       891 non-null    int64
 2   sex          891 non-null    object
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64
 5   parch        891 non-null    int64
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object
 8   class        891 non-null    category
 9   who          891 non-null    object
 10  adult_male   891 non-null    bool
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object
 13  alive        891 non-null    object
 14  alone        891 non-null    bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
2025-03-28 21:31:30,560 - INFO ->>> None
2025-03-28 21:31:30,561 - INFO ->>>
Summary Statistics:
2025-03-28 21:31:30,575 - INFO ->>>           survived      pclass   sex         age       sibsp       parch        fare  ...  class  who adult_male deck  embark_town alive alone
count   891.000000  891.000000   891  714.000000  891.000000  891.000000  891.000000  ...    891  891        891  203          889   891   891
unique         NaN         NaN     2         NaN         NaN         NaN         NaN  ...      3    3          2    7            3     2     2
top            NaN         NaN  male         NaN         NaN         NaN         NaN  ...  Third  man       True    C  Southampton    no  True
freq           NaN         NaN   577         NaN         NaN         NaN         NaN  ...    491  537        537   59          644   549   537
mean      0.383838    2.308642   NaN   29.699118    0.523008    0.381594   32.204208  ...    NaN  NaN        NaN  NaN          NaN   NaN   NaN
std       0.486592    0.836071   NaN   14.526497    1.102743    0.806057   49.693429  ...    NaN  NaN        NaN  NaN          NaN   NaN   NaN
min       0.000000    1.000000   NaN    0.420000    0.000000    0.000000    0.000000  ...    NaN  NaN        NaN  NaN          NaN   NaN   NaN
25%       0.000000    2.000000   NaN   20.125000    0.000000    0.000000    7.910400  ...    NaN  NaN        NaN  NaN          NaN   NaN   NaN
50%       0.000000    3.000000   NaN   28.000000    0.000000    0.000000   14.454200  ...    NaN  NaN        NaN  NaN          NaN   NaN   NaN
75%       1.000000    3.000000   NaN   38.000000    1.000000    0.000000   31.000000  ...    NaN  NaN        NaN  NaN          NaN   NaN   NaN
max       1.000000    3.000000   NaN   80.000000    8.000000    6.000000  512.329200  ...    NaN  NaN        NaN  NaN          NaN   NaN   NaN

[11 rows x 15 columns]
2025-03-28 21:31:30,579 - INFO ->>> 
Missing Values:
2025-03-28 21:31:30,579 - INFO ->>> survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
2025-03-28 21:31:30,579 - INFO ->>> Step 2: -->> Handling missing values using mean strategy ...
2025-03-28 21:31:30,596 - INFO ->>> Step 3: -->> Performimg Univariate Analysis ...
2025-03-28 21:31:34,699 - INFO ->>> Step 4: -->> Performing bivariate analysis with target 'survived'...
2025-03-28 21:31:34,700 - INFO ->>> Seaborn version: 0.13.2, Matplotlib version: 3.10.1
2025-03-28 21:31:34,700 - INFO ->>> Original type of 'survived': int64
2025-03-28 21:31:34,700 - INFO ->>> Sample of 'survived': [0, 1, 1, 1, 0]
2025-03-28 21:31:34,700 - INFO ->>> Plotting pclass vs survived...
2025-03-28 21:31:34,700 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:31:38,734 - INFO ->>> Plotting age vs survived...
2025-03-28 21:31:38,735 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:31:40,756 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:31:42,255 - INFO ->>> Plotting parch vs survived...
2025-03-28 21:31:42,256 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:31:43,735 - INFO ->>> Plotting fare vs survived...
2025-03-28 21:31:43,736 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:31:46,462 - INFO ->>> Step 5: -->> Performing correlation analysis...
2025-03-28 21:31:48,645 - INFO ->>> EDA Pipeline Completed Successfully!
PS D:\my_github_projects> & "C:/Users/VIJAYAGOPAL S/AppData/Local/Programs/Python/Python312/python.exe" d:/my_github_projects/exploratory-data-analysis/main.py
2025-03-28 21:39:14,359 - INFO ->>> Starting EDA Pipeline ...
2025-03-28 21:39:14,359 - INFO ->>> Step 1: -->> Understanding the dataset ...
2025-03-28 21:39:14,359 - INFO ->>>
Display Features of Dataset:

RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   survived     891 non-null    int64
 1   pclass       891 non-null    int64
 2   sex          891 non-null    object
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64
 5   parch        891 non-null    int64
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object
 8   class        891 non-null    category
 9   who          891 non-null    object
 10  adult_male   891 non-null    bool
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object
 13  alive        891 non-null    object
 14  alone        891 non-null    bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
2025-03-28 21:39:14,369 - INFO ->>> None
2025-03-28 21:39:14,369 - INFO ->>>
Summary Statistics:
2025-03-28 21:39:14,382 - INFO ->>>           survived      pclass   sex         age       sibsp       parch        fare embarked  class  who adult_male deck  embark_town alive alone
count   891.000000  891.000000   891  714.000000  891.000000  891.000000  891.000000      889    891  891        891  203          889   891   891
unique         NaN         NaN     2         NaN         NaN         NaN         NaN        3      3    3          2    7            3     2     2
top            NaN         NaN  male         NaN         NaN         NaN         NaN        S  Third  man       True    C  Southampton    no  True
freq           NaN         NaN   577         NaN         NaN         NaN         NaN      644    491  537        537   59          644   549   537
mean      0.383838    2.308642   NaN   29.699118    0.523008    0.381594   32.204208      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
std       0.486592    0.836071   NaN   14.526497    1.102743    0.806057   49.693429      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
min       0.000000    1.000000   NaN    0.420000    0.000000    0.000000    0.000000      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
25%       0.000000    2.000000   NaN   20.125000    0.000000    0.000000    7.910400      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
50%       0.000000    3.000000   NaN   28.000000    0.000000    0.000000   14.454200      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
75%       1.000000    3.000000   NaN   38.000000    1.000000    0.000000   31.000000      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
max       1.000000    3.000000   NaN   80.000000    8.000000    6.000000  512.329200      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
2025-03-28 21:39:14,394 - INFO ->>> 
Missing Values:
2025-03-28 21:39:14,394 - INFO ->>> survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
2025-03-28 21:39:14,394 - INFO ->>> Step 2: -->> Handling missing values using mean strategy ...
2025-03-28 21:39:14,394 - INFO ->>> Step 3: -->> Performimg Univariate Analysis ...
2025-03-28 21:39:18,726 - INFO ->>> Step 4: -->> Performing bivariate analysis with target 'survived'...
2025-03-28 21:39:18,726 - INFO ->>> Seaborn version: 0.13.2, Matplotlib version: 3.10.1
2025-03-28 21:39:18,727 - INFO ->>> Original type of 'survived': int64
2025-03-28 21:39:18,728 - INFO ->>> Sample of 'survived': [0, 1, 1, 1, 0]
2025-03-28 21:39:18,728 - INFO ->>> Plotting pclass vs survived...
2025-03-28 21:39:18,729 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:39:20,743 - INFO ->>> Plotting age vs survived...
2025-03-28 21:39:20,744 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:39:22,164 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:39:22,164 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:39:22,164 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:39:22,164 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:39:22,164 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:39:22,164 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:39:22,164 - INFO ->>> Plotting sibsp vs survived...
2025-03-28 21:39:22,165 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:39:23,652 - INFO ->>> Plotting parch vs survived...
2025-03-28 21:39:23,653 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:39:24,926 - INFO ->>> Plotting fare vs survived...
2025-03-28 21:39:24,926 - INFO ->>> Type of 'survived' before plotting: int64
2025-03-28 21:39:26,041 - INFO ->>> Step 5: -->> Performing correlation analysis...
2025-03-28 21:39:28,204 - INFO ->>> EDA Pipeline Completed Successfully!

Future Enhancements

- Outlier Detection & Treatment

- Feature Engineering & Selection

- Automated Report Generation

Documentation

📚 Documentation - Setup Guide - Model Training

Contributors

- Vijayagopal S - [GitHub](https://github.com/vijayagopalsb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploratory Data Analysis (EDA) Pipeline

Project Overview

Directory Structure

Installation & Setup

Prerequisites

Running the Project

Components & Handlers

EDA Pipeline Execution

Example Log Output

Future Enhancements

Documentation

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
handlers		handlers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logging_config.py		logging_config.py
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

Exploratory Data Analysis (EDA) Pipeline

Project Overview

Directory Structure

Installation & Setup

Prerequisites

Running the Project

Components & Handlers

EDA Pipeline Execution

Example Log Output

Future Enhancements

Documentation

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages