This project analyzes the Titanic dataset to predict passenger survival using various machine learning models. The dataset is preprocessed, explored, and evaluated through multiple classification algorithms.
The titanic dataset used is this project is fetched from Seaborn:
titanic.csv
- Features include
age
,fare
,pclass
,sex
,embarked
, and others.
- Languages & Libraries: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Plotly
- ML Models: Logistic Regression, KNN, Random Forest, SVM, Decision Tree, Naive Bayes
- Other Tools: Streamlit for visualization, GridSearchCV for hyperparameter tuning
- Handling missing values (mean/mode imputation)
- Dropping columns having high missing values
- Encoding categorical variables using OneHotEncoding
- Normalization (MinMaxScaler, StandardScaler)
- Count plots for survival distribution
- Scatter plots for fare vs. age
- Histograms of fares for different passenger classes
- Boxplots for age and fare distribution
- FacetGrid visualizations for survival analysis
Models were trained using a pipeline approach:
- Preprocessing (Scaling + Encoding)
- Splitting Data (80% train, 20% test)
- Model Training (Logistic Regression, KNN, etc.)
- Hyperparameter Tuning (GridSearchCV for Logistic Regression)
- Evaluation Metrics:
- Accuracy
- Precision, Recall, F1 Score
- Confusion Matrix
- Cross-validation scores
- The best model was Logistic Regression, achieving the highest accuracy of 83.7079%.
- Logistic Regression with hyperparameter tuning performed well with an accuracy of 84.26966%.
- The trained model is saved using
joblib
(titanic_trained_model.pkl
). - The trained model is tested using sample data.
- Streamlit is used for interactive visualizations.
- Confusion matrices, survival rate visualizations, and EDA graphs are included.
|Titanic-Insights/
|dashboard/
|-- dataset/
|-- titanic.csv
|-- cleaned_data.csv
|-- notebook/
|-- titanic.ipynb
|-- titanic_trained_model.pkl
|-- app.py # Streamlit app
|-- requirements.txt
|-- README.md
- Clone the repository:
git clone https://github.com/UFAQUE123/Titanic-Insights.git
- Install dependencies:
pip install -r requirements.txt
- Run the Streamlit app:
streamlit run app.py
- Feature engineering and proper preprocessing significantly improve model performance.
- Logistic Regression is the best-performing model for this dataset.
- The project demonstrates an end-to-end machine learning pipeline, from data preprocessing to deployment.
🚀 Future Work: Feature selection for better interpretability, and deploying a web-based ML model interface.
📌 Author: UFAQUE SHADAB
📧 Contact: [email protected]