Skip to content

longway2go-ai/Advanced-Data-Preprocessing-App

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Advanced Data Preprocessing App

Streamlit App Python 3.8+ License: MIT

Transform your raw data into machine learning-ready datasets with professional-grade preprocessing techniques in just a few clicks.

🎯 What This App Does

The Advanced Data Preprocessing App is a comprehensive tool that automates the tedious and time-consuming process of cleaning and preparing data for machine learning. Whether you're a data scientist, analyst, or ML engineer, this app saves you hours of manual preprocessing work.

✨ Key Features

  • 🧹 Intelligent Data Cleaning: Remove duplicates, handle missing values, fix data formats
  • βš™οΈ Advanced Feature Engineering: Extract datetime features, create interactions, apply transformations
  • 🏷️ Smart Categorical Encoding: Automatic encoding strategies based on data characteristics
  • πŸ“ Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler options
  • 🎯 Feature Selection: Remove low-variance and highly correlated features
  • πŸ“‰ Dimensionality Reduction: PCA and t-SNE implementations with visualization
  • πŸ“Š Interactive Visualizations: Real-time plots and data exploration
  • πŸ“₯ ML-Ready Export: Download processed datasets ready for any ML framework

πŸš€ Live Demo

Try the App Now β†’

πŸ“‹ Supported File Formats

  • CSV files (.csv)
  • Excel files (.xlsx, .xls)
  • Maximum file size: 200MB

πŸ› οΈ Preprocessing Capabilities

Data Cleaning

  • βœ… Duplicate Removal: Eliminates identical rows
  • βœ… Missing Value Handling: Mean, median, KNN imputation strategies
  • βœ… Format Standardization: Date parsing, data type optimization
  • βœ… Outlier Management: IQR and Z-score methods

Feature Engineering

  • βœ… DateTime Features: Extract year, month, day, weekday, hour, weekend flags
  • βœ… Text Features: Character count, word count, text statistics
  • βœ… Numeric Interactions: Create multiplication features between variables
  • βœ… Mathematical Transformations: Log and square root transformations
  • βœ… Binning: Convert continuous variables to categorical ranges

Machine Learning Preparation

  • βœ… Categorical Encoding: One-hot, label, and frequency encoding
  • βœ… Feature Scaling: Multiple scaling strategies for optimal performance
  • βœ… Feature Selection: Variance-based and correlation-based filtering
  • βœ… Dimensionality Reduction: PCA for linear reduction, t-SNE for visualization

πŸ“– How to Use

1. πŸ“ Upload Your Data

  • Click "Choose your data file"
  • Upload CSV or Excel files up to 200MB
  • Preview your data structure and statistics

2. 🎯 Configure Processing

  • Select Target Column (optional): Choose what you want to predict
  • Choose Task Type: Classification, Regression, or Exploration
  • Select Preprocessing Steps: Enable the techniques you need

3. πŸ› οΈ Customize Settings

  • Data Cleaning: Configure missing value strategies and outlier handling
  • Feature Engineering: Enable datetime, text, and interaction features
  • Encoding & Scaling: Choose optimal encoding and scaling methods
  • Advanced Options: Set up feature selection and dimensionality reduction

4. πŸš€ Process & Download

  • Click "Start Preprocessing"
  • Monitor progress and view processing logs
  • Download your ML-ready dataset

πŸ’» Local Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

# Clone the repository
git clone https://github.com/yourusername/data-preprocessing-app.git
cd data-preprocessing-app

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py

The app will open in your browser at http://localhost:8501

πŸ“Š Example Use Cases

🏠 Real Estate Price Prediction

  • Input: Property data with mixed types (dates, categories, numbers)
  • Processing: Handle missing values, encode neighborhoods, scale prices
  • Output: Clean dataset ready for regression models

πŸ“§ Email Spam Classification

  • Input: Email data with text content and metadata
  • Processing: Extract text features, encode categorical variables
  • Output: Balanced dataset ready for classification algorithms

πŸ›’ Customer Segmentation

  • Input: Customer transaction data with timestamps
  • Processing: Extract datetime features, create interaction terms, apply PCA
  • Output: Reduced-dimension dataset perfect for clustering

πŸ“ˆ Stock Market Analysis

  • Input: Financial time series with multiple indicators
  • Processing: Handle outliers, create technical indicators, scale features
  • Output: Normalized dataset ready for time series modeling

πŸ”§ Technical Specifications

Built With

  • Frontend: Streamlit with custom CSS styling
  • Backend: Python with pandas, scikit-learn, scipy
  • Visualization: Plotly, Matplotlib, Seaborn
  • Machine Learning: scikit-learn preprocessing and decomposition modules

Performance

  • Memory Efficient: Optimized for large datasets
  • Fast Processing: Vectorized operations with NumPy and pandas
  • Scalable: Works with datasets from small samples to 200MB files

Browser Compatibility

  • βœ… Chrome (recommended)
  • βœ… Firefox
  • βœ… Safari
  • βœ… Edge

πŸ“š API Reference

The app uses a comprehensive ComprehensiveDataPreprocessor class with methods for:

# Main preprocessing pipeline
processed_df, report = preprocessor.process_data(
    file_path="data.csv",
    preprocessing_choices=config_dict,
    target_column="target_variable"
)

# Individual preprocessing steps
preprocessor.handle_missing_values_advanced(df)
preprocessor.encode_categorical_advanced(df)
preprocessor.apply_scaling(df)
preprocessor.dimensionality_reduction(df)

🀝 Contributing

We welcome contributions! Here's how you can help:

  1. 🍴 Fork the repository
  2. 🌿 Create a feature branch: git checkout -b feature/amazing-feature
  3. πŸ’Ύ Commit changes: git commit -m 'Add amazing feature'
  4. πŸ“€ Push to branch: git push origin feature/amazing-feature
  5. πŸ”„ Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add docstrings for new functions
  • Test with various dataset types
  • Update README if adding new features

πŸ› Bug Reports

Found a bug? Please create an issue with:

  • Description: Clear description of the problem
  • Steps to Reproduce: Detailed steps to recreate the issue
  • Expected vs Actual: What should happen vs what actually happens
  • Dataset Info: File type, size, and structure (anonymized)
  • Browser/Environment: Your system details

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🌟 Acknowledgments

  • Streamlit Team: For the amazing framework
  • Scikit-learn Contributors: For comprehensive ML preprocessing tools
  • Plotly Team: For interactive visualization capabilities
  • Open Source Community: For the countless libraries that make this possible

Made with ❀️ for the Data Science Community by Arnab