Transform your raw data into machine learning-ready datasets with professional-grade preprocessing techniques in just a few clicks.
The Advanced Data Preprocessing App is a comprehensive tool that automates the tedious and time-consuming process of cleaning and preparing data for machine learning. Whether you're a data scientist, analyst, or ML engineer, this app saves you hours of manual preprocessing work.
- π§Ή Intelligent Data Cleaning: Remove duplicates, handle missing values, fix data formats
- βοΈ Advanced Feature Engineering: Extract datetime features, create interactions, apply transformations
- π·οΈ Smart Categorical Encoding: Automatic encoding strategies based on data characteristics
- π Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler options
- π― Feature Selection: Remove low-variance and highly correlated features
- π Dimensionality Reduction: PCA and t-SNE implementations with visualization
- π Interactive Visualizations: Real-time plots and data exploration
- π₯ ML-Ready Export: Download processed datasets ready for any ML framework
- CSV files (
.csv) - Excel files (
.xlsx,.xls) - Maximum file size: 200MB
- β Duplicate Removal: Eliminates identical rows
- β Missing Value Handling: Mean, median, KNN imputation strategies
- β Format Standardization: Date parsing, data type optimization
- β Outlier Management: IQR and Z-score methods
- β DateTime Features: Extract year, month, day, weekday, hour, weekend flags
- β Text Features: Character count, word count, text statistics
- β Numeric Interactions: Create multiplication features between variables
- β Mathematical Transformations: Log and square root transformations
- β Binning: Convert continuous variables to categorical ranges
- β Categorical Encoding: One-hot, label, and frequency encoding
- β Feature Scaling: Multiple scaling strategies for optimal performance
- β Feature Selection: Variance-based and correlation-based filtering
- β Dimensionality Reduction: PCA for linear reduction, t-SNE for visualization
- Click "Choose your data file"
- Upload CSV or Excel files up to 200MB
- Preview your data structure and statistics
- Select Target Column (optional): Choose what you want to predict
- Choose Task Type: Classification, Regression, or Exploration
- Select Preprocessing Steps: Enable the techniques you need
- Data Cleaning: Configure missing value strategies and outlier handling
- Feature Engineering: Enable datetime, text, and interaction features
- Encoding & Scaling: Choose optimal encoding and scaling methods
- Advanced Options: Set up feature selection and dimensionality reduction
- Click "Start Preprocessing"
- Monitor progress and view processing logs
- Download your ML-ready dataset
- Python 3.8 or higher
- pip package manager
# Clone the repository
git clone https://github.com/yourusername/data-preprocessing-app.git
cd data-preprocessing-app
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.pyThe app will open in your browser at http://localhost:8501
- Input: Property data with mixed types (dates, categories, numbers)
- Processing: Handle missing values, encode neighborhoods, scale prices
- Output: Clean dataset ready for regression models
- Input: Email data with text content and metadata
- Processing: Extract text features, encode categorical variables
- Output: Balanced dataset ready for classification algorithms
- Input: Customer transaction data with timestamps
- Processing: Extract datetime features, create interaction terms, apply PCA
- Output: Reduced-dimension dataset perfect for clustering
- Input: Financial time series with multiple indicators
- Processing: Handle outliers, create technical indicators, scale features
- Output: Normalized dataset ready for time series modeling
- Frontend: Streamlit with custom CSS styling
- Backend: Python with pandas, scikit-learn, scipy
- Visualization: Plotly, Matplotlib, Seaborn
- Machine Learning: scikit-learn preprocessing and decomposition modules
- Memory Efficient: Optimized for large datasets
- Fast Processing: Vectorized operations with NumPy and pandas
- Scalable: Works with datasets from small samples to 200MB files
- β Chrome (recommended)
- β Firefox
- β Safari
- β Edge
The app uses a comprehensive ComprehensiveDataPreprocessor class with methods for:
# Main preprocessing pipeline
processed_df, report = preprocessor.process_data(
file_path="data.csv",
preprocessing_choices=config_dict,
target_column="target_variable"
)
# Individual preprocessing steps
preprocessor.handle_missing_values_advanced(df)
preprocessor.encode_categorical_advanced(df)
preprocessor.apply_scaling(df)
preprocessor.dimensionality_reduction(df)We welcome contributions! Here's how you can help:
- π΄ Fork the repository
- πΏ Create a feature branch:
git checkout -b feature/amazing-feature - πΎ Commit changes:
git commit -m 'Add amazing feature' - π€ Push to branch:
git push origin feature/amazing-feature - π Open a Pull Request
- Follow PEP 8 style guidelines
- Add docstrings for new functions
- Test with various dataset types
- Update README if adding new features
Found a bug? Please create an issue with:
- Description: Clear description of the problem
- Steps to Reproduce: Detailed steps to recreate the issue
- Expected vs Actual: What should happen vs what actually happens
- Dataset Info: File type, size, and structure (anonymized)
- Browser/Environment: Your system details
This project is licensed under the MIT License - see the LICENSE file for details.
- Streamlit Team: For the amazing framework
- Scikit-learn Contributors: For comprehensive ML preprocessing tools
- Plotly Team: For interactive visualization capabilities
- Open Source Community: For the countless libraries that make this possible
Made with β€οΈ for the Data Science Community by Arnab