Skip to content

TELUGUSCRIPTER/Day-5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Python & Pandas Assignment: DataFrame Column Selection by Data Type


🏫 Assignment Title Page

  • Course: Data Analytics with Python & Pandas
  • Topic: Column Filtering using pandas.DataFrame.select_dtypes()
  • Assignment ID: PY-PD-01
  • Level: Beginner to Intermediate
  • Author: Gowtham (Workspace: Assigenmt - 1)
  • Date: May 2026

πŸ“ 1. Introduction

When working with real-world datasets, data often arrives in mixed formats containing numbers, strings, dates, and boolean flags. Before performing any mathematical computations, machine learning training, or text processing, a data scientist must isolate columns of specific types.

The Pandas library in Python provides a highly efficient, built-in method called pandas.DataFrame.select_dtypes() to solve this problem. Instead of manually looping over columns and checking their types one-by-one, select_dtypes() allows you to subset a DataFrame based on column data types using high-level groups or specific type classes.


🎯 2. Objective

The primary objectives of this assignment are to:

  1. Understand how Pandas represents and tracks data types (dtypes) within a DataFrame.
  2. Master the usage of the pandas.DataFrame.select_dtypes() method.
  3. Learn how to selectively include or exclude columns based on their data types (numeric, object/string, and boolean).
  4. Build a professional, GitHub-ready Python project that showcases clean, well-commented code.

πŸ“– 3. Theory & Explanation

In Pandas, a DataFrame is composed of one or more Series (columns). Each column has an associated data type (dtype), which dictates what operations can be performed on it.

Common Pandas Data Types:

  • int64 / int32: Integer numbers (e.g., 29, 45, 100).
  • float64 / float32: Floating-point decimal numbers (e.g., 150.50, 89.20).
  • object: Generally represents text / strings, or mixed python objects.
  • bool: Boolean values (True or False).
  • datetime64: Date and time values.
  • category: Finite list of text values (efficient for repeated values).

The select_dtypes() method returns a subset of the DataFrame’s columns based on the column data types. It checks each column's dtype against the criteria specified in the arguments and yields a new DataFrame with matching columns.


βš™οΈ 4. Syntax Explanation

The syntax of the select_dtypes() method is simple and expressive:

DataFrame.select_dtypes(include=None, exclude=None)

Both arguments are optional, but at least one of them must be provided.


πŸ“‹ 5. Parameters Explanation

Parameter Type Description
include scalar or list-like A selection of dtypes or strings to be included in the result. At least one must be matched.
exclude scalar or list-like A selection of dtypes or strings to be excluded from the result.

Valid Data Type Inputs (Strings or Objects):

  • To select numeric types, use 'number' or np.number (includes all integers and floats).
  • To select strings, use 'object'.
  • To select booleans, use 'bool'.
  • To select datetimes, use 'datetime' or 'datetime64'.
  • To select categories, use 'category'.

⚠️ Note: You can pass lists of types, such as include=['int64', 'float64'] or include=['number', 'bool'].


πŸ’» 6. Example Programs & Outputs

Here is how select_dtypes() works across different scenarios. The following examples represent the actual code implemented in main.py.

πŸ—ƒοΈ Sample Dataset

We define a sample customer database:

data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Evan Wright'],
    'Age': [29, 34, 45, 28, 52],
    'PurchaseAmount': [150.50, 89.20, 420.75, 310.00, 75.60],
    'IsPremium': [True, False, True, True, False],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Australia'],
    'NewsletterSubscribed': [True, True, False, True, False]
}

πŸ”Ή Example 1: Selecting Numeric Columns (include='number')

Extracts integer and floating-point columns. Ideal before computing mathematical averages, sums, or correlations.

numeric_df = df.select_dtypes(include='number')

Output:

   CustomerID  Age  PurchaseAmount
0         101   29          150.50
1         102   34           89.20
2         103   45          420.75
3         104   28          310.00
4         105   52           75.60

πŸ”Ή Example 2: Selecting String / Object Columns (include='object')

Extracts text columns. Highly useful when cleaning strings, stripping whitespaces, or encoding categorical variables for machine learning.

object_df = df.select_dtypes(include='object')

Output:

            Name    Country
0  Alice Johnson        USA
1      Bob Smith     Canada
2  Charlie Brown         UK
3   Diana Prince        USA
4    Evan Wright  Australia

πŸ”Ή Example 3: Selecting Boolean Columns (include='bool')

Extracts truth-value columns. Perfect for filtering customers, calculating conversion rates, or performing logical flag checks.

boolean_df = df.select_dtypes(include='bool')

Output:

   IsPremium  NewsletterSubscribed
0       True                  True
1      False                  True
2       True                 False
3       True                  True
4      False                 False

πŸ”Ή Example 4: Excluding Columns (exclude='number')

Extracts all columns except the specified ones. In this case, we exclude all numeric types to keep only descriptive/categorical properties.

non_numeric_df = df.select_dtypes(exclude='number')

Output:

            Name  IsPremium    Country  NewsletterSubscribed
0  Alice Johnson       True        USA                  True
1      Bob Smith      False     Canada                  True
2  Charlie Brown       True         UK                 False
3   Diana Prince       True        USA                  True
4    Evan Wright      False  Australia                 False

🏒 7. Real-World Use Case

Imagine you are a Junior Data Scientist at an e-commerce platform preparing a raw transactions table for a Machine Learning Model.

  1. Model Inputs: Machine Learning algorithms (like Linear Regression or XGBoost) require purely numeric data. You must use .select_dtypes(include='number') to feed columns like Age and PurchaseAmount into the model.
  2. Feature Engineering: You want to convert customer strings (Country) into dummy indicators (one-hot encoding). You isolate these variables using .select_dtypes(include='object') to apply preprocessing algorithms.
  3. Auditing: You need to ensure no private text files or raw labels slip into a mathematical calculation step. .select_dtypes(exclude='object') protects against errors!

🌟 8. Advantages of select_dtypes()

  1. Readability & Elegance: Replaces long loops and complex conditional statements with one clear, readable line of code.
  2. Speed & Efficiency: Vectorized execution in Pandas runs C-level optimizations under the hood, significantly outperforming Python native loops.
  3. Dynamic Flexibility: If your schema updates (e.g., a new numeric column is added to the source database), select_dtypes() handles it automatically without hardcoded column index adjustments.
  4. Error Prevention: Safeguards downstream processes (e.g., you won't accidentally try to compute the mathematical average of Name).

πŸš€ 9. Step-by-Step Execution Guide

Follow these steps to run the project locally on your machine.

πŸ”Œ Prerequisites

Ensure you have Python 3.8+ installed on your system.

πŸ“₯ Step 1: Install Pandas

Run the following pip command in your terminal to install the necessary library:

pip install -r requirements.txt

Alternatively, you can install pandas directly:

pip install pandas

⚑ Step 2: Run the Project

Execute the main program using Python:

python main.py

πŸ™ 10. Git & GitHub Pushing Instructions

Upload this professional project to your GitHub account to showcase your skills:

Step 1: Initialize Git Repository

git init

Step 2: Add Files to Staging Area

git add .

Step 3: Commit Changes

git commit -m "Initial commit"

Step 4: Rename Branch to main

git branch -M main

Step 5: Add Remote Repository

Replace <repo-url> with your actual GitHub repository URL (e.g., https://github.com/username/pandas-select-dtypes.git):

git remote add origin <repo-url>

Step 6: Push to GitHub

git push -u origin main

🏁 11. Conclusion

The pandas.DataFrame.select_dtypes() method is a fundamental utility in every data wrangler's toolkit. It streamlines the data preprocessing lifecycle, improves code clarity, and allows dynamic operations on columns based on their functional characteristics. Mastering this method is a key milestone for any aspiring Python developer or data scientist!

About

Data Cleaning and Pandas Fundamental

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages