- Course: Data Analytics with Python & Pandas
- Topic: Column Filtering using
pandas.DataFrame.select_dtypes() - Assignment ID: PY-PD-01
- Level: Beginner to Intermediate
- Author: Gowtham (Workspace: Assigenmt - 1)
- Date: May 2026
When working with real-world datasets, data often arrives in mixed formats containing numbers, strings, dates, and boolean flags. Before performing any mathematical computations, machine learning training, or text processing, a data scientist must isolate columns of specific types.
The Pandas library in Python provides a highly efficient, built-in method called pandas.DataFrame.select_dtypes() to solve this problem. Instead of manually looping over columns and checking their types one-by-one, select_dtypes() allows you to subset a DataFrame based on column data types using high-level groups or specific type classes.
The primary objectives of this assignment are to:
- Understand how Pandas represents and tracks data types (
dtypes) within a DataFrame. - Master the usage of the
pandas.DataFrame.select_dtypes()method. - Learn how to selectively include or exclude columns based on their data types (numeric, object/string, and boolean).
- Build a professional, GitHub-ready Python project that showcases clean, well-commented code.
In Pandas, a DataFrame is composed of one or more Series (columns). Each column has an associated data type (dtype), which dictates what operations can be performed on it.
int64/int32: Integer numbers (e.g., 29, 45, 100).float64/float32: Floating-point decimal numbers (e.g., 150.50, 89.20).object: Generally represents text / strings, or mixed python objects.bool: Boolean values (TrueorFalse).datetime64: Date and time values.category: Finite list of text values (efficient for repeated values).
The select_dtypes() method returns a subset of the DataFrameβs columns based on the column data types. It checks each column's dtype against the criteria specified in the arguments and yields a new DataFrame with matching columns.
The syntax of the select_dtypes() method is simple and expressive:
DataFrame.select_dtypes(include=None, exclude=None)Both arguments are optional, but at least one of them must be provided.
| Parameter | Type | Description |
|---|---|---|
include |
scalar or list-like | A selection of dtypes or strings to be included in the result. At least one must be matched. |
exclude |
scalar or list-like | A selection of dtypes or strings to be excluded from the result. |
- To select numeric types, use
'number'ornp.number(includes all integers and floats). - To select strings, use
'object'. - To select booleans, use
'bool'. - To select datetimes, use
'datetime'or'datetime64'. - To select categories, use
'category'.
β οΈ Note: You can pass lists of types, such asinclude=['int64', 'float64']orinclude=['number', 'bool'].
Here is how select_dtypes() works across different scenarios. The following examples represent the actual code implemented in main.py.
We define a sample customer database:
data = {
'CustomerID': [101, 102, 103, 104, 105],
'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Evan Wright'],
'Age': [29, 34, 45, 28, 52],
'PurchaseAmount': [150.50, 89.20, 420.75, 310.00, 75.60],
'IsPremium': [True, False, True, True, False],
'Country': ['USA', 'Canada', 'UK', 'USA', 'Australia'],
'NewsletterSubscribed': [True, True, False, True, False]
}Extracts integer and floating-point columns. Ideal before computing mathematical averages, sums, or correlations.
numeric_df = df.select_dtypes(include='number')Output:
CustomerID Age PurchaseAmount
0 101 29 150.50
1 102 34 89.20
2 103 45 420.75
3 104 28 310.00
4 105 52 75.60
Extracts text columns. Highly useful when cleaning strings, stripping whitespaces, or encoding categorical variables for machine learning.
object_df = df.select_dtypes(include='object')Output:
Name Country
0 Alice Johnson USA
1 Bob Smith Canada
2 Charlie Brown UK
3 Diana Prince USA
4 Evan Wright Australia
Extracts truth-value columns. Perfect for filtering customers, calculating conversion rates, or performing logical flag checks.
boolean_df = df.select_dtypes(include='bool')Output:
IsPremium NewsletterSubscribed
0 True True
1 False True
2 True False
3 True True
4 False False
Extracts all columns except the specified ones. In this case, we exclude all numeric types to keep only descriptive/categorical properties.
non_numeric_df = df.select_dtypes(exclude='number')Output:
Name IsPremium Country NewsletterSubscribed
0 Alice Johnson True USA True
1 Bob Smith False Canada True
2 Charlie Brown True UK False
3 Diana Prince True USA True
4 Evan Wright False Australia False
Imagine you are a Junior Data Scientist at an e-commerce platform preparing a raw transactions table for a Machine Learning Model.
- Model Inputs: Machine Learning algorithms (like Linear Regression or XGBoost) require purely numeric data. You must use
.select_dtypes(include='number')to feed columns likeAgeandPurchaseAmountinto the model. - Feature Engineering: You want to convert customer strings (
Country) into dummy indicators (one-hot encoding). You isolate these variables using.select_dtypes(include='object')to apply preprocessing algorithms. - Auditing: You need to ensure no private text files or raw labels slip into a mathematical calculation step.
.select_dtypes(exclude='object')protects against errors!
- Readability & Elegance: Replaces long loops and complex conditional statements with one clear, readable line of code.
- Speed & Efficiency: Vectorized execution in Pandas runs C-level optimizations under the hood, significantly outperforming Python native loops.
- Dynamic Flexibility: If your schema updates (e.g., a new numeric column is added to the source database),
select_dtypes()handles it automatically without hardcoded column index adjustments. - Error Prevention: Safeguards downstream processes (e.g., you won't accidentally try to compute the mathematical average of
Name).
Follow these steps to run the project locally on your machine.
Ensure you have Python 3.8+ installed on your system.
Run the following pip command in your terminal to install the necessary library:
pip install -r requirements.txtAlternatively, you can install pandas directly:
pip install pandasExecute the main program using Python:
python main.pyUpload this professional project to your GitHub account to showcase your skills:
git initgit add .git commit -m "Initial commit"git branch -M mainReplace <repo-url> with your actual GitHub repository URL (e.g., https://github.com/username/pandas-select-dtypes.git):
git remote add origin <repo-url>git push -u origin mainThe pandas.DataFrame.select_dtypes() method is a fundamental utility in every data wrangler's toolkit. It streamlines the data preprocessing lifecycle, improves code clarity, and allows dynamic operations on columns based on their functional characteristics. Mastering this method is a key milestone for any aspiring Python developer or data scientist!