Pandas Series

Subsetting a DataFrame in Pandas

Learn different techniques to filter and subset pandas DataFrames efficiently

Subsetting a DataFrame in Pandas

Master the art of data filtering with practical examples using the Iris dataset. Learn different techniques to filter and subset pandas DataFrames efficiently for better data analysis workflows.

Importing Packages and Datasets

import pandas as pd

# Fetching data from URL as CSV by mentioning values of various parameters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header=None,
                   index_col=False,
                   names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'iris_class'])

# Unique classes of iris datasets
data.iris_class.unique()

Output:

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Subsetting Techniques

Basic Filtering

data_setosa = data[data.iris_class == 'Iris-setosa']
data_versicolor = data[data.iris_class == 'Iris-versicolor']
data_virginica = data[data.iris_class == 'Iris-virginica']

Key Insights

Now we can look at descriptive statistics summary for each subset and make inferences:

  • Each subset is of the same size (50 records)
  • Average Sepal and Petal Length is lowest in setosa and highest in virginica
  • This demonstrates clear species differentiation in the dataset

Descriptive Statistics

Iris Setosa

data_setosa.describe().T
Metric count mean std min 25% 50% 75% max
sepal_length 50.0 5.006 0.352490 4.3 4.800 5.0 5.200 5.8
sepal_width 50.0 3.418 0.381024 2.3 3.125 3.4 3.675 4.4
petal_length 50.0 1.464 0.173511 1.0 1.400 1.5 1.575 1.9
petal_width 50.0 0.244 0.107210 0.1 0.200 0.2 0.300 0.6

Iris Versicolor

data_versicolor.describe().T
Metric count mean std min 25% 50% 75% max
sepal_length 50.0 5.936 0.516171 4.9 5.600 5.90 6.3 7.0
sepal_width 50.0 2.770 0.313798 2.0 2.525 2.80 3.0 3.4
petal_length 50.0 4.260 0.469911 3.0 4.000 4.35 4.6 5.1
petal_width 50.0 1.326 0.197753 1.0 1.200 1.30 1.5 1.8

Iris Virginica

data_virginica.describe().T
Metric count mean std min 25% 50% 75% max
sepal_length 50.0 6.588 0.635880 4.9 6.225 6.50 6.900 7.9
sepal_width 50.0 2.974 0.322497 2.2 2.800 3.00 3.175 3.8
petal_length 50.0 5.552 0.551895 4.5 5.100 5.55 5.875 6.9
petal_width 50.0 2.026 0.274650 1.4 1.800 2.00 2.300 2.5

Advanced Filtering Patterns

Multiple Conditions

# Filter with multiple conditions using boolean operators
large_setosa = data[(data.iris_class == 'Iris-setosa') & (data.sepal_length > 5.0)]

# Using query method (more readable for complex conditions)
large_setosa_query = data.query("iris_class == 'Iris-setosa' and sepal_length > 5.0")

# Multiple OR conditions
large_flowers = data[(data.sepal_length > 6.5) | (data.petal_length > 5.0)]

Using .loc and .iloc for Advanced Selection

# Label-based selection with .loc
setosa_sepal_data = data.loc[data.iris_class == 'Iris-setosa', ['sepal_length', 'sepal_width']]

# Position-based selection with .iloc
first_10_rows = data.iloc[:10, :]

# Combining boolean indexing with .loc
large_petals = data.loc[(data.petal_length > 4.0) & (data.petal_width > 1.5)]

String-based Filtering

# Filter by string patterns
setosa_variants = data[data.iris_class.str.contains('setosa')]

# Case-insensitive filtering
setosa_case_insensitive = data[data.iris_class.str.lower().str.contains('setosa')]

# Filter by string length
long_names = data[data.iris_class.str.len() > 12]

Numerical Range Filtering

# Filter by numerical ranges
medium_sepal_length = data[data.sepal_length.between(5.0, 6.0)]

# Using quantiles for filtering
q75 = data.sepal_length.quantile(0.75)
top_quartile = data[data.sepal_length >= q75]

# Filter outliers using IQR method
Q1 = data.sepal_length.quantile(0.25)
Q3 = data.sepal_length.quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data.sepal_length < (Q1 - 1.5 * IQR)) | 
                (data.sepal_length > (Q3 + 1.5 * IQR))]

Performance Tips and Best Practices

1. Use Vectorized Operations

# Good: Vectorized operation
filtered_data = data[data.sepal_length > 5.0]

# Avoid: Loop-based filtering (slow)
# filtered_rows = []
# for idx, row in data.iterrows():
#     if row['sepal_length'] > 5.0:
#         filtered_rows.append(row)

2. Chain Conditions Efficiently

# Efficient chaining with parentheses
complex_filter = data[
    (data.iris_class == 'Iris-setosa') & 
    (data.sepal_length > 5.0) & 
    (data.petal_width < 0.3)
]

# Use query() for very complex conditions
complex_query = data.query(
    "iris_class == 'Iris-setosa' and sepal_length > 5.0 and petal_width < 0.3"
)

3. Memory-Efficient Filtering

# For large datasets, consider using categorical data types
data_categorical = data.copy()
data_categorical['iris_class'] = data_categorical['iris_class'].astype('category')

# This reduces memory usage for string columns with repeated values
print(f"Original memory usage: {data.memory_usage(deep=True).sum()} bytes")
print(f"Categorical memory usage: {data_categorical.memory_usage(deep=True).sum()} bytes")

4. Index-based Filtering for Performance

# Set index for faster filtering on frequently used columns
data_indexed = data.set_index('iris_class')

# Now filtering by iris_class is much faster
setosa_indexed = data_indexed.loc['Iris-setosa']

Common Filtering Patterns

Filter by Multiple Values

# Filter by multiple specific values
species_of_interest = ['Iris-setosa', 'Iris-virginica']
filtered_species = data[data.iris_class.isin(species_of_interest)]

# Exclude specific values
not_versicolor = data[~data.iris_class.isin(['Iris-versicolor'])]

Filter by Missing Values

# Filter rows with missing values
has_missing = data[data.isnull().any(axis=1)]

# Filter rows without missing values
complete_rows = data[data.notnull().all(axis=1)]

# Filter specific column for missing values
missing_sepal_length = data[data.sepal_length.isnull()]

Conditional Replacement During Filtering

# Create filtered copy with conditional modifications
filtered_modified = data[data.iris_class == 'Iris-setosa'].copy()
filtered_modified.loc[filtered_modified.sepal_length > 5.0, 'size_category'] = 'large'
filtered_modified.loc[filtered_modified.sepal_length <= 5.0, 'size_category'] = 'small'

Quick Reference: Filtering Cheat Sheet

Operation Syntax Use Case
Single condition df[df['col'] > value] Basic filtering
Multiple AND conditions df[(df['col1'] > val1) & (df['col2'] < val2)] Complex filtering
Multiple OR conditions df[(df['col1'] > val1) | (df['col2'] < val2)] Alternative conditions
String contains df[df['col'].str.contains('pattern')] Text pattern matching
Value in list df[df['col'].isin([val1, val2])] Multiple value matching
Between values df[df['col'].between(min_val, max_val)] Range filtering
Query method df.query("col > value and col2 < value2") Readable complex conditions
Label-based df.loc[condition, columns] Specific rows and columns

This subsetting technique is fundamental to many other pandas operations:

  • Data Loading Patterns - Filter data during or after loading
  • Indexing and Sorting - Combine with sorting for ordered subsets
  • Join Operations - Filter before joins for better performance
  • Groupby Operations - Subset data before grouping for targeted analysis

Master these filtering patterns and you’ll handle 90% of data subsetting challenges efficiently!


Combining and Reshaping DataFrames

Sample Data

import pandas as pd
import numpy as np

students = pd.DataFrame({
    "StudentID": [1, 2, 3, 4],
    "Name":      ["Alice", "Bob", "Charlie", "David"],
    "Age":       [20, 22, 19, 21]
})

scores = pd.DataFrame({
    "StudentID": [1, 2, 3, 5],
    "Subject":   ["Math", "Science", "Math", "History"],
    "Score":     [85, 90, 78, 92]
})

merge — Joining DataFrames

# Inner join — only matching StudentIDs
merged_inner = pd.merge(students, scores, on="StudentID", how="inner")

# Left join — keep all students, NaN where no score exists
merged_left = pd.merge(students, scores, on="StudentID", how="left",
                       suffixes=("_stu", "_scr"))
how Keeps
inner Only matching rows on both sides
left All rows from left, NaN for unmatched right
right All rows from right, NaN for unmatched left
outer All rows from both, NaN where no match
# Join on different column names
pd.merge(students, scores, left_on="StudentID", right_on="StudentID")

concat — Stacking DataFrames

# Vertical stack (axis=0) — append rows
concat_vertical = pd.concat([students, students], axis=0, ignore_index=True)

# Horizontal stack (axis=1) — append columns
concat_horizontal = pd.concat([students, scores], axis=1)

# With keys to identify source
concat_keyed = pd.concat({"Set1": students, "Set2": scores}, axis=1, join="inner")

merge vs concat: use merge when joining on a key column; use concat when stacking rows or columns directly.

melt — Wide to Long

wide_df = pd.DataFrame({
    "StudentID": [1, 2],
    "Math":      [85, 90],
    "Science":   [88, 92]
})

melted = pd.melt(
    wide_df,
    id_vars=["StudentID"],       # columns to keep as-is
    value_vars=["Math", "Science"],  # columns to unpivot
    var_name="Subject",
    value_name="Score"
)

Output:

   StudentID  Subject  Score
0          1     Math     85
1          2     Math     90
2          1  Science     88
3          2  Science     92

pivot — Long to Wide

long_df = pd.DataFrame({
    "StudentID": [1, 1, 2, 2],
    "Subject":   ["Math", "Science", "Math", "Science"],
    "Score":     [85, 88, 90, 92]
})

pivoted = long_df.pivot(index="StudentID", columns="Subject", values="Score")

Output:

Subject    Math  Science
StudentID               
1            85       88
2            90       92

pivot raises ValueError if there are duplicate (index, columns) combinations. Use pivot_table instead.

pivot_table — pivot with Aggregation

# Handles duplicates by aggregating
pivot_table = pd.pivot_table(
    long_df,
    index="StudentID",
    columns="Subject",
    values="Score",
    aggfunc="mean",
    fill_value=0
)

Common Exceptions

# merge: mismatched key names raises KeyError
try:
    pd.merge(students, scores, left_on="Name", right_on="Subject")
except KeyError as e:
    print("KeyError:", e)

# pivot: duplicate entries raises ValueError
long_df_dup = pd.DataFrame({
    "StudentID": [1, 1, 1],
    "Subject":   ["Math", "Math", "Science"],
    "Score":     [85, 87, 88]
})
try:
    long_df_dup.pivot(index="StudentID", columns="Subject", values="Score")
except ValueError as e:
    print("ValueError:", e)  # use pivot_table with aggfunc instead

Quick Reference

Operation Function Direction
Join on key pd.merge(df1, df2, on=key) Side by side
Stack rows pd.concat([df1, df2], axis=0) Vertical
Stack columns pd.concat([df1, df2], axis=1) Horizontal
Wide → Long pd.melt(df, id_vars=..., value_vars=...) Unpivot
Long → Wide df.pivot(index=..., columns=..., values=...) Pivot
Long → Wide + agg pd.pivot_table(df, aggfunc=...) Pivot with duplicates