45+ Python Pandas Interview Questions and Answers 2026

Pandas interview questions and python pandas interview questions show up in data analyst, data scientist, ML engineer, and Python backend loops wherever tabular data matters. Interviewers expect more than read_csv—they probe groupby vs transform, merge row explosions, vectorization vs apply, and how you clean messy production CSVs. If you searched panda express interview questions, that usually refers to the restaurant chain's hiring process, not this library; this guide covers the pandas Python package for data analysis only.

Below are 45 questions with elaborate answers; technical sections include a strong answer sample you can say aloud. Pair with Python developer interview questions for language fundamentals, PostgreSQL interview questions when data is loaded from Postgres, data science interview questions for statistics and ML breadth, SQL technical interview questions when interviewers compare joins to merge, and Kafka interview questions for pipeline context.

NOTE

Prep target: Master groupby, merge, and cleaning on a real dataset. Senior loops add performance, validate= on merges, and when Polars or SQL would replace pandas.

Tested on: Ubuntu 25.04 (Plucky Puffin); kernel 6.14.0-37-generic; Python 3.12; pandas 2.2.3.

Interview context and how to prepare

What do pandas interviews actually test?

Pandas interviews test whether you can manipulate labeled tables correctly and efficiently—not whether you memorized every method name.

Layer	What interviewers probe
Basics	Series, DataFrame, index, dtypes
Selection	`loc`, `iloc`, boolean indexing
Aggregation	`groupby`, `agg`, `transform`
Combine	`merge`, `join`, `concat`
Reshape	`pivot`, `melt`, `stack`
Cleaning	missing values, duplicates, types
Performance	vectorization, dtypes, chunking
Time series	`resample`, `shift`, `rolling`

Role	Emphasis
Data analyst	groupby, merge, pivot, SQL analogy
Data scientist	feature prep, leakage-safe splits
ML engineer	large data, parquet, memory

Pandas vs SQL — when do interviewers compare them?

Task	SQL	pandas
Large remote data	Server executes	Pull subset or use warehouse
Joins	`JOIN`	`merge`
Group by	`GROUP BY`	`groupby().agg()`
Window functions	`OVER()`	`transform`, `rolling`
Exploration	Heavier iteration	Notebook-friendly

Many loops give SQL + pandas for the same business question—see SQL interviews.

What is a typical pandas interview loop?

Round	Format	Focus
Screening	30 min	Projects, stack, pandas vs Spark
Live coding	45–60 min	groupby revenue, anti-join customers
Take-home	2–4 hr	EDA notebook with clean narrative
ML follow-up	45 min	Features from DataFrame, leakage

Most pandas screens center on groupby, merge, and missing data—know row counts before and after each step.

What is a realistic 3–5 week pandas prep plan?

Week	Focus	Output
1	DataFrame basics, selection, dtypes	10 timed pandas practice problems
2	groupby agg/transform, pivot	Cohort summary notebook
3	merge/join, duplicates, validate	Anti-join + duplicate key drill
4	Cleaning, time series, resample	Messy CSV → analysis-ready
5	Performance + mock	Replace apply with vectorized solution

Work on one Kaggle-style CSV end to end.

DataFrame and Series fundamentals

What is pandas?

pandas is an open-source Python library for tabular data manipulation and analysis. Core structures:

Series — one-dimensional labeled array
DataFrame — two-dimensional table (columns + index)

Built on NumPy; integrates with CSV, Parquet, SQL, Excel.

A strong answer is:

pandas is Python's standard labeled table library—DataFrame for rows and columns, Series for single columns—with alignment by index at the heart of operations.

Series vs DataFrame?

	Series	DataFrame
Dimensions	1D	2D
Columns	One (name optional)	Many
Index	Row labels	Row labels shared
From dict	`{label: value}`	`{col: list}`

A DataFrame column is a Series sharing the DataFrame index.

A strong answer is:

Series is one column or vector; DataFrame is a table of Series aligned on the same index—I think Series for single-metric ops, DataFrame for relational shape.

How do you create a DataFrame?

Common methods:

python


import pandas as pd

df = pd.DataFrame({"city": ["A", "B"], "sales": [10, 20]})
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet")

From NumPy: pd.DataFrame(arr, columns=[...], index=[...]).

A strong answer is:

From dicts for tests, read_csv/read_parquet in production—I set dtypes on read when I know the schema to save memory and errors.

What is the Index in pandas?

The Index labels rows (and can label columns). Enables alignment in arithmetic and joins without positional guessing.

Index type	Use
RangeIndex	Default 0..n-1
DatetimeIndex	Time series
MultiIndex	Hierarchical keys

reset_index() moves index to columns; set_index() promotes columns.

A strong answer is:

Index is the alignment backbone—operations line up on labels, which is why bad indexes after groupby cause silent bugs until I reset_index deliberately.

Why do dtypes matter in pandas?

dtypes control memory and correctness:

dtype	When
`int64` / `Int64`	Integers (`Int64` nullable)
`float64`	Continuous
`category`	Low-cardinality strings
`datetime64[ns]`	Timestamps
`string`	Text (pandas 1+ nullable string)

df.dtypes and df.info() for audit; astype() for conversion.

A strong answer is:

Wrong dtypes inflate memory and break joins—I use category for repeated strings, parse dates on load, and nullable Int64 when missing integers are real.

How do you explore a DataFrame quickly?

Method	Purpose
`head()` / `tail()`	Sample rows
`info()`	dtypes, non-null counts, memory
`describe()`	Numeric summary stats
`shape`	(rows, cols)
`value_counts()`	Categorical frequency

First step on any new dataset in interviews and on the job.

A strong answer is:

info for dtypes and nulls, describe for numeric range, value_counts for keys—before any merge I verify uniqueness assumptions with value_counts on join keys.

SettingWithCopyWarning — what causes it?

Assigning to a slice that might be a view triggers SettingWithCopyWarning—changes may not propagate to the original DataFrame.

Fix:

python


subset = df[df["active"]].copy()
subset["score"] = 0

Or use .loc on the original: df.loc[df["active"], "score"] = 0.

A strong answer is:

I use .copy() when building a filtered subset I'll mutate, or assign with .loc on the parent DataFrame—never chain filters then assign without copy.

loc vs iloc?

	`loc`	`iloc`
Selection	Label-based	Integer position
Slice end	Inclusive	Exclusive
Rows & cols	`df.loc[rows, cols]`	`df.iloc[i, j]`

Boolean indexing: df.loc[df["sales"] > 10].

A strong answer is:

loc for label slices and boolean masks; iloc for positional access—I avoid mixing chained brackets without loc after a filter.

What is boolean indexing?

Filter rows with a boolean Series aligned to the index:

python


high = df[df["amount"] > 100]
mask = (df["country"] == "IN") & (df["amount"] > 0)
filtered = df.loc[mask]

Use &, |, ~ with parentheses—not Python and/or.

A strong answer is:

Boolean masks filter with vectorized comparisons—I parenthesize each condition and use & for AND when combining masks.

GroupBy, aggregation, and reshape

What is groupby and split-apply-combine?

groupby splits rows by key, applies a function per group, combines results.

Phases:

Split — partition by city, user_id, etc.
Apply — sum, mean, custom
Combine — assemble result index

python

df.groupby("city", as_index=False)["sales"].sum()

A strong answer is:

groupby is split-apply-combine—I name the key columns, choose agg vs transform based on output shape, and reset_index when I need flat columns for merge.

groupby agg vs transform?

Method	Output shape	Use
`agg`	One row per group	Summary tables
`transform`	Same rows as input	Group stats per row
`apply`	Flexible; often slower	Custom group logic

python


# agg: one row per city
df.groupby("city")["sales"].sum()

# transform: each row gets its city's mean
df["city_mean"] = df.groupby("city")["sales"].transform("mean")

A strong answer is:

agg reduces to one row per group; transform broadcasts group stats back for row-level ratios—I skip apply when agg/transform express the logic.

What are named aggregations?

pandas 0.25+ style for multiple metrics with clear column names:

python


df.groupby("region").agg(
    total_sales=("sales", "sum"),
    avg_price=("price", "mean"),
    orders=("order_id", "nunique"),
)

Cleaner than dict of lists for interviews.

A strong answer is:

Named aggregations make readable summary tables in one groupby call—total_sales equals sum of sales per region without manual rename hacks.

What is a MultiIndex after groupby?

Grouping by multiple keys yields a hierarchical index:

python

df.groupby(["region", "product"])["sales"].sum()

Call reset_index() before merge to avoid alignment bugs.

A strong answer is:

MultiIndex after multi-key groupby—I reset_index immediately if the next step is merge or export to SQL expecting flat columns.

pivot vs pivot_table?

	`pivot`	`pivot_table`
Aggregation	No duplicate index pairs	Handles duplicates with `aggfunc`
Use	Strict reshape	Cross-tab summaries

python

pd.pivot_table(df, index="region", columns="month", values="sales", aggfunc="sum")

Like Excel pivot tables / SQL PIVOT conceptually.

A strong answer is:

pivot_table for revenue by region and month with duplicates allowed; plain pivot only when index pairs are unique.

melt and wide-to-long reshape?

melt unpivots wide columns to rows:

python

pd.melt(df, id_vars=["id"], value_vars=["Q1", "Q2"], var_name="quarter", value_name="revenue")

Opposite of pivot—tidy data for plotting and groupby.

A strong answer is:

melt turns wide quarter columns into long format for groupby and charts—I use id_vars for entity keys I keep constant.

Merge, join, and combine

merge vs join vs concat?

API	Behavior
`pd.merge(left, right, on=...)`	SQL-style joins on columns
`df.join(other)`	Index-based join
`pd.concat([a,b])`	Stack vertically or horizontally

python


pd.merge(orders, customers, on="customer_id", how="left")
pd.concat([df2024, df2025], ignore_index=True)

A strong answer is:

merge for key columns like SQL; join when indexes already align; concat to stack batches with same schema.

Explain inner, left, right, and outer merge.

`how`	Result
inner	Keys in both
left	All left keys; NaN if no match right
right	All right keys
outer	All keys from either side

Anti-join (rows in left not in right): left merge + indicator + filter _merge == 'left_only'.

A strong answer is:

I pick how based on which keys must survive—left for preserve-all-orders, inner for matched-only analysis, indicator for anti-join customers without orders.

What causes merge row explosion?

Duplicate keys on one or both sides create a Cartesian product within each key—row count multiplies silently.

Prevent:

python

pd.merge(a, b, on="id", validate="many_to_one")

Check before merge: df["id"].is_unique or value_counts().

A strong answer is:

Duplicate join keys inflate rows—I validate many_to_one or dedupe keys first and assert output row count matches expectations.

Merging on multiple keys?

python


pd.merge(
    df1, df2,
    on=["customer_id", "order_date"],
    how="inner",
)

Column names differ: left_on=["a_id"], right_on=["b_id"].

A strong answer is:

Multi-key merge when composite identity matters—I confirm both keys are clean dtypes and no nulls unless intentional.

concat axis=0 vs axis=1?

`axis`	Effect
0	Stack rows (append tables)
1	Side-by-side columns (align index)

Use ignore_index=True when appending batches without meaningful index.

A strong answer is:

axis=0 to append monthly files; axis=1 to add feature columns aligned on the same index—I watch duplicate indexes on axis=1 joins.

Data cleaning and missing values

How do you handle missing data?

Method	When
`isna()` / `notna()`	Detect
`dropna(subset=[...])`	Remove incomplete rows
`fillna(value)`	Impute constant
`fillna(method="ffill")`	Forward fill (time series)
`interpolate()`	Numeric gap fill

Interviewers want justification—mean imputation vs drop vs model-based.

A strong answer is:

I quantify missingness first—drop when MCAR and small; fill with domain rule or group median; never blind global mean on skewed targets without saying why.

How do you find and remove duplicates?

python


df.duplicated(subset=["email"], keep="first")
df.drop_duplicates(subset=["order_id"], keep="last")

keep=False marks all duplicates in a group.

A strong answer is:

duplicated with subset on business key—I keep first or last explicitly based on which record is authoritative, not default blindly.

Parsing and converting dtypes?

python


df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["ts"] = pd.to_datetime(df["ts"], utc=True, errors="coerce")
df["code"] = df["code"].astype("category")

errors="coerce" turns bad values into NaT/NaN for audit.

A strong answer is:

to_datetime with utc when events cross zones; coerce errors so bad strings surface as nulls I can count in info().

What is the .str accessor?

Vectorized string ops on text columns:

python


df["email"].str.lower()
df["name"].str.contains("test", na=False)
df["code"].str.split("-", expand=True)

Prefer .str over apply for strings.

A strong answer is:

str accessor is vectorized string logic—lower, contains, split—faster and clearer than apply with lambdas on text columns.

When is apply appropriate?

apply runs Python function per row/column—flexible but slow on large data.

Use when:

Logic not expressible with vectorized ops
Small DataFrames
Prototyping

Replace with NumPy/pandas vectorized ops in production paths.

A strong answer is:

apply is last resort—I try vectorized arithmetic, np.where, or .str first and reserve apply for small custom group logic apply can't vectorize.

Performance, time series, and advanced topics

Vectorization vs iterrows — why does it matter?

Approach	Speed
Vectorized	C/NumPy under hood
`iterrows` / row apply	Python per row—very slow

python


# Vectorized
df["tax"] = df["price"] * 0.18

# Avoid on large data
df["tax"] = df.apply(lambda r: r["price"] * 0.18, axis=1)

Interviewers often ask for 10–100x intuition.

A strong answer is:

Vectorized ops batch in C—iterrows is for debugging small samples, not production transforms on millions of rows.

How do you reduce DataFrame memory?

Technique	Effect
`category` dtype	Compress string columns
Downcast ints/floats	`pd.to_numeric(..., downcast="integer")`
Load only needed columns	`usecols` in `read_csv`
`read_parquet`	Columnar, compressed

df.memory_usage(deep=True) to measure.

A strong answer is:

category for low-cardinality strings, parquet instead of CSV for pipelines, and usecols on read—I measure with memory_usage before claiming OOM fixes.

How does pandas handle time series?

Tool	Use
`pd.to_datetime`	Parse timestamps
`set_index(datetime)`	Time-indexed Series
`resample("D").sum()`	Aggregate to daily
`shift(1)`	Lag for MoM growth
`rolling(7).mean()`	Moving average

Timezone-aware dtypes avoid DST bugs.

A strong answer is:

Parse to datetime, set index, resample for calendar buckets—shift for period-over-period and rolling for smoothed trends.

rolling vs expanding windows?

	`rolling(n)`	`expanding()`
Window	Fixed last n rows	From start to current
Use	Moving avg, volatility	Cumulative metrics

python

df["ma7"] = df["sales"].rolling(7, min_periods=1).mean()

A strong answer is:

rolling for last-N behavior; expanding for cumulative from series start—I set min_periods when early windows are sparse.

Pandas vs Polars in 2026 interviews?

	pandas	Polars
Engine	Python/NumPy	Rust, multi-thread
API	Mutable, index-centric	Lazy/eager, expression API
Best for	EDA, sklearn pipelines	Large in-memory ETL

pandas remains default interview expectation; Polars is a senior differentiator for performance questions.

A strong answer is:

pandas for interviews and sklearn compatibility; Polars when I need lazy multi-core on GB-scale single-node data—I can explain both without dismissing pandas.

Reading large CSV files?

python


for chunk in pd.read_csv("big.csv", chunksize=100_000):
    process(chunk)

Or filter at source with SQL/Parquet column pruning.

A strong answer is:

chunksize for out-of-core aggregation, or switch to parquet/SQL when the full table won't fit—I never iterrows a file that should be chunked or queried down.

Scenarios and live coding

Scenario: Top 3 products by revenue per region.

python


(
    df.groupby("region", as_index=False)
    .apply(lambda g: g.nlargest(3, "revenue"))
    .reset_index(drop=True)
)

Or sort + groupby.head(3) in pandas 1.5+:

python

df.sort_values("revenue", ascending=False).groupby("region").head(3)

A strong answer is:

Sort by revenue descending, groupby head 3 per region—or nlargest inside groupby—I verify ties handling if interviewer cares.

Scenario: Customers who never placed an order.

python


merged = customers.merge(orders[["customer_id"]].drop_duplicates(),
                         on="customer_id", how="left", indicator=True)
never_ordered = merged[merged["_merge"] == "left_only"]

Alternative: customers[~customers["id"].isin(orders["customer_id"])] when ids unique.

A strong answer is:

Left merge with indicator left_only, or isin negation—I confirm customer_id uniqueness so I don't drop valid customers.

Scenario: Month-over-month revenue growth.

python


monthly = df.groupby(pd.Grouper(key="date", freq="ME"))["revenue"].sum()
growth = monthly.pct_change()

Or diff() / shift(1) for absolute change.

A strong answer is:

Grouper to month buckets, sum revenue, pct_change for MoM—I sort by date first and handle first month NaN explicitly.

Scenario: Clean a messy CSV in an interview.

Checklist:

read_csv with na_values, dtype hints
drop_duplicates on business key
to_datetime / to_numeric with errors="coerce"
dropna or impute with stated rule
astype("category") where appropriate
Document before/after shape and null counts

A strong answer is:

I narrate each cleaning step with shape and null counts—parse dates, coerce numerics, dedupe on order_id, then aggregate—so interviewer sees audit trail.

What is the pipe method for readable chains?

.pipe(func, *args) passes DataFrame through functions—method chaining style:

python


(
    df.pipe(load_raw)
    .pipe(clean_orders)
    .pipe(add_revenue_column)
    .groupby("region")["revenue"].sum()
)

Improves readability in notebooks and interviews.

A strong answer is:

pipe names transformation steps in a left-to-right pipeline—cleaner than nested temp variables in live coding.

How does pandas interoperate with NumPy?

.values / .to_numpy() get array backing; many ops delegate to NumPy.

Watch: nullable dtypes and index alignment—NumPy arrays lose labels.

A strong answer is:

to_numpy when feeding sklearn without index—I'm careful that nullables become object or need fill before numpy-only APIs.

How do you export results?

Format	Use
`to_csv`	Universal, larger
`to_parquet`	Analytics pipelines
`to_sql`	Warehouse load
`to_dict("records")`	JSON APIs

A strong answer is:

parquet for downstream pipelines, csv for human handoff—I reset_index before export if flat files need key columns.

Scenario: Avoid data leakage when engineering features in pandas.

Rules:

Fit aggregations only on train split, apply to test
No future data in shift for prediction targets
Group stats computed inside cross-validation folds

python


# Wrong: global mean including test rows
# Right: train.groupby(...).transform saved mapping applied to test

Link to data science interviews.

A strong answer is:

Any group-level stat for ML must be fit on training data only—I never compute global target means on the full DataFrame before split.

How do you test pandas transformation code?

pytest with small fixture DataFrames
pandas.testing.assert_frame_equal for output compare
Property checks: row count, key uniqueness, no null in required cols

A strong answer is:

assert_frame_equal on expected output for pure functions—I build minimal frames covering edge cases like duplicate keys and null join keys.

Final prep checklist

What should you rehearse before pandas interviews?

Checklist:

loc / iloc / boolean indexing
groupby agg vs transform with reset_index
merge types + anti-join + validate=
pivot_table and melt
Missing data strategy out loud
Vectorization over apply/iterrows
Time series resample, shift, rolling
One top-N per group scenario
One messy CSV cleaning narrative
Python interviews language refresh
SQL interviews join analogy
Data science interviews if ML follow-up

A strong answer is:

I time-box groupby + merge + cleaning on one dataset daily, state row counts after every join, and rehearse one leakage-safe feature story for ML roles.

Pattern cheat sheet (quick reference)

Task	pandas approach
Filter rows	Boolean mask + `loc`
Summary by group	`groupby().agg()`
Per-row group stat	`transform`
SQL inner join	`merge(how="inner")`
Customers not in orders	Left merge + `indicator`
Prevent join explosion	`validate="many_to_one"`
Wide → long	`melt`
Cross-tab	`pivot_table`
Fast math column	Vectorized ops, not `apply`
Large CSV	`chunksize` or Parquet
MoM growth	`Grouper` + `pct_change`

References

pandas interview prep

pandas documentation

On-site prep

Summary

Pandas interviews test groupby, merge semantics, and vectorized cleaning—state row counts after every join and explain why you chose merge over join. Pair with SQL for warehouse logic and data science interviews when modeling follows.

Interview context and how to prepare

DataFrame and Series fundamentals

GroupBy, aggregation, and reshape

Merge, join, and combine

Data cleaning and missing values

Performance, time series, and advanced topics

Scenarios and live coding

Final prep checklist

Pattern cheat sheet (quick reference)

References

Summary

Related Articles

Django Interview Questions and Answers for Experienced Developers

Python Developer Interview Questions and Answers

Data Science Interview Questions and Answers

Search GoLinuxCloud