Pandas interview questions and python pandas interview questions show up in data analyst, data scientist, ML engineer, and Python backend loops wherever tabular data matters. Interviewers expect more than read_csv—they probe groupby vs transform, merge row explosions, vectorization vs apply, and how you clean messy production CSVs. If you searched panda express interview questions, that usually refers to the restaurant chain's hiring process, not this library; this guide covers the pandas Python package for data analysis only.
Below are 45 questions with elaborate answers; technical sections include a strong answer sample you can say aloud. Pair with Python developer interview questions for language fundamentals, PostgreSQL interview questions when data is loaded from Postgres, data science interview questions for statistics and ML breadth, SQL technical interview questions when interviewers compare joins to merge, and Kafka interview questions for pipeline context.
validate= on merges, and when Polars or SQL would replace pandas.
Tested on: Ubuntu 25.04 (Plucky Puffin); kernel 6.14.0-37-generic; Python 3.12; pandas 2.2.3.
Interview context and how to prepare
What do pandas interviews actually test?
Pandas interviews test whether you can manipulate labeled tables correctly and efficiently—not whether you memorized every method name.
| Layer | What interviewers probe |
|---|---|
| Basics | Series, DataFrame, index, dtypes |
| Selection | loc, iloc, boolean indexing |
| Aggregation | groupby, agg, transform |
| Combine | merge, join, concat |
| Reshape | pivot, melt, stack |
| Cleaning | missing values, duplicates, types |
| Performance | vectorization, dtypes, chunking |
| Time series | resample, shift, rolling |
| Role | Emphasis |
|---|---|
| Data analyst | groupby, merge, pivot, SQL analogy |
| Data scientist | feature prep, leakage-safe splits |
| ML engineer | large data, parquet, memory |
Pandas vs SQL — when do interviewers compare them?
| Task | SQL | pandas |
|---|---|---|
| Large remote data | Server executes | Pull subset or use warehouse |
| Joins | JOIN |
merge |
| Group by | GROUP BY |
groupby().agg() |
| Window functions | OVER() |
transform, rolling |
| Exploration | Heavier iteration | Notebook-friendly |
Many loops give SQL + pandas for the same business question—see SQL interviews.
What is a typical pandas interview loop?
| Round | Format | Focus |
|---|---|---|
| Screening | 30 min | Projects, stack, pandas vs Spark |
| Live coding | 45–60 min | groupby revenue, anti-join customers |
| Take-home | 2–4 hr | EDA notebook with clean narrative |
| ML follow-up | 45 min | Features from DataFrame, leakage |
Most pandas screens center on groupby, merge, and missing data—know row counts before and after each step.
What is a realistic 3–5 week pandas prep plan?
| Week | Focus | Output |
|---|---|---|
| 1 | DataFrame basics, selection, dtypes | 10 timed pandas practice problems |
| 2 | groupby agg/transform, pivot | Cohort summary notebook |
| 3 | merge/join, duplicates, validate | Anti-join + duplicate key drill |
| 4 | Cleaning, time series, resample | Messy CSV → analysis-ready |
| 5 | Performance + mock | Replace apply with vectorized solution |
Work on one Kaggle-style CSV end to end.
DataFrame and Series fundamentals
What is pandas?
pandas is an open-source Python library for tabular data manipulation and analysis. Core structures:
Series— one-dimensional labeled arrayDataFrame— two-dimensional table (columns + index)
Built on NumPy; integrates with CSV, Parquet, SQL, Excel.
A strong answer is:
pandas is Python's standard labeled table library—DataFrame for rows and columns, Series for single columns—with alignment by index at the heart of operations.
Series vs DataFrame?
| Series | DataFrame | |
|---|---|---|
| Dimensions | 1D | 2D |
| Columns | One (name optional) | Many |
| Index | Row labels | Row labels shared |
| From dict | {label: value} |
{col: list} |
A DataFrame column is a Series sharing the DataFrame index.
A strong answer is:
Series is one column or vector; DataFrame is a table of Series aligned on the same index—I think Series for single-metric ops, DataFrame for relational shape.
How do you create a DataFrame?
Common methods:
import pandas as pd
df = pd.DataFrame({"city": ["A", "B"], "sales": [10, 20]})
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet")From NumPy: pd.DataFrame(arr, columns=[...], index=[...]).
A strong answer is:
From dicts for tests, read_csv/read_parquet in production—I set dtypes on read when I know the schema to save memory and errors.
What is the Index in pandas?
The Index labels rows (and can label columns). Enables alignment in arithmetic and joins without positional guessing.
| Index type | Use |
|---|---|
| RangeIndex | Default 0..n-1 |
| DatetimeIndex | Time series |
| MultiIndex | Hierarchical keys |
reset_index() moves index to columns; set_index() promotes columns.
A strong answer is:
Index is the alignment backbone—operations line up on labels, which is why bad indexes after groupby cause silent bugs until I reset_index deliberately.
Why do dtypes matter in pandas?
dtypes control memory and correctness:
| dtype | When |
|---|---|
int64 / Int64 |
Integers (Int64 nullable) |
float64 |
Continuous |
category |
Low-cardinality strings |
datetime64[ns] |
Timestamps |
string |
Text (pandas 1+ nullable string) |
df.dtypes and df.info() for audit; astype() for conversion.
A strong answer is:
Wrong dtypes inflate memory and break joins—I use category for repeated strings, parse dates on load, and nullable Int64 when missing integers are real.
How do you explore a DataFrame quickly?
| Method | Purpose |
|---|---|
head() / tail() |
Sample rows |
info() |
dtypes, non-null counts, memory |
describe() |
Numeric summary stats |
shape |
(rows, cols) |
value_counts() |
Categorical frequency |
First step on any new dataset in interviews and on the job.
A strong answer is:
info for dtypes and nulls, describe for numeric range, value_counts for keys—before any merge I verify uniqueness assumptions with value_counts on join keys.
SettingWithCopyWarning — what causes it?
Assigning to a slice that might be a view triggers SettingWithCopyWarning—changes may not propagate to the original DataFrame.
Fix:
subset = df[df["active"]].copy()
subset["score"] = 0Or use .loc on the original: df.loc[df["active"], "score"] = 0.
A strong answer is:
I use .copy() when building a filtered subset I'll mutate, or assign with .loc on the parent DataFrame—never chain filters then assign without copy.
loc vs iloc?
loc |
iloc |
|
|---|---|---|
| Selection | Label-based | Integer position |
| Slice end | Inclusive | Exclusive |
| Rows & cols | df.loc[rows, cols] |
df.iloc[i, j] |
Boolean indexing: df.loc[df["sales"] > 10].
A strong answer is:
loc for label slices and boolean masks; iloc for positional access—I avoid mixing chained brackets without loc after a filter.
What is boolean indexing?
Filter rows with a boolean Series aligned to the index:
high = df[df["amount"] > 100]
mask = (df["country"] == "IN") & (df["amount"] > 0)
filtered = df.loc[mask]Use &, |, ~ with parentheses—not Python and/or.
A strong answer is:
Boolean masks filter with vectorized comparisons—I parenthesize each condition and use & for AND when combining masks.
GroupBy, aggregation, and reshape
What is groupby and split-apply-combine?
groupby splits rows by key, applies a function per group, combines results.
Phases:
- Split — partition by
city,user_id, etc. - Apply — sum, mean, custom
- Combine — assemble result index
df.groupby("city", as_index=False)["sales"].sum()A strong answer is:
groupby is split-apply-combine—I name the key columns, choose agg vs transform based on output shape, and reset_index when I need flat columns for merge.
groupby agg vs transform?
| Method | Output shape | Use |
|---|---|---|
agg |
One row per group | Summary tables |
transform |
Same rows as input | Group stats per row |
apply |
Flexible; often slower | Custom group logic |
# agg: one row per city
df.groupby("city")["sales"].sum()
# transform: each row gets its city's mean
df["city_mean"] = df.groupby("city")["sales"].transform("mean")A strong answer is:
agg reduces to one row per group; transform broadcasts group stats back for row-level ratios—I skip apply when agg/transform express the logic.
What are named aggregations?
pandas 0.25+ style for multiple metrics with clear column names:
df.groupby("region").agg(
total_sales=("sales", "sum"),
avg_price=("price", "mean"),
orders=("order_id", "nunique"),
)Cleaner than dict of lists for interviews.
A strong answer is:
Named aggregations make readable summary tables in one groupby call—total_sales equals sum of sales per region without manual rename hacks.
What is a MultiIndex after groupby?
Grouping by multiple keys yields a hierarchical index:
df.groupby(["region", "product"])["sales"].sum()Call reset_index() before merge to avoid alignment bugs.
A strong answer is:
MultiIndex after multi-key groupby—I reset_index immediately if the next step is merge or export to SQL expecting flat columns.
pivot vs pivot_table?
pivot |
pivot_table |
|
|---|---|---|
| Aggregation | No duplicate index pairs | Handles duplicates with aggfunc |
| Use | Strict reshape | Cross-tab summaries |
pd.pivot_table(df, index="region", columns="month", values="sales", aggfunc="sum")Like Excel pivot tables / SQL PIVOT conceptually.
A strong answer is:
pivot_table for revenue by region and month with duplicates allowed; plain pivot only when index pairs are unique.
melt and wide-to-long reshape?
melt unpivots wide columns to rows:
pd.melt(df, id_vars=["id"], value_vars=["Q1", "Q2"], var_name="quarter", value_name="revenue")Opposite of pivot—tidy data for plotting and groupby.
A strong answer is:
melt turns wide quarter columns into long format for groupby and charts—I use id_vars for entity keys I keep constant.
Merge, join, and combine
merge vs join vs concat?
| API | Behavior |
|---|---|
pd.merge(left, right, on=...) |
SQL-style joins on columns |
df.join(other) |
Index-based join |
pd.concat([a,b]) |
Stack vertically or horizontally |
pd.merge(orders, customers, on="customer_id", how="left")
pd.concat([df2024, df2025], ignore_index=True)A strong answer is:
merge for key columns like SQL; join when indexes already align; concat to stack batches with same schema.
Explain inner, left, right, and outer merge.
how |
Result |
|---|---|
| inner | Keys in both |
| left | All left keys; NaN if no match right |
| right | All right keys |
| outer | All keys from either side |
Anti-join (rows in left not in right): left merge + indicator + filter _merge == 'left_only'.
A strong answer is:
I pick how based on which keys must survive—left for preserve-all-orders, inner for matched-only analysis, indicator for anti-join customers without orders.
What causes merge row explosion?
Duplicate keys on one or both sides create a Cartesian product within each key—row count multiplies silently.
Prevent:
pd.merge(a, b, on="id", validate="many_to_one")Check before merge: df["id"].is_unique or value_counts().
A strong answer is:
Duplicate join keys inflate rows—I validate many_to_one or dedupe keys first and assert output row count matches expectations.
Merging on multiple keys?
pd.merge(
df1, df2,
on=["customer_id", "order_date"],
how="inner",
)Column names differ: left_on=["a_id"], right_on=["b_id"].
A strong answer is:
Multi-key merge when composite identity matters—I confirm both keys are clean dtypes and no nulls unless intentional.
concat axis=0 vs axis=1?
axis |
Effect |
|---|---|
| 0 | Stack rows (append tables) |
| 1 | Side-by-side columns (align index) |
Use ignore_index=True when appending batches without meaningful index.
A strong answer is:
axis=0 to append monthly files; axis=1 to add feature columns aligned on the same index—I watch duplicate indexes on axis=1 joins.
Data cleaning and missing values
How do you handle missing data?
| Method | When |
|---|---|
isna() / notna() |
Detect |
dropna(subset=[...]) |
Remove incomplete rows |
fillna(value) |
Impute constant |
fillna(method="ffill") |
Forward fill (time series) |
interpolate() |
Numeric gap fill |
Interviewers want justification—mean imputation vs drop vs model-based.
A strong answer is:
I quantify missingness first—drop when MCAR and small; fill with domain rule or group median; never blind global mean on skewed targets without saying why.
How do you find and remove duplicates?
df.duplicated(subset=["email"], keep="first")
df.drop_duplicates(subset=["order_id"], keep="last")keep=False marks all duplicates in a group.
A strong answer is:
duplicated with subset on business key—I keep first or last explicitly based on which record is authoritative, not default blindly.
Parsing and converting dtypes?
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["ts"] = pd.to_datetime(df["ts"], utc=True, errors="coerce")
df["code"] = df["code"].astype("category")errors="coerce" turns bad values into NaT/NaN for audit.
A strong answer is:
to_datetime with utc when events cross zones; coerce errors so bad strings surface as nulls I can count in info().
What is the .str accessor?
Vectorized string ops on text columns:
df["email"].str.lower()
df["name"].str.contains("test", na=False)
df["code"].str.split("-", expand=True)Prefer .str over apply for strings.
A strong answer is:
str accessor is vectorized string logic—lower, contains, split—faster and clearer than apply with lambdas on text columns.
When is apply appropriate?
apply runs Python function per row/column—flexible but slow on large data.
Use when:
- Logic not expressible with vectorized ops
- Small DataFrames
- Prototyping
Replace with NumPy/pandas vectorized ops in production paths.
A strong answer is:
apply is last resort—I try vectorized arithmetic, np.where, or .str first and reserve apply for small custom group logic apply can't vectorize.
Performance, time series, and advanced topics
Vectorization vs iterrows — why does it matter?
| Approach | Speed |
|---|---|
| Vectorized | C/NumPy under hood |
iterrows / row apply |
Python per row—very slow |
# Vectorized
df["tax"] = df["price"] * 0.18
# Avoid on large data
df["tax"] = df.apply(lambda r: r["price"] * 0.18, axis=1)Interviewers often ask for 10–100x intuition.
A strong answer is:
Vectorized ops batch in C—iterrows is for debugging small samples, not production transforms on millions of rows.
How do you reduce DataFrame memory?
| Technique | Effect |
|---|---|
category dtype |
Compress string columns |
| Downcast ints/floats | pd.to_numeric(..., downcast="integer") |
| Load only needed columns | usecols in read_csv |
read_parquet |
Columnar, compressed |
df.memory_usage(deep=True) to measure.
A strong answer is:
category for low-cardinality strings, parquet instead of CSV for pipelines, and usecols on read—I measure with memory_usage before claiming OOM fixes.
How does pandas handle time series?
| Tool | Use |
|---|---|
pd.to_datetime |
Parse timestamps |
set_index(datetime) |
Time-indexed Series |
resample("D").sum() |
Aggregate to daily |
shift(1) |
Lag for MoM growth |
rolling(7).mean() |
Moving average |
Timezone-aware dtypes avoid DST bugs.
A strong answer is:
Parse to datetime, set index, resample for calendar buckets—shift for period-over-period and rolling for smoothed trends.
rolling vs expanding windows?
rolling(n) |
expanding() |
|
|---|---|---|
| Window | Fixed last n rows | From start to current |
| Use | Moving avg, volatility | Cumulative metrics |
df["ma7"] = df["sales"].rolling(7, min_periods=1).mean()A strong answer is:
rolling for last-N behavior; expanding for cumulative from series start—I set min_periods when early windows are sparse.
Pandas vs Polars in 2026 interviews?
| pandas | Polars | |
|---|---|---|
| Engine | Python/NumPy | Rust, multi-thread |
| API | Mutable, index-centric | Lazy/eager, expression API |
| Best for | EDA, sklearn pipelines | Large in-memory ETL |
pandas remains default interview expectation; Polars is a senior differentiator for performance questions.
A strong answer is:
pandas for interviews and sklearn compatibility; Polars when I need lazy multi-core on GB-scale single-node data—I can explain both without dismissing pandas.
Reading large CSV files?
for chunk in pd.read_csv("big.csv", chunksize=100_000):
process(chunk)Or filter at source with SQL/Parquet column pruning.
A strong answer is:
chunksize for out-of-core aggregation, or switch to parquet/SQL when the full table won't fit—I never iterrows a file that should be chunked or queried down.
Scenarios and live coding
Scenario: Top 3 products by revenue per region.
(
df.groupby("region", as_index=False)
.apply(lambda g: g.nlargest(3, "revenue"))
.reset_index(drop=True)
)Or sort + groupby.head(3) in pandas 1.5+:
df.sort_values("revenue", ascending=False).groupby("region").head(3)A strong answer is:
Sort by revenue descending, groupby head 3 per region—or nlargest inside groupby—I verify ties handling if interviewer cares.
Scenario: Customers who never placed an order.
merged = customers.merge(orders[["customer_id"]].drop_duplicates(),
on="customer_id", how="left", indicator=True)
never_ordered = merged[merged["_merge"] == "left_only"]Alternative: customers[~customers["id"].isin(orders["customer_id"])] when ids unique.
A strong answer is:
Left merge with indicator left_only, or isin negation—I confirm customer_id uniqueness so I don't drop valid customers.
Scenario: Month-over-month revenue growth.
monthly = df.groupby(pd.Grouper(key="date", freq="ME"))["revenue"].sum()
growth = monthly.pct_change()Or diff() / shift(1) for absolute change.
A strong answer is:
Grouper to month buckets, sum revenue, pct_change for MoM—I sort by date first and handle first month NaN explicitly.
Scenario: Clean a messy CSV in an interview.
Checklist:
read_csvwithna_values,dtypehintsdrop_duplicateson business keyto_datetime/to_numericwitherrors="coerce"dropnaor impute with stated ruleastype("category")where appropriate- Document before/after
shapeand null counts
A strong answer is:
I narrate each cleaning step with shape and null counts—parse dates, coerce numerics, dedupe on order_id, then aggregate—so interviewer sees audit trail.
What is the pipe method for readable chains?
.pipe(func, *args) passes DataFrame through functions—method chaining style:
(
df.pipe(load_raw)
.pipe(clean_orders)
.pipe(add_revenue_column)
.groupby("region")["revenue"].sum()
)Improves readability in notebooks and interviews.
A strong answer is:
pipe names transformation steps in a left-to-right pipeline—cleaner than nested temp variables in live coding.
How does pandas interoperate with NumPy?
.values / .to_numpy() get array backing; many ops delegate to NumPy.
Watch: nullable dtypes and index alignment—NumPy arrays lose labels.
A strong answer is:
to_numpy when feeding sklearn without index—I'm careful that nullables become object or need fill before numpy-only APIs.
How do you export results?
| Format | Use |
|---|---|
to_csv |
Universal, larger |
to_parquet |
Analytics pipelines |
to_sql |
Warehouse load |
to_dict("records") |
JSON APIs |
A strong answer is:
parquet for downstream pipelines, csv for human handoff—I reset_index before export if flat files need key columns.
Scenario: Avoid data leakage when engineering features in pandas.
Rules:
- Fit aggregations only on train split, apply to test
- No future data in
shiftfor prediction targets - Group stats computed inside cross-validation folds
# Wrong: global mean including test rows
# Right: train.groupby(...).transform saved mapping applied to testLink to data science interviews.
A strong answer is:
Any group-level stat for ML must be fit on training data only—I never compute global target means on the full DataFrame before split.
How do you test pandas transformation code?
- pytest with small fixture DataFrames
pandas.testing.assert_frame_equalfor output compare- Property checks: row count, key uniqueness, no null in required cols
A strong answer is:
assert_frame_equal on expected output for pure functions—I build minimal frames covering edge cases like duplicate keys and null join keys.
Final prep checklist
What should you rehearse before pandas interviews?
Checklist:
- loc / iloc / boolean indexing
- groupby agg vs transform with reset_index
- merge types + anti-join +
validate= - pivot_table and melt
- Missing data strategy out loud
- Vectorization over apply/iterrows
- Time series resample, shift, rolling
- One top-N per group scenario
- One messy CSV cleaning narrative
- Python interviews language refresh
- SQL interviews join analogy
- Data science interviews if ML follow-up
A strong answer is:
I time-box groupby + merge + cleaning on one dataset daily, state row counts after every join, and rehearse one leakage-safe feature story for ML roles.
Pattern cheat sheet (quick reference)
| Task | pandas approach |
|---|---|
| Filter rows | Boolean mask + loc |
| Summary by group | groupby().agg() |
| Per-row group stat | transform |
| SQL inner join | merge(how="inner") |
| Customers not in orders | Left merge + indicator |
| Prevent join explosion | validate="many_to_one" |
| Wide → long | melt |
| Cross-tab | pivot_table |
| Fast math column | Vectorized ops, not apply |
| Large CSV | chunksize or Parquet |
| MoM growth | Grouper + pct_change |
References
pandas interview prep
On-site prep
- Python developer interviews
- Data science interview questions
- SQL technical interview questions
- Kafka interview questions
- Full stack developer interviews
- Interview Questions category
Summary
Pandas interviews test groupby, merge semantics, and vectorized cleaning—state row counts after every join and explain why you chose merge over join. Pair with SQL for warehouse logic and data science interviews when modeling follows.

