Pandas Interview Questions and Answers

Pandas and python pandas interview questions for 2026: DataFrame, groupby, merge, cleaning, vectorization, time series, performance, and data analyst prep with answers.

Published

Updated

Tech reviewed byDeepak Prasad

Pandas Interview Questions and Answers

Pandas interview questions and python pandas interview questions show up in data analyst, data scientist, ML engineer, and Python backend loops wherever tabular data matters. Interviewers expect more than read_csv—they probe groupby vs transform, merge row explosions, vectorization vs apply, and how you clean messy production CSVs. If you searched panda express interview questions, that usually refers to the restaurant chain's hiring process, not this library; this guide covers the pandas Python package for data analysis only.

Below are 45 questions with elaborate answers; technical sections include a strong answer sample you can say aloud. Pair with Python developer interview questions for language fundamentals, PostgreSQL interview questions when data is loaded from Postgres, data science interview questions for statistics and ML breadth, SQL technical interview questions when interviewers compare joins to merge, and Kafka interview questions for pipeline context.

NOTE
Prep target: Master groupby, merge, and cleaning on a real dataset. Senior loops add performance, validate= on merges, and when Polars or SQL would replace pandas.

Tested on: Ubuntu 25.04 (Plucky Puffin); kernel 6.14.0-37-generic; Python 3.12; pandas 2.2.3.


Interview context and how to prepare

What do pandas interviews actually test?

Pandas interviews test whether you can manipulate labeled tables correctly and efficiently—not whether you memorized every method name.

Layer What interviewers probe
Basics Series, DataFrame, index, dtypes
Selection loc, iloc, boolean indexing
Aggregation groupby, agg, transform
Combine merge, join, concat
Reshape pivot, melt, stack
Cleaning missing values, duplicates, types
Performance vectorization, dtypes, chunking
Time series resample, shift, rolling
Role Emphasis
Data analyst groupby, merge, pivot, SQL analogy
Data scientist feature prep, leakage-safe splits
ML engineer large data, parquet, memory
Pandas vs SQL — when do interviewers compare them?
Task SQL pandas
Large remote data Server executes Pull subset or use warehouse
Joins JOIN merge
Group by GROUP BY groupby().agg()
Window functions OVER() transform, rolling
Exploration Heavier iteration Notebook-friendly

Many loops give SQL + pandas for the same business question—see SQL interviews.

What is a typical pandas interview loop?
Round Format Focus
Screening 30 min Projects, stack, pandas vs Spark
Live coding 45–60 min groupby revenue, anti-join customers
Take-home 2–4 hr EDA notebook with clean narrative
ML follow-up 45 min Features from DataFrame, leakage

Most pandas screens center on groupby, merge, and missing data—know row counts before and after each step.

What is a realistic 3–5 week pandas prep plan?
Week Focus Output
1 DataFrame basics, selection, dtypes 10 timed pandas practice problems
2 groupby agg/transform, pivot Cohort summary notebook
3 merge/join, duplicates, validate Anti-join + duplicate key drill
4 Cleaning, time series, resample Messy CSV → analysis-ready
5 Performance + mock Replace apply with vectorized solution

Work on one Kaggle-style CSV end to end.


DataFrame and Series fundamentals

What is pandas?

pandas is an open-source Python library for tabular data manipulation and analysis. Core structures:

  • Series — one-dimensional labeled array
  • DataFrame — two-dimensional table (columns + index)

Built on NumPy; integrates with CSV, Parquet, SQL, Excel.

A strong answer is:

pandas is Python's standard labeled table library—DataFrame for rows and columns, Series for single columns—with alignment by index at the heart of operations.

Series vs DataFrame?
Series DataFrame
Dimensions 1D 2D
Columns One (name optional) Many
Index Row labels Row labels shared
From dict {label: value} {col: list}

A DataFrame column is a Series sharing the DataFrame index.

A strong answer is:

Series is one column or vector; DataFrame is a table of Series aligned on the same index—I think Series for single-metric ops, DataFrame for relational shape.

How do you create a DataFrame?

Common methods:

python
import pandas as pd

df = pd.DataFrame({"city": ["A", "B"], "sales": [10, 20]})
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet")

From NumPy: pd.DataFrame(arr, columns=[...], index=[...]).

A strong answer is:

From dicts for tests, read_csv/read_parquet in production—I set dtypes on read when I know the schema to save memory and errors.

What is the Index in pandas?

The Index labels rows (and can label columns). Enables alignment in arithmetic and joins without positional guessing.

Index type Use
RangeIndex Default 0..n-1
DatetimeIndex Time series
MultiIndex Hierarchical keys

reset_index() moves index to columns; set_index() promotes columns.

A strong answer is:

Index is the alignment backbone—operations line up on labels, which is why bad indexes after groupby cause silent bugs until I reset_index deliberately.

Why do dtypes matter in pandas?

dtypes control memory and correctness:

dtype When
int64 / Int64 Integers (Int64 nullable)
float64 Continuous
category Low-cardinality strings
datetime64[ns] Timestamps
string Text (pandas 1+ nullable string)

df.dtypes and df.info() for audit; astype() for conversion.

A strong answer is:

Wrong dtypes inflate memory and break joins—I use category for repeated strings, parse dates on load, and nullable Int64 when missing integers are real.

How do you explore a DataFrame quickly?
Method Purpose
head() / tail() Sample rows
info() dtypes, non-null counts, memory
describe() Numeric summary stats
shape (rows, cols)
value_counts() Categorical frequency

First step on any new dataset in interviews and on the job.

A strong answer is:

info for dtypes and nulls, describe for numeric range, value_counts for keys—before any merge I verify uniqueness assumptions with value_counts on join keys.

SettingWithCopyWarning — what causes it?

Assigning to a slice that might be a view triggers SettingWithCopyWarning—changes may not propagate to the original DataFrame.

Fix:

python
subset = df[df["active"]].copy()
subset["score"] = 0

Or use .loc on the original: df.loc[df["active"], "score"] = 0.

A strong answer is:

I use .copy() when building a filtered subset I'll mutate, or assign with .loc on the parent DataFrame—never chain filters then assign without copy.

loc vs iloc?
loc iloc
Selection Label-based Integer position
Slice end Inclusive Exclusive
Rows & cols df.loc[rows, cols] df.iloc[i, j]

Boolean indexing: df.loc[df["sales"] > 10].

A strong answer is:

loc for label slices and boolean masks; iloc for positional access—I avoid mixing chained brackets without loc after a filter.

What is boolean indexing?

Filter rows with a boolean Series aligned to the index:

python
high = df[df["amount"] > 100]
mask = (df["country"] == "IN") & (df["amount"] > 0)
filtered = df.loc[mask]

Use &, |, ~ with parentheses—not Python and/or.

A strong answer is:

Boolean masks filter with vectorized comparisons—I parenthesize each condition and use & for AND when combining masks.


GroupBy, aggregation, and reshape

What is groupby and split-apply-combine?

groupby splits rows by key, applies a function per group, combines results.

Phases:

  1. Split — partition by city, user_id, etc.
  2. Apply — sum, mean, custom
  3. Combine — assemble result index
python
df.groupby("city", as_index=False)["sales"].sum()

A strong answer is:

groupby is split-apply-combine—I name the key columns, choose agg vs transform based on output shape, and reset_index when I need flat columns for merge.

groupby agg vs transform?
Method Output shape Use
agg One row per group Summary tables
transform Same rows as input Group stats per row
apply Flexible; often slower Custom group logic
python
# agg: one row per city
df.groupby("city")["sales"].sum()

# transform: each row gets its city's mean
df["city_mean"] = df.groupby("city")["sales"].transform("mean")

A strong answer is:

agg reduces to one row per group; transform broadcasts group stats back for row-level ratios—I skip apply when agg/transform express the logic.

What are named aggregations?

pandas 0.25+ style for multiple metrics with clear column names:

python
df.groupby("region").agg(
    total_sales=("sales", "sum"),
    avg_price=("price", "mean"),
    orders=("order_id", "nunique"),
)

Cleaner than dict of lists for interviews.

A strong answer is:

Named aggregations make readable summary tables in one groupby call—total_sales equals sum of sales per region without manual rename hacks.

What is a MultiIndex after groupby?

Grouping by multiple keys yields a hierarchical index:

python
df.groupby(["region", "product"])["sales"].sum()

Call reset_index() before merge to avoid alignment bugs.

A strong answer is:

MultiIndex after multi-key groupby—I reset_index immediately if the next step is merge or export to SQL expecting flat columns.

pivot vs pivot_table?
pivot pivot_table
Aggregation No duplicate index pairs Handles duplicates with aggfunc
Use Strict reshape Cross-tab summaries
python
pd.pivot_table(df, index="region", columns="month", values="sales", aggfunc="sum")

Like Excel pivot tables / SQL PIVOT conceptually.

A strong answer is:

pivot_table for revenue by region and month with duplicates allowed; plain pivot only when index pairs are unique.

melt and wide-to-long reshape?

melt unpivots wide columns to rows:

python
pd.melt(df, id_vars=["id"], value_vars=["Q1", "Q2"], var_name="quarter", value_name="revenue")

Opposite of pivot—tidy data for plotting and groupby.

A strong answer is:

melt turns wide quarter columns into long format for groupby and charts—I use id_vars for entity keys I keep constant.


Merge, join, and combine

merge vs join vs concat?
API Behavior
pd.merge(left, right, on=...) SQL-style joins on columns
df.join(other) Index-based join
pd.concat([a,b]) Stack vertically or horizontally
python
pd.merge(orders, customers, on="customer_id", how="left")
pd.concat([df2024, df2025], ignore_index=True)

A strong answer is:

merge for key columns like SQL; join when indexes already align; concat to stack batches with same schema.

Explain inner, left, right, and outer merge.
how Result
inner Keys in both
left All left keys; NaN if no match right
right All right keys
outer All keys from either side

Anti-join (rows in left not in right): left merge + indicator + filter _merge == 'left_only'.

A strong answer is:

I pick how based on which keys must survive—left for preserve-all-orders, inner for matched-only analysis, indicator for anti-join customers without orders.

What causes merge row explosion?

Duplicate keys on one or both sides create a Cartesian product within each key—row count multiplies silently.

Prevent:

python
pd.merge(a, b, on="id", validate="many_to_one")

Check before merge: df["id"].is_unique or value_counts().

A strong answer is:

Duplicate join keys inflate rows—I validate many_to_one or dedupe keys first and assert output row count matches expectations.

Merging on multiple keys?
python
pd.merge(
    df1, df2,
    on=["customer_id", "order_date"],
    how="inner",
)

Column names differ: left_on=["a_id"], right_on=["b_id"].

A strong answer is:

Multi-key merge when composite identity matters—I confirm both keys are clean dtypes and no nulls unless intentional.

concat axis=0 vs axis=1?
axis Effect
0 Stack rows (append tables)
1 Side-by-side columns (align index)

Use ignore_index=True when appending batches without meaningful index.

A strong answer is:

axis=0 to append monthly files; axis=1 to add feature columns aligned on the same index—I watch duplicate indexes on axis=1 joins.


Data cleaning and missing values

How do you handle missing data?
Method When
isna() / notna() Detect
dropna(subset=[...]) Remove incomplete rows
fillna(value) Impute constant
fillna(method="ffill") Forward fill (time series)
interpolate() Numeric gap fill

Interviewers want justification—mean imputation vs drop vs model-based.

A strong answer is:

I quantify missingness first—drop when MCAR and small; fill with domain rule or group median; never blind global mean on skewed targets without saying why.

How do you find and remove duplicates?
python
df.duplicated(subset=["email"], keep="first")
df.drop_duplicates(subset=["order_id"], keep="last")

keep=False marks all duplicates in a group.

A strong answer is:

duplicated with subset on business key—I keep first or last explicitly based on which record is authoritative, not default blindly.

Parsing and converting dtypes?
python
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["ts"] = pd.to_datetime(df["ts"], utc=True, errors="coerce")
df["code"] = df["code"].astype("category")

errors="coerce" turns bad values into NaT/NaN for audit.

A strong answer is:

to_datetime with utc when events cross zones; coerce errors so bad strings surface as nulls I can count in info().

What is the .str accessor?

Vectorized string ops on text columns:

python
df["email"].str.lower()
df["name"].str.contains("test", na=False)
df["code"].str.split("-", expand=True)

Prefer .str over apply for strings.

A strong answer is:

str accessor is vectorized string logic—lower, contains, split—faster and clearer than apply with lambdas on text columns.

When is apply appropriate?

apply runs Python function per row/column—flexible but slow on large data.

Use when:

  • Logic not expressible with vectorized ops
  • Small DataFrames
  • Prototyping

Replace with NumPy/pandas vectorized ops in production paths.

A strong answer is:

apply is last resort—I try vectorized arithmetic, np.where, or .str first and reserve apply for small custom group logic apply can't vectorize.


Performance, time series, and advanced topics

Vectorization vs iterrows — why does it matter?
Approach Speed
Vectorized C/NumPy under hood
iterrows / row apply Python per row—very slow
python
# Vectorized
df["tax"] = df["price"] * 0.18

# Avoid on large data
df["tax"] = df.apply(lambda r: r["price"] * 0.18, axis=1)

Interviewers often ask for 10–100x intuition.

A strong answer is:

Vectorized ops batch in C—iterrows is for debugging small samples, not production transforms on millions of rows.

How do you reduce DataFrame memory?
Technique Effect
category dtype Compress string columns
Downcast ints/floats pd.to_numeric(..., downcast="integer")
Load only needed columns usecols in read_csv
read_parquet Columnar, compressed

df.memory_usage(deep=True) to measure.

A strong answer is:

category for low-cardinality strings, parquet instead of CSV for pipelines, and usecols on read—I measure with memory_usage before claiming OOM fixes.

How does pandas handle time series?
Tool Use
pd.to_datetime Parse timestamps
set_index(datetime) Time-indexed Series
resample("D").sum() Aggregate to daily
shift(1) Lag for MoM growth
rolling(7).mean() Moving average

Timezone-aware dtypes avoid DST bugs.

A strong answer is:

Parse to datetime, set index, resample for calendar buckets—shift for period-over-period and rolling for smoothed trends.

rolling vs expanding windows?
rolling(n) expanding()
Window Fixed last n rows From start to current
Use Moving avg, volatility Cumulative metrics
python
df["ma7"] = df["sales"].rolling(7, min_periods=1).mean()

A strong answer is:

rolling for last-N behavior; expanding for cumulative from series start—I set min_periods when early windows are sparse.

Pandas vs Polars in 2026 interviews?
pandas Polars
Engine Python/NumPy Rust, multi-thread
API Mutable, index-centric Lazy/eager, expression API
Best for EDA, sklearn pipelines Large in-memory ETL

pandas remains default interview expectation; Polars is a senior differentiator for performance questions.

A strong answer is:

pandas for interviews and sklearn compatibility; Polars when I need lazy multi-core on GB-scale single-node data—I can explain both without dismissing pandas.

Reading large CSV files?
python
for chunk in pd.read_csv("big.csv", chunksize=100_000):
    process(chunk)

Or filter at source with SQL/Parquet column pruning.

A strong answer is:

chunksize for out-of-core aggregation, or switch to parquet/SQL when the full table won't fit—I never iterrows a file that should be chunked or queried down.


Scenarios and live coding

Scenario: Top 3 products by revenue per region.
python
(
    df.groupby("region", as_index=False)
    .apply(lambda g: g.nlargest(3, "revenue"))
    .reset_index(drop=True)
)

Or sort + groupby.head(3) in pandas 1.5+:

python
df.sort_values("revenue", ascending=False).groupby("region").head(3)

A strong answer is:

Sort by revenue descending, groupby head 3 per region—or nlargest inside groupby—I verify ties handling if interviewer cares.

Scenario: Customers who never placed an order.
python
merged = customers.merge(orders[["customer_id"]].drop_duplicates(),
                         on="customer_id", how="left", indicator=True)
never_ordered = merged[merged["_merge"] == "left_only"]

Alternative: customers[~customers["id"].isin(orders["customer_id"])] when ids unique.

A strong answer is:

Left merge with indicator left_only, or isin negation—I confirm customer_id uniqueness so I don't drop valid customers.

Scenario: Month-over-month revenue growth.
python
monthly = df.groupby(pd.Grouper(key="date", freq="ME"))["revenue"].sum()
growth = monthly.pct_change()

Or diff() / shift(1) for absolute change.

A strong answer is:

Grouper to month buckets, sum revenue, pct_change for MoM—I sort by date first and handle first month NaN explicitly.

Scenario: Clean a messy CSV in an interview.

Checklist:

  1. read_csv with na_values, dtype hints
  2. drop_duplicates on business key
  3. to_datetime / to_numeric with errors="coerce"
  4. dropna or impute with stated rule
  5. astype("category") where appropriate
  6. Document before/after shape and null counts

A strong answer is:

I narrate each cleaning step with shape and null counts—parse dates, coerce numerics, dedupe on order_id, then aggregate—so interviewer sees audit trail.

What is the pipe method for readable chains?

.pipe(func, *args) passes DataFrame through functions—method chaining style:

python
(
    df.pipe(load_raw)
    .pipe(clean_orders)
    .pipe(add_revenue_column)
    .groupby("region")["revenue"].sum()
)

Improves readability in notebooks and interviews.

A strong answer is:

pipe names transformation steps in a left-to-right pipeline—cleaner than nested temp variables in live coding.

How does pandas interoperate with NumPy?

.values / .to_numpy() get array backing; many ops delegate to NumPy.

Watch: nullable dtypes and index alignment—NumPy arrays lose labels.

A strong answer is:

to_numpy when feeding sklearn without index—I'm careful that nullables become object or need fill before numpy-only APIs.

How do you export results?
Format Use
to_csv Universal, larger
to_parquet Analytics pipelines
to_sql Warehouse load
to_dict("records") JSON APIs

A strong answer is:

parquet for downstream pipelines, csv for human handoff—I reset_index before export if flat files need key columns.

Scenario: Avoid data leakage when engineering features in pandas.

Rules:

  • Fit aggregations only on train split, apply to test
  • No future data in shift for prediction targets
  • Group stats computed inside cross-validation folds
python
# Wrong: global mean including test rows
# Right: train.groupby(...).transform saved mapping applied to test

Link to data science interviews.

A strong answer is:

Any group-level stat for ML must be fit on training data only—I never compute global target means on the full DataFrame before split.

How do you test pandas transformation code?
  • pytest with small fixture DataFrames
  • pandas.testing.assert_frame_equal for output compare
  • Property checks: row count, key uniqueness, no null in required cols

A strong answer is:

assert_frame_equal on expected output for pure functions—I build minimal frames covering edge cases like duplicate keys and null join keys.


Final prep checklist

What should you rehearse before pandas interviews?

Checklist:

  • loc / iloc / boolean indexing
  • groupby agg vs transform with reset_index
  • merge types + anti-join + validate=
  • pivot_table and melt
  • Missing data strategy out loud
  • Vectorization over apply/iterrows
  • Time series resample, shift, rolling
  • One top-N per group scenario
  • One messy CSV cleaning narrative
  • Python interviews language refresh
  • SQL interviews join analogy
  • Data science interviews if ML follow-up

A strong answer is:

I time-box groupby + merge + cleaning on one dataset daily, state row counts after every join, and rehearse one leakage-safe feature story for ML roles.


Pattern cheat sheet (quick reference)

Task pandas approach
Filter rows Boolean mask + loc
Summary by group groupby().agg()
Per-row group stat transform
SQL inner join merge(how="inner")
Customers not in orders Left merge + indicator
Prevent join explosion validate="many_to_one"
Wide → long melt
Cross-tab pivot_table
Fast math column Vectorized ops, not apply
Large CSV chunksize or Parquet
MoM growth Grouper + pct_change

References

pandas interview prep

On-site prep


Summary

Pandas interviews test groupby, merge semantics, and vectorized cleaning—state row counts after every join and explain why you chose merge over join. Pair with SQL for warehouse logic and data science interviews when modeling follows.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with more than 15 years of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive …