42+ Data Science Interview Questions and Answers 2026

Data science interviews blend statistics, machine learning judgment, SQL and Python execution, and product sense. Hiring managers want to hear how you reason about noisy data, design experiments, and explain trade-offs—not only textbook definitions.

Below are 42 questions grouped by topic: statistics and probability, experimentation, ML fundamentals, coding (SQL and pandas), and behavioral or case-style prompts. For dedicated pandas interview questions (groupby, merge, cleaning, vectorization), see pandas interview questions. For PostgreSQL interview questions when analytics pipelines read from OLTP sources, see PostgreSQL interview questions. Try each question out loud before opening the answer.

NOTE

Role split: Product analytics loops weight SQL, metrics, and A/B tests. Applied ML loops add model design, feature engineering, and deployment trade-offs. Research loops go deeper on theory. Read the JD before you optimize prep.

Interview context and how to prepare

What do data science technical interviews actually test?

Data science interviews usually test whether you can turn messy data into a reliable business or product decision. Interviewers are not only checking if you know formulas or algorithm names.

They usually score four areas:

Statistical reasoning — Can you reason about uncertainty, sampling bias, confidence intervals, p-values, experiment design, and whether a result is trustworthy?
ML judgment — Can you choose the right model, metric, validation strategy, and explain trade-offs such as accuracy vs interpretability or precision vs recall?
Execution — Can you write SQL, Python, or pandas under time pressure without overcomplicating the solution?
Communication — Can you explain assumptions, limitations, trade-offs, and the business impact in simple language?

A strong answer usually connects the technical method to a decision.

For example, instead of saying:

“I would train a random forest and check accuracy.”

A stronger answer is:

“I would first define the business goal and the cost of false positives vs false negatives. If missing a positive case is expensive, I would optimize recall or F1 instead of accuracy. I would also check performance by segment before recommending deployment.”

That type of answer shows practical judgment, not just textbook knowledge.

What interview formats should you expect?

A typical data science interview process has 3–6 rounds, depending on the company, seniority, and role type.

Round	Duration	What it tests
Recruiter screen	20–30 min	Background, role fit, salary range, notice period
SQL / Python screen	45–60 min	Joins, aggregations, window functions, data cleaning, basic algorithms
Statistics & ML	45–90 min	Probability, hypothesis testing, metrics, model selection, validation
Product / business case	45–60 min	Metrics, funnel analysis, A/B testing, recommendation design
Behavioral round	30–60 min	Collaboration, ownership, conflict, stakeholder communication
Project deep-dive	45–60 min	Past work, impact, trade-offs, technical decisions
Take-home assignment	2–6 hours	End-to-end analysis, code quality, assumptions, communication

The exact format depends on the role:

Role type	Extra focus
Product data scientist	Metrics, experimentation, product sense, stakeholder communication
ML data scientist	Modeling, validation, feature engineering, deployment risks
Analytics data scientist	SQL, dashboards, funnel analysis, business recommendations
Senior data scientist	Ambiguity, project leadership, trade-offs, measurable impact

If you get a take-home assignment, treat it like a small production project:

Add a short README
State assumptions clearly
Keep the notebook or script reproducible
Explain why you chose each metric
Include limitations and next steps
Avoid unnecessary complexity

A good take-home is not only correct. It should be easy for the reviewer to run, understand, and trust.

What is a realistic 4–6 week prep plan?

A realistic data science interview prep plan should balance SQL, statistics, ML, product thinking, and communication. Do not spend all your time memorizing machine learning algorithms.

Week	Focus	Output
1	Statistics refresh	Explain p-values, confidence intervals, distributions, sampling bias, and correlation vs causation
2	SQL	Solve 15–20 timed SQL problems using joins, aggregations, CTEs, and window functions
3	pandas + Python	Practice groupby, merge, filtering, missing data, sorting, and basic data transformation
4	ML fundamentals	Review classification metrics, regression metrics, bias-variance, regularization, trees, boosting, and model validation
5	A/B testing + product cases	Practice 3–5 cases involving funnels, experiments, conversion drops, churn, or recommendations
6	Mock interviews + behavioral stories	Complete 2 mock loops and prepare 6–8 STAR stories from past projects

For most mid-level roles, the highest-ROI prep is:

Timed SQL practice
Statistics and A/B testing
Explaining ML trade-offs clearly
Project stories with measurable business impact

A good weekly routine is:

Activity	Frequency
SQL practice	4–5 days/week
Statistics review	3 days/week
ML concept review	2–3 days/week
Product case practice	2 days/week
Behavioral/story practice	1–2 days/week

When practicing, do not only check whether your answer is correct. Also ask:

Did I clarify the goal before solving?
Did I mention assumptions?
Did I choose the right metric?
Did I explain trade-offs?
Did I connect the answer to a business decision?

That is the difference between a technically correct answer and an interview-ready answer.

How do product analytics, applied ML, and research DS roles differ in interviews?

Data science interviews are not the same for every role. A product analytics role may feel very different from an applied ML or research data science role.

Lane	Heavier rounds	Lighter rounds
Product analytics / product DS	SQL, metrics, experimentation, funnels, A/B testing, stakeholder communication	Deep learning theory, complex model architecture
Applied ML / ML data scientist	Feature engineering, model evaluation, validation, deployment risks, ML system design	Ad-hoc SQL trivia, pure probability puzzles
Research DS / research scientist	Theory, papers, novel methods, mathematical reasoning, experimental design	Dashboarding, routine reporting, basic business metrics

For a product analytics role, expect questions like:

“A metric dropped by 10%. How would you investigate?”
“How would you design an experiment for a new recommendation feature?”
“Which metric would you use for user retention?”
“How would you explain the result to a product manager?”

For an applied ML role, expect more questions like:

“How would you handle data leakage?”
“Which metric would you choose for an imbalanced classification problem?”
“How would you monitor model performance after deployment?”
“What would you do if offline performance is good but online performance is poor?”

For a research DS role, expect deeper questions around:

Model assumptions
Statistical validity
Paper discussion
Mathematical reasoning
Novel method design
Experimental rigor

In 2026, many interviews also include LLM-assisted workflow questions, especially for mid-level and senior candidates.

Examples:

“How would you use an LLM to speed up exploratory data analysis?”
“Where would you not trust an LLM in a data science pipeline?”
“How would you validate LLM-generated SQL or Python?”
“How would you prevent hallucinated insights in an automated reporting workflow?”

A strong answer should not say, “I will use ChatGPT to solve it.”

A better answer is:

“I may use an LLM to speed up boilerplate SQL, summarize documentation, or generate initial analysis ideas. But I would still validate the query, inspect the data, test assumptions, check edge cases, and review whether the final recommendation is statistically and business-wise sound.”

That answer shows the right balance: you can use modern tools, but you still own the reasoning, validation, and final decision.

Statistics and probability

What is the difference between correlation and causation?

Correlation means two variables move together. When one changes, the other also tends to change, either in the same direction or the opposite direction.

Causation means changing one variable directly produces a change in another variable.

Correlation does not imply causation because the relationship may be caused by:

Confounding variables — a third variable affects both
Reverse causality — Y may be causing X instead of X causing Y
Selection bias — the observed sample is not representative
Coincidence — two variables move together by chance

Example:

Ice cream sales and drowning incidents may be correlated, but buying ice cream does not cause drowning. Both increase during summer, so weather/season is the confounder.

In an interview, a strong answer should also explain how you would test causality:

Run a randomized controlled experiment if possible
Use A/B testing for product changes
Use quasi-experimental methods such as difference-in-differences, matching, instrumental variables, or regression discontinuity when randomization is not possible
Check whether the causal story makes business and domain sense

A good interview response is:

“Correlation is useful for finding relationships, but I would not make a product or business decision from correlation alone. I would check for confounders and, if possible, design an experiment to estimate the causal effect.”

What is a p-value, and what is a common misinterpretation?

A p-value is the probability of observing a result at least as extreme as the one in your sample, assuming the null hypothesis is true.

For example, if the null hypothesis says there is no difference between variant A and variant B, a p-value tells you how surprising your observed result would be under that assumption.

Common misinterpretation:

“p = 0.03 means there is a 97% chance the effect is real.”

That is incorrect. A p-value is not the probability that the null hypothesis is false. It is also not the probability that your result will repeat in the future.

In interviews, pair p-values with:

Effect size — how large is the difference?
Confidence interval — what range of values is plausible?
Sample size — is the test underpowered or overpowered?
Practical significance — does the result matter for the business?
Experiment design — was the data collected correctly?

Example:

A new checkout page improves conversion from 10.0% to 10.1% with p < 0.05. The result may be statistically significant, but the business impact may be too small to justify engineering effort.

A strong interview answer is:

I would not look at the p-value alone. I would also check the effect size, confidence interval, business impact, and whether the experiment design was valid.

Explain Type I and Type II errors.

Type I and Type II errors describe two ways a statistical decision can be wrong.

Error	Meaning	Example
Type I error	False positive: you reject a true null hypothesis	You conclude a feature improved conversion when it actually did not
Type II error	False negative: you fail to reject a false null hypothesis	You miss a real improvement because the test did not detect it

A simple way to remember:

Type I error = false alarm
Type II error = missed detection

The significance level controls the Type I error rate. For example, alpha = 0.05 means you are accepting a 5% false positive rate under the null hypothesis.

The power of a test is related to Type II error:

Term	Meaning
Power	Probability of detecting a real effect
1 - beta	Power
beta	Type II error rate

There is usually a trade-off:

Lowering alpha reduces false positives
But it can increase false negatives if sample size stays the same
Increasing sample size can improve power

Example:

In fraud detection, a company may tolerate more Type I errors because false positives can be manually reviewed. But Type II errors may be expensive because real fraud is missed.

In medical testing, the trade-off depends on the disease, treatment risk, and cost of missing a true case.

A strong interview answer connects the error type to cost:

“I would choose the threshold based on the cost of false positives versus false negatives. The right threshold is a business and risk decision, not only a statistical one.”

What is the Central Limit Theorem and why does it matter?

The Central Limit Theorem says that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, even if the original data is not normally distributed.

This works under important assumptions, such as:

Observations are independent or approximately independent
The sample is reasonably representative
The underlying distribution has finite variance
The sample size is large enough for the approximation to be useful

Why it matters:

It justifies normal approximations for sample means
It supports many confidence intervals and hypothesis tests
It explains why averages become more stable with larger samples
It is widely used in A/B testing and business metric analysis

Important clarification:

The CLT does not mean the original data becomes normal. It means the distribution of the sample mean becomes approximately normal as sample size grows.

Example:

Individual order values in an ecommerce store may be highly skewed because some customers place very large orders. But if you repeatedly take large random samples and calculate the average order value, those sample averages will often look approximately normal.

A strong interview answer is:

The CLT lets us reason about averages using normal approximations, but I would still check sample size, independence, skew, outliers, and whether the metric is suitable for that approximation.

When would you use Bayesian vs frequentist inference?

Frequentist inference treats parameters as fixed but unknown. Probability is interpreted as long-run frequency.

It is common when:

You need standard hypothesis tests
You are running classical A/B tests
You want p-values and confidence intervals
The organization already uses frequentist experimentation
Regulatory or reporting standards expect frequentist methods

Bayesian inference treats parameters as uncertain and represents them using probability distributions. You start with a prior belief and update it with observed data.

It is useful when:

You have meaningful prior information
Data is sparse
You want probability-style statements
You need shrinkage across many groups
You are monitoring results sequentially
Stakeholders ask questions like “What is the probability that variant B is better?”

Example:

For a standard website A/B test with large traffic, a frequentist test may be simple and acceptable.

For estimating conversion rates across many small regions, Bayesian methods can help because low-volume regions can borrow strength from the overall population instead of producing noisy estimates.

A strong comparison:

Approach	Good for	Interview wording
Frequentist	Standard tests, large samples, established reporting	“What would happen over repeated samples?”
Bayesian	Prior knowledge, sparse data, probability statements	“Given the data, what is the probability of the effect?”

A good interview answer is:

“I would use frequentist methods when the experiment is standard, high-volume, and the company expects p-values. I would consider Bayesian methods when prior information matters, data is sparse, or stakeholders need probability-based decision statements.”

You flip a fair coin until you get heads. What is the expected number of flips?

This follows a geometric distribution.

Each flip has probability:

Outcome	Probability
Heads	0.5
Tails	0.5

If the probability of success is p, the expected number of trials until the first success is:

Expected value = 1 / p

For a fair coin:

Expected flips = 1 / 0.5 = 2

So the expected number of flips is 2.

One way to reason about it:

Sequence	Probability	Number of flips
H	1/2	1
TH	1/4	2
TTH	1/8	3
TTTH	1/16	4

The weighted average of these outcomes equals 2.

Follow-up answers:

Question	Answer
Expected flips	2
Variance	(1 - p) / p² = 2
Probability it takes more than 3 flips	(1/2)³ = 1/8

In interviews, probability puzzles often test whether you can identify the distribution before calculating.

A strong answer is:

This is a geometric distribution because we are waiting for the first success. Since p = 0.5, the expected number of flips is 1/p = 2.

Experimentation and A/B testing

How do you design an A/B test for a new checkout button?

I would start by defining the decision the test must support: should we ship the new checkout button or keep the current one?

A good A/B test design includes:

Hypothesis — “Changing the checkout button from blue to green increases checkout completion.”
Primary metric — checkout completion rate, defined clearly as completed orders / users who reached checkout.
Guardrail metrics — revenue per user, refund rate, page load time, payment errors, support tickets.
Randomization unit — user-level or account-level, not session-level, to avoid the same user seeing both variants.
Sample size — calculate using baseline conversion rate, minimum detectable effect, power, and significance level.
Duration — run long enough to cover weekday/weekend behavior and normal traffic cycles.
Analysis plan — decide the stopping rule, primary metric, exclusion rules, and segment checks before launch.

Also document exclusions such as bots, internal users, test orders, unsupported countries, or payment methods out of scope.

Before trusting the result, I would check for sample ratio mismatch. If traffic was supposed to split 50/50 but the actual split is 60/40, there may be a tracking or assignment bug.

A strong interview answer is:

I would not only check whether the green button wins. I would verify that the experiment was assigned correctly, the metric was defined correctly, and no guardrail metric got worse.

Why is peeking at A/B test results early dangerous?

Peeking is dangerous when you repeatedly check the test result and stop as soon as p < 0.05.

Each look is like running another hypothesis test. If you keep checking until the result becomes significant, the false positive rate becomes higher than the planned alpha level.

Example:

If you plan for alpha = 0.05 but check the result every day and stop on the first significant day, your actual false positive risk may be much higher than 5%.

Good mitigations:

Use a fixed sample size and analyze once at the end
Use sequential testing if early stopping is required
Define stopping rules before the experiment starts
Monitor guardrail metrics during the test, but do not declare a winner early without the right method

It is fine to monitor a running test for bugs, outages, sample ratio mismatch, or severe guardrail regressions. The problem is making a “winner” decision from repeated unplanned significance checks.

A strong interview answer is:

Peeking inflates false positives. I would monitor the experiment for health issues, but I would only make the final decision using the planned stopping rule or a valid sequential testing method.

How do you handle multiple comparisons across many metrics?

Multiple comparisons become a problem when you test many metrics, segments, or variants and then highlight only the significant results.

If you test 20 independent metrics at alpha = 0.05, you should expect about one false positive by chance.

Good ways to handle this:

Pre-specify one primary metric
Treat secondary metrics as diagnostic unless corrected
Use Bonferroni correction when you want strict false positive control
Use Benjamini–Hochberg when controlling false discovery rate is more appropriate
Use hierarchical testing where secondary metrics are tested only if the primary metric wins

Example:

If a checkout experiment affects 25 metrics, I would not claim success because one small segment improved. I would first check the primary metric, then inspect guardrails and segments as supporting evidence.

A strong interview answer is:

I would separate decision metrics from diagnostic metrics. The launch decision should depend mainly on the pre-defined primary metric and guardrails, not on cherry-picked significant results.

When should you not run an A/B test?

You should not always run an A/B test. Sometimes the design is invalid, too slow, or ethically wrong.

Skip or delay an A/B test when:

Traffic is too low to reach useful power in a reasonable time
The change is a bug fix, security fix, or legal requirement
Ethical or fairness risks make random exposure inappropriate
Network effects break independence, such as marketplaces, social products, or collaboration tools
Outcomes have long lag, such as 90-day retention or customer lifetime value
The launch is irreversible and you cannot maintain a control group
The expected effect is too small to matter even if statistically significant

Possible alternatives:

Before/after analysis with caution
Difference-in-differences
Geographic rollout
Cluster-level experiment
Synthetic control
Holdout group
Phased rollout with monitoring

A strong interview answer is:

I would not force an A/B test if the assumptions are broken. I would choose the design based on traffic, risk, independence, and how quickly the outcome can be measured.

A key product metric dropped 20% overnight. How do you investigate?

I would treat this as a structured debugging and product analytics problem.

Steps:

Verify the metric — check logging, dashboards, timezone boundaries, numerator/denominator changes, and data delays.
Check scope — is the drop global or limited to one platform, country, browser, app version, or user segment?
Inspect the funnel — find where users dropped: landing page, signup, checkout, payment, activation, or retention.
Check experiments and feature flags — see whether a treatment, rollout, or configuration change affected only some users.
Review deployments — compare the timing with frontend, backend, data pipeline, tracking, or ML model releases.
Look for external causes — outages, holidays, payment provider issues, marketing changes, competitor events, or app store problems.
Communicate clearly — share what changed, who is affected, what has been ruled out, and the next action.

A useful interview structure is:

Question	What it tells you
Is the drop real?	Data issue vs product issue
Who is affected?	Segment or global problem
Where in the funnel?	Likely failure point
What changed recently?	Deployment, experiment, or external trigger
What is the business impact?	Severity and escalation path

A strong interview answer is:

First I would confirm whether the drop is real or caused by instrumentation. Then I would segment the metric, locate the funnel step where the drop starts, compare it with recent launches or experiments, and communicate impact while investigation continues.

Machine learning fundamentals

Explain the bias-variance tradeoff.

The bias-variance tradeoff explains two common sources of model error.

Problem	Meaning	Symptom
High bias	Model is too simple and misses patterns	High train error and high validation error
High variance	Model is too sensitive to training data	Low train error but high validation error

A simple way to remember:

High bias = underfitting
High variance = overfitting

Total prediction error can be thought of as:

Error = bias² + variance + irreducible noise

Ways to reduce high bias:

Add useful features
Use a more flexible model
Reduce excessive regularization
Improve feature representation

Ways to reduce high variance:

Add more training data
Use regularization
Use cross-validation
Simplify the model
Use bagging or ensembling
Remove noisy or leaking features

A strong interview answer is:

I would compare train and validation performance. If both are poor, I suspect high bias. If train is strong but validation is weak, I suspect high variance.

How do you detect and prevent overfitting?

Overfitting happens when a model learns noise or accidental patterns from the training data instead of generalizable patterns.

How to detect it:

Training score is much better than validation/test score
Validation error increases while training error keeps decreasing
Model performs well offline but poorly in production
Performance changes a lot across cross-validation folds

How to prevent it:

Use proper train/validation/test splits
Use cross-validation
Add regularization such as L1 or L2
Use early stopping
Reduce model complexity
Add more data if possible
Remove noisy or irrelevant features
Use dropout for neural networks
Use data augmentation where appropriate

Always mention data leakage. If information from the future, target, or validation set leaks into training, the model may look excellent during validation but fail in production.

A strong interview answer is:

I would first check whether the validation setup is realistic. If there is no leakage and the train-validation gap is large, I would reduce variance using regularization, simpler models, early stopping, or more data.

What is the difference between L1 and L2 regularization?

L1 and L2 regularization both penalize large model weights, but they behave differently.

	L1 regularization	L2 regularization
Also called	Lasso	Ridge
Penalty	Sum of absolute weights	Sum of squared weights
Effect	Can shrink some weights to zero	Shrinks weights smoothly
Useful for	Feature selection, sparse models	Stable models, correlated features
Interpretability	Often more interpretable	Usually keeps more features

Elastic Net combines L1 and L2 regularization.

Use L1 when you have many features and want a sparse model. Use L2 when features are correlated and you want smoother weight shrinkage.

Important practical note: regularization is sensitive to feature scale, so features are usually standardized before applying L1 or L2 in linear models.

A strong interview answer is:

L1 can remove features by pushing weights to zero, while L2 usually keeps all features but shrinks their impact.

What are key assumptions of linear regression?

Classical linear regression has several important assumptions:

Linearity — the relationship between features and target is approximately linear.
Independence — errors are independent, which is especially important in time series or repeated-user data.
Homoscedasticity — residuals have roughly constant variance.
No strong multicollinearity — features should not be almost duplicates of each other.
Normality of residuals — mainly important for confidence intervals and p-values, especially with small samples.

If assumptions are violated, possible fixes include:

Transforming the target or features
Adding interaction terms
Using robust standard errors
Removing or combining highly correlated features
Choosing a different model family

For prediction-only tasks, mild assumption violations may be acceptable if validation performance is strong. For inference, assumptions matter more because coefficients, p-values, and confidence intervals need to be trustworthy.

See simple linear regression in Python for practical fitting.

Random Forest vs Gradient Boosting — when do you pick each?

Random Forest and Gradient Boosting are both tree-based ensemble methods, but they work differently.

	Random Forest	Gradient Boosting
Main idea	Builds many trees independently and averages them	Builds trees sequentially to correct previous errors
Strength	Robust default, less tuning, handles noise well	Often higher accuracy with careful tuning
Risk	Can be less accurate on structured/tabular tasks than tuned boosting	Can overfit if learning rate, depth, or iterations are poorly tuned
Training	Easier to parallelize	More sequential
Good first choice	Strong baseline	Performance-focused model

I would usually start with a simple baseline such as logistic regression or a Random Forest. Then I would try Gradient Boosting if extra lift justifies the added tuning and monitoring cost.

A strong interview answer is:

I would choose Random Forest when I need a robust baseline quickly. I would choose Gradient Boosting when predictive performance matters more and I have time to tune and validate carefully.

When is unsupervised learning appropriate?

Use unsupervised learning when labels are missing, expensive, unreliable, or not clearly defined.

Common use cases:

Method	Use case
Clustering	Customer segmentation, user behavior groups
Dimensionality reduction	PCA, UMAP, visualization, noise reduction
Anomaly detection	Fraud, equipment failure, suspicious activity
Topic modeling	Grouping documents or support tickets

The main challenge is validation. Since there is no true label, you should validate results using:

Business interpretability
Cluster stability
Segment usefulness
Domain expert review
Downstream performance

Example:

Customer clusters are only useful if marketing, product, or sales teams can act on them. A cluster that is mathematically clean but not business-interpretable may not be valuable.

See supervised learning algorithms for the labeled counterpart.

How do you handle class imbalance?

Class imbalance happens when one class is much rarer than another, such as fraud, churn, disease detection, or failure prediction.

Do not rely on accuracy alone. A model can get high accuracy by mostly predicting the majority class.

Better approaches:

Use metrics such as precision, recall, F1-score, PR-AUC, and confusion matrix
Use stratified train/test splits
Apply class weights in the loss function
Undersample the majority class or oversample the minority class
Use SMOTE carefully and only on the training set
Tune the decision threshold on a validation set
Consider anomaly detection for extremely rare events

The right metric depends on business cost.

Example:

In fraud detection, false positives may create manual review work, but false negatives may allow real fraud. In medical screening, missing a true positive may be much more costly than sending a false alarm for follow-up.

A strong interview answer is:

I would first define the cost of false positives and false negatives. Then I would choose the metric and threshold based on that cost, not accuracy alone.

Model evaluation and metrics

Explain precision and recall. When does each matter?

Precision answers: “Of the cases predicted positive, how many were actually correct?”

Precision = TP / (TP + FP)

Recall answers: “Of all actual positive cases, how many did the model catch?”

Recall = TP / (TP + FN)

Priority	Example	Why
High precision	Spam filter, fraud alert, sales lead scoring	False positives are costly or annoying
High recall	Cancer screening, security threat detection, high-risk fraud	Missing a true case is costly

F1-score is the harmonic mean of precision and recall. It is useful when you need one summary metric, especially for imbalanced classification.

However, interviews expect more than formulas. A strong answer should connect the metric to business cost:

“If false positives are expensive, I optimize precision. If false negatives are expensive, I optimize recall. The final threshold should be chosen using validation data and business cost.”

ROC-AUC vs PR-AUC — which do you report for imbalanced data?

ROC-AUC measures how well the model ranks positives above negatives across thresholds. It plots:

True positive rate
False positive rate

PR-AUC focuses on:

Precision
Recall

For heavily imbalanced data, PR-AUC is often more informative because it focuses on performance for the positive class.

Example:

In fraud detection, 99.5% of transactions may be legitimate. A model can have a good ROC-AUC while still producing too many false positives or missing too much fraud. PR-AUC gives a clearer view of how well the model handles the rare positive class.

In an interview, say:

Use ROC-AUC when classes are reasonably balanced or ranking quality is the main concern.
Use PR-AUC when the positive class is rare and positive-class performance matters most.
Also report a confusion matrix at the chosen threshold.

A strong answer is:

For imbalanced data, I would prioritize PR-AUC, but I would still choose the operating threshold using precision, recall, confusion matrix, and business cost.

Why use k-fold cross-validation?

K-fold cross-validation estimates how well a model generalizes to unseen data.

The dataset is split into k folds. The model trains on k - 1 folds and validates on the remaining fold. This repeats until each fold has been used as validation once.

Benefits:

More stable estimate than a single train/test split
Better use of limited data
Helps compare models and hyperparameters
Shows whether performance changes a lot across folds

Important caveats:

Do not randomly shuffle time series data. Use time-based splits or forward chaining.
Do not fit preprocessing on the full dataset before cross-validation.
Use pipelines so scaling, encoding, imputation, and feature selection are learned only from the training fold.
Keep a final untouched test set if you need an unbiased final estimate.

A strong interview answer is:

Cross-validation gives a more reliable estimate of generalization, but it must be designed to match the data. For time series, users, or grouped data, I would avoid random splits that leak information.

RMSE vs MAE for regression — when does each matter?

MAE and RMSE both measure regression error, but they behave differently.

Metric	Meaning	Best when
MAE	Average absolute error	You want typical error in the same unit as the target
RMSE	Square root of average squared error	Large errors should be penalized more

Use MAE when you want a simple, interpretable error metric.

Example:

If MAE is 5 minutes for delivery time prediction, the model is off by about 5 minutes on average.

Use RMSE when large mistakes are disproportionately costly.

Example:

In demand forecasting, a few very large prediction errors may cause stockouts or excess inventory, so RMSE may be more useful.

A strong interview answer is:

I would use MAE when I care about typical error and interpretability. I would use RMSE when large errors are much more costly than small errors.

What is model calibration and why does it matter?

A model is calibrated when predicted probabilities match real-world frequencies.

Example:

If 1,000 users are each assigned a conversion probability around 0.80, then about 80% of those users should actually convert.

Calibration matters when the probability itself is used for decisions, not just ranking.

Common examples:

Credit risk scoring
Insurance pricing
Medical risk prediction
Fraud review thresholds
Marketing budget allocation
Lead scoring

A model can rank users well but still be poorly calibrated. For example, it may correctly rank high-risk users above low-risk users, but its predicted probabilities may be too high or too low.

Ways to improve calibration:

Platt scaling
Isotonic regression
Calibration on a separate holdout set
Periodic recalibration when data drifts

A strong interview answer is:

Calibration matters when predicted probabilities drive decisions. If a model says 80% risk, that number should mean something operationally, not just help with ranking.

Feature engineering and data cleaning

How do you handle missing data?

First, I would understand why the data is missing, not immediately fill it.

Type	Meaning	Example
MCAR	Missing completely at random	Random logging failure
MAR	Missingness depends on observed data	Income missing more often for younger users
MNAR	Missingness depends on the missing value itself	High-income users avoid entering income

Common tactics:

Drop rows if missingness is small and random
Drop columns if missingness is very high and not useful
Use mean/median imputation for numeric features
Use mode or “unknown” category for categorical features
Add a missingness indicator when missing itself may carry signal
Use model-based imputation for more complex cases

Important interview point: fit imputation only on the training data, then apply it to validation/test data. Otherwise, you leak information from validation or test data into training.

A strong answer is:

I would first check the missingness pattern. Then I would choose deletion, imputation, or missingness indicators based on whether the missing values are random and whether missingness itself is predictive.

See pandas dropna for syntax patterns.

What are filter, wrapper, and embedded feature selection?

Feature selection removes weak, noisy, redundant, or expensive features to improve model simplicity and sometimes performance.

Method	Idea	Example	Trade-off
Filter	Score features before modeling	Correlation, chi-square, mutual information	Fast, but may miss interactions
Wrapper	Try feature subsets using model performance	Recursive feature elimination	More accurate, but expensive
Embedded	Model selects features during training	Lasso, tree importance	Good balance, but model-dependent

A practical workflow:

Remove constant or near-constant features
Remove duplicate or extremely high-null features
Check highly correlated features
Train a baseline model
Use embedded importance or cross-validated selection

Important: feature selection should happen inside the training process or cross-validation pipeline. Selecting features using the full dataset can leak information from validation/test data.

A strong answer is:

I would start with simple cleanup, then use model-based or cross-validated feature selection. I would avoid selecting features on the full dataset before splitting.

What is multicollinearity and why is it a problem?

Multicollinearity happens when predictor variables are highly correlated with each other.

Example:

If a model includes both total_spend and average_monthly_spend, those features may carry overlapping information.

Why it matters:

Coefficients become unstable in linear models
It becomes hard to interpret individual feature effects
Standard errors can increase
Small data changes may produce large coefficient changes

How to detect it:

Correlation matrix
Variance inflation factor (VIF)
Large coefficient changes when features are added or removed

How to fix it:

Drop one of the correlated features
Combine features
Use PCA
Use regularization such as Ridge
Use tree-based models if interpretation of individual coefficients is not required

A useful interview distinction:

“Multicollinearity is mainly a problem for interpretation and coefficient stability. It may be less harmful if the goal is only prediction and validation performance is strong.”

How do you encode high-cardinality categorical features?

High-cardinality categorical features have many unique values, such as user IDs, product IDs, cities, ZIP codes, or search terms.

Common encoding options:

Method	Use when	Risk
One-hot encoding	Low cardinality	Too many sparse columns
Rare-category grouping	Many infrequent categories	May lose useful detail
Target encoding	Category has predictive signal	Target leakage if done incorrectly
Hashing	Very high cardinality or memory limits	Collisions reduce interpretability
Embeddings	Deep learning or large-scale recommender systems	More complexity and data needed

For high-cardinality features, I would usually avoid plain one-hot encoding unless the number of categories is manageable.

If using target encoding, it must be done carefully:

Use out-of-fold or cross-fitted encoding
Smooth rare categories toward the global mean
Fit encoding only on training data
Handle unseen categories during inference

A strong interview answer is:

For high-cardinality categories, I would consider target encoding with out-of-fold validation, hashing, or rare-category grouping. The main risk is leakage, so encoding must be fitted only from training folds.

SQL for data scientists

Why is SQL still the top elimination round for data scientists?

SQL is still a top elimination round because most business data lives in warehouses such as BigQuery, Snowflake, Redshift, or Databricks.

Interviewers use SQL to test whether you can extract reliable data before doing analysis or modeling.

They usually check:

Correct grain: user, session, order, event, or account
Joins without duplicate rows or double-counting
Aggregations with the right numerator and denominator
Window functions for ranking, deduplication, retention, and running totals
Ability to reason about nulls, filters, dates, and edge cases

A strong interview answer is:

SQL tests whether I can define the metric correctly and pull the right dataset. A model built on the wrong grain or a duplicated join will produce misleading results.

Full drill set: SQL technical interview questions.

How would you compute weekly cohort retention in SQL?

A high-level pattern for weekly cohort retention is:

Assign each user to a signup week
Deduplicate activity to one row per user per active week
Join activity back to the user cohort
Calculate the week offset from signup
Count active users per cohort and week offset
Divide by the original cohort size

The key is to keep the grain correct:

Step	Grain
Cohort table	One row per user
Activity table	One row per user per active week
Final output	One row per cohort week and retention week

Use functions such as DATE_TRUNC, DATEDIFF, or dialect-specific date functions depending on the SQL engine.

Common mistakes:

Counting events instead of users
Not deduplicating multiple events in the same week
Using session-level grain when the metric is user-level
Accidentally excluding users with no later activity

A strong interview answer is:

I would first create a clean cohort table and deduplicated weekly activity table. Then I would compute week offsets and divide active users by the original cohort size.

How do you get the latest event per user in SQL?

A common pattern is to rank events within each user and keep the first row.

sql


WITH ranked AS (
  SELECT
    e.*,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY event_time DESC
    ) AS rn
  FROM events e
)
SELECT *
FROM ranked
WHERE rn = 1;

This pattern is useful for:

Latest event per user
Most recent order per customer
Last login per account
Highest-value transaction per user

If there can be ties, add a deterministic tie-breaker:

sql

ORDER BY event_time DESC, event_id DESC

Some SQL engines also support QUALIFY, which can make this shorter:

sql


SELECT *
FROM events
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY user_id
  ORDER BY event_time DESC
) = 1;

A strong interview answer is:

I would use ROW_NUMBER() partitioned by user and ordered by event time descending. I would also add a tie-breaker if event times are not unique.

Python and pandas

How does pandas groupby-apply differ from SQL GROUP BY?

Both pandas groupby and SQL GROUP BY split data by keys and compute aggregations, but they are used in different contexts.

Area	SQL GROUP BY	pandas groupby
Runs where	Database or warehouse	In Python memory
Best for	Large data extraction and aggregation	EDA, feature engineering, smaller local analysis
Output	Query result table	DataFrame or Series
Flexibility	Strong for set-based operations	Flexible Python-side transformations

Example:

python

df.groupby("country")["revenue"].agg(["sum", "mean", "count"])

Output

Watch these common mistakes:

NA group keys are dropped by default unless you use dropna=False
.apply() is flexible but can be slower than built-in aggregations
Grouping after filtering may change denominators
Multi-index output may need cleanup with reset_index()

A strong interview answer is:

I would use SQL to reduce and extract the right dataset, then pandas groupby for local exploration or feature engineering. I would also check null group keys and row counts.

What mistakes do candidates make with pandas merge?

Common pandas merge mistakes include:

Many-to-many joins inflating row counts
Using the wrong join type, such as inner when unmatched rows matter
Merging on float keys without rounding or cleaning
Ignoring duplicate keys before the merge
Ignoring null join keys
Not checking row counts before and after the merge

A good merge checklist:

Check df.shape before and after
Check whether join keys are unique
Inspect unmatched rows when using left/right joins
Use validate= when you expect one-to-one, one-to-many, or many-to-one joins
Use clear suffixes for overlapping column names

Example:

python

orders.merge(users, on="user_id", how="left", validate="many_to_one")

Output

A strong interview answer is:

After every merge, I check row counts, duplicate keys, null keys, and whether the join relationship matches my expectation.

Why do interviewers care about vectorization in pandas?

Interviewers care about vectorization because it shows you can write pandas code that works beyond tiny toy datasets.

In pandas, vectorized column operations are usually faster and cleaner than looping row by row in Python.

Prefer:

Boolean indexing
Built-in arithmetic on columns
.map() for simple mappings
.where() or np.where() for conditional logic
Built-in string and datetime accessors
groupby().agg() instead of custom row loops

Use .apply() only when built-in vectorized operations are not enough.

Also mention memory:

Read only needed columns
Use chunked reads for large files
Use appropriate dtypes
Push heavy aggregation to SQL when the dataset is too large for memory

A strong interview answer is:

I avoid row-by-row loops in pandas when a column-wise operation exists. It is usually faster, easier to read, and more scalable.

Product sense and case studies

How would you define success metrics for a new recommendation feature?

I would start by clarifying the product goal. A recommendation feature can optimize for engagement, purchases, retention, discovery, or user satisfaction, so the metric should match the decision.

A good metric framework:

North star metric — long-term product goal, such as retention, GMV, watch time, or repeat purchases.
Primary experiment metric — measurable during the test window, such as click-through rate, conversion rate, add-to-cart rate, or session depth.
Quality metrics — saves, long clicks, repeat usage, completion rate, or user ratings.
Guardrail metrics — latency, revenue per user, diversity, complaint rate, unsubscribe rate, creator fairness, or refund rate.
Counter metrics — check whether the recommendation improves one surface while hurting another.

Avoid vanity metrics. More clicks are not always better if users click low-quality recommendations and leave quickly.

A strong answer should define:

Numerator
Denominator
Attribution window
User segment
Experiment unit
Guardrail thresholds

Example:

“For a shopping recommendation feature, I might use add-to-cart rate or purchase conversion as the primary metric, but I would guardrail refund rate, latency, and recommendation diversity. I would not ship only because clicks increased.”

Sketch a high-level recommendation system architecture.

A high-level recommendation system usually has these parts:

Data collection — user events, item catalog, search history, purchases, ratings, impressions, and skips.
Feature pipeline — user features, item features, context features, and point-in-time correct training data.
Candidate generation — quickly retrieve a smaller set of possible items using collaborative filtering, embeddings, popularity, or rules.
Ranking model — score and order candidates using predicted relevance, conversion, or long-term value.
Serving layer — low-latency API with caching, freshness controls, and fallback recommendations.
Evaluation — offline metrics, online A/B testing, guardrails, and long-term impact.
Monitoring — data drift, model performance, latency, popularity bias, fairness, and freshness.

Senior candidates should discuss trade-offs:

Trade-off	Example
Accuracy vs latency	Better ranking may be too slow for real-time serving
Personalization vs diversity	Too much personalization can create filter bubbles
Exploration vs exploitation	Show proven items vs discover new interests
Freshness vs stability	New content needs exposure without hurting relevance

A strong interview answer is:

I would separate candidate generation from ranking. Candidate generation keeps serving fast, while the ranking model optimizes relevance. I would evaluate with both offline metrics and online experiments because offline ranking gains may not always improve business outcomes.

How do you analyze a conversion funnel drop-off?

I would first define the funnel clearly, then locate where and for whom the drop happens.

Steps:

Define funnel steps — for example: landing page → signup → email verification → checkout → payment success.
Define the time window — same session, 24 hours, 7 days, or first user journey.
Calculate step-to-step conversion and overall conversion.
Segment the drop — device, browser, country, traffic source, user cohort, app version, and experiment bucket.
Find the largest absolute loss — not only the lowest percentage conversion.
Check recent changes — deployments, tracking changes, payment issues, page speed, pricing, or onboarding changes.
Use qualitative evidence — session replay, support tickets, surveys, and UX review.

A common mistake is focusing only on percentages. A 70% drop at a tiny step may matter less than a 10% drop at a high-volume step.

A strong answer is:

I would quantify where the biggest user loss happens, segment it to find who is affected, then combine data with qualitative evidence before recommending a product change.

LLMs, MLOps, and communication

How do you explain a model result to a non-technical stakeholder?

I would explain the result in terms of the decision the stakeholder needs to make, not the algorithm.

A good structure:

Start with the business question
Give the recommendation first
Explain the evidence in plain language
State confidence and limitations
Explain the expected impact
Clarify the decision or next step

Avoid unnecessary jargon. Instead of saying:

“The posterior probability increased after recalibration.”

Say:

“Based on the latest data, this customer segment is more likely to churn than we previously estimated.”

A strong answer is:

I would lead with the recommendation, explain the evidence simply, state what could change the conclusion, and confirm what decision the stakeholder wants to make.

How are LLMs and MLOps changing data science interviews in 2026?

In 2026, many data science interviews include questions about LLM-assisted workflows, model monitoring, and production readiness, especially for mid-level and senior roles.

For LLMs, interviewers may ask how you would use them in daily work.

Good uses:

Drafting SQL or pandas for exploration
Summarizing tickets, logs, or user feedback
Generating first-pass analysis ideas
Writing documentation or experiment summaries
Speeding up repetitive workflow steps

But LLMs do not replace fundamentals. You still need to validate:

Query correctness
Data definitions
Statistical assumptions
Causal claims
Privacy risks
Hallucinated explanations

For MLOps, expect questions about what happens after a model is trained.

Important topics:

Model monitoring
Data drift
Performance decay
Feature freshness
Reproducible pipelines
Versioned data, code, and models
Bias and fairness checks
Rollback plans
Documentation

A strong answer is:

I would use LLMs to speed up analysis and documentation, but I would not trust outputs blindly. For production models, I would monitor drift, performance, latency, and business impact after launch.

Final checklist before your interview

15+ timed SQL problems (window functions included)
5 pandas exercises — groupby, merge, missing data
Can explain p-value, bias-variance, precision/recall without notes
One A/B test design story with guardrails and peeking awareness
One metric drop investigation framework
Two STAR stories — technical win + mistake you owned
Read the JD: product analytics vs applied ML emphasis

Pattern recognition across statistics, SQL, and business framing beats memorizing isolated definitions.

Interview context and how to prepare

Statistics and probability

Experimentation and A/B testing

Machine learning fundamentals

Model evaluation and metrics

Feature engineering and data cleaning

SQL for data scientists

Python and pandas

Product sense and case studies

LLMs, MLOps, and communication

Final checklist before your interview

Related Articles

Pandas Interview Questions and Answers

C and C++ Interview Questions and Answers

DBMS Interview Questions and Answers

Search GoLinuxCloud