Data Science Interview Questions and Answers

Data science and data scientist interview questions for 2026: statistics, A/B testing, machine learning, SQL, Python pandas, product metrics, behavioral prep, and LLM-era workflows.

Published

Updated

Tech reviewed byDeepak Prasad

Data Science Interview Questions and Answers

Data science interviews blend statistics, machine learning judgment, SQL and Python execution, and product sense. Hiring managers want to hear how you reason about noisy data, design experiments, and explain trade-offs—not only textbook definitions.

Below are 42 questions grouped by topic: statistics and probability, experimentation, ML fundamentals, coding (SQL and pandas), and behavioral or case-style prompts. For dedicated pandas interview questions (groupby, merge, cleaning, vectorization), see pandas interview questions. For PostgreSQL interview questions when analytics pipelines read from OLTP sources, see PostgreSQL interview questions. Try each question out loud before opening the answer.

NOTE
Role split: Product analytics loops weight SQL, metrics, and A/B tests. Applied ML loops add model design, feature engineering, and deployment trade-offs. Research loops go deeper on theory. Read the JD before you optimize prep.

Interview context and how to prepare

What do data science technical interviews actually test?

Data science interviews usually test whether you can turn messy data into a reliable business or product decision. Interviewers are not only checking if you know formulas or algorithm names.

They usually score four areas:

  1. Statistical reasoning — Can you reason about uncertainty, sampling bias, confidence intervals, p-values, experiment design, and whether a result is trustworthy?
  2. ML judgment — Can you choose the right model, metric, validation strategy, and explain trade-offs such as accuracy vs interpretability or precision vs recall?
  3. Execution — Can you write SQL, Python, or pandas under time pressure without overcomplicating the solution?
  4. Communication — Can you explain assumptions, limitations, trade-offs, and the business impact in simple language?

A strong answer usually connects the technical method to a decision.

For example, instead of saying:

“I would train a random forest and check accuracy.”

A stronger answer is:

“I would first define the business goal and the cost of false positives vs false negatives. If missing a positive case is expensive, I would optimize recall or F1 instead of accuracy. I would also check performance by segment before recommending deployment.”

That type of answer shows practical judgment, not just textbook knowledge.

What interview formats should you expect?

A typical data science interview process has 3–6 rounds, depending on the company, seniority, and role type.

Round Duration What it tests
Recruiter screen 20–30 min Background, role fit, salary range, notice period
SQL / Python screen 45–60 min Joins, aggregations, window functions, data cleaning, basic algorithms
Statistics & ML 45–90 min Probability, hypothesis testing, metrics, model selection, validation
Product / business case 45–60 min Metrics, funnel analysis, A/B testing, recommendation design
Behavioral round 30–60 min Collaboration, ownership, conflict, stakeholder communication
Project deep-dive 45–60 min Past work, impact, trade-offs, technical decisions
Take-home assignment 2–6 hours End-to-end analysis, code quality, assumptions, communication

The exact format depends on the role:

Role type Extra focus
Product data scientist Metrics, experimentation, product sense, stakeholder communication
ML data scientist Modeling, validation, feature engineering, deployment risks
Analytics data scientist SQL, dashboards, funnel analysis, business recommendations
Senior data scientist Ambiguity, project leadership, trade-offs, measurable impact

If you get a take-home assignment, treat it like a small production project:

  • Add a short README
  • State assumptions clearly
  • Keep the notebook or script reproducible
  • Explain why you chose each metric
  • Include limitations and next steps
  • Avoid unnecessary complexity

A good take-home is not only correct. It should be easy for the reviewer to run, understand, and trust.

What is a realistic 4–6 week prep plan?

A realistic data science interview prep plan should balance SQL, statistics, ML, product thinking, and communication. Do not spend all your time memorizing machine learning algorithms.

Week Focus Output
1 Statistics refresh Explain p-values, confidence intervals, distributions, sampling bias, and correlation vs causation
2 SQL Solve 15–20 timed SQL problems using joins, aggregations, CTEs, and window functions
3 pandas + Python Practice groupby, merge, filtering, missing data, sorting, and basic data transformation
4 ML fundamentals Review classification metrics, regression metrics, bias-variance, regularization, trees, boosting, and model validation
5 A/B testing + product cases Practice 3–5 cases involving funnels, experiments, conversion drops, churn, or recommendations
6 Mock interviews + behavioral stories Complete 2 mock loops and prepare 6–8 STAR stories from past projects

For most mid-level roles, the highest-ROI prep is:

  1. Timed SQL practice
  2. Statistics and A/B testing
  3. Explaining ML trade-offs clearly
  4. Project stories with measurable business impact

A good weekly routine is:

Activity Frequency
SQL practice 4–5 days/week
Statistics review 3 days/week
ML concept review 2–3 days/week
Product case practice 2 days/week
Behavioral/story practice 1–2 days/week

When practicing, do not only check whether your answer is correct. Also ask:

  • Did I clarify the goal before solving?
  • Did I mention assumptions?
  • Did I choose the right metric?
  • Did I explain trade-offs?
  • Did I connect the answer to a business decision?

That is the difference between a technically correct answer and an interview-ready answer.

How do product analytics, applied ML, and research DS roles differ in interviews?

Data science interviews are not the same for every role. A product analytics role may feel very different from an applied ML or research data science role.

Lane Heavier rounds Lighter rounds
Product analytics / product DS SQL, metrics, experimentation, funnels, A/B testing, stakeholder communication Deep learning theory, complex model architecture
Applied ML / ML data scientist Feature engineering, model evaluation, validation, deployment risks, ML system design Ad-hoc SQL trivia, pure probability puzzles
Research DS / research scientist Theory, papers, novel methods, mathematical reasoning, experimental design Dashboarding, routine reporting, basic business metrics

For a product analytics role, expect questions like:

  • “A metric dropped by 10%. How would you investigate?”
  • “How would you design an experiment for a new recommendation feature?”
  • “Which metric would you use for user retention?”
  • “How would you explain the result to a product manager?”

For an applied ML role, expect more questions like:

  • “How would you handle data leakage?”
  • “Which metric would you choose for an imbalanced classification problem?”
  • “How would you monitor model performance after deployment?”
  • “What would you do if offline performance is good but online performance is poor?”

For a research DS role, expect deeper questions around:

  • Model assumptions
  • Statistical validity
  • Paper discussion
  • Mathematical reasoning
  • Novel method design
  • Experimental rigor

In 2026, many interviews also include LLM-assisted workflow questions, especially for mid-level and senior candidates.

Examples:

  • “How would you use an LLM to speed up exploratory data analysis?”
  • “Where would you not trust an LLM in a data science pipeline?”
  • “How would you validate LLM-generated SQL or Python?”
  • “How would you prevent hallucinated insights in an automated reporting workflow?”

A strong answer should not say, “I will use ChatGPT to solve it.”

A better answer is:

“I may use an LLM to speed up boilerplate SQL, summarize documentation, or generate initial analysis ideas. But I would still validate the query, inspect the data, test assumptions, check edge cases, and review whether the final recommendation is statistically and business-wise sound.”

That answer shows the right balance: you can use modern tools, but you still own the reasoning, validation, and final decision.


Statistics and probability

What is the difference between correlation and causation?

Correlation means two variables move together. When one changes, the other also tends to change, either in the same direction or the opposite direction.

Causation means changing one variable directly produces a change in another variable.

Correlation does not imply causation because the relationship may be caused by:

  • Confounding variables — a third variable affects both
  • Reverse causality — Y may be causing X instead of X causing Y
  • Selection bias — the observed sample is not representative
  • Coincidence — two variables move together by chance

Example:

Ice cream sales and drowning incidents may be correlated, but buying ice cream does not cause drowning. Both increase during summer, so weather/season is the confounder.

In an interview, a strong answer should also explain how you would test causality:

  • Run a randomized controlled experiment if possible
  • Use A/B testing for product changes
  • Use quasi-experimental methods such as difference-in-differences, matching, instrumental variables, or regression discontinuity when randomization is not possible
  • Check whether the causal story makes business and domain sense

A good interview response is:

“Correlation is useful for finding relationships, but I would not make a product or business decision from correlation alone. I would check for confounders and, if possible, design an experiment to estimate the causal effect.”

What is a p-value, and what is a common misinterpretation?

A p-value is the probability of observing a result at least as extreme as the one in your sample, assuming the null hypothesis is true.

For example, if the null hypothesis says there is no difference between variant A and variant B, a p-value tells you how surprising your observed result would be under that assumption.

Common misinterpretation:

“p = 0.03 means there is a 97% chance the effect is real.”

That is incorrect. A p-value is not the probability that the null hypothesis is false. It is also not the probability that your result will repeat in the future.

In interviews, pair p-values with:

  • Effect size — how large is the difference?
  • Confidence interval — what range of values is plausible?
  • Sample size — is the test underpowered or overpowered?
  • Practical significance — does the result matter for the business?
  • Experiment design — was the data collected correctly?

Example:

A new checkout page improves conversion from 10.0% to 10.1% with p < 0.05. The result may be statistically significant, but the business impact may be too small to justify engineering effort.

A strong interview answer is:

I would not look at the p-value alone. I would also check the effect size, confidence interval, business impact, and whether the experiment design was valid.

Explain Type I and Type II errors.

Type I and Type II errors describe two ways a statistical decision can be wrong.

Error Meaning Example
Type I error False positive: you reject a true null hypothesis You conclude a feature improved conversion when it actually did not
Type II error False negative: you fail to reject a false null hypothesis You miss a real improvement because the test did not detect it

A simple way to remember:

  • Type I error = false alarm
  • Type II error = missed detection

The significance level controls the Type I error rate. For example, alpha = 0.05 means you are accepting a 5% false positive rate under the null hypothesis.

The power of a test is related to Type II error:

Term Meaning
Power Probability of detecting a real effect
1 - beta Power
beta Type II error rate

There is usually a trade-off:

  • Lowering alpha reduces false positives
  • But it can increase false negatives if sample size stays the same
  • Increasing sample size can improve power

Example:

In fraud detection, a company may tolerate more Type I errors because false positives can be manually reviewed. But Type II errors may be expensive because real fraud is missed.

In medical testing, the trade-off depends on the disease, treatment risk, and cost of missing a true case.

A strong interview answer connects the error type to cost:

“I would choose the threshold based on the cost of false positives versus false negatives. The right threshold is a business and risk decision, not only a statistical one.”

What is the Central Limit Theorem and why does it matter?

The Central Limit Theorem says that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, even if the original data is not normally distributed.

This works under important assumptions, such as:

  • Observations are independent or approximately independent
  • The sample is reasonably representative
  • The underlying distribution has finite variance
  • The sample size is large enough for the approximation to be useful

Why it matters:

  • It justifies normal approximations for sample means
  • It supports many confidence intervals and hypothesis tests
  • It explains why averages become more stable with larger samples
  • It is widely used in A/B testing and business metric analysis

Important clarification:

The CLT does not mean the original data becomes normal. It means the distribution of the sample mean becomes approximately normal as sample size grows.

Example:

Individual order values in an ecommerce store may be highly skewed because some customers place very large orders. But if you repeatedly take large random samples and calculate the average order value, those sample averages will often look approximately normal.

A strong interview answer is:

The CLT lets us reason about averages using normal approximations, but I would still check sample size, independence, skew, outliers, and whether the metric is suitable for that approximation.

When would you use Bayesian vs frequentist inference?

Frequentist inference treats parameters as fixed but unknown. Probability is interpreted as long-run frequency.

It is common when:

  • You need standard hypothesis tests
  • You are running classical A/B tests
  • You want p-values and confidence intervals
  • The organization already uses frequentist experimentation
  • Regulatory or reporting standards expect frequentist methods

Bayesian inference treats parameters as uncertain and represents them using probability distributions. You start with a prior belief and update it with observed data.

It is useful when:

  • You have meaningful prior information
  • Data is sparse
  • You want probability-style statements
  • You need shrinkage across many groups
  • You are monitoring results sequentially
  • Stakeholders ask questions like “What is the probability that variant B is better?”

Example:

For a standard website A/B test with large traffic, a frequentist test may be simple and acceptable.

For estimating conversion rates across many small regions, Bayesian methods can help because low-volume regions can borrow strength from the overall population instead of producing noisy estimates.

A strong comparison:

Approach Good for Interview wording
Frequentist Standard tests, large samples, established reporting “What would happen over repeated samples?”
Bayesian Prior knowledge, sparse data, probability statements “Given the data, what is the probability of the effect?”

A good interview answer is:

“I would use frequentist methods when the experiment is standard, high-volume, and the company expects p-values. I would consider Bayesian methods when prior information matters, data is sparse, or stakeholders need probability-based decision statements.”

You flip a fair coin until you get heads. What is the expected number of flips?

This follows a geometric distribution.

Each flip has probability:

Outcome Probability
Heads 0.5
Tails 0.5

If the probability of success is p, the expected number of trials until the first success is:

Expected value = 1 / p

For a fair coin:

Expected flips = 1 / 0.5 = 2

So the expected number of flips is 2.

One way to reason about it:

Sequence Probability Number of flips
H 1/2 1
TH 1/4 2
TTH 1/8 3
TTTH 1/16 4

The weighted average of these outcomes equals 2.

Follow-up answers:

Question Answer
Expected flips 2
Variance (1 - p) / p² = 2
Probability it takes more than 3 flips (1/2)³ = 1/8

In interviews, probability puzzles often test whether you can identify the distribution before calculating.

A strong answer is:

This is a geometric distribution because we are waiting for the first success. Since p = 0.5, the expected number of flips is 1/p = 2.


Experimentation and A/B testing

How do you design an A/B test for a new checkout button?

I would start by defining the decision the test must support: should we ship the new checkout button or keep the current one?

A good A/B test design includes:

  1. Hypothesis — “Changing the checkout button from blue to green increases checkout completion.”
  2. Primary metric — checkout completion rate, defined clearly as completed orders / users who reached checkout.
  3. Guardrail metrics — revenue per user, refund rate, page load time, payment errors, support tickets.
  4. Randomization unit — user-level or account-level, not session-level, to avoid the same user seeing both variants.
  5. Sample size — calculate using baseline conversion rate, minimum detectable effect, power, and significance level.
  6. Duration — run long enough to cover weekday/weekend behavior and normal traffic cycles.
  7. Analysis plan — decide the stopping rule, primary metric, exclusion rules, and segment checks before launch.

Also document exclusions such as bots, internal users, test orders, unsupported countries, or payment methods out of scope.

Before trusting the result, I would check for sample ratio mismatch. If traffic was supposed to split 50/50 but the actual split is 60/40, there may be a tracking or assignment bug.

A strong interview answer is:

I would not only check whether the green button wins. I would verify that the experiment was assigned correctly, the metric was defined correctly, and no guardrail metric got worse.

Why is peeking at A/B test results early dangerous?

Peeking is dangerous when you repeatedly check the test result and stop as soon as p < 0.05.

Each look is like running another hypothesis test. If you keep checking until the result becomes significant, the false positive rate becomes higher than the planned alpha level.

Example:

If you plan for alpha = 0.05 but check the result every day and stop on the first significant day, your actual false positive risk may be much higher than 5%.

Good mitigations:

  • Use a fixed sample size and analyze once at the end
  • Use sequential testing if early stopping is required
  • Define stopping rules before the experiment starts
  • Monitor guardrail metrics during the test, but do not declare a winner early without the right method

It is fine to monitor a running test for bugs, outages, sample ratio mismatch, or severe guardrail regressions. The problem is making a “winner” decision from repeated unplanned significance checks.

A strong interview answer is:

Peeking inflates false positives. I would monitor the experiment for health issues, but I would only make the final decision using the planned stopping rule or a valid sequential testing method.

How do you handle multiple comparisons across many metrics?

Multiple comparisons become a problem when you test many metrics, segments, or variants and then highlight only the significant results.

If you test 20 independent metrics at alpha = 0.05, you should expect about one false positive by chance.

Good ways to handle this:

  • Pre-specify one primary metric
  • Treat secondary metrics as diagnostic unless corrected
  • Use Bonferroni correction when you want strict false positive control
  • Use Benjamini–Hochberg when controlling false discovery rate is more appropriate
  • Use hierarchical testing where secondary metrics are tested only if the primary metric wins

Example:

If a checkout experiment affects 25 metrics, I would not claim success because one small segment improved. I would first check the primary metric, then inspect guardrails and segments as supporting evidence.

A strong interview answer is:

I would separate decision metrics from diagnostic metrics. The launch decision should depend mainly on the pre-defined primary metric and guardrails, not on cherry-picked significant results.

When should you not run an A/B test?

You should not always run an A/B test. Sometimes the design is invalid, too slow, or ethically wrong.

Skip or delay an A/B test when:

  • Traffic is too low to reach useful power in a reasonable time
  • The change is a bug fix, security fix, or legal requirement
  • Ethical or fairness risks make random exposure inappropriate
  • Network effects break independence, such as marketplaces, social products, or collaboration tools
  • Outcomes have long lag, such as 90-day retention or customer lifetime value
  • The launch is irreversible and you cannot maintain a control group
  • The expected effect is too small to matter even if statistically significant

Possible alternatives:

  • Before/after analysis with caution
  • Difference-in-differences
  • Geographic rollout
  • Cluster-level experiment
  • Synthetic control
  • Holdout group
  • Phased rollout with monitoring

A strong interview answer is:

I would not force an A/B test if the assumptions are broken. I would choose the design based on traffic, risk, independence, and how quickly the outcome can be measured.

A key product metric dropped 20% overnight. How do you investigate?

I would treat this as a structured debugging and product analytics problem.

Steps:

  1. Verify the metric — check logging, dashboards, timezone boundaries, numerator/denominator changes, and data delays.
  2. Check scope — is the drop global or limited to one platform, country, browser, app version, or user segment?
  3. Inspect the funnel — find where users dropped: landing page, signup, checkout, payment, activation, or retention.
  4. Check experiments and feature flags — see whether a treatment, rollout, or configuration change affected only some users.
  5. Review deployments — compare the timing with frontend, backend, data pipeline, tracking, or ML model releases.
  6. Look for external causes — outages, holidays, payment provider issues, marketing changes, competitor events, or app store problems.
  7. Communicate clearly — share what changed, who is affected, what has been ruled out, and the next action.

A useful interview structure is:

Question What it tells you
Is the drop real? Data issue vs product issue
Who is affected? Segment or global problem
Where in the funnel? Likely failure point
What changed recently? Deployment, experiment, or external trigger
What is the business impact? Severity and escalation path

A strong interview answer is:

First I would confirm whether the drop is real or caused by instrumentation. Then I would segment the metric, locate the funnel step where the drop starts, compare it with recent launches or experiments, and communicate impact while investigation continues.


Machine learning fundamentals

Explain the bias-variance tradeoff.

The bias-variance tradeoff explains two common sources of model error.

Problem Meaning Symptom
High bias Model is too simple and misses patterns High train error and high validation error
High variance Model is too sensitive to training data Low train error but high validation error

A simple way to remember:

  • High bias = underfitting
  • High variance = overfitting

Total prediction error can be thought of as:

Error = bias² + variance + irreducible noise

Ways to reduce high bias:

  • Add useful features
  • Use a more flexible model
  • Reduce excessive regularization
  • Improve feature representation

Ways to reduce high variance:

  • Add more training data
  • Use regularization
  • Use cross-validation
  • Simplify the model
  • Use bagging or ensembling
  • Remove noisy or leaking features

A strong interview answer is:

I would compare train and validation performance. If both are poor, I suspect high bias. If train is strong but validation is weak, I suspect high variance.

How do you detect and prevent overfitting?

Overfitting happens when a model learns noise or accidental patterns from the training data instead of generalizable patterns.

How to detect it:

  • Training score is much better than validation/test score
  • Validation error increases while training error keeps decreasing
  • Model performs well offline but poorly in production
  • Performance changes a lot across cross-validation folds

How to prevent it:

  • Use proper train/validation/test splits
  • Use cross-validation
  • Add regularization such as L1 or L2
  • Use early stopping
  • Reduce model complexity
  • Add more data if possible
  • Remove noisy or irrelevant features
  • Use dropout for neural networks
  • Use data augmentation where appropriate

Always mention data leakage. If information from the future, target, or validation set leaks into training, the model may look excellent during validation but fail in production.

A strong interview answer is:

I would first check whether the validation setup is realistic. If there is no leakage and the train-validation gap is large, I would reduce variance using regularization, simpler models, early stopping, or more data.

What is the difference between L1 and L2 regularization?

L1 and L2 regularization both penalize large model weights, but they behave differently.

L1 regularization L2 regularization
Also called Lasso Ridge
Penalty Sum of absolute weights Sum of squared weights
Effect Can shrink some weights to zero Shrinks weights smoothly
Useful for Feature selection, sparse models Stable models, correlated features
Interpretability Often more interpretable Usually keeps more features

Elastic Net combines L1 and L2 regularization.

Use L1 when you have many features and want a sparse model. Use L2 when features are correlated and you want smoother weight shrinkage.

Important practical note: regularization is sensitive to feature scale, so features are usually standardized before applying L1 or L2 in linear models.

A strong interview answer is:

L1 can remove features by pushing weights to zero, while L2 usually keeps all features but shrinks their impact.

What are key assumptions of linear regression?

Classical linear regression has several important assumptions:

  1. Linearity — the relationship between features and target is approximately linear.
  2. Independence — errors are independent, which is especially important in time series or repeated-user data.
  3. Homoscedasticity — residuals have roughly constant variance.
  4. No strong multicollinearity — features should not be almost duplicates of each other.
  5. Normality of residuals — mainly important for confidence intervals and p-values, especially with small samples.

If assumptions are violated, possible fixes include:

  • Transforming the target or features
  • Adding interaction terms
  • Using robust standard errors
  • Removing or combining highly correlated features
  • Choosing a different model family

For prediction-only tasks, mild assumption violations may be acceptable if validation performance is strong. For inference, assumptions matter more because coefficients, p-values, and confidence intervals need to be trustworthy.

See simple linear regression in Python for practical fitting.

Random Forest vs Gradient Boosting — when do you pick each?

Random Forest and Gradient Boosting are both tree-based ensemble methods, but they work differently.

Random Forest Gradient Boosting
Main idea Builds many trees independently and averages them Builds trees sequentially to correct previous errors
Strength Robust default, less tuning, handles noise well Often higher accuracy with careful tuning
Risk Can be less accurate on structured/tabular tasks than tuned boosting Can overfit if learning rate, depth, or iterations are poorly tuned
Training Easier to parallelize More sequential
Good first choice Strong baseline Performance-focused model

I would usually start with a simple baseline such as logistic regression or a Random Forest. Then I would try Gradient Boosting if extra lift justifies the added tuning and monitoring cost.

A strong interview answer is:

I would choose Random Forest when I need a robust baseline quickly. I would choose Gradient Boosting when predictive performance matters more and I have time to tune and validate carefully.

When is unsupervised learning appropriate?

Use unsupervised learning when labels are missing, expensive, unreliable, or not clearly defined.

Common use cases:

Method Use case
Clustering Customer segmentation, user behavior groups
Dimensionality reduction PCA, UMAP, visualization, noise reduction
Anomaly detection Fraud, equipment failure, suspicious activity
Topic modeling Grouping documents or support tickets

The main challenge is validation. Since there is no true label, you should validate results using:

  • Business interpretability
  • Cluster stability
  • Segment usefulness
  • Domain expert review
  • Downstream performance

Example:

Customer clusters are only useful if marketing, product, or sales teams can act on them. A cluster that is mathematically clean but not business-interpretable may not be valuable.

See supervised learning algorithms for the labeled counterpart.

How do you handle class imbalance?

Class imbalance happens when one class is much rarer than another, such as fraud, churn, disease detection, or failure prediction.

Do not rely on accuracy alone. A model can get high accuracy by mostly predicting the majority class.

Better approaches:

  • Use metrics such as precision, recall, F1-score, PR-AUC, and confusion matrix
  • Use stratified train/test splits
  • Apply class weights in the loss function
  • Undersample the majority class or oversample the minority class
  • Use SMOTE carefully and only on the training set
  • Tune the decision threshold on a validation set
  • Consider anomaly detection for extremely rare events

The right metric depends on business cost.

Example:

In fraud detection, false positives may create manual review work, but false negatives may allow real fraud. In medical screening, missing a true positive may be much more costly than sending a false alarm for follow-up.

A strong interview answer is:

I would first define the cost of false positives and false negatives. Then I would choose the metric and threshold based on that cost, not accuracy alone.


Model evaluation and metrics

Explain precision and recall. When does each matter?

Precision answers: “Of the cases predicted positive, how many were actually correct?”

Precision = TP / (TP + FP)

Recall answers: “Of all actual positive cases, how many did the model catch?”

Recall = TP / (TP + FN)

Priority Example Why
High precision Spam filter, fraud alert, sales lead scoring False positives are costly or annoying
High recall Cancer screening, security threat detection, high-risk fraud Missing a true case is costly

F1-score is the harmonic mean of precision and recall. It is useful when you need one summary metric, especially for imbalanced classification.

However, interviews expect more than formulas. A strong answer should connect the metric to business cost:

“If false positives are expensive, I optimize precision. If false negatives are expensive, I optimize recall. The final threshold should be chosen using validation data and business cost.”

ROC-AUC vs PR-AUC — which do you report for imbalanced data?

ROC-AUC measures how well the model ranks positives above negatives across thresholds. It plots:

  • True positive rate
  • False positive rate

PR-AUC focuses on:

  • Precision
  • Recall

For heavily imbalanced data, PR-AUC is often more informative because it focuses on performance for the positive class.

Example:

In fraud detection, 99.5% of transactions may be legitimate. A model can have a good ROC-AUC while still producing too many false positives or missing too much fraud. PR-AUC gives a clearer view of how well the model handles the rare positive class.

In an interview, say:

  • Use ROC-AUC when classes are reasonably balanced or ranking quality is the main concern.
  • Use PR-AUC when the positive class is rare and positive-class performance matters most.
  • Also report a confusion matrix at the chosen threshold.

A strong answer is:

For imbalanced data, I would prioritize PR-AUC, but I would still choose the operating threshold using precision, recall, confusion matrix, and business cost.

Why use k-fold cross-validation?

K-fold cross-validation estimates how well a model generalizes to unseen data.

The dataset is split into k folds. The model trains on k - 1 folds and validates on the remaining fold. This repeats until each fold has been used as validation once.

Benefits:

  • More stable estimate than a single train/test split
  • Better use of limited data
  • Helps compare models and hyperparameters
  • Shows whether performance changes a lot across folds

Important caveats:

  • Do not randomly shuffle time series data. Use time-based splits or forward chaining.
  • Do not fit preprocessing on the full dataset before cross-validation.
  • Use pipelines so scaling, encoding, imputation, and feature selection are learned only from the training fold.
  • Keep a final untouched test set if you need an unbiased final estimate.

A strong interview answer is:

Cross-validation gives a more reliable estimate of generalization, but it must be designed to match the data. For time series, users, or grouped data, I would avoid random splits that leak information.

RMSE vs MAE for regression — when does each matter?

MAE and RMSE both measure regression error, but they behave differently.

Metric Meaning Best when
MAE Average absolute error You want typical error in the same unit as the target
RMSE Square root of average squared error Large errors should be penalized more

Use MAE when you want a simple, interpretable error metric.

Example:

If MAE is 5 minutes for delivery time prediction, the model is off by about 5 minutes on average.

Use RMSE when large mistakes are disproportionately costly.

Example:

In demand forecasting, a few very large prediction errors may cause stockouts or excess inventory, so RMSE may be more useful.

A strong interview answer is:

I would use MAE when I care about typical error and interpretability. I would use RMSE when large errors are much more costly than small errors.

What is model calibration and why does it matter?

A model is calibrated when predicted probabilities match real-world frequencies.

Example:

If 1,000 users are each assigned a conversion probability around 0.80, then about 80% of those users should actually convert.

Calibration matters when the probability itself is used for decisions, not just ranking.

Common examples:

  • Credit risk scoring
  • Insurance pricing
  • Medical risk prediction
  • Fraud review thresholds
  • Marketing budget allocation
  • Lead scoring

A model can rank users well but still be poorly calibrated. For example, it may correctly rank high-risk users above low-risk users, but its predicted probabilities may be too high or too low.

Ways to improve calibration:

  • Platt scaling
  • Isotonic regression
  • Calibration on a separate holdout set
  • Periodic recalibration when data drifts

A strong interview answer is:

Calibration matters when predicted probabilities drive decisions. If a model says 80% risk, that number should mean something operationally, not just help with ranking.


Feature engineering and data cleaning

How do you handle missing data?

First, I would understand why the data is missing, not immediately fill it.

Type Meaning Example
MCAR Missing completely at random Random logging failure
MAR Missingness depends on observed data Income missing more often for younger users
MNAR Missingness depends on the missing value itself High-income users avoid entering income

Common tactics:

  • Drop rows if missingness is small and random
  • Drop columns if missingness is very high and not useful
  • Use mean/median imputation for numeric features
  • Use mode or “unknown” category for categorical features
  • Add a missingness indicator when missing itself may carry signal
  • Use model-based imputation for more complex cases

Important interview point: fit imputation only on the training data, then apply it to validation/test data. Otherwise, you leak information from validation or test data into training.

A strong answer is:

I would first check the missingness pattern. Then I would choose deletion, imputation, or missingness indicators based on whether the missing values are random and whether missingness itself is predictive.

See pandas dropna for syntax patterns.

What are filter, wrapper, and embedded feature selection?

Feature selection removes weak, noisy, redundant, or expensive features to improve model simplicity and sometimes performance.

Method Idea Example Trade-off
Filter Score features before modeling Correlation, chi-square, mutual information Fast, but may miss interactions
Wrapper Try feature subsets using model performance Recursive feature elimination More accurate, but expensive
Embedded Model selects features during training Lasso, tree importance Good balance, but model-dependent

A practical workflow:

  1. Remove constant or near-constant features
  2. Remove duplicate or extremely high-null features
  3. Check highly correlated features
  4. Train a baseline model
  5. Use embedded importance or cross-validated selection

Important: feature selection should happen inside the training process or cross-validation pipeline. Selecting features using the full dataset can leak information from validation/test data.

A strong answer is:

I would start with simple cleanup, then use model-based or cross-validated feature selection. I would avoid selecting features on the full dataset before splitting.

What is multicollinearity and why is it a problem?

Multicollinearity happens when predictor variables are highly correlated with each other.

Example:

If a model includes both total_spend and average_monthly_spend, those features may carry overlapping information.

Why it matters:

  • Coefficients become unstable in linear models
  • It becomes hard to interpret individual feature effects
  • Standard errors can increase
  • Small data changes may produce large coefficient changes

How to detect it:

  • Correlation matrix
  • Variance inflation factor (VIF)
  • Large coefficient changes when features are added or removed

How to fix it:

  • Drop one of the correlated features
  • Combine features
  • Use PCA
  • Use regularization such as Ridge
  • Use tree-based models if interpretation of individual coefficients is not required

A useful interview distinction:

“Multicollinearity is mainly a problem for interpretation and coefficient stability. It may be less harmful if the goal is only prediction and validation performance is strong.”

How do you encode high-cardinality categorical features?

High-cardinality categorical features have many unique values, such as user IDs, product IDs, cities, ZIP codes, or search terms.

Common encoding options:

Method Use when Risk
One-hot encoding Low cardinality Too many sparse columns
Rare-category grouping Many infrequent categories May lose useful detail
Target encoding Category has predictive signal Target leakage if done incorrectly
Hashing Very high cardinality or memory limits Collisions reduce interpretability
Embeddings Deep learning or large-scale recommender systems More complexity and data needed

For high-cardinality features, I would usually avoid plain one-hot encoding unless the number of categories is manageable.

If using target encoding, it must be done carefully:

  • Use out-of-fold or cross-fitted encoding
  • Smooth rare categories toward the global mean
  • Fit encoding only on training data
  • Handle unseen categories during inference

A strong interview answer is:

For high-cardinality categories, I would consider target encoding with out-of-fold validation, hashing, or rare-category grouping. The main risk is leakage, so encoding must be fitted only from training folds.


SQL for data scientists

Why is SQL still the top elimination round for data scientists?

SQL is still a top elimination round because most business data lives in warehouses such as BigQuery, Snowflake, Redshift, or Databricks.

Interviewers use SQL to test whether you can extract reliable data before doing analysis or modeling.

They usually check:

  • Correct grain: user, session, order, event, or account
  • Joins without duplicate rows or double-counting
  • Aggregations with the right numerator and denominator
  • Window functions for ranking, deduplication, retention, and running totals
  • Ability to reason about nulls, filters, dates, and edge cases

A strong interview answer is:

SQL tests whether I can define the metric correctly and pull the right dataset. A model built on the wrong grain or a duplicated join will produce misleading results.

Full drill set: SQL technical interview questions.

How would you compute weekly cohort retention in SQL?

A high-level pattern for weekly cohort retention is:

  1. Assign each user to a signup week
  2. Deduplicate activity to one row per user per active week
  3. Join activity back to the user cohort
  4. Calculate the week offset from signup
  5. Count active users per cohort and week offset
  6. Divide by the original cohort size

The key is to keep the grain correct:

Step Grain
Cohort table One row per user
Activity table One row per user per active week
Final output One row per cohort week and retention week

Use functions such as DATE_TRUNC, DATEDIFF, or dialect-specific date functions depending on the SQL engine.

Common mistakes:

  • Counting events instead of users
  • Not deduplicating multiple events in the same week
  • Using session-level grain when the metric is user-level
  • Accidentally excluding users with no later activity

A strong interview answer is:

I would first create a clean cohort table and deduplicated weekly activity table. Then I would compute week offsets and divide active users by the original cohort size.

How do you get the latest event per user in SQL?

A common pattern is to rank events within each user and keep the first row.

sql
WITH ranked AS (
  SELECT
    e.*,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY event_time DESC
    ) AS rn
  FROM events e
)
SELECT *
FROM ranked
WHERE rn = 1;

This pattern is useful for:

  • Latest event per user
  • Most recent order per customer
  • Last login per account
  • Highest-value transaction per user

If there can be ties, add a deterministic tie-breaker:

sql
ORDER BY event_time DESC, event_id DESC

Some SQL engines also support QUALIFY, which can make this shorter:

sql
SELECT *
FROM events
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY user_id
  ORDER BY event_time DESC
) = 1;

A strong interview answer is:

I would use ROW_NUMBER() partitioned by user and ordered by event time descending. I would also add a tie-breaker if event times are not unique.


Python and pandas

How does pandas groupby-apply differ from SQL GROUP BY?

Both pandas groupby and SQL GROUP BY split data by keys and compute aggregations, but they are used in different contexts.

Area SQL GROUP BY pandas groupby
Runs where Database or warehouse In Python memory
Best for Large data extraction and aggregation EDA, feature engineering, smaller local analysis
Output Query result table DataFrame or Series
Flexibility Strong for set-based operations Flexible Python-side transformations

Example:

python
df.groupby("country")["revenue"].agg(["sum", "mean", "count"])
Output

Watch these common mistakes:

  • NA group keys are dropped by default unless you use dropna=False
  • .apply() is flexible but can be slower than built-in aggregations
  • Grouping after filtering may change denominators
  • Multi-index output may need cleanup with reset_index()

A strong interview answer is:

I would use SQL to reduce and extract the right dataset, then pandas groupby for local exploration or feature engineering. I would also check null group keys and row counts.

What mistakes do candidates make with pandas merge?

Common pandas merge mistakes include:

  • Many-to-many joins inflating row counts
  • Using the wrong join type, such as inner when unmatched rows matter
  • Merging on float keys without rounding or cleaning
  • Ignoring duplicate keys before the merge
  • Ignoring null join keys
  • Not checking row counts before and after the merge

A good merge checklist:

  • Check df.shape before and after
  • Check whether join keys are unique
  • Inspect unmatched rows when using left/right joins
  • Use validate= when you expect one-to-one, one-to-many, or many-to-one joins
  • Use clear suffixes for overlapping column names

Example:

python
orders.merge(users, on="user_id", how="left", validate="many_to_one")
Output

A strong interview answer is:

After every merge, I check row counts, duplicate keys, null keys, and whether the join relationship matches my expectation.

Why do interviewers care about vectorization in pandas?

Interviewers care about vectorization because it shows you can write pandas code that works beyond tiny toy datasets.

In pandas, vectorized column operations are usually faster and cleaner than looping row by row in Python.

Prefer:

  • Boolean indexing
  • Built-in arithmetic on columns
  • .map() for simple mappings
  • .where() or np.where() for conditional logic
  • Built-in string and datetime accessors
  • groupby().agg() instead of custom row loops

Use .apply() only when built-in vectorized operations are not enough.

Also mention memory:

  • Read only needed columns
  • Use chunked reads for large files
  • Use appropriate dtypes
  • Push heavy aggregation to SQL when the dataset is too large for memory

A strong interview answer is:

I avoid row-by-row loops in pandas when a column-wise operation exists. It is usually faster, easier to read, and more scalable.


Product sense and case studies

How would you define success metrics for a new recommendation feature?

I would start by clarifying the product goal. A recommendation feature can optimize for engagement, purchases, retention, discovery, or user satisfaction, so the metric should match the decision.

A good metric framework:

  1. North star metric — long-term product goal, such as retention, GMV, watch time, or repeat purchases.
  2. Primary experiment metric — measurable during the test window, such as click-through rate, conversion rate, add-to-cart rate, or session depth.
  3. Quality metrics — saves, long clicks, repeat usage, completion rate, or user ratings.
  4. Guardrail metrics — latency, revenue per user, diversity, complaint rate, unsubscribe rate, creator fairness, or refund rate.
  5. Counter metrics — check whether the recommendation improves one surface while hurting another.

Avoid vanity metrics. More clicks are not always better if users click low-quality recommendations and leave quickly.

A strong answer should define:

  • Numerator
  • Denominator
  • Attribution window
  • User segment
  • Experiment unit
  • Guardrail thresholds

Example:

“For a shopping recommendation feature, I might use add-to-cart rate or purchase conversion as the primary metric, but I would guardrail refund rate, latency, and recommendation diversity. I would not ship only because clicks increased.”

Sketch a high-level recommendation system architecture.

A high-level recommendation system usually has these parts:

  1. Data collection — user events, item catalog, search history, purchases, ratings, impressions, and skips.
  2. Feature pipeline — user features, item features, context features, and point-in-time correct training data.
  3. Candidate generation — quickly retrieve a smaller set of possible items using collaborative filtering, embeddings, popularity, or rules.
  4. Ranking model — score and order candidates using predicted relevance, conversion, or long-term value.
  5. Serving layer — low-latency API with caching, freshness controls, and fallback recommendations.
  6. Evaluation — offline metrics, online A/B testing, guardrails, and long-term impact.
  7. Monitoring — data drift, model performance, latency, popularity bias, fairness, and freshness.

Senior candidates should discuss trade-offs:

Trade-off Example
Accuracy vs latency Better ranking may be too slow for real-time serving
Personalization vs diversity Too much personalization can create filter bubbles
Exploration vs exploitation Show proven items vs discover new interests
Freshness vs stability New content needs exposure without hurting relevance

A strong interview answer is:

I would separate candidate generation from ranking. Candidate generation keeps serving fast, while the ranking model optimizes relevance. I would evaluate with both offline metrics and online experiments because offline ranking gains may not always improve business outcomes.

How do you analyze a conversion funnel drop-off?

I would first define the funnel clearly, then locate where and for whom the drop happens.

Steps:

  1. Define funnel steps — for example: landing page → signup → email verification → checkout → payment success.
  2. Define the time window — same session, 24 hours, 7 days, or first user journey.
  3. Calculate step-to-step conversion and overall conversion.
  4. Segment the drop — device, browser, country, traffic source, user cohort, app version, and experiment bucket.
  5. Find the largest absolute loss — not only the lowest percentage conversion.
  6. Check recent changes — deployments, tracking changes, payment issues, page speed, pricing, or onboarding changes.
  7. Use qualitative evidence — session replay, support tickets, surveys, and UX review.

A common mistake is focusing only on percentages. A 70% drop at a tiny step may matter less than a 10% drop at a high-volume step.

A strong answer is:

I would quantify where the biggest user loss happens, segment it to find who is affected, then combine data with qualitative evidence before recommending a product change.


LLMs, MLOps, and communication

How do you explain a model result to a non-technical stakeholder?

I would explain the result in terms of the decision the stakeholder needs to make, not the algorithm.

A good structure:

  1. Start with the business question
  2. Give the recommendation first
  3. Explain the evidence in plain language
  4. State confidence and limitations
  5. Explain the expected impact
  6. Clarify the decision or next step

Avoid unnecessary jargon. Instead of saying:

“The posterior probability increased after recalibration.”

Say:

“Based on the latest data, this customer segment is more likely to churn than we previously estimated.”

A strong answer is:

I would lead with the recommendation, explain the evidence simply, state what could change the conclusion, and confirm what decision the stakeholder wants to make.

How are LLMs and MLOps changing data science interviews in 2026?

In 2026, many data science interviews include questions about LLM-assisted workflows, model monitoring, and production readiness, especially for mid-level and senior roles.

For LLMs, interviewers may ask how you would use them in daily work.

Good uses:

  • Drafting SQL or pandas for exploration
  • Summarizing tickets, logs, or user feedback
  • Generating first-pass analysis ideas
  • Writing documentation or experiment summaries
  • Speeding up repetitive workflow steps

But LLMs do not replace fundamentals. You still need to validate:

  • Query correctness
  • Data definitions
  • Statistical assumptions
  • Causal claims
  • Privacy risks
  • Hallucinated explanations

For MLOps, expect questions about what happens after a model is trained.

Important topics:

  • Model monitoring
  • Data drift
  • Performance decay
  • Feature freshness
  • Reproducible pipelines
  • Versioned data, code, and models
  • Bias and fairness checks
  • Rollback plans
  • Documentation

A strong answer is:

I would use LLMs to speed up analysis and documentation, but I would not trust outputs blindly. For production models, I would monitor drift, performance, latency, and business impact after launch.


Final checklist before your interview

  • 15+ timed SQL problems (window functions included)
  • 5 pandas exercises — groupby, merge, missing data
  • Can explain p-value, bias-variance, precision/recall without notes
  • One A/B test design story with guardrails and peeking awareness
  • One metric drop investigation framework
  • Two STAR stories — technical win + mistake you owned
  • Read the JD: product analytics vs applied ML emphasis

Pattern recognition across statistics, SQL, and business framing beats memorizing isolated definitions.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with more than 15 years of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive …