SQL Technical Interview Questions and Answers

SQL technical interview questions for 2026: joins, window functions, CTEs, GROUP BY traps, practice problems, and data analyst prep with sample answers.

Published

Updated

Tech reviewed byDeepak Prasad

SQL Technical Interview Questions and Answers

SQL technical interviews reward queries you can write under time pressure—joins, aggregations, subqueries, window functions, and the NULL and top-N patterns that trip people up in live loops. Interviewers care less about keyword recall and more about whether you can turn a vague business question into correct, readable SQL.

Below are 47 questions with sample answers in PostgreSQL-style syntax (most examples also work on MySQL 8+, BigQuery, and Snowflake with minor tweaks). For DBMS interview questions (ER model, normalization, ACID, keys—theory before engine specifics), see DBMS interview questions. For dedicated PostgreSQL interview questions (MVCC, vacuum, index types, replication, DBA topics), see PostgreSQL interview questions. For pandas interview questions when loops compare SQL joins to merge, see pandas interview questions. Open each answer after you try the question yourself; snippets assume tables like employees or orders exist — later sections link to deeper SQL tutorials on this site when you want more practice.

NOTE
Prep target: Work through 35–50 practice problems with emphasis on window functions—they appear often in senior data and backend loops.

Interview context and how to prepare

What do SQL technical interviews actually test?

SQL interviews are less about memorizing syntax and more about thinking in relations — which rows you need, at what grain, and how to combine tables without lying with numbers.

Three layers interviewers probe (often in that order):

Layer What it means Example signal
Vocabulary You know clauses, join types, keys, NULL rules Explain INNER vs LEFT without guessing
Correctness Output matches the business question No double-counted revenue; ties handled explicitly
Decomposition You break hard problems into readable steps CTEs, window frames, narrated assumptions

Role differences: Data analyst and data engineer loops are usually SQL-heavy (live coding, take-homes). Software engineer loops may use SQL as a filter — still expect joins, aggregates, and one window-function problem at many companies.

What separates strong candidates: They clarify the question (ties, NULLs, timezone, dialect), state grain aloud ("one row per user per day"), and write SQL another engineer can maintain.

What format should you expect?

Formats vary by company and role, but most SQL screens combine time pressure with communication — they want to hear how you think, not only see a final SELECT.

Common formats:

Format What happens How to prepare
Live coding Shared editor, 30–60 min, 1–3 problems Talk through assumptions; leave time to test edge cases
Take-home Larger schema, documented deliverable Write clean SQL + short README of assumptions
Explain / debug Optimize or fix a broken query Read execution plan basics; spot wrong join grain first
Whiteboard SQL Pseudocode + key clauses aloud Practice saying PARTITION BY, GROUP BY, filters out loud

Always clarify up front:

  • Dialect — PostgreSQL, MySQL 8+, BigQuery, Snowflake (syntax for dates, strings, limits)
  • Tie behavior — "second highest" with duplicate salaries?
  • NULL handling — are missing values excluded or their own category?
  • Timezone — for DATE(event_at) on global events

Interviewers forgive minor syntax slips if logic and communication are solid.

What is a realistic 3–5 week prep plan?

Three to five weeks is enough for most candidates if you practice writing, not only read tutorials. Medium-difficulty pattern recognition is the bar at many data and backend interviews.

Week Focus Target output
1 SELECT, WHERE, GROUP BY, JOINs 10 easy problems; explain grain on each
2 Subqueries, CTEs, HAVING traps 10 medium problems; rewrite one nested query as CTE
3 Window functionsROW_NUMBER, RANK, LAG, running totals 15 medium; five patterns from memory
4 Hard: gaps, retention, YoY, median 5–10 timed problems
5 Mock screens + explain answers aloud 2 full timed mocks with verbal walkthrough

Daily habit: For each problem, write assumptions in a comment, solve, then ask "what if NULL?" and "what if tie?"

Medium difficulty defines the interview bar — easy builds confidence; hard teaches edge cases.

Do you need to know every SQL dialect?

No — and nobody expects encyclopedic dialect knowledge in a 45-minute screen. Pick one primary dialect (PostgreSQL or MySQL 8+ are common) and know the transfer differences that actually change interview answers.

Feature PostgreSQL MySQL 8+
String concat || or CONCAT() CONCAT()
Limit rows LIMIT n LIMIT n
Boolean native true / false historically TINYINT(1)
Date arithmetic date - date, INTERVAL '7 days' DATEDIFF(), DATE_ADD()
Median PERCENTILE_CONT window-function workaround

Interview strategy: State your dialect at the start ("I'll write PostgreSQL-style"). If you forget DATEDIFF vs interval syntax, say so and keep the logic correct — interviewers care more about joins and grain than memorizing date functions.

For fundamentals, see getting started with SQL on this site.


Fundamentals and core concepts

What is SQL, and how does it relate to RDBMS?

SQL (Structured Query Language) is the standard language for defining, querying, and manipulating relational data — tables with rows and columns linked by keys. An RDBMS (relational database management system) is the engine that stores data and executes SQL: PostgreSQL, MySQL, SQL Server, Oracle, SQLite, and cloud warehouses like BigQuery and Snowflake all speak SQL with dialect differences.

Why interviewers ask this: They want to know you understand SQL is not "a database" — it is the interface to one.

SQL statement categories (useful vocabulary in system-design tangents):

Category Purpose Examples
DDL Define structure CREATE, ALTER, DROP
DML Change data INSERT, UPDATE, DELETE
DQL Read data SELECT (most interview focus)
DCL Permissions GRANT, REVOKE
TCL Transactions COMMIT, ROLLBACK

Most technical screens are DQL-heavySELECT with joins and aggregates — but senior roles may ask how migrations (DDL) or transactions (TCL) affect production behavior.

A strong answer is:

SQL is the language; the RDBMS is the engine. Interviews focus on SELECT and relational reasoning, but I know DDL/DML/TCL matter for how data is stored, updated, and kept consistent in production.

What is the difference between PRIMARY KEY and FOREIGN KEY?

Keys define identity and relationships between tables — the foundation of every join you write in an interview.

PRIMARY KEY

  • Uniquely identifies each row in a table
  • Implies NOT NULL (no duplicate keys, no missing keys)
  • One primary key per table (can be composite — multiple columns together)

FOREIGN KEY

  • Column(s) in a child table that reference a key in a parent table
  • Enforces referential integrity — you cannot insert a child row with a parent id that does not exist (unless constraints are deferred or disabled)

Interview example: orders.customer_idcustomers.customer_id. A JOIN between orders and customers relies on this relationship.

Practical tip: Interview schemas sometimes omit explicit FK constraints but still have logical keys. Ask: "Which column links these tables?" before joining on the wrong id.

See SQL CREATE TABLE for how keys are declared in DDL.

A strong answer is:

A primary key uniquely identifies rows in a table; a foreign key links a child table to a parent and enforces referential integrity. I confirm join keys before writing JOINs, even when FK constraints are not shown on the schema.

What is the difference between DELETE, TRUNCATE, and DROP?

All three remove data or structure, but at different scope and with different safety implications — a common fundamentals question for backend and data roles.

Command What it removes WHERE clause? Typical rollback Speed
DELETE Matching rows Yes Often yes (inside a transaction) Slower (row-level logging)
TRUNCATE All rows in table No (table-wide) Dialect-dependent Fast (deallocates pages)
DROP Entire table (structure + data) N/A Depends on DB and transaction DDL — metadata change

Details interviewers probe:

  • DELETE without WHERE deletes every row but keeps the table — still logged row-by-row in many engines
  • TRUNCATE often resets identity/serial counters and may not fire row-level triggers
  • DROP removes the table object from the catalog. Rollback behavior depends on the database and transaction context — PostgreSQL can roll back many DDL statements inside a transaction; other engines differ. After commit, recovery usually means backup/restore or CREATE again — do not assume undo

Production caution: TRUNCATE and DROP are common incident causes; mention transactions and backups on senior loops.

A strong answer is:

DELETE removes specific rows and can often roll back in a transaction; TRUNCATE wipes all rows quickly without a WHERE; DROP removes the table object. I choose based on scope, and I treat DROP rollback as dialect-dependent — after commit you typically need recreation or restore.

What is the difference between WHERE and HAVING?

Both filter rows, but at different stages of query execution — mixing them up is one of the most common SQL mistakes in interviews and production.

Execution order (simplified):

  1. FROM / JOIN — assemble rows
  2. WHERE — filter individual rows before grouping
  3. GROUP BY — form groups
  4. HAVING — filter groups after aggregation
  5. SELECT / ORDER BY — project and sort
sql
SELECT department_id, AVG(salary) AS avg_sal
FROM employees
WHERE status = 'active'           -- row filter: only active employees enter the average
GROUP BY department_id
HAVING AVG(salary) > 80000;       -- group filter: only departments above threshold

Classic trap: Putting a non-aggregated column in HAVING that belongs in WHERE (e.g., HAVING status = 'active'). Some databases allow it but it confuses readers and may hurt performance.

Cleaner pattern for complex filters: Aggregate in a CTE, then WHERE on the alias in the outer query.

A strong answer is:

WHERE filters rows before GROUP BY; HAVING filters groups after aggregation. I put row-level conditions in WHERE and aggregate thresholds in HAVING — or use a CTE when the logic gets hard to read.

How does NULL behave in SQL comparisons?

NULL means unknown or missing — not zero, not empty string '', not false. Three-valued logic (TRUE, FALSE, UNKNOWN) drives many interview traps.

Rules to internalize:

Expression Result Why
NULL = NULL UNKNOWN (not TRUE) Cannot confirm two unknowns are equal
WHERE col = NULL Wrong idiom Use WHERE col IS NULL
COUNT(*) Counts all rows in group Includes rows where columns are NULL
COUNT(col) Counts non-NULL values only NULLs skipped
AVG(col) Averages non-NULL values NULLs ignored, not treated as zero

Interview follow-ups:

  • SUM of no rows returns NULL — use COALESCE(SUM(x), 0) when you need zero
  • NOT IN (subquery) breaks if the subquery returns NULL — prefer NOT EXISTS
  • Outer joins produce NULL for non-matching columns — anti-joins use WHERE right.key IS NULL

See ISNULL / NULL handling for dialect-specific functions.

A strong answer is:

NULL is unknown — I never compare with = NULL; I use IS NULL. I remember COUNT(column) skips NULLs, and I avoid NOT IN when the subquery might return NULL.

When do you use COALESCE in SQL?

COALESCE(expr1, expr2, …) returns the first non-NULL argument — the standard tool for NULL-safe display and fallbacks in reports.

Common interview uses:

Scenario Pattern
Show zero instead of NULL COALESCE(SUM(amount), 0) when no rows match
Fallback value COALESCE(nickname, name)
Outer join reports COALESCE(b.order_count, 0) when the right side is NULL
Safer division setup Often paired with NULLIFCOALESCE(x / NULLIF(y, 0), 0)

What COALESCE does not fix: It does not replace proper IS NULL filters in WHERE. COALESCE(col, 0) = 0 matches both NULL and zero — know if that is intended.

vs IFNULL / NVL: Dialect-specific aliases; COALESCE is portable SQL.

See server COALESCE examples for more patterns.

A strong answer is:

I use COALESCE for the first non-NULL fallback — especially zero instead of NULL after outer joins or empty aggregates — and I pair it with NULLIF when I need safe division, not as a substitute for correct NULL filters.


JOINs

Explain INNER, LEFT, RIGHT, and FULL OUTER JOIN.

Joins combine rows from two tables based on a join condition — usually equality on keys (ON a.id = b.a_id). The join type decides what happens when no match exists on one side.

Join type Rows in result When to use
INNER Only rows with a match in both tables Default "related" data — orders with customers who exist
LEFT All rows from left table + matches from right; NULLs if no match Keep all drivers even if no trips; find missing matches (anti-join)
RIGHT Mirror of LEFT (all right + left matches) Rare in practice — rewrite as LEFT by swapping tables
FULL OUTER All rows from both; NULLs where no match Compare two snapshots; find rows only in A or only in B

How to narrate in an interview: "I need every customer, even those with zero orders — that's a LEFT JOIN from customers to orders."

Venn intuition: INNER is the overlap; OUTER types keep non-overlapping rows with NULL-filled columns.

Drill syntax: LEFT JOIN examples, RIGHT JOIN, FULL OUTER JOIN.

A strong answer is:

INNER keeps only matching rows; LEFT keeps all left rows with NULLs when there is no right match; FULL OUTER keeps both sides. I pick the type based on whether I need unmatched rows and say that aloud before coding.

When do you use a self-join?

A self-join joins a table to itself — aliased as two different roles — when relationships exist between rows in the same table.

Common interview patterns:

Pattern Example
Hierarchy Employee → manager (both in employees)
Sequential events Compare each row to the previous row (often replaced by LAG)
Pairing Find users who referred each other, duplicate pairs
sql
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;

LEFT JOIN keeps employees with no manager (CEO). INNER JOIN would drop them.

Modern alternative: LAG / recursive CTEs for hierarchies — but self-joins still appear in screens because they test whether you understand join logic.

A strong answer is:

I self-join when rows in one table relate to other rows in the same table — like employee to manager. I use clear aliases and LEFT JOIN when the parent row might not exist.

How do you find rows in A that have no match in B?

An anti-join returns rows in table A with no corresponding row in table B — "customers who never ordered," "orders without payments."

Three idioms (know all three; prefer NOT EXISTS when NULLs are possible):

sql
-- 1. LEFT JOIN + NULL filter (most visual)
SELECT a.*
FROM orders a
LEFT JOIN payments b ON a.order_id = b.order_id
WHERE b.order_id IS NULL;

-- 2. NOT EXISTS (often efficient; NULL-safe)
SELECT a.*
FROM orders a
WHERE NOT EXISTS (
  SELECT 1 FROM payments b WHERE b.order_id = a.order_id
);

-- 3. NOT IN — only when subquery column is NOT NULL guaranteed
SELECT a.*
FROM orders a
WHERE a.order_id NOT IN (SELECT order_id FROM payments);

NULL trap with NOT IN: If the subquery returns any NULL order_id, the outer NOT IN may return no rows (UNKNOWN logic). NOT EXISTS avoids this.

See EXISTS examples for more patterns.

A strong answer is:

I use LEFT JOIN with WHERE right.key IS NULL or NOT EXISTS for rows in A with no match in B. I avoid NOT IN when the subquery might contain NULLs.

What is a CROSS JOIN and when is it dangerous?

A CROSS JOIN produces the Cartesian product: every row in A paired with every row in B. Row count = |A| × |B|.

Intentional uses:

  • Build a date spine × product list for zero-filled reporting grids
  • Generate combinations (small dimension tables only)

Why it is dangerous: A missing ON clause on an implicit join, or forgetting a filter, can turn 1,000 × 1,000 rows into one million rows — a classic production outage pattern.

Interview signal: When you see unexpectedly huge row counts, check for accidental cross joins.

A strong answer is:

CROSS JOIN pairs every row in A with every row in B. I use it deliberately for small dimension combos like date × product, and I watch for missing join conditions that create accidental Cartesian explosions.

UNION vs JOIN — when to use each?

They solve different problems — interviewers test whether you confuse stacking rows with combining columns.

Operation Direction Requirement Dedup?
JOIN Horizontal — adds columns from another table Related keys N/A
UNION / UNION ALL Vertical — stacks result sets Same number and compatible types of columns UNION yes; UNION ALL no

Examples:

  • JOIN: Add customer_name to each order row
  • UNION ALL: Combine archived_orders and orders into one report

UNION ALL is faster when you do not need deduplication — prefer it unless duplicates must be removed.

Review SQL UNION examples.

A strong answer is:

JOIN combines related tables horizontally on keys; UNION stacks compatible result sets vertically. I use UNION ALL unless I explicitly need deduplication.


Aggregations and GROUP BY traps

What is the difference between COUNT(*) and COUNT(column)?

COUNT seems simple but encodes NULL semantics — senior interviewers use it to test attention to detail.

Expression What it counts
COUNT(*) Rows in the group (or whole table if no GROUP BY) — includes rows where columns are NULL
COUNT(column) Rows where column IS NOT NULL only
COUNT(DISTINCT column) Distinct non-NULL values

Example: 10 users in a group, 3 with NULL email:

  • COUNT(*) = 10
  • COUNT(email) = 7

When it matters: "How many users provided an email?" → COUNT(email). "How many users?" → COUNT(*).

Do not use COUNT(DISTINCT order_id) as a band-aid for join double-counting — fix grain first (see the double-counting question in this section).

A strong answer is:

COUNT(*) counts rows; COUNT(column) counts non-NULL values in that column. I pick based on whether NULL means missing data I should exclude from the count.

How do joins cause double-counting in aggregates?

Double-counting happens when you join before aggregating at the wrong grain — one parent row matches many child rows, so parent metrics get multiplied.

Classic mistake:

sql
-- WRONG: order total counted once per line item
SELECT o.order_id, SUM(o.order_total) AS revenue
FROM orders o
JOIN order_items i ON o.order_id = i.order_id
GROUP BY o.order_id;

If an order has 5 line items, order_total is summed five times.

How to fix (pick grain first):

Where the metric lives Strategy
Order-level column (orders.order_total) Do not join to order_items before SUM unless you need line detail — aggregate at order grain, then join to customers or dates
Line-item metric (quantity × price) Aggregate child rows first to order grain, then join to parent tables
sql
-- Metric comes from line items: aggregate to order grain FIRST, then join via orders
WITH order_revenue AS (
  SELECT order_id, SUM(quantity * unit_price) AS order_total
  FROM order_items
  GROUP BY order_id
)
SELECT c.customer_id, SUM(r.order_total) AS customer_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_revenue r ON o.order_id = r.order_id
GROUP BY c.customer_id;

-- Order total already on orders table: no line-item join needed for revenue
SELECT SUM(order_total) FROM orders;

If you must join one-to-many for filters (e.g., "orders containing product X"), use a distinct order list or subquery at order grain before aggregating order-level dollars.

COUNT(DISTINCT order_id) only when the business metric is literally "count distinct orders" — not a band-aid for inflated SUM.

Interview habit: Say grain aloud — "one row per order," "one row per user per day" — before writing GROUP BY.

A strong answer is:

Double-counting happens when I join one-to-many before aggregating parent metrics. I aggregate line items to order grain first, or keep order-level totals without the line join, then join to customers or time — and I state grain explicitly.

Write a query to find the second highest salary.

Clarify tie behavior first: if two people tie for highest, is there a "second" salary distinct value, or the next rank?

Subquery approach (distinct second value):

sql
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Returns NULL if everyone has the same salary.

Window approach (handles ties with ranking):

sql
WITH ranked AS (
  SELECT salary,
         DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
  FROM employees
)
SELECT DISTINCT salary
FROM ranked
WHERE rnk = 2;
  • DENSE_RANK — 1, 1, 2 when two people tie for top (good for "second distinct salary level")
  • ROW_NUMBER — arbitrary tie-break; use only if interviewer wants exactly one row

Classic interview pattern for second-highest salary with tie handling.

A strong answer is:

I clarify tie rules, then use MAX with a subquery for the second distinct value or DENSE_RANK = 2 when ties matter. I mention NULL when all salaries are equal.

How do you get top 3 salaries per department?

Wrong approach: GROUP BY department_id + LIMIT 3LIMIT applies to the final result, not per group.

Right approach: window function + filter:

sql
WITH ranked AS (
  SELECT *,
         DENSE_RANK() OVER (
           PARTITION BY department_id
           ORDER BY salary DESC
         ) AS rnk
  FROM employees
)
SELECT *
FROM ranked
WHERE rnk <= 3;

Function choice:

Function Top-3 behavior with ties
ROW_NUMBER Exactly 3 rows; ties broken arbitrarily
RANK May return more than 3 rows if ties span rank 3
DENSE_RANK Top 3 salary levels per department

Common pattern for ranking within departments—appears constantly in data interviews.

See select first row in group for related patterns.

A strong answer is:

I partition by department, rank by salary with ROW_NUMBER or DENSE_RANK depending on tie rules, then filter rnk <= 3 in an outer query — never GLOBAL LIMIT after GROUP BY.

Can you use WHERE on an aggregate like AVG(salary)?

No — aggregate expressions are not available in WHERE because filtering happens before grouping.

Options:

sql
-- HAVING (after GROUP BY)
SELECT department_id, AVG(salary) AS avg_sal
FROM employees
GROUP BY department_id
HAVING AVG(salary) > 80000;

-- CTE + WHERE (often clearer in interviews)
WITH dept_avg AS (
  SELECT department_id, AVG(salary) AS avg_sal
  FROM employees
  GROUP BY department_id
)
SELECT * FROM dept_avg WHERE avg_sal > 80000;

CTEs read top-to-bottom like your explanation — interviewers follow the logic more easily than nested HAVING with multiple aggregates.

A strong answer is:

I cannot filter aggregates in WHERE — only in HAVING after GROUP BY, or in an outer query on a CTE. I prefer CTE plus WHERE when the logic has multiple steps.

What is the difference between DISTINCT and GROUP BY?

Both can reduce duplicate rows, but they serve different purposes — confusing them is a common interview mistake.

DISTINCT GROUP BY
Primary job Remove duplicate result rows Form groups for aggregation
Aggregates No aggregates required Usually paired with SUM, COUNT, AVG
All columns Applies to full selected row shape Groups by explicit key columns
sql
-- DISTINCT: unique country values
SELECT DISTINCT country FROM customers;

-- GROUP BY: count per country
SELECT country, COUNT(*) AS n
FROM customers
GROUP BY country;

Similar-looking output: SELECT DISTINCT country vs SELECT country FROM customers GROUP BY country — often the same rows, but meaning differs: one deduplicates; the other defines groups (and could add aggregates).

Misuse: SELECT DISTINCT customer_id, SUM(amount) without GROUP BY is invalid — use GROUP BY customer_id.

A strong answer is:

DISTINCT removes duplicate rows in the result; GROUP BY defines groups for aggregation. They can look similar without aggregates, but I use GROUP BY when I need counts or sums per key.

How do you use CASE WHEN in SQL interviews?

CASE WHEN is the workhorse for conditional logic in queries — especially conditional aggregation and bucketing without pivot syntax.

Patterns interviewers expect:

sql
-- Conditional counts (CTR-style)
SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) AS clicks

-- Conditional sums
SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed_revenue

-- Bucketing users
CASE
  WHEN total_spend >= 1000 THEN 'high'
  WHEN total_spend >= 100 THEN 'medium'
  ELSE 'low'
END AS spend_tier

-- Pivot-style metrics in one query
SUM(CASE WHEN product = 'A' THEN revenue ELSE 0 END) AS product_a_revenue,
SUM(CASE WHEN product = 'B' THEN revenue ELSE 0 END) AS product_b_revenue

Tips: Always include ELSE 0 (or ELSE NULL) in conditional sums when you mean "only count this branch." Use CASE in ORDER BY for custom sort orders.

See SQL IF in SELECT for related patterns.

A strong answer is:

I use CASE WHEN for conditional counts and sums, user bucketing, and pivot-style metrics in one GROUP BY — with explicit ELSE so NULLs do not surprise me in aggregates.


Subqueries and CTEs

When do you prefer a CTE over a nested subquery?

Both break problems into steps; CTEs (WITH) usually win in interviews for readability and reuse.

Prefer a CTE when:

  • Logic has multiple steps — filter events → aggregate per user → rank
  • You reference the same intermediate set twice
  • You want the interviewer to follow top-to-bottom narrative

Nested subquery is fine when:

  • Single simple existence check (WHERE EXISTS (...))
  • One-off scalar subquery (WHERE salary > (SELECT AVG...))
sql
WITH active_users AS (
  SELECT user_id FROM users WHERE status = 'active'
),
monthly AS (
  SELECT user_id, DATE_TRUNC('month', event_at) AS month, COUNT(*) AS events
  FROM events
  WHERE user_id IN (SELECT user_id FROM active_users)
  GROUP BY 1, 2
)
SELECT * FROM monthly;

See SQL WITH clause examples. In production, correlated subqueries per row can be slow — rewrite as JOIN + GROUP BY when scale matters.

A strong answer is:

I use CTEs for multi-step logic and readability in interviews; nested subqueries for simple scalar or EXISTS checks. I rewrite correlated subqueries as joins when performance matters at scale.

What is a correlated subquery? Give an example.

A correlated subquery references columns from the outer query — the inner query is re-evaluated per outer row (conceptually a nested loop).

sql
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (
  SELECT AVG(salary)
  FROM employees
  WHERE department_id = e.department_id  -- correlated: depends on outer e
);

Returns employees earning above their department average.

Equivalent JOIN pattern (often clearer):

sql
WITH dept_avg AS (
  SELECT department_id, AVG(salary) AS avg_sal
  FROM employees
  GROUP BY department_id
)
SELECT e.name, e.salary
FROM employees e
JOIN dept_avg d ON e.department_id = d.department_id
WHERE e.salary > d.avg_sal;

In live interviews, write the version you can explain fastest — then mention the alternative.

A strong answer is:

A correlated subquery references the outer row — like comparing each employee to their department average. I can write it as a subquery or as JOIN to a grouped CTE; I pick whichever is clearer to explain.

EXISTS vs IN — which do you use?

Both test membership; behavior diverges with NULLs and performance on large sets.

Construct Best for NULL caution
IN (list) Small static sets: country IN ('US', 'CA') Subquery with NULL → UNKNOWN
EXISTS Correlated "has at least one related row" Stops at first match; NULL-safe pattern
sql
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.customer_id
    AND o.order_date >= CURRENT_DATE - INTERVAL '30 days'
);

SELECT 1 inside EXISTS is idiomatic — the engine only checks row existence.

See EXISTS examples and NOT IN operator for traps.

A strong answer is:

IN for small static lists; EXISTS for correlated has-row checks, especially when the subquery might contain NULLs. I use NOT EXISTS instead of NOT IN when NULLs are possible.

What is a recursive CTE used for?

A recursive CTE walks hierarchical or graph-like data by repeatedly joining a working set to itself until no new rows appear.

Structure (memorize this shape):

  1. Anchor member — base rows (roots)
  2. UNION ALL
  3. Recursive member — join anchor/recursive set to source table
  4. Termination — recursive part returns no new rows
sql
WITH RECURSIVE emp_tree AS (
  SELECT employee_id, manager_id, name, 1 AS depth
  FROM employees
  WHERE manager_id IS NULL
  UNION ALL
  SELECT e.employee_id, e.manager_id, e.name, t.depth + 1
  FROM employees e
  JOIN emp_tree t ON e.manager_id = t.employee_id
)
SELECT * FROM emp_tree;

Uses: org charts, category trees, bill of materials, generating date sequences (gap detection).

Harder interview topic — fewer loops require writing one from scratch, but recognizing the pattern helps.

A strong answer is:

Recursive CTEs have an anchor query plus a recursive UNION ALL join — I use them for hierarchies like org charts or generating date spines when the database has no built-in generator.


Window functions (high priority)

When do you use a window function instead of GROUP BY?

GROUP BY collapses rows into groups — you lose row-level detail. Window functions (OVER (...)) compute aggregates across a set of rows while keeping every row.

Use GROUP BY when: You only need one row per group (total revenue per day).

Use window functions when: You need both detail and aggregate — each order row plus customer lifetime spend, daily revenue plus running total, employee row plus department rank.

sql
SELECT order_id, customer_id, order_total,
       SUM(order_total) OVER (PARTITION BY customer_id) AS customer_lifetime
FROM orders;

Interview signal: If the question says "for each row, also show…" — think window, not GROUP BY.

See SQL ranking functions for the full family.

A strong answer is:

GROUP BY when I only need one row per group; window functions when I need row-level detail plus a ranking, running total, or comparison to the group — OVER PARTITION BY is the clue.

What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?

All assign an ordering within a partition; they differ on ties.

Function Ties get same rank? Next rank after tie Interview use
ROW_NUMBER No — unique 1,2,3 N/A Pick exactly one row per group (latest event)
RANK Yes Skips (1,1,3) Olympic-style ranking
DENSE_RANK Yes No skip (1,1,2) Top N distinct values per group

Example salaries 100, 100, 90:

  • ROW_NUMBER → 1, 2, 3 (tie broken arbitrarily)
  • RANK → 1, 1, 3
  • DENSE_RANK → 1, 1, 2

Clarify with interviewer before coding top-N problems.

A strong answer is:

ROW_NUMBER gives a unique rank for deduping one row per group; RANK and DENSE_RANK handle ties, with DENSE_RANK leaving no gaps — I ask which tie behavior the question expects.

Write a running total of daily revenue.

A running total is a cumulative SUM over an ordered window — classic analytics pattern.

sql
SELECT order_date,
       revenue,
       SUM(revenue) OVER (
         ORDER BY order_date
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS running_total
FROM daily_revenue
ORDER BY order_date;

Frame clause matters:

  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — sum one physical row at a time through the current row (clear running total)
  • Default frames vary by database and may use RANGE instead of ROWS — with duplicate ORDER BY values, peers can be included in the frame in surprising ways
  • In interviews, always specify ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for an unambiguous running total

Variants: PARTITION BY region for per-region running totals.

A strong answer is:

I use SUM() OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and specify ROWS explicitly — default RANGE frames and duplicate order keys can change results otherwise.

What is the difference between ROWS and RANGE in a window frame?

Window frames control which rows participate in a window aggregate. ROWS and RANGE sound similar but behave differently — especially with duplicate ORDER BY values.

Frame Counts Duplicate sort keys
ROWS Physical row offsets (1 row before, current row) Each row is separate — running total moves row by row
RANGE Logical peer group at same sort value All rows tied on ORDER BY key share the same frame boundary

Example: Two rows on 2024-01-01 with revenue 100 and 50.

  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW → second row includes only first + itself (150 cumulative on row 2)
  • Default RANGE behavior may treat both same-date rows as peers — both see the full day's total in the frame

Interview rule: For a clear running total, write:

sql
SUM(revenue) OVER (
  ORDER BY order_date
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

If multiple rows share the same ORDER BY value and row-by-row order matters, add a tie-breaker — e.g., ORDER BY order_date, id — so ROWS frames are fully deterministic.

Senior candidates who explain ROWS vs RANGE stand out on analytics loops.

A strong answer is:

ROWS counts physical row offsets; RANGE groups peers with the same ORDER BY value. I specify ROWS BETWEEN for an unambiguous running total because duplicate dates make RANGE frames behave differently.

How do you calculate day-over-day change with LAG?

LAG reads a previous row in the window; LEAD reads the next row — essential for period-over-period metrics.

sql
WITH with_prev AS (
  SELECT order_date,
         revenue,
         LAG(revenue) OVER (ORDER BY order_date) AS prev_revenue
  FROM daily_revenue
)
SELECT order_date,
       revenue,
       revenue - prev_revenue AS dod_change,
       100.0 * (revenue - prev_revenue) / NULLIF(prev_revenue, 0) AS dod_pct
FROM with_prev;

For readability, compute previous revenue once in a CTE when the same LAG() expression is reused — the inline version with repeated LAG() is also correct.

Common uses:

  • Day-over-day / month-over-month growth
  • Compare each event to the previous event per user (PARTITION BY user_id)
  • Funnel analysis with LEAD for "time to next step"

First row has NULL LAG — handle with COALESCE or filter if needed.

A strong answer is:

LAG pulls the previous row in order — I use it for day-over-day change, often with NULLIF on the denominator for percent change, and PARTITION BY user when the series is per entity.

How do you compute each row as percent of category total?

Divide row value by window aggregate over the partition — no self-join needed.

sql
SELECT category,
       product,
       sales,
       100.0 * sales / SUM(sales) OVER (PARTITION BY category) AS pct_of_category
FROM product_sales;

Tips:

  • Multiply by 100.0 for percentage; omit if interviewer wants ratio 0–1
  • SUM(...) OVER (PARTITION BY category) computes the category total on every row in that category
  • Same pattern works for percent of grand total — omit PARTITION BY

A strong answer is:

I divide row sales by SUM(sales) OVER (PARTITION BY category) — clarify whether they want a ratio or percentage, and use 100.0 for percent.

Why can you not put ROW_NUMBER() in the WHERE clause?

Window functions are evaluated after WHERE, GROUP BY, and HAVING — you cannot filter on them in the same query level.

SQL order of operations (simplified):

FROMWHEREGROUP BYHAVINGwindow functionsSELECTWHERE on outer query

Dedup pattern — latest row per user:

sql
WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
  FROM events
)
SELECT * FROM ranked WHERE rn = 1;

Some databases support QUALIFY (BigQuery, Snowflake) to filter windows inline — mention if you know the dialect.

See select first row in group.

A strong answer is:

Window functions run after WHERE, so I filter ROW_NUMBER in an outer query or CTE — latest row per user is PARTITION BY key ORDER BY time DESC, then WHERE rn = 1.

What does NTILE do?

NTILE(n) splits an ordered partition into n buckets of roughly equal size — quartiles, deciles, percentile bands.

sql
SELECT customer_id, total_spend,
       NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_totals;

Use cases in analytics interviews:

  • Customer segmentation (top quartile spenders)
  • Load balancing buckets
  • A/B test stratification

Caveat: Bucket counts are balanced, not bucket ranges — for true percentile cutoffs use PERCENT_RANK or PERCENTILE_CONT (dialect-specific).

A strong answer is:

NTILE splits rows into N ordered buckets — I use it for quartile-style segmentation; I note it balances row counts, not value ranges.


Common live-coding patterns

Find numbers that appear at least three times consecutively.

Consecutive means adjacent in a defined order — usually by id or timestamp, not numeric value alone. Clarify the ordering column before coding.

LAG approach (readable):

sql
WITH flagged AS (
  SELECT num,
         LAG(num, 1) OVER (ORDER BY id) AS prev1,
         LAG(num, 2) OVER (ORDER BY id) AS prev2
  FROM logs
)
SELECT DISTINCT num
FROM flagged
WHERE num = prev1 AND num = prev2;

Self-join alternative: Join logs to itself on id+1 and id+2 — same logic, more verbose.

Confirm whether the output should be distinct num values or full rows before you pick DISTINCT or window deduplication.

A strong answer is:

I confirm the ordering column, then use LAG to compare each row to the two prior rows in that order — or self-join on id+1 and id+2 — and return nums where all three match.

Find duplicate email addresses.

Find values that appear more than once — pure GROUP BY + HAVING:

sql
SELECT email
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

To list all rows with duplicate emails (not just the email string):

sql
SELECT u.*
FROM users u
JOIN (
  SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1
) d ON u.email = d.email;

Delete duplicates keeping one row: ROW_NUMBER() in a CTE, delete where rn > 1 (dialect permitting). See select duplicate records and delete duplicate rows.

A strong answer is:

GROUP BY email HAVING COUNT(*) > 1 finds duplicate values; if I need all duplicate rows or to keep one copy, I join back or use ROW_NUMBER in a CTE.

Find the Nth highest salary (parameterized).

Parameterize rank instead of hard-coding 2 for second highest:

sql
WITH ranked AS (
  SELECT salary,
         DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
  FROM employees
)
SELECT salary FROM ranked WHERE rnk = :n LIMIT 1;

Discuss with interviewer:

  • Multiple employees can share rank N → DENSE_RANK may return one salary value with many people
  • ROW_NUMBER = N returns one arbitrary row among ties
  • Empty result if fewer than N distinct salary levels exist

A strong answer is:

I rank salaries with DENSE_RANK or ROW_NUMBER depending on tie rules, filter WHERE rnk = N, and discuss what happens when ties exist or N exceeds distinct values.

How would you compute Day-7 user retention?

Retention = fraction of a cohort that returns on a target day after signup (or first action).

Steps:

  1. Define cohort — users by signup date
  2. Define return event — any activity on day 7 (or exactly day 7 — clarify!)
  3. retained / cohort_size per cohort date
sql
WITH cohort AS (
  SELECT user_id, DATE(signup_at) AS signup_date
  FROM users
),
activity AS (
  SELECT user_id, DATE(event_at) AS activity_date
  FROM events
)
SELECT c.signup_date,
       COUNT(DISTINCT c.user_id) AS cohort_size,
       COUNT(DISTINCT a.user_id) AS retained_d7,
       1.0 * COUNT(DISTINCT a.user_id)
         / NULLIF(COUNT(DISTINCT c.user_id), 0) AS d7_retention_rate
FROM cohort c
LEFT JOIN activity a
  ON c.user_id = a.user_id
 AND a.activity_date = c.signup_date + INTERVAL '7 days'
GROUP BY c.signup_date;

Clarify: Calendar day 7 vs "any time in week 2" changes the join. Report rate (0–1 or multiply by 100 for percent), not only numerator and denominator. Rolling retention windows use the same cohort logic with a different day offset.

A strong answer is:

I define the cohort by signup date, LEFT JOIN activity on the target return day, and return cohort_size, retained_d7, and d7_retention_rate with NULLIF on the denominator — clarifying whether day 7 is an exact calendar day or a window.

How do you calculate rolling 7-day active users?

Rolling 7-day active users = distinct users with at least one event in the last 7 days ending on each report date — a staple analytics metric (DAU/WAU cousins).

Approach: For each calendar day, count distinct user_id where event_date is between report_date - 6 days and report_date (7 days inclusive — clarify definition).

sql
WITH daily_users AS (
  SELECT DATE(event_at) AS event_date, user_id
  FROM events
  GROUP BY 1, 2
),
spine AS (
  SELECT DISTINCT event_date AS report_date FROM daily_users
)
SELECT s.report_date,
       COUNT(DISTINCT d.user_id) AS rolling_7d_active_users
FROM spine s
JOIN daily_users d
  ON d.event_date BETWEEN s.report_date - INTERVAL '6 days' AND s.report_date
GROUP BY s.report_date
ORDER BY s.report_date;

Alternatives: RANGE window on dates (dialect-specific), or self-join with date inequality (watch performance).

Production note: The spine above uses only dates that appear in events — fine for a short interview answer. In production, use a real date spine (generate_series, recursive CTE, or dim_date) so days with zero activity still appear in the report.

Clarify with interviewer: 7 days inclusive vs trailing 168 hours; timezone for DATE(event_at).

Pairs naturally with Day-7 retention (cohort) — rolling active is calendar-window, not cohort-based.

A strong answer is:

I dedupe user-day events, then for each report date count distinct users with activity in the trailing 7 calendar days using BETWEEN on dates — and I clarify inclusive days and timezone up front.

Calculate click-through rate (CTR) by day.

CTR = clicks / impressions — usually aggregated per day or campaign.

sql
SELECT DATE(event_at) AS event_date,
       SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) AS clicks,
       SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS impressions,
       1.0 * SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END)
         / NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0) AS ctr
FROM events
GROUP BY DATE(event_at);

Proactive mentions:

  • NULLIF(..., 0) prevents divide-by-zero on days with no impressions
  • Use 1.0 * or CAST for floating-point division
  • Alternative: COUNT(*) FILTER (WHERE event_type = 'click') in PostgreSQL

A strong answer is:

I sum clicks and impressions per day with conditional aggregation, divide with NULLIF to avoid zero impressions, and use floating-point division for CTR.

How do you compute a median in SQL?

Median = middle value of ordered data — dialect support varies.

PostgreSQL / SQL Server (ordered-set aggregate):

sql
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees;

Portable window approach (interview-friendly):

sql
WITH ordered AS (
  SELECT salary,
         ROW_NUMBER() OVER (ORDER BY salary) AS rn,
         COUNT(*) OVER () AS cnt
  FROM employees
)
SELECT AVG(salary) AS median_salary
FROM ordered
WHERE rn IN ((cnt + 1) / 2, (cnt + 2) / 2);

Even count averages the two middle values. State your dialect — interviewers accept logic with minor syntax fixes.

A strong answer is:

In PostgreSQL I use PERCENTILE_CONT(0.5); otherwise I rank rows and average the middle one or two. I say which dialect I am assuming.

Find missing dates in a daily series.

Detect gaps by comparing a complete date spine to facts.

Concept:

sql
SELECT d.day
FROM dates d
LEFT JOIN daily_metrics m ON d.day = m.day
WHERE m.day IS NULL;

Build the spine:

  • PostgreSQL: generate_series
  • Recursive CTE from min to max date
  • Warehouse dimension table dim_date

Tests whether you think in complete grids — common in analytics screens ("which days did we not report revenue?").

A strong answer is:

I generate a continuous date series, LEFT JOIN facts, and filter WHERE the fact key IS NULL — using generate_series or a recursive CTE depending on dialect.


Design, performance, and senior depth

What is database normalization and when do you denormalize?

Normalization organizes tables to reduce redundancy and update anomalies — each fact lives in one place, linked by keys.

Normal forms (interview shorthand):

Form Idea
1NF Atomic columns; no repeating groups
2NF No partial dependency on part of a composite key
3NF No transitive dependency (non-key → non-key)

OLTP / application databases favor normalized schemas — orders, customers, line items as separate tables.

When to denormalize:

  • Analytics / warehouses — star schema with fact + dimension tables trades storage for faster reads
  • Read-heavy dashboards — pre-joined wide tables or materialized aggregates
  • Latency-sensitive reports when join cost at query time exceeds duplication cost

Interview framing: "Normalized for writes and integrity; denormalized for read performance when we measure the trade-off."

A strong answer is:

Normalization reduces redundancy and update bugs through 1NF–3NF; I denormalize for analytics read performance when measured join cost justifies duplication — star schemas in warehouses, not random duplicate columns in OLTP.

When does an index help, and when does it hurt?

An index is a lookup structure (often B-tree) that speeds find operations at the cost of write maintenance and storage.

Helps when:

  • WHERE filters on high-selectivity columns (few rows match)
  • JOIN keys and foreign keys used in lookups
  • ORDER BY columns that align with index order (covering indexes avoid table lookups)

Hurts or helps little when:

  • Tiny tables (full scan is fine)
  • Low-selectivity columns (gender on balanced data) — optimizer may ignore index
  • Heavy INSERT/UPDATE/DELETE workloads (every index must update)
  • Functions on indexed columnsWHERE YEAR(created_at) = 2024 often cannot use index on created_at (use range on raw column instead)

Composite index tip: Column order matters — put equality filters first, then range.

A strong answer is:

Indexes speed selective WHERE and JOIN keys; they hurt heavy write tables and wrong-column queries. I avoid wrapping indexed columns in functions and match composite index order to query filters.

What is ACID in transactions?

Transactions group SQL statements into one logical unit — critical for payments, inventory, and any multi-step write.

Property Meaning Example
Atomicity All statements commit or none do Transfer debit + credit both succeed or both roll back
Consistency DB moves between valid states Constraints satisfied after commit
Isolation Concurrent transactions do not see each other's partial work Levels: READ COMMITTED, REPEATABLE READ, SERIALIZABLE
Durability Committed data survives crash Write-ahead log on disk

Interview depth: Mention isolation anomalies — dirty read, non-repeatable read, phantom read — and that higher isolation costs concurrency.

See SQL transactions explained.

A strong answer is:

ACID means transactions are all-or-nothing, keep valid state, isolate concurrent sessions, and persist after commit. I pick isolation level based on consistency needs vs concurrency.

How do you approach optimizing a slow SELECT?

Correctness first — a fast wrong answer is worse than a slow right one. Then optimize with evidence, not guesses.

Structured approach:

  1. EXPLAIN / EXPLAIN ANALYZE — sequential scan vs index scan, row estimates vs actuals, join order
  2. Check join keys — missing index on FK columns is a common root cause
  3. Reduce selected columns — avoid SELECT * in production pipelines
  4. Filter early — push predicates; rewrite WHERE YEAR(d) = 2024 → date range
  5. Fix grain — accidental row explosion from joins inflates work
  6. Pre-aggregate — materialized views or summary tables for dashboards

In interviews: Write the correct query, narrate one optimization you would check on EXPLAIN, and mention measuring before/after.

A strong answer is:

I make the query correct, run EXPLAIN ANALYZE, fix bad join grain and missing indexes, push filters without functions on indexed columns, and only then consider materialized views for repeated heavy reports.

What is a final-week checklist before SQL interviews?

Use this as a readiness audit — not a substitute for timed practice.

Problem volume and patterns:

  • 35+ problems done (easy + medium + a few hard)
  • Top-N per group without hesitation (ROW_NUMBER / DENSE_RANK)
  • NULL, COUNT(*) vs COUNT(col), HAVING traps internalized
  • Five window patterns cold: rank, running sum, LAG, % of total, dedupe latest row
  • Know ROWS vs RANGE for running totals; CASE WHEN for conditional aggregation
  • One retention and one rolling active users or gap problem explained end-to-end

Communication:

  • Practice explaining queries aloud while typing
  • Drill stating grain, dialect, and tie behavior before coding

On-site tutorials to review:

Related interview prep: technical specialist, Salesforce data engineer, Interview Questions category.

A strong answer is:

In the final week I time-box medium problems, drill top-N and window patterns, practice narrating grain and NULL handling aloud, and review joins and GROUP BY until the syntax matches how I explain the logic.


Pattern cheat sheet (quick reference)

Pattern Tool
Top N per group ROW_NUMBER / DENSE_RANK + filter
Second highest MAX subquery or DENSE_RANK = 2
Running total SUM() OVER (ORDER BY … ROWS BETWEEN …)
ROWS vs RANGE frames Specify ROWS for row-by-row running totals
Conditional metrics SUM(CASE WHEN … THEN 1 ELSE 0 END)
Rolling 7-day active Distinct users with events in trailing 7-day window
Period-over-period LAG / LEAD
Anti-join LEFT JOIN … WHERE right.key IS NULL or NOT EXISTS
Dedupe latest row ROW_NUMBER partitioned by key, rn = 1
Avoid double-count Aggregate at correct grain in CTE first

Good luck — pattern recognition beats memorizing 200 isolated answers.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with more than 15 years of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive …