SQL technical interviews reward queries you can write under time pressure—joins, aggregations, subqueries, window functions, and the NULL and top-N patterns that trip people up in live loops. Interviewers care less about keyword recall and more about whether you can turn a vague business question into correct, readable SQL.
Below are 47 questions with sample answers in PostgreSQL-style syntax (most examples also work on MySQL 8+, BigQuery, and Snowflake with minor tweaks). For DBMS interview questions (ER model, normalization, ACID, keys—theory before engine specifics), see DBMS interview questions. For dedicated PostgreSQL interview questions (MVCC, vacuum, index types, replication, DBA topics), see PostgreSQL interview questions. For pandas interview questions when loops compare SQL joins to merge, see pandas interview questions. Open each answer after you try the question yourself; snippets assume tables like employees or orders exist — later sections link to deeper SQL tutorials on this site when you want more practice.
Interview context and how to prepare
What do SQL technical interviews actually test?
SQL interviews are less about memorizing syntax and more about thinking in relations — which rows you need, at what grain, and how to combine tables without lying with numbers.
Three layers interviewers probe (often in that order):
| Layer | What it means | Example signal |
|---|---|---|
| Vocabulary | You know clauses, join types, keys, NULL rules | Explain INNER vs LEFT without guessing |
| Correctness | Output matches the business question | No double-counted revenue; ties handled explicitly |
| Decomposition | You break hard problems into readable steps | CTEs, window frames, narrated assumptions |
Role differences: Data analyst and data engineer loops are usually SQL-heavy (live coding, take-homes). Software engineer loops may use SQL as a filter — still expect joins, aggregates, and one window-function problem at many companies.
What separates strong candidates: They clarify the question (ties, NULLs, timezone, dialect), state grain aloud ("one row per user per day"), and write SQL another engineer can maintain.
What format should you expect?
Formats vary by company and role, but most SQL screens combine time pressure with communication — they want to hear how you think, not only see a final SELECT.
Common formats:
| Format | What happens | How to prepare |
|---|---|---|
| Live coding | Shared editor, 30–60 min, 1–3 problems | Talk through assumptions; leave time to test edge cases |
| Take-home | Larger schema, documented deliverable | Write clean SQL + short README of assumptions |
| Explain / debug | Optimize or fix a broken query | Read execution plan basics; spot wrong join grain first |
| Whiteboard SQL | Pseudocode + key clauses aloud | Practice saying PARTITION BY, GROUP BY, filters out loud |
Always clarify up front:
- Dialect — PostgreSQL, MySQL 8+, BigQuery, Snowflake (syntax for dates, strings, limits)
- Tie behavior — "second highest" with duplicate salaries?
- NULL handling — are missing values excluded or their own category?
- Timezone — for
DATE(event_at)on global events
Interviewers forgive minor syntax slips if logic and communication are solid.
What is a realistic 3–5 week prep plan?
Three to five weeks is enough for most candidates if you practice writing, not only read tutorials. Medium-difficulty pattern recognition is the bar at many data and backend interviews.
| Week | Focus | Target output |
|---|---|---|
| 1 | SELECT, WHERE, GROUP BY, JOINs |
10 easy problems; explain grain on each |
| 2 | Subqueries, CTEs, HAVING traps |
10 medium problems; rewrite one nested query as CTE |
| 3 | Window functions — ROW_NUMBER, RANK, LAG, running totals |
15 medium; five patterns from memory |
| 4 | Hard: gaps, retention, YoY, median | 5–10 timed problems |
| 5 | Mock screens + explain answers aloud | 2 full timed mocks with verbal walkthrough |
Daily habit: For each problem, write assumptions in a comment, solve, then ask "what if NULL?" and "what if tie?"
Medium difficulty defines the interview bar — easy builds confidence; hard teaches edge cases.
Do you need to know every SQL dialect?
No — and nobody expects encyclopedic dialect knowledge in a 45-minute screen. Pick one primary dialect (PostgreSQL or MySQL 8+ are common) and know the transfer differences that actually change interview answers.
| Feature | PostgreSQL | MySQL 8+ |
|---|---|---|
| String concat | || or CONCAT() |
CONCAT() |
| Limit rows | LIMIT n |
LIMIT n |
| Boolean | native true / false |
historically TINYINT(1) |
| Date arithmetic | date - date, INTERVAL '7 days' |
DATEDIFF(), DATE_ADD() |
| Median | PERCENTILE_CONT |
window-function workaround |
Interview strategy: State your dialect at the start ("I'll write PostgreSQL-style"). If you forget DATEDIFF vs interval syntax, say so and keep the logic correct — interviewers care more about joins and grain than memorizing date functions.
For fundamentals, see getting started with SQL on this site.
Fundamentals and core concepts
What is SQL, and how does it relate to RDBMS?
SQL (Structured Query Language) is the standard language for defining, querying, and manipulating relational data — tables with rows and columns linked by keys. An RDBMS (relational database management system) is the engine that stores data and executes SQL: PostgreSQL, MySQL, SQL Server, Oracle, SQLite, and cloud warehouses like BigQuery and Snowflake all speak SQL with dialect differences.
Why interviewers ask this: They want to know you understand SQL is not "a database" — it is the interface to one.
SQL statement categories (useful vocabulary in system-design tangents):
| Category | Purpose | Examples |
|---|---|---|
| DDL | Define structure | CREATE, ALTER, DROP |
| DML | Change data | INSERT, UPDATE, DELETE |
| DQL | Read data | SELECT (most interview focus) |
| DCL | Permissions | GRANT, REVOKE |
| TCL | Transactions | COMMIT, ROLLBACK |
Most technical screens are DQL-heavy — SELECT with joins and aggregates — but senior roles may ask how migrations (DDL) or transactions (TCL) affect production behavior.
A strong answer is:
SQL is the language; the RDBMS is the engine. Interviews focus on SELECT and relational reasoning, but I know DDL/DML/TCL matter for how data is stored, updated, and kept consistent in production.
What is the difference between PRIMARY KEY and FOREIGN KEY?
Keys define identity and relationships between tables — the foundation of every join you write in an interview.
PRIMARY KEY
- Uniquely identifies each row in a table
- Implies NOT NULL (no duplicate keys, no missing keys)
- One primary key per table (can be composite — multiple columns together)
FOREIGN KEY
- Column(s) in a child table that reference a key in a parent table
- Enforces referential integrity — you cannot insert a child row with a parent id that does not exist (unless constraints are deferred or disabled)
Interview example: orders.customer_id → customers.customer_id. A JOIN between orders and customers relies on this relationship.
Practical tip: Interview schemas sometimes omit explicit FK constraints but still have logical keys. Ask: "Which column links these tables?" before joining on the wrong id.
See SQL CREATE TABLE for how keys are declared in DDL.
A strong answer is:
A primary key uniquely identifies rows in a table; a foreign key links a child table to a parent and enforces referential integrity. I confirm join keys before writing JOINs, even when FK constraints are not shown on the schema.
What is the difference between DELETE, TRUNCATE, and DROP?
All three remove data or structure, but at different scope and with different safety implications — a common fundamentals question for backend and data roles.
| Command | What it removes | WHERE clause? |
Typical rollback | Speed |
|---|---|---|---|---|
| DELETE | Matching rows | Yes | Often yes (inside a transaction) | Slower (row-level logging) |
| TRUNCATE | All rows in table | No (table-wide) | Dialect-dependent | Fast (deallocates pages) |
| DROP | Entire table (structure + data) | N/A | Depends on DB and transaction | DDL — metadata change |
Details interviewers probe:
DELETEwithoutWHEREdeletes every row but keeps the table — still logged row-by-row in many enginesTRUNCATEoften resets identity/serial counters and may not fire row-level triggersDROPremoves the table object from the catalog. Rollback behavior depends on the database and transaction context — PostgreSQL can roll back many DDL statements inside a transaction; other engines differ. After commit, recovery usually means backup/restore orCREATEagain — do not assume undo
Production caution: TRUNCATE and DROP are common incident causes; mention transactions and backups on senior loops.
A strong answer is:
DELETE removes specific rows and can often roll back in a transaction; TRUNCATE wipes all rows quickly without a WHERE; DROP removes the table object. I choose based on scope, and I treat DROP rollback as dialect-dependent — after commit you typically need recreation or restore.
What is the difference between WHERE and HAVING?
Both filter rows, but at different stages of query execution — mixing them up is one of the most common SQL mistakes in interviews and production.
Execution order (simplified):
FROM/JOIN— assemble rowsWHERE— filter individual rows before groupingGROUP BY— form groupsHAVING— filter groups after aggregationSELECT/ORDER BY— project and sort
SELECT department_id, AVG(salary) AS avg_sal
FROM employees
WHERE status = 'active' -- row filter: only active employees enter the average
GROUP BY department_id
HAVING AVG(salary) > 80000; -- group filter: only departments above thresholdClassic trap: Putting a non-aggregated column in HAVING that belongs in WHERE (e.g., HAVING status = 'active'). Some databases allow it but it confuses readers and may hurt performance.
Cleaner pattern for complex filters: Aggregate in a CTE, then WHERE on the alias in the outer query.
A strong answer is:
WHERE filters rows before GROUP BY; HAVING filters groups after aggregation. I put row-level conditions in WHERE and aggregate thresholds in HAVING — or use a CTE when the logic gets hard to read.
How does NULL behave in SQL comparisons?
NULL means unknown or missing — not zero, not empty string '', not false. Three-valued logic (TRUE, FALSE, UNKNOWN) drives many interview traps.
Rules to internalize:
| Expression | Result | Why |
|---|---|---|
NULL = NULL |
UNKNOWN (not TRUE) | Cannot confirm two unknowns are equal |
WHERE col = NULL |
Wrong idiom | Use WHERE col IS NULL |
COUNT(*) |
Counts all rows in group | Includes rows where columns are NULL |
COUNT(col) |
Counts non-NULL values only | NULLs skipped |
AVG(col) |
Averages non-NULL values | NULLs ignored, not treated as zero |
Interview follow-ups:
SUMof no rows returns NULL — useCOALESCE(SUM(x), 0)when you need zeroNOT IN (subquery)breaks if the subquery returns NULL — preferNOT EXISTS- Outer joins produce NULL for non-matching columns — anti-joins use
WHERE right.key IS NULL
See ISNULL / NULL handling for dialect-specific functions.
A strong answer is:
NULL is unknown — I never compare with = NULL; I use IS NULL. I remember COUNT(column) skips NULLs, and I avoid NOT IN when the subquery might return NULL.
When do you use COALESCE in SQL?
COALESCE(expr1, expr2, …) returns the first non-NULL argument — the standard tool for NULL-safe display and fallbacks in reports.
Common interview uses:
| Scenario | Pattern |
|---|---|
| Show zero instead of NULL | COALESCE(SUM(amount), 0) when no rows match |
| Fallback value | COALESCE(nickname, name) |
| Outer join reports | COALESCE(b.order_count, 0) when the right side is NULL |
| Safer division setup | Often paired with NULLIF — COALESCE(x / NULLIF(y, 0), 0) |
What COALESCE does not fix: It does not replace proper IS NULL filters in WHERE. COALESCE(col, 0) = 0 matches both NULL and zero — know if that is intended.
vs IFNULL / NVL: Dialect-specific aliases; COALESCE is portable SQL.
See server COALESCE examples for more patterns.
A strong answer is:
I use COALESCE for the first non-NULL fallback — especially zero instead of NULL after outer joins or empty aggregates — and I pair it with NULLIF when I need safe division, not as a substitute for correct NULL filters.
JOINs
Explain INNER, LEFT, RIGHT, and FULL OUTER JOIN.
Joins combine rows from two tables based on a join condition — usually equality on keys (ON a.id = b.a_id). The join type decides what happens when no match exists on one side.
| Join type | Rows in result | When to use |
|---|---|---|
| INNER | Only rows with a match in both tables | Default "related" data — orders with customers who exist |
| LEFT | All rows from left table + matches from right; NULLs if no match | Keep all drivers even if no trips; find missing matches (anti-join) |
| RIGHT | Mirror of LEFT (all right + left matches) | Rare in practice — rewrite as LEFT by swapping tables |
| FULL OUTER | All rows from both; NULLs where no match | Compare two snapshots; find rows only in A or only in B |
How to narrate in an interview: "I need every customer, even those with zero orders — that's a LEFT JOIN from customers to orders."
Venn intuition: INNER is the overlap; OUTER types keep non-overlapping rows with NULL-filled columns.
Drill syntax: LEFT JOIN examples, RIGHT JOIN, FULL OUTER JOIN.
A strong answer is:
INNER keeps only matching rows; LEFT keeps all left rows with NULLs when there is no right match; FULL OUTER keeps both sides. I pick the type based on whether I need unmatched rows and say that aloud before coding.
When do you use a self-join?
A self-join joins a table to itself — aliased as two different roles — when relationships exist between rows in the same table.
Common interview patterns:
| Pattern | Example |
|---|---|
| Hierarchy | Employee → manager (both in employees) |
| Sequential events | Compare each row to the previous row (often replaced by LAG) |
| Pairing | Find users who referred each other, duplicate pairs |
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;LEFT JOIN keeps employees with no manager (CEO). INNER JOIN would drop them.
Modern alternative: LAG / recursive CTEs for hierarchies — but self-joins still appear in screens because they test whether you understand join logic.
A strong answer is:
I self-join when rows in one table relate to other rows in the same table — like employee to manager. I use clear aliases and LEFT JOIN when the parent row might not exist.
How do you find rows in A that have no match in B?
An anti-join returns rows in table A with no corresponding row in table B — "customers who never ordered," "orders without payments."
Three idioms (know all three; prefer NOT EXISTS when NULLs are possible):
-- 1. LEFT JOIN + NULL filter (most visual)
SELECT a.*
FROM orders a
LEFT JOIN payments b ON a.order_id = b.order_id
WHERE b.order_id IS NULL;
-- 2. NOT EXISTS (often efficient; NULL-safe)
SELECT a.*
FROM orders a
WHERE NOT EXISTS (
SELECT 1 FROM payments b WHERE b.order_id = a.order_id
);
-- 3. NOT IN — only when subquery column is NOT NULL guaranteed
SELECT a.*
FROM orders a
WHERE a.order_id NOT IN (SELECT order_id FROM payments);NULL trap with NOT IN: If the subquery returns any NULL order_id, the outer NOT IN may return no rows (UNKNOWN logic). NOT EXISTS avoids this.
See EXISTS examples for more patterns.
A strong answer is:
I use LEFT JOIN with WHERE right.key IS NULL or NOT EXISTS for rows in A with no match in B. I avoid NOT IN when the subquery might contain NULLs.
What is a CROSS JOIN and when is it dangerous?
A CROSS JOIN produces the Cartesian product: every row in A paired with every row in B. Row count = |A| × |B|.
Intentional uses:
- Build a date spine × product list for zero-filled reporting grids
- Generate combinations (small dimension tables only)
Why it is dangerous: A missing ON clause on an implicit join, or forgetting a filter, can turn 1,000 × 1,000 rows into one million rows — a classic production outage pattern.
Interview signal: When you see unexpectedly huge row counts, check for accidental cross joins.
A strong answer is:
CROSS JOIN pairs every row in A with every row in B. I use it deliberately for small dimension combos like date × product, and I watch for missing join conditions that create accidental Cartesian explosions.
UNION vs JOIN — when to use each?
They solve different problems — interviewers test whether you confuse stacking rows with combining columns.
| Operation | Direction | Requirement | Dedup? |
|---|---|---|---|
| JOIN | Horizontal — adds columns from another table | Related keys | N/A |
| UNION / UNION ALL | Vertical — stacks result sets | Same number and compatible types of columns | UNION yes; UNION ALL no |
Examples:
- JOIN: Add
customer_nameto eachorderrow - UNION ALL: Combine
archived_ordersandordersinto one report
UNION ALL is faster when you do not need deduplication — prefer it unless duplicates must be removed.
Review SQL UNION examples.
A strong answer is:
JOIN combines related tables horizontally on keys; UNION stacks compatible result sets vertically. I use UNION ALL unless I explicitly need deduplication.
Aggregations and GROUP BY traps
What is the difference between COUNT(*) and COUNT(column)?
COUNT seems simple but encodes NULL semantics — senior interviewers use it to test attention to detail.
| Expression | What it counts |
|---|---|
COUNT(*) |
Rows in the group (or whole table if no GROUP BY) — includes rows where columns are NULL |
COUNT(column) |
Rows where column IS NOT NULL only |
COUNT(DISTINCT column) |
Distinct non-NULL values |
Example: 10 users in a group, 3 with NULL email:
COUNT(*) = 10COUNT(email) = 7
When it matters: "How many users provided an email?" → COUNT(email). "How many users?" → COUNT(*).
Do not use COUNT(DISTINCT order_id) as a band-aid for join double-counting — fix grain first (see the double-counting question in this section).
A strong answer is:
COUNT(*) counts rows; COUNT(column) counts non-NULL values in that column. I pick based on whether NULL means missing data I should exclude from the count.
How do joins cause double-counting in aggregates?
Double-counting happens when you join before aggregating at the wrong grain — one parent row matches many child rows, so parent metrics get multiplied.
Classic mistake:
-- WRONG: order total counted once per line item
SELECT o.order_id, SUM(o.order_total) AS revenue
FROM orders o
JOIN order_items i ON o.order_id = i.order_id
GROUP BY o.order_id;If an order has 5 line items, order_total is summed five times.
How to fix (pick grain first):
| Where the metric lives | Strategy |
|---|---|
Order-level column (orders.order_total) |
Do not join to order_items before SUM unless you need line detail — aggregate at order grain, then join to customers or dates |
Line-item metric (quantity × price) |
Aggregate child rows first to order grain, then join to parent tables |
-- Metric comes from line items: aggregate to order grain FIRST, then join via orders
WITH order_revenue AS (
SELECT order_id, SUM(quantity * unit_price) AS order_total
FROM order_items
GROUP BY order_id
)
SELECT c.customer_id, SUM(r.order_total) AS customer_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_revenue r ON o.order_id = r.order_id
GROUP BY c.customer_id;
-- Order total already on orders table: no line-item join needed for revenue
SELECT SUM(order_total) FROM orders;If you must join one-to-many for filters (e.g., "orders containing product X"), use a distinct order list or subquery at order grain before aggregating order-level dollars.
COUNT(DISTINCT order_id) only when the business metric is literally "count distinct orders" — not a band-aid for inflated SUM.
Interview habit: Say grain aloud — "one row per order," "one row per user per day" — before writing GROUP BY.
A strong answer is:
Double-counting happens when I join one-to-many before aggregating parent metrics. I aggregate line items to order grain first, or keep order-level totals without the line join, then join to customers or time — and I state grain explicitly.
Write a query to find the second highest salary.
Clarify tie behavior first: if two people tie for highest, is there a "second" salary distinct value, or the next rank?
Subquery approach (distinct second value):
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);Returns NULL if everyone has the same salary.
Window approach (handles ties with ranking):
WITH ranked AS (
SELECT salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees
)
SELECT DISTINCT salary
FROM ranked
WHERE rnk = 2;DENSE_RANK— 1, 1, 2 when two people tie for top (good for "second distinct salary level")ROW_NUMBER— arbitrary tie-break; use only if interviewer wants exactly one row
Classic interview pattern for second-highest salary with tie handling.
A strong answer is:
I clarify tie rules, then use MAX with a subquery for the second distinct value or DENSE_RANK = 2 when ties matter. I mention NULL when all salaries are equal.
How do you get top 3 salaries per department?
Wrong approach: GROUP BY department_id + LIMIT 3 — LIMIT applies to the final result, not per group.
Right approach: window function + filter:
WITH ranked AS (
SELECT *,
DENSE_RANK() OVER (
PARTITION BY department_id
ORDER BY salary DESC
) AS rnk
FROM employees
)
SELECT *
FROM ranked
WHERE rnk <= 3;Function choice:
| Function | Top-3 behavior with ties |
|---|---|
ROW_NUMBER |
Exactly 3 rows; ties broken arbitrarily |
RANK |
May return more than 3 rows if ties span rank 3 |
DENSE_RANK |
Top 3 salary levels per department |
Common pattern for ranking within departments—appears constantly in data interviews.
See select first row in group for related patterns.
A strong answer is:
I partition by department, rank by salary with ROW_NUMBER or DENSE_RANK depending on tie rules, then filter rnk <= 3 in an outer query — never GLOBAL LIMIT after GROUP BY.
Can you use WHERE on an aggregate like AVG(salary)?
No — aggregate expressions are not available in WHERE because filtering happens before grouping.
Options:
-- HAVING (after GROUP BY)
SELECT department_id, AVG(salary) AS avg_sal
FROM employees
GROUP BY department_id
HAVING AVG(salary) > 80000;
-- CTE + WHERE (often clearer in interviews)
WITH dept_avg AS (
SELECT department_id, AVG(salary) AS avg_sal
FROM employees
GROUP BY department_id
)
SELECT * FROM dept_avg WHERE avg_sal > 80000;CTEs read top-to-bottom like your explanation — interviewers follow the logic more easily than nested HAVING with multiple aggregates.
A strong answer is:
I cannot filter aggregates in WHERE — only in HAVING after GROUP BY, or in an outer query on a CTE. I prefer CTE plus WHERE when the logic has multiple steps.
What is the difference between DISTINCT and GROUP BY?
Both can reduce duplicate rows, but they serve different purposes — confusing them is a common interview mistake.
| DISTINCT | GROUP BY | |
|---|---|---|
| Primary job | Remove duplicate result rows | Form groups for aggregation |
| Aggregates | No aggregates required | Usually paired with SUM, COUNT, AVG |
| All columns | Applies to full selected row shape | Groups by explicit key columns |
-- DISTINCT: unique country values
SELECT DISTINCT country FROM customers;
-- GROUP BY: count per country
SELECT country, COUNT(*) AS n
FROM customers
GROUP BY country;Similar-looking output: SELECT DISTINCT country vs SELECT country FROM customers GROUP BY country — often the same rows, but meaning differs: one deduplicates; the other defines groups (and could add aggregates).
Misuse: SELECT DISTINCT customer_id, SUM(amount) without GROUP BY is invalid — use GROUP BY customer_id.
A strong answer is:
DISTINCT removes duplicate rows in the result; GROUP BY defines groups for aggregation. They can look similar without aggregates, but I use GROUP BY when I need counts or sums per key.
How do you use CASE WHEN in SQL interviews?
CASE WHEN is the workhorse for conditional logic in queries — especially conditional aggregation and bucketing without pivot syntax.
Patterns interviewers expect:
-- Conditional counts (CTR-style)
SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) AS clicks
-- Conditional sums
SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed_revenue
-- Bucketing users
CASE
WHEN total_spend >= 1000 THEN 'high'
WHEN total_spend >= 100 THEN 'medium'
ELSE 'low'
END AS spend_tier
-- Pivot-style metrics in one query
SUM(CASE WHEN product = 'A' THEN revenue ELSE 0 END) AS product_a_revenue,
SUM(CASE WHEN product = 'B' THEN revenue ELSE 0 END) AS product_b_revenueTips: Always include ELSE 0 (or ELSE NULL) in conditional sums when you mean "only count this branch." Use CASE in ORDER BY for custom sort orders.
See SQL IF in SELECT for related patterns.
A strong answer is:
I use CASE WHEN for conditional counts and sums, user bucketing, and pivot-style metrics in one GROUP BY — with explicit ELSE so NULLs do not surprise me in aggregates.
Subqueries and CTEs
When do you prefer a CTE over a nested subquery?
Both break problems into steps; CTEs (WITH) usually win in interviews for readability and reuse.
Prefer a CTE when:
- Logic has multiple steps — filter events → aggregate per user → rank
- You reference the same intermediate set twice
- You want the interviewer to follow top-to-bottom narrative
Nested subquery is fine when:
- Single simple existence check (
WHERE EXISTS (...)) - One-off scalar subquery (
WHERE salary > (SELECT AVG...))
WITH active_users AS (
SELECT user_id FROM users WHERE status = 'active'
),
monthly AS (
SELECT user_id, DATE_TRUNC('month', event_at) AS month, COUNT(*) AS events
FROM events
WHERE user_id IN (SELECT user_id FROM active_users)
GROUP BY 1, 2
)
SELECT * FROM monthly;See SQL WITH clause examples. In production, correlated subqueries per row can be slow — rewrite as JOIN + GROUP BY when scale matters.
A strong answer is:
I use CTEs for multi-step logic and readability in interviews; nested subqueries for simple scalar or EXISTS checks. I rewrite correlated subqueries as joins when performance matters at scale.
EXISTS vs IN — which do you use?
Both test membership; behavior diverges with NULLs and performance on large sets.
| Construct | Best for | NULL caution |
|---|---|---|
IN (list) |
Small static sets: country IN ('US', 'CA') |
Subquery with NULL → UNKNOWN |
EXISTS |
Correlated "has at least one related row" | Stops at first match; NULL-safe pattern |
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= CURRENT_DATE - INTERVAL '30 days'
);SELECT 1 inside EXISTS is idiomatic — the engine only checks row existence.
See EXISTS examples and NOT IN operator for traps.
A strong answer is:
IN for small static lists; EXISTS for correlated has-row checks, especially when the subquery might contain NULLs. I use NOT EXISTS instead of NOT IN when NULLs are possible.
What is a recursive CTE used for?
A recursive CTE walks hierarchical or graph-like data by repeatedly joining a working set to itself until no new rows appear.
Structure (memorize this shape):
- Anchor member — base rows (roots)
UNION ALL- Recursive member — join anchor/recursive set to source table
- Termination — recursive part returns no new rows
WITH RECURSIVE emp_tree AS (
SELECT employee_id, manager_id, name, 1 AS depth
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, e.name, t.depth + 1
FROM employees e
JOIN emp_tree t ON e.manager_id = t.employee_id
)
SELECT * FROM emp_tree;Uses: org charts, category trees, bill of materials, generating date sequences (gap detection).
Harder interview topic — fewer loops require writing one from scratch, but recognizing the pattern helps.
A strong answer is:
Recursive CTEs have an anchor query plus a recursive UNION ALL join — I use them for hierarchies like org charts or generating date spines when the database has no built-in generator.
Window functions (high priority)
When do you use a window function instead of GROUP BY?
GROUP BY collapses rows into groups — you lose row-level detail. Window functions (OVER (...)) compute aggregates across a set of rows while keeping every row.
Use GROUP BY when: You only need one row per group (total revenue per day).
Use window functions when: You need both detail and aggregate — each order row plus customer lifetime spend, daily revenue plus running total, employee row plus department rank.
SELECT order_id, customer_id, order_total,
SUM(order_total) OVER (PARTITION BY customer_id) AS customer_lifetime
FROM orders;Interview signal: If the question says "for each row, also show…" — think window, not GROUP BY.
See SQL ranking functions for the full family.
A strong answer is:
GROUP BY when I only need one row per group; window functions when I need row-level detail plus a ranking, running total, or comparison to the group — OVER PARTITION BY is the clue.
What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?
All assign an ordering within a partition; they differ on ties.
| Function | Ties get same rank? | Next rank after tie | Interview use |
|---|---|---|---|
| ROW_NUMBER | No — unique 1,2,3 | N/A | Pick exactly one row per group (latest event) |
| RANK | Yes | Skips (1,1,3) | Olympic-style ranking |
| DENSE_RANK | Yes | No skip (1,1,2) | Top N distinct values per group |
Example salaries 100, 100, 90:
ROW_NUMBER→ 1, 2, 3 (tie broken arbitrarily)RANK→ 1, 1, 3DENSE_RANK→ 1, 1, 2
Clarify with interviewer before coding top-N problems.
A strong answer is:
ROW_NUMBER gives a unique rank for deduping one row per group; RANK and DENSE_RANK handle ties, with DENSE_RANK leaving no gaps — I ask which tie behavior the question expects.
Write a running total of daily revenue.
A running total is a cumulative SUM over an ordered window — classic analytics pattern.
SELECT order_date,
revenue,
SUM(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM daily_revenue
ORDER BY order_date;Frame clause matters:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW— sum one physical row at a time through the current row (clear running total)- Default frames vary by database and may use
RANGEinstead ofROWS— with duplicateORDER BYvalues, peers can be included in the frame in surprising ways - In interviews, always specify
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWfor an unambiguous running total
Variants: PARTITION BY region for per-region running totals.
A strong answer is:
I use SUM() OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and specify ROWS explicitly — default RANGE frames and duplicate order keys can change results otherwise.
What is the difference between ROWS and RANGE in a window frame?
Window frames control which rows participate in a window aggregate. ROWS and RANGE sound similar but behave differently — especially with duplicate ORDER BY values.
| Frame | Counts | Duplicate sort keys |
|---|---|---|
ROWS |
Physical row offsets (1 row before, current row) | Each row is separate — running total moves row by row |
RANGE |
Logical peer group at same sort value | All rows tied on ORDER BY key share the same frame boundary |
Example: Two rows on 2024-01-01 with revenue 100 and 50.
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW→ second row includes only first + itself (150 cumulative on row 2)- Default RANGE behavior may treat both same-date rows as peers — both see the full day's total in the frame
Interview rule: For a clear running total, write:
SUM(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)If multiple rows share the same ORDER BY value and row-by-row order matters, add a tie-breaker — e.g., ORDER BY order_date, id — so ROWS frames are fully deterministic.
Senior candidates who explain ROWS vs RANGE stand out on analytics loops.
A strong answer is:
ROWS counts physical row offsets; RANGE groups peers with the same ORDER BY value. I specify ROWS BETWEEN for an unambiguous running total because duplicate dates make RANGE frames behave differently.
How do you calculate day-over-day change with LAG?
LAG reads a previous row in the window; LEAD reads the next row — essential for period-over-period metrics.
WITH with_prev AS (
SELECT order_date,
revenue,
LAG(revenue) OVER (ORDER BY order_date) AS prev_revenue
FROM daily_revenue
)
SELECT order_date,
revenue,
revenue - prev_revenue AS dod_change,
100.0 * (revenue - prev_revenue) / NULLIF(prev_revenue, 0) AS dod_pct
FROM with_prev;For readability, compute previous revenue once in a CTE when the same LAG() expression is reused — the inline version with repeated LAG() is also correct.
Common uses:
- Day-over-day / month-over-month growth
- Compare each event to the previous event per user (
PARTITION BY user_id) - Funnel analysis with
LEADfor "time to next step"
First row has NULL LAG — handle with COALESCE or filter if needed.
A strong answer is:
LAG pulls the previous row in order — I use it for day-over-day change, often with NULLIF on the denominator for percent change, and PARTITION BY user when the series is per entity.
How do you compute each row as percent of category total?
Divide row value by window aggregate over the partition — no self-join needed.
SELECT category,
product,
sales,
100.0 * sales / SUM(sales) OVER (PARTITION BY category) AS pct_of_category
FROM product_sales;Tips:
- Multiply by
100.0for percentage; omit if interviewer wants ratio 0–1 SUM(...) OVER (PARTITION BY category)computes the category total on every row in that category- Same pattern works for percent of grand total — omit
PARTITION BY
A strong answer is:
I divide row sales by SUM(sales) OVER (PARTITION BY category) — clarify whether they want a ratio or percentage, and use 100.0 for percent.
Why can you not put ROW_NUMBER() in the WHERE clause?
Window functions are evaluated after WHERE, GROUP BY, and HAVING — you cannot filter on them in the same query level.
SQL order of operations (simplified):
FROM → WHERE → GROUP BY → HAVING → window functions → SELECT → WHERE on outer query
Dedup pattern — latest row per user:
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM events
)
SELECT * FROM ranked WHERE rn = 1;Some databases support QUALIFY (BigQuery, Snowflake) to filter windows inline — mention if you know the dialect.
See select first row in group.
A strong answer is:
Window functions run after WHERE, so I filter ROW_NUMBER in an outer query or CTE — latest row per user is PARTITION BY key ORDER BY time DESC, then WHERE rn = 1.
What does NTILE do?
NTILE(n) splits an ordered partition into n buckets of roughly equal size — quartiles, deciles, percentile bands.
SELECT customer_id, total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_totals;Use cases in analytics interviews:
- Customer segmentation (top quartile spenders)
- Load balancing buckets
- A/B test stratification
Caveat: Bucket counts are balanced, not bucket ranges — for true percentile cutoffs use PERCENT_RANK or PERCENTILE_CONT (dialect-specific).
A strong answer is:
NTILE splits rows into N ordered buckets — I use it for quartile-style segmentation; I note it balances row counts, not value ranges.
Common live-coding patterns
Find numbers that appear at least three times consecutively.
Consecutive means adjacent in a defined order — usually by id or timestamp, not numeric value alone. Clarify the ordering column before coding.
LAG approach (readable):
WITH flagged AS (
SELECT num,
LAG(num, 1) OVER (ORDER BY id) AS prev1,
LAG(num, 2) OVER (ORDER BY id) AS prev2
FROM logs
)
SELECT DISTINCT num
FROM flagged
WHERE num = prev1 AND num = prev2;Self-join alternative: Join logs to itself on id+1 and id+2 — same logic, more verbose.
Confirm whether the output should be distinct num values or full rows before you pick DISTINCT or window deduplication.
A strong answer is:
I confirm the ordering column, then use LAG to compare each row to the two prior rows in that order — or self-join on id+1 and id+2 — and return nums where all three match.
Find duplicate email addresses.
Find values that appear more than once — pure GROUP BY + HAVING:
SELECT email
FROM users
GROUP BY email
HAVING COUNT(*) > 1;To list all rows with duplicate emails (not just the email string):
SELECT u.*
FROM users u
JOIN (
SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1
) d ON u.email = d.email;Delete duplicates keeping one row: ROW_NUMBER() in a CTE, delete where rn > 1 (dialect permitting). See select duplicate records and delete duplicate rows.
A strong answer is:
GROUP BY email HAVING COUNT(*) > 1 finds duplicate values; if I need all duplicate rows or to keep one copy, I join back or use ROW_NUMBER in a CTE.
Find the Nth highest salary (parameterized).
Parameterize rank instead of hard-coding 2 for second highest:
WITH ranked AS (
SELECT salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees
)
SELECT salary FROM ranked WHERE rnk = :n LIMIT 1;Discuss with interviewer:
- Multiple employees can share rank N →
DENSE_RANKmay return one salary value with many people ROW_NUMBER = Nreturns one arbitrary row among ties- Empty result if fewer than N distinct salary levels exist
A strong answer is:
I rank salaries with DENSE_RANK or ROW_NUMBER depending on tie rules, filter WHERE rnk = N, and discuss what happens when ties exist or N exceeds distinct values.
How would you compute Day-7 user retention?
Retention = fraction of a cohort that returns on a target day after signup (or first action).
Steps:
- Define cohort — users by signup date
- Define return event — any activity on day 7 (or exactly day 7 — clarify!)
- retained / cohort_size per cohort date
WITH cohort AS (
SELECT user_id, DATE(signup_at) AS signup_date
FROM users
),
activity AS (
SELECT user_id, DATE(event_at) AS activity_date
FROM events
)
SELECT c.signup_date,
COUNT(DISTINCT c.user_id) AS cohort_size,
COUNT(DISTINCT a.user_id) AS retained_d7,
1.0 * COUNT(DISTINCT a.user_id)
/ NULLIF(COUNT(DISTINCT c.user_id), 0) AS d7_retention_rate
FROM cohort c
LEFT JOIN activity a
ON c.user_id = a.user_id
AND a.activity_date = c.signup_date + INTERVAL '7 days'
GROUP BY c.signup_date;Clarify: Calendar day 7 vs "any time in week 2" changes the join. Report rate (0–1 or multiply by 100 for percent), not only numerator and denominator. Rolling retention windows use the same cohort logic with a different day offset.
A strong answer is:
I define the cohort by signup date, LEFT JOIN activity on the target return day, and return cohort_size, retained_d7, and d7_retention_rate with NULLIF on the denominator — clarifying whether day 7 is an exact calendar day or a window.
How do you calculate rolling 7-day active users?
Rolling 7-day active users = distinct users with at least one event in the last 7 days ending on each report date — a staple analytics metric (DAU/WAU cousins).
Approach: For each calendar day, count distinct user_id where event_date is between report_date - 6 days and report_date (7 days inclusive — clarify definition).
WITH daily_users AS (
SELECT DATE(event_at) AS event_date, user_id
FROM events
GROUP BY 1, 2
),
spine AS (
SELECT DISTINCT event_date AS report_date FROM daily_users
)
SELECT s.report_date,
COUNT(DISTINCT d.user_id) AS rolling_7d_active_users
FROM spine s
JOIN daily_users d
ON d.event_date BETWEEN s.report_date - INTERVAL '6 days' AND s.report_date
GROUP BY s.report_date
ORDER BY s.report_date;Alternatives: RANGE window on dates (dialect-specific), or self-join with date inequality (watch performance).
Production note: The spine above uses only dates that appear in events — fine for a short interview answer. In production, use a real date spine (generate_series, recursive CTE, or dim_date) so days with zero activity still appear in the report.
Clarify with interviewer: 7 days inclusive vs trailing 168 hours; timezone for DATE(event_at).
Pairs naturally with Day-7 retention (cohort) — rolling active is calendar-window, not cohort-based.
A strong answer is:
I dedupe user-day events, then for each report date count distinct users with activity in the trailing 7 calendar days using BETWEEN on dates — and I clarify inclusive days and timezone up front.
Calculate click-through rate (CTR) by day.
CTR = clicks / impressions — usually aggregated per day or campaign.
SELECT DATE(event_at) AS event_date,
SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) AS clicks,
SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS impressions,
1.0 * SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END)
/ NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0) AS ctr
FROM events
GROUP BY DATE(event_at);Proactive mentions:
NULLIF(..., 0)prevents divide-by-zero on days with no impressions- Use
1.0 *orCASTfor floating-point division - Alternative:
COUNT(*) FILTER (WHERE event_type = 'click')in PostgreSQL
A strong answer is:
I sum clicks and impressions per day with conditional aggregation, divide with NULLIF to avoid zero impressions, and use floating-point division for CTR.
How do you compute a median in SQL?
Median = middle value of ordered data — dialect support varies.
PostgreSQL / SQL Server (ordered-set aggregate):
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees;Portable window approach (interview-friendly):
WITH ordered AS (
SELECT salary,
ROW_NUMBER() OVER (ORDER BY salary) AS rn,
COUNT(*) OVER () AS cnt
FROM employees
)
SELECT AVG(salary) AS median_salary
FROM ordered
WHERE rn IN ((cnt + 1) / 2, (cnt + 2) / 2);Even count averages the two middle values. State your dialect — interviewers accept logic with minor syntax fixes.
A strong answer is:
In PostgreSQL I use PERCENTILE_CONT(0.5); otherwise I rank rows and average the middle one or two. I say which dialect I am assuming.
Find missing dates in a daily series.
Detect gaps by comparing a complete date spine to facts.
Concept:
SELECT d.day
FROM dates d
LEFT JOIN daily_metrics m ON d.day = m.day
WHERE m.day IS NULL;Build the spine:
- PostgreSQL:
generate_series - Recursive CTE from min to max date
- Warehouse dimension table
dim_date
Tests whether you think in complete grids — common in analytics screens ("which days did we not report revenue?").
A strong answer is:
I generate a continuous date series, LEFT JOIN facts, and filter WHERE the fact key IS NULL — using generate_series or a recursive CTE depending on dialect.
Design, performance, and senior depth
What is database normalization and when do you denormalize?
Normalization organizes tables to reduce redundancy and update anomalies — each fact lives in one place, linked by keys.
Normal forms (interview shorthand):
| Form | Idea |
|---|---|
| 1NF | Atomic columns; no repeating groups |
| 2NF | No partial dependency on part of a composite key |
| 3NF | No transitive dependency (non-key → non-key) |
OLTP / application databases favor normalized schemas — orders, customers, line items as separate tables.
When to denormalize:
- Analytics / warehouses — star schema with fact + dimension tables trades storage for faster reads
- Read-heavy dashboards — pre-joined wide tables or materialized aggregates
- Latency-sensitive reports when join cost at query time exceeds duplication cost
Interview framing: "Normalized for writes and integrity; denormalized for read performance when we measure the trade-off."
A strong answer is:
Normalization reduces redundancy and update bugs through 1NF–3NF; I denormalize for analytics read performance when measured join cost justifies duplication — star schemas in warehouses, not random duplicate columns in OLTP.
When does an index help, and when does it hurt?
An index is a lookup structure (often B-tree) that speeds find operations at the cost of write maintenance and storage.
Helps when:
WHEREfilters on high-selectivity columns (few rows match)JOINkeys and foreign keys used in lookupsORDER BYcolumns that align with index order (covering indexes avoid table lookups)
Hurts or helps little when:
- Tiny tables (full scan is fine)
- Low-selectivity columns (
genderon balanced data) — optimizer may ignore index - Heavy
INSERT/UPDATE/DELETEworkloads (every index must update) - Functions on indexed columns —
WHERE YEAR(created_at) = 2024often cannot use index oncreated_at(use range on raw column instead)
Composite index tip: Column order matters — put equality filters first, then range.
A strong answer is:
Indexes speed selective WHERE and JOIN keys; they hurt heavy write tables and wrong-column queries. I avoid wrapping indexed columns in functions and match composite index order to query filters.
What is ACID in transactions?
Transactions group SQL statements into one logical unit — critical for payments, inventory, and any multi-step write.
| Property | Meaning | Example |
|---|---|---|
| Atomicity | All statements commit or none do | Transfer debit + credit both succeed or both roll back |
| Consistency | DB moves between valid states | Constraints satisfied after commit |
| Isolation | Concurrent transactions do not see each other's partial work | Levels: READ COMMITTED, REPEATABLE READ, SERIALIZABLE |
| Durability | Committed data survives crash | Write-ahead log on disk |
Interview depth: Mention isolation anomalies — dirty read, non-repeatable read, phantom read — and that higher isolation costs concurrency.
See SQL transactions explained.
A strong answer is:
ACID means transactions are all-or-nothing, keep valid state, isolate concurrent sessions, and persist after commit. I pick isolation level based on consistency needs vs concurrency.
How do you approach optimizing a slow SELECT?
Correctness first — a fast wrong answer is worse than a slow right one. Then optimize with evidence, not guesses.
Structured approach:
EXPLAIN/EXPLAIN ANALYZE— sequential scan vs index scan, row estimates vs actuals, join order- Check join keys — missing index on FK columns is a common root cause
- Reduce selected columns — avoid
SELECT *in production pipelines - Filter early — push predicates; rewrite
WHERE YEAR(d) = 2024→ date range - Fix grain — accidental row explosion from joins inflates work
- Pre-aggregate — materialized views or summary tables for dashboards
In interviews: Write the correct query, narrate one optimization you would check on EXPLAIN, and mention measuring before/after.
A strong answer is:
I make the query correct, run EXPLAIN ANALYZE, fix bad join grain and missing indexes, push filters without functions on indexed columns, and only then consider materialized views for repeated heavy reports.
What is a final-week checklist before SQL interviews?
Use this as a readiness audit — not a substitute for timed practice.
Problem volume and patterns:
- 35+ problems done (easy + medium + a few hard)
- Top-N per group without hesitation (
ROW_NUMBER/DENSE_RANK) - NULL,
COUNT(*)vsCOUNT(col),HAVINGtraps internalized - Five window patterns cold: rank, running sum,
LAG, % of total, dedupe latest row - Know ROWS vs RANGE for running totals; CASE WHEN for conditional aggregation
- One retention and one rolling active users or gap problem explained end-to-end
Communication:
- Practice explaining queries aloud while typing
- Drill stating grain, dialect, and tie behavior before coding
On-site tutorials to review:
Related interview prep: technical specialist, Salesforce data engineer, Interview Questions category.
A strong answer is:
In the final week I time-box medium problems, drill top-N and window patterns, practice narrating grain and NULL handling aloud, and review joins and GROUP BY until the syntax matches how I explain the logic.
Pattern cheat sheet (quick reference)
| Pattern | Tool |
|---|---|
| Top N per group | ROW_NUMBER / DENSE_RANK + filter |
| Second highest | MAX subquery or DENSE_RANK = 2 |
| Running total | SUM() OVER (ORDER BY … ROWS BETWEEN …) |
| ROWS vs RANGE frames | Specify ROWS for row-by-row running totals |
| Conditional metrics | SUM(CASE WHEN … THEN 1 ELSE 0 END) |
| Rolling 7-day active | Distinct users with events in trailing 7-day window |
| Period-over-period | LAG / LEAD |
| Anti-join | LEFT JOIN … WHERE right.key IS NULL or NOT EXISTS |
| Dedupe latest row | ROW_NUMBER partitioned by key, rn = 1 |
| Avoid double-count | Aggregate at correct grain in CTE first |
Good luck — pattern recognition beats memorizing 200 isolated answers.

