Kafka interview questions in 2026 go far past "what is a message broker?" Teams using Apache Kafka for event-driven microservices, audit logs, and real-time pipelines expect you to explain partitions and consumer groups, defend acks=all with min.insync.replicas, and describe how you would fix consumer lag or design exactly-once processing. Kafka interview questions for Java developer loops often pair broker theory with Spring Kafka configuration—producer retries, listener concurrency, and error handlers.
Below are 45 questions for interview questions on Kafka, kafka interview questions and answers, and senior backend or data-engineering roles. Answers are written so you can learn while prepping—tables, diagrams in prose, small simulations, and a strong answer after each topic. Pair this guide with PostgreSQL interview questions for logical replication and CDC slot behavior on source databases, Kubernetes interview questions when brokers run on StatefulSets, Spring Boot interview questions for microservice integration, Java interview questions part 2 for concurrency fundamentals, Python developer interviews when consumers are written in Python, and SQL technical interviews for downstream analytics stores.
Tested on: Ubuntu 25.04 (Plucky Puffin); kernel 6.14.0-37-generic; Python 3.13.3.
Interview context and how to prepare
What do Kafka interviews actually test?
Apache Kafka interviews test whether you understand a distributed commit log—not a classic queue that deletes messages after one consumer reads them.
| Layer | What interviewers probe |
|---|---|
| Architecture | Brokers, topics, partitions, leaders, replicas |
| Producers | Keys, partitioning, acks, idempotence |
| Consumers | Groups, offsets, rebalancing, lag |
| Durability | Replication, ISR, unclean leader election |
| Semantics | At-most-once, at-least-once, exactly-once |
| Operations | Retention, compaction, monitoring, capacity |
| Integration | Spring Kafka, Schema Registry, Connect |
| Role | Emphasis |
|---|---|
| Java backend | Spring producer/consumer config, error handling |
| Platform / SRE | Broker tuning, KRaft, multi-AZ, upgrades |
| Data engineer | Connect, Streams, schema evolution, pipelines |
Kafka interview questions for Java developer — what is different from generic messaging questions?
Java developer loops add client integration depth on top of broker concepts:
| Generic Kafka question | Java-specific follow-up |
|---|---|
| Consumer groups | @KafkaListener concurrency vs partitions |
| Serialization | JsonSerializer, Avro + Schema Registry |
| Error handling | DefaultErrorHandler, DLT (dead-letter topic) |
| Transactions | KafkaTransactionManager with Spring |
| Testing | @EmbeddedKafka, Testcontainers |
Interviewers often ask you to sketch Spring Boot properties for a producer and consumer in the same service—see Spring Boot interviews for broader microservice context.
What is a typical Kafka interview loop?
| Round | Duration | Focus |
|---|---|---|
| Recruiter / HM | 30 min | Event-driven experience, throughput, on-call |
| Fundamentals | 45–60 min | Topics, partitions, consumer groups, offsets |
| Deep technical | 60–90 min | acks, EOS, replication, failure scenarios |
| System design | 45–60 min | Order service events, CDC, metrics pipeline |
| Live exercise | 45 min | Partition assignment math, config fix, lag triage |
| Behavioral | 30 min | Outage stories, rollout discipline |
Live rounds often stress consumer groups and partitions together with delivery semantics—at-least-once, at-most-once, and exactly-once trade-offs.
What is a realistic 4–6 week prep plan?
| Week | Focus | Output |
|---|---|---|
| 1 | Core model — log, topic, partition, broker | Draw producer → topic → consumer group |
| 2 | Producers — keys, acks, retries, idempotence |
Configure a test producer; list trade-offs |
| 3 | Consumers — groups, commits, rebalancing | Run two consumers; observe partition assignment |
| 4 | Durability — ISR, min.insync.replicas, replication |
Explain leader failure walkthrough |
| 5 | EOS + Schema Registry / Avro basics | Document read-process-write with transactions |
| 6 | Java integration + scenarios | Spring @KafkaListener sample; rehearse lag debug |
Run a local KRaft broker or Docker Compose stack and produce/consume with kafka-console-producer / kafka-console-consumer at least once.
Architecture and core concepts
What is Apache Kafka and how is it different from a traditional message queue?
Kafka is a distributed event streaming platform built around an append-only partitioned log. Producers append records; consumers pull and track position via offsets.
| Aspect | Traditional queue (e.g. RabbitMQ) | Kafka |
|---|---|---|
| Message lifecycle | Often deleted after ack | Retained by policy (time/size/compaction) |
| Consumption | Push to consumers | Consumers pull at their pace |
| Replay | Usually not supported | Seek to older offsets |
| Fan-out | Queues or exchanges | Multiple independent consumer groups |
| Ordering | Per queue | Per partition only |
Kafka fits event sourcing, audit trails, stream processing, and decoupled microservices where many teams read the same history.
A strong answer is:
Kafka is a durable, replayable log with pub/sub via consumer groups—not a delete-on-read queue—so multiple services can consume the same topic at different speeds and rewind when needed.
Explain Kafka architecture and its main components.
| Component | Role |
|---|---|
| Broker | Server storing partitions, serving produce/fetch requests |
| Cluster | Multiple brokers for scale and fault tolerance |
| Topic | Named logical stream (e.g. orders) |
| Partition | Ordered, immutable log shard—unit of parallelism |
| Replica | Copy of a partition for durability |
| Leader / follower | One leader serves reads/writes; followers replicate |
| Producer | Publishes records with optional key |
| Consumer | Reads from assigned partitions |
| Consumer group | Cooperative consumers sharing load |
| Controller | Manages leader election (KRaft or legacy ZK mode) |
| ZooKeeper / KRaft | Cluster metadata (KRaft is standard in new deployments) |
Data flow: producer → partition leader → followers replicate → consumers fetch.
A strong answer is:
Brokers host partition leaders and replicas; producers write to leaders; consumer groups divide partitions among members; the controller handles metadata and leader election.
What is a topic and what is a partition?
A topic is a logical category of records. Each topic is split into partitions—physical logs on disk.
| Concept | Detail |
|---|---|
| Partition | Ordered sequence with monotonic offsets (0, 1, 2, …) |
| Parallelism | More partitions → more concurrent consumers (up to 1:1 per group) |
| Ordering | Guaranteed within a partition, not across partitions |
| Key routing | Same key → same partition (default murmur2 hash) |
Partition count is chosen at topic creation—increasing partitions is possible but does not reorder existing keys' history.
Partition assignment by key (simulation):
def partition_for_key(key: str, num_partitions: int) -> int:
# Kafka uses murmur2; this models stable "same key → same partition"
return hash(key) % num_partitions
keys = ["user-42", "user-42", "user-99", "user-42"]
parts = [partition_for_key(k, 6) for k in keys]
print(parts[0] == parts[1] == parts[3], parts)When you click Run, the first value should be True—the same key maps to the same partition index, which is how per-key ordering is preserved.
A strong answer is:
A topic is the stream name; partitions are the ordered shards that enable scale—I choose partition count and keys so ordering and throughput match product rules.
How does Kafka handle message ordering and delivery guarantees?
Ordering: Kafka guarantees order per partition. If you need all events for order-123 ordered, use key = order-123 so they land in one partition.
Delivery is not one setting—it is a stack of choices:
| Layer | Knob |
|---|---|
| Producer | acks, retries, idempotence |
| Broker | Replication, ISR, min.insync.replicas |
| Consumer | When offsets are committed vs when work runs |
There is no global order across partitions—design for partition-local order or use a single partition (limits throughput).
A strong answer is:
Ordering is per partition via key routing; end-to-end delivery semantics come from producer acks, replication, and consumer offset timing together.
What is KRaft and how does it differ from ZooKeeper?
KRaft (Kafka Raft) stores cluster metadata in Kafka itself using a Raft quorum—ZooKeeper is no longer required for new Kafka versions (3.3+ production-ready path; 4.x removes ZK).
| ZooKeeper mode (legacy) | KRaft | |
|---|---|---|
| Metadata | External ZK ensemble | Internal metadata log |
| Operations | Two systems to patch/secure | Single Kafka operational model |
| Scaling metadata | ZK limits at very large scale | Designed for millions of partitions |
Interviewers in 2026 expect you to know new clusters use KRaft and migration projects move off ZK.
A strong answer is:
KRaft embeds metadata in Kafka with Raft consensus, simplifying operations— I treat ZooKeeper as legacy and KRaft as the default for greenfield clusters.
Producers
What happens internally when a producer sends a message to Kafka?
Simplified produce path:
- Serializer turns key/value into bytes
- Partitioner picks partition (key hash or sticky batching)
- Producer batches records for throughput
- Request sent to partition leader broker
- Leader appends to log; followers replicate
- Broker responds based on
ackssetting - On failure, producer retries (may duplicate without idempotence)
Batching (linger.ms, batch.size) trades latency for throughput.
A strong answer is:
The producer serializes, partitions, batches, and sends to the leader; replication and ack level determine when the produce call succeeds and whether retries can create duplicates.
Difference between acks=0, acks=1, and acks=all?
| Setting | Behavior | Trade-off |
|---|---|---|
acks=0 |
Fire-and-forget; no broker ack | Highest throughput; may lose data |
acks=1 |
Leader persisted ack | Fast; loss if leader dies before replication |
acks=all |
All ISR replicas must ack | Strongest durability; higher latency |
Pair acks=all with min.insync.replicas ≥ 2 on the broker and topic replication factor ≥ 3 for meaningful safety—otherwise all can still ack with one replica.
Use acks=0 only for metrics or logs where loss is acceptable.
A strong answer is:
acks=0 is fire-and-forget, acks=1 waits for the leader, acks=all waits for the full ISR—I use all for critical events with proper replication and min.insync.replicas.
Why do producer keys matter?
The key determines partition (when non-null):
| Key | Effect |
|---|---|
| Set | hash(key) % numPartitions — colocate related events |
| Null | Round-robin / sticky partitioning — no key-based order |
Use cases:
orderIdas key — all lifecycle events for one order stay ordereduserId— per-user ordering- Null key — maximum spread when order does not matter
Hot keys can create partition skew—one partition overloaded while others idle.
A strong answer is:
Keys route related events to the same partition for ordering; I avoid hot keys and null keys when I need predictable locality.
What is an idempotent producer?
Enable with enable.idempotence=true (Kafka producer). Broker assigns Producer ID (PID) and tracks sequence numbers per partition—duplicate retries are deduplicated within that producer session.
| Scope | Guarantee |
|---|---|
| Idempotent producer | No duplicate writes to one partition from retries |
| Transactions | Atomic writes across partitions + offset commits |
Idempotence requires acks=all, retries > 0, and appropriate max.in.flight.requests.per.connection (≤ 5 with idempotence enabled).
A strong answer is:
Idempotent producers dedupe retry batches per partition via sequence numbers—necessary but not sufficient alone for end-to-end exactly-once.
Which producer settings do interviewers expect you to know?
| Property | Purpose |
|---|---|
bootstrap.servers |
Broker seed list |
key.serializer / value.serializer |
Byte format |
acks |
Durability vs latency |
retries / delivery.timeout.ms |
Resilience |
enable.idempotence |
Dedupe retries |
compression.type |
lz4, zstd, gzip — bandwidth vs CPU |
linger.ms / batch.size |
Batching tuning |
Misconfigured max.in.flight.requests.per.connection with retries and no idempotence can reorder batches—classic interview trap.
A strong answer is:
I tune acks, idempotence, and in-flight requests together, and I batch with linger/batch.size only after measuring latency impact.
Consumers, groups, and offsets
What is a consumer group and how does it distribute work?
A consumer group shares one group.id. Each partition is assigned to exactly one consumer in the group at a time.
| Consumers | Partitions | Result |
|---|---|---|
| 3 consumers | 6 partitions | ~2 partitions each |
| 6 consumers | 6 partitions | 1:1 — max parallelism for this group |
| 8 consumers | 6 partitions | 2 idle consumers |
Different groups reading the same topic are independent—each maintains its own offsets (fan-out).
Consumer group assignment simulation:
def assign_partitions(partitions, consumers):
assignments = {c: [] for c in consumers}
for i, p in enumerate(partitions):
assignments[consumers[i % len(consumers)]].append(p)
return assignments
parts = [f"P{i}" for i in range(6)]
print(assign_partitions(parts, ["C0", "C1", "C2"]))When you click Run, you should see each consumer name mapped to two partition labels—round-robin style assignment similar in spirit to range/round-robin assignors.
A strong answer is:
A consumer group divides partitions among members so each partition is processed once per group; extra consumers stay idle unless I add partitions.
How are consumer offsets managed?
An offset is the consumer's position in a partition log. Committed offsets are stored in the internal topic __consumer_offsets.
| Commit mode | Behavior |
|---|---|
| Auto commit | Periodic commit — simple; risk if process after commit |
| Manual commit | commitSync / commitAsync after successful processing |
| Transactional | Offsets committed atomically with producer transaction |
At-least-once pattern: process record, then commit offset.
At-most-once pattern: commit offset, then process (may lose on crash).
A strong answer is:
Offsets are the consumer's bookmark—I commit after successful processing for at-least-once, and I disable careless auto-commit on critical pipelines.
What triggers a rebalance and how do you minimize impact?
Rebalance redistributes partitions when group membership changes:
- Consumer joins or leaves
- Consumer exceeds
session.timeout.ms/max.poll.interval.ms - Topic partition count changes
- Cooperative sticky assignor revokes subset incrementally
| Protocol | Impact |
|---|---|
| Eager (range) | Revoke all, reassign — stop-the-world pauses |
| Cooperative sticky | Incremental revoke — preferred in production |
Mitigations:
- Right-size
session.timeout.ms,heartbeat.interval.ms,max.poll.interval.ms - Avoid long processing in poll loop—pause and resume or decouple processing
- Static membership (
group.instance.id) for rolling restarts
A strong answer is:
Rebalances happen on membership or timeout changes—I use cooperative rebalancing, tune poll intervals, and keep processing off the poll thread for long work.
What is consumer lag and how do you debug it?
Lag = difference between log end offset and consumer committed/current offset—how far behind a consumer is.
| Cause | Investigation |
|---|---|
| Slow processing | JVM GC, DB calls in listener, thread pool exhaustion |
| Too few consumers | Consumers < partitions |
| Hot partition | Skewed key distribution |
| Rebalance storm | Frequent join/leave during deploy |
| Downstream bottleneck | Sink cannot keep pace |
Tools: kafka-consumer-groups.sh --describe, Burrow, Datadog, Prometheus exporters.
Fix: scale consumers (up to partition count), optimize handler, increase partitions with key strategy review, fix poison messages.
A strong answer is:
Lag is how far behind consumption is—I find whether the bottleneck is processing time, partition skew, or rebalance churn, then scale or fix handlers with metrics, not guesses.
What are max.poll.interval.ms and session.timeout.ms?
| Setting | Purpose |
|---|---|
session.timeout.ms |
Heartbeat failure → consumer considered dead |
heartbeat.interval.ms |
Must be < session timeout (typically ~1/3) |
max.poll.interval.ms |
Max time between poll() calls before rebalance |
If your listener processes one message for five minutes without polling, the consumer is kicked out—even if heartbeats still run (depending on version/config).
Pattern: poll often, hand off to worker pool, use pause/resume for backpressure.
A strong answer is:
session.timeout detects dead members; max.poll.interval caps time between polls—I never block the poll loop on long synchronous work.
What is isolation.level=read_committed?
Consumers default to read_uncommitted—see all messages including those from aborted transactions.
read_committed hides aborted transactional batches—required for exactly-once consume-transform-produce loops so consumers do not read rolled-back data.
Pair with transactional producers and disabled auto-commit.
A strong answer is:
read_committed lets consumers see only committed transactional records—part of exactly-once pipelines with transactional producers.
Replication, durability, and fault tolerance
How does Kafka ensure data durability and fault tolerance?
Each partition has replication factor N—one leader and N−1 followers on different brokers.
| Mechanism | Role |
|---|---|
| Leader replication | Followers fetch from leader |
| ISR (in-sync replicas) | Replicas caught up enough to be promotable |
| Leader election | New leader chosen from ISR on failure |
unclean.leader.election.enable=false |
Avoid data loss from out-of-sync promotion |
Producer acks=all waits for ISR acks—not merely any replica.
A strong answer is:
Replication across brokers plus ISR tracking and safe leader election keeps partitions available after node loss—I pair that with acks=all and min.insync.replicas.
What is the in-sync replica set (ISR)?
ISR = replicas whose lag is below replica.lag.time.max.ms (or byte lag thresholds in older configs).
Only ISR members can become leader if unclean.leader.election.enable=false.
If ISR shrinks to one replica:
- Cluster still serves traffic
- No redundancy until followers catch up
- Risk if that single broker fails
Monitor ISR shrink events—they predict durability risk.
A strong answer is:
ISR is the set of replicas safe to promote— I alert when ISR size drops and I never enable unclean election on critical topics.
What is min.insync.replicas?
Broker/topic setting min.insync.replicas defines how many replicas must ack a write for acks=all to succeed.
Example: replication.factor=3, min.insync.replicas=2
- Tolerates one broker loss without stopping writes
- Producer with
acks=allfails if only one ISR member available—prefer failing writes over silent data loss
Interview trap: acks=all with min.insync.replicas=1 and RF=3 still allows single-replica commits.
A strong answer is:
min.insync.replicas enforces how many replicas must ack before a write counts—I set it to 2 with RF=3 so one broker can die without weakening durability.
What happens when a Kafka broker fails?
| Scenario | Effect |
|---|---|
| Follower dies | ISR may shrink; leaders continue |
| Leader dies | Controller elects new leader from ISR |
| Multiple brokers die | Partitions with no ISR leader go offline |
| Controller failure | Failover to standby controller (KRaft quorum) |
Clients refresh metadata and discover new leaders—brief produce/fetch errors during election.
Operations: rack awareness (broker.rack) for cross-AZ placement, monitor under-replicated partitions.
A strong answer is:
Follower loss reduces redundancy; leader loss triggers ISR election; clients retry after metadata refresh—I design RF and rack awareness so single-AZ loss does not take topics offline.
Delivery semantics and exactly-once
Explain at-most-once, at-least-once, and exactly-once semantics.
| Semantic | Guarantee | Typical implementation |
|---|---|---|
| At-most-once | May lose; no duplicates | Commit offset before process; acks=0 |
| At-least-once | No loss; may duplicate | Process then commit; retries without idempotence |
| Exactly-once | Neither loss nor duplicate | Idempotent producer + transactions / Streams EOS |
Delivery semantics simulation:
def at_most_once(process, commit, events):
lost = []
for e in events:
commit(e)
if e == "crash":
lost.append("unprocessed-after-commit")
break
process(e)
return lost
def at_least_once(process, commit, events):
dup = []
processed = set()
for e in events:
if e != "crash":
process(e)
processed.add(e)
else:
for p in processed:
process(p)
dup.append(p)
break
commit(e)
return dup
events = ["A", "B", "crash"]
print("at-most-once lost:", at_most_once(lambda x: None, lambda x: None, events))
print("at-least-once dups:", at_least_once(lambda x: None, lambda x: None, events))When you click Run, you should see at-most-once report a lost-after-commit case and at-least-once list duplicates after a simulated crash—illustrating why idempotent handlers matter.
A strong answer is:
At-most-once commits early; at-least-once commits after work and may retry duplicates; exactly-once needs broker and application cooperation—I default to at-least-once with idempotent consumers unless EOS is justified.
How do you achieve exactly-once processing in Kafka?
Broker-side pieces:
- Idempotent producer — dedupe per-partition retries
- Transactions —
initTransactions,beginTransaction,commitTransaction sendOffsetsToTransaction— atomic offset commit with output records- Consumer
isolation.level=read_committed
Application-side: processing must be deterministic; external sinks need idempotent writes or transactional stores—EOS in Kafka does not magically dedupe your database.
Kafka Streams offers processing.guarantee=exactly_once_v2 packaging the pattern.
A strong answer is:
Exactly-once combines idempotent producers, transactional writes, and committed offset reads—I still make downstream sinks idempotent because EOS stops at Kafka's boundary.
Walk through a consume-transform-produce transaction.
Steps:
- Producer
initTransactions() - Consumer polls records
beginTransaction()- Process and produce output records
sendOffsetsToTransactionwith consumed offsetscommitTransaction()— all visible or none
On failure: abortTransaction() — consumers with read_committed never see partial output.
Requires unique transactional.id per producer instance (fence zombies after failover).
A strong answer is:
I wrap output records and input offset commits in one transaction so a crash cannot double-publish or skip commits visible to downstream readers.
Why do you still need idempotent consumers with at-least-once?
Most teams run at-least-once (simpler than full transactions). Retries and rebalance redelivery mean duplicate delivery is normal.
Consumer strategies:
| Strategy | Example |
|---|---|
| Natural idempotence | Upsert by primary key |
| Dedup store | Redis/DB of processed event IDs |
| Transactional DB | Unique constraint on event_id |
"Exactly-once" in interviews often means effective exactly-once—at-least-once transport + idempotent processing.
A strong answer is:
At-least-once will redeliver—I design handlers to dedupe on business keys or store processed IDs so duplicates are harmless.
Schema Registry, Connect, and stream processing
What is Schema Registry and why use Avro with Kafka?
Schema Registry (often Confluent) stores Avro/JSON/Protobuf schemas with versioning.
| Benefit | Detail |
|---|---|
| Compatibility | BACKWARD, FORWARD, FULL modes |
| Evolution | Add fields with defaults safely |
| Compact payloads | Avro binary + schema id in message |
Producers send schema ID; consumers fetch schema from registry—contract between teams.
Interview talking point: breaking schema change blocked by compatibility mode—coordinate with consumers before deploy.
A strong answer is:
Schema Registry versions contracts between producers and consumers—I use backward-compatible Avro evolution and test compatibility in CI.
What is Kafka Connect?
Kafka Connect is a framework for source and sink connectors:
| Type | Example |
|---|---|
| Source | Debezium CDC from PostgreSQL → Kafka |
| Sink | S3, Elasticsearch, JDBC sink |
Runs as distributed workers with offset tracking in Kafka—fits ETL without custom consumer boilerplate.
Pair with SQL interviews when discussing CDC and warehouse loads.
A strong answer is:
Kafka Connect moves data in and out with connectors—I use CDC sources and managed sinks instead of one-off consumers when the pattern is standard.
What is Kafka Streams at interview level?
Kafka Streams is a Java library for stream processing on top of Kafka:
- Stateful operations (aggregations, joins) with changelog topics
- Exactly-once v2 processing guarantee option
- No separate cluster like Flink—runs as your app
Use when team is JVM-heavy and logic fits Kafka-centric topologies; use Flink/Spark when you need complex event-time windowing at massive scale.
A strong answer is:
Kafka Streams embeds processing in Java apps with state stores and EOS options—I choose it for JVM microservices, Flink when operability needs a dedicated cluster.
What is log compaction vs time-based retention?
| Policy | Behavior |
|---|---|
| delete (time/size) | Old segments removed after retention.ms |
| compact | Keep latest record per key; tombstones remove keys |
Compaction suits changelog topics—config snapshots, __consumer_offsets, compacted state topics in Streams.
Not a substitute for infinite raw event history—compacted topics are key-value latest views.
A strong answer is:
Time retention drops old data by age; compaction keeps the latest value per key for changelog-style topics like config or state stores.
Operations, tuning, and monitoring
How do retention settings affect topics?
| Setting | Effect |
|---|---|
retention.ms |
Max age before segment delete |
retention.bytes |
Max size per partition |
segment.ms / segment.bytes |
File roll size |
Long retention enables replay for new consumer groups or reprocessing—costs disk.
Tiered storage (vendor/cloud features) moves old segments to object storage.
A strong answer is:
Retention balances replay needs and disk cost—I size retention per topic based on compliance and consumer onboarding, not one global default.
What metrics do you monitor in production Kafka?
| Metric | Signals |
|---|---|
| Under-replicated partitions | Replication lag |
| Offline partitions | Availability incident |
| Consumer lag | Processing backlog |
| Request latency (produce/fetch) | Broker load |
| ISR shrink/expand | Durability risk |
| Disk usage | Retention pressure |
Alert on lag SLO breach and URP > 0 sustained—not only broker up/down.
A strong answer is:
I monitor lag, under-replication, and disk—I tie alerts to consumer SLOs and broker health, not just process running.
How do you plan topic partition count and broker capacity?
| Factor | Guidance |
|---|---|
| Target throughput | Partitions parallelize consumers |
| Ordering | Fewer partitions if strict global order (bottleneck) |
| Consumer count | Partitions ≥ max consumers in group |
| Broker load | Each partition has file handles and leader CPU |
| Future growth | Easier to add partitions than shrink |
Load test with expected message size and compression; watch disk I/O and network.
A strong answer is:
I size partitions for peak consumer parallelism and key ordering needs, then load-test brokers—partition count is hard to reduce later.
How do you handle poison messages?
Poison message — fails processing every retry, blocks partition or causes infinite loop.
Patterns:
| Pattern | Detail |
|---|---|
| Retry topic | Limited retries with backoff |
| DLT (dead-letter topic) | Spring DeadLetterPublishingRecoverer |
| Quarantine store | Manual triage |
| Skip with metric | Only for non-critical data |
Always log key, offset, partition, stack trace; alert on DLT rate.
A strong answer is:
I cap retries, route failures to a dead-letter topic with metadata, and alert—never spin forever on the same offset without visibility.
How do you secure Kafka in production?
| Layer | Control |
|---|---|
| Network | TLS encryption in transit |
| Authentication | SASL (SCRAM, OAuth/OIDC) |
| Authorization | ACLs or RBAC (managed offerings) |
| Multi-tenant | Separate topics + ACLs per team |
Never expose plaintext brokers to the public internet; rotate credentials; least-privilege ACLs per producer/consumer principal.
A strong answer is:
TLS plus SASL authentication and topic-level ACLs—I give each service only produce or consume rights on the topics it needs.
Comparisons and system design
Kafka vs RabbitMQ — when do you pick each?
| Factor | RabbitMQ | Kafka |
|---|---|---|
| Model | Smart broker, queues | Dumb broker, smart consumers |
| Routing | Exchanges, bindings | Topics/partitions |
| Retention | Delete on ack | Configurable log |
| Replay | Limited | First-class |
| Throughput | Moderate | Very high horizontal scale |
| Task queues | Excellent | Possible but not primary fit |
Pick RabbitMQ for RPC-style work queues; Kafka for event streams, audit logs, and multiple independent readers.
A strong answer is:
RabbitMQ for task routing and low-latency queues; Kafka for durable event streams, replay, and high-throughput fan-out—I do not force all messaging through one tool.
How do you design events for microservices?
Practices:
| Practice | Why |
|---|---|
| Past-tense names | OrderPlaced, not PlaceOrder |
| Versioned payloads | Schema evolution |
| Include metadata | eventId, occurredAt, correlationId |
| Idempotent handlers | At-least-once reality |
| Bounded context topics | Avoid god topic |
Align with full stack and Spring Boot service boundaries.
A strong answer is:
Events are contracts—I name them as facts, version schemas, include correlation IDs, and design consumers to tolerate redelivery.
What is change data capture (CDC) with Kafka?
CDC streams database row changes to Kafka—often Debezium on Kafka Connect.
| Use | Benefit |
|---|---|
| Cache invalidation | Downstream read models update |
| Search index | Elasticsearch sync |
| Warehouse | Snowflake/BigQuery ingestion |
| Decouple services | Without dual writes |
Challenge: ordering per key, schema changes, initial snapshot load.
A strong answer is:
CDC publishes database changes as events—I use Debezium for near-real-time sync instead of brittle dual-write patterns between services and DBs.
Java and Spring Boot integration
How do you configure a Kafka producer in Spring Boot?
Add spring-kafka and configure:
spring.kafka.bootstrap-servers=localhost:9092
spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.springframework.kafka.support.serializer.JsonSerializer
spring.kafka.producer.acks=all
spring.kafka.producer.properties.enable.idempotence=true@Service
@RequiredArgsConstructor
public class OrderEventPublisher {
private final KafkaTemplate<String, OrderPlacedEvent> kafkaTemplate;
public void publish(OrderPlacedEvent event) {
kafkaTemplate.send("orders", event.orderId(), event);
}
}Use orderId as key for partition locality. Configure retries and delivery timeout explicitly for critical topics.
A strong answer is:
Spring Kafka wraps KafkaTemplate with serializers and producer props—I set acks, idempotence, and keys explicitly for durable ordered per-order events.
How do you configure a @KafkaListener consumer in Spring Boot?
@Component
@Slf4j
public class OrderListener {
@KafkaListener(topics = "orders", groupId = "billing-service", concurrency = "3")
public void onOrder(OrderPlacedEvent event) {
log.info("billing {}", event.orderId());
}
}| Setting | Note |
|---|---|
concurrency |
Listener threads—≤ partitions for efficiency |
groupId |
Consumer group per logical service |
autoStartup |
Control lifecycle |
| Error handler | DefaultErrorHandler + DLT |
Disable enable-auto-commit for manual ack mode when you need at-least-once discipline:
@KafkaListener(...)
public void listen(ConsumerRecord<String, OrderPlacedEvent> record, Acknowledgment ack) {
process(record.value());
ack.acknowledge();
}A strong answer is:
I match listener concurrency to partitions, use explicit ack when needed, and wire error handlers to dead-letter topics instead of infinite retry loops.
How do you test Kafka integration in Java?
| Approach | Use |
|---|---|
@EmbeddedKafka |
Fast unit/integration tests in JVM |
| Testcontainers | Real broker behavior in CI |
KafkaTemplate + @KafkaListener |
End-to-end in test context |
Assert with KafkaTestUtils.getSingleRecord or awaitility on side effects.
Mirror Spring Boot testing pyramid—many handler unit tests, few full broker tests.
A strong answer is:
I unit-test handlers without Kafka, use EmbeddedKafka or Testcontainers for wiring tests, and keep broker tests focused on serialization and ack behavior.
Senior scenarios and final prep
Scenario: Consumer lag spiked after a deployment — how do you respond?
| Step | Action |
|---|---|
| 1 | Confirm which group/topic/partition—dashboard or kafka-consumer-groups |
| 2 | Correlate with deploy time—new code slower? rebalance storm? |
| 3 | Check hot partitions — skewed keys |
| 4 | Thread/GC logs on consumers; DB latency in handler |
| 5 | Roll back if regression; scale consumers if CPU-bound and partitions allow |
| 6 | Temporary partition increase only with key strategy review |
| 7 | Post-incident: add lag alert, load test, poison message guard |
A strong answer is:
I isolate the group and partition skew, compare release timing, inspect handler latency and rebalances, then fix code or scale consumers with metrics proving the bottleneck.
What should you rehearse the week before a Kafka interview?
Checklist:
- Whiteboard producer → topic partitions → consumer group
- Explain
acks=all+min.insync.replicaswith numbers - Contrast at-least-once vs exactly-once with idempotent consumer
- Walk rebalance triggers and cooperative protocol
- Partition key strategy for one real domain (orders, payments)
- Spring Kafka producer +
@KafkaListener+ DLT pattern - One lag or broker failure STAR story
- Java concurrency refresh for poll/thread issues
- Spring Boot microservice context
A strong answer is:
I rehearse one event pipeline end to end—keys, acks, consumer group, failure handling, and Java config—so answers sound like production war stories, not glossary recitation.
Pattern cheat sheet (quick reference)
| Need | Kafka starting point |
|---|---|
| Per-key ordering | Producer key → partition |
| Max consumer parallelism | Partition count ≥ consumers |
| Durability | acks=all, RF=3, min.insync.replicas=2 |
| Dedupe retries | enable.idempotence=true |
| At-least-once | Process then commit offset |
| EOS in Kafka | Transactions + read_committed |
| Schema evolution | Schema Registry + backward compatible Avro |
| Poison messages | DLT + limited retries |
| Lag triage | Group describe, skew, handler time, rebalances |
| Java integration | Spring KafkaTemplate + @KafkaListener |
| New clusters | KRaft metadata mode |
References
Official Apache Kafka documentation
On-site prep
- Spring Boot interview questions for experienced developers
- Java interview questions — part 2
- Python developer interview questions
- Full stack developer interviews
- SQL technical interview questions
- Azure developer interviews
- Amazon AWS interview questions
- Git interview questions
- Interview Questions category
Summary
Kafka interviews connect partitions, consumer groups, and offsets to delivery semantics and operational trade-offs—not broker definitions alone. Java-heavy loops add Spring Kafka configuration, listener concurrency, and dead-letter handling. Answer aloud and compare your structure to each section. Pair with Spring Boot and SQL when events land in analytics stores.

