A Salesforce Data Engineer interview goes deeper than a generic data pipeline loop. You are expected to reason about CRM data models, Data Cloud ingestion, cloud ETL patterns, and operating pipelines in production—not only naming tools on a resume.
Below are 45 questions organized from role context through SQL, platform specifics, system design, and behavioral prompts. Each answer stays collapsed until you are ready to check your work after answering out loud.
If you are preparing for a Salesforce Developer loop (Apex, LWC, triggers, governor limits), see the dedicated Salesforce developer interview questions guide—this page stays focused on data engineering depth.
Role context and interview process
What does a Salesforce Data Engineer actually own in 2026?
A Salesforce Data Engineer usually owns the data pipelines and models that make CRM, analytics, Data Cloud, and AI features reliable.
You are not only moving rows between systems. You are responsible for making sure customer, sales, marketing, and engagement data is accurate, timely, governed, and usable.
Typical ownership includes:
- Building batch and near-real-time pipelines
- Ingesting data from Salesforce CRM, external systems, APIs, files, and event streams
- Designing bronze / silver / gold style data layers
- Modeling CRM entities like Account, Contact, Lead, Opportunity, Case, and Activity
- Handling data quality checks, SLAs, backfills, retries, and schema changes
- Supporting identity resolution and unified customer/profile views in Data Cloud
- Partnering with analytics, product, security, and platform teams
- Protecting PII with retention, masking, audit, and access controls
A strong candidate does not only say:
I build ETL pipelines.A better answer connects pipelines to business reliability:
I build and operate trusted data pipelines that ingest CRM and engagement data, transform it into analytics-ready models, validate quality, and make it available for reporting, Data Cloud, AI, and downstream business workflows.A strong answer is:
I see the Salesforce Data Engineer role as owning reliable CRM-centric data pipelines from ingestion to serving. That includes data modeling, quality checks, backfills, SLA monitoring, schema evolution, governance, and making data trustworthy for analytics, Data Cloud, and AI use cases.
What is the typical Salesforce Data Engineer interview loop?
The exact loop depends on the team, but a Salesforce Data Engineer interview commonly tests SQL, data modeling, coding, system design, production ownership, and collaboration.
A typical loop may include:
- Recruiter screen — background, location, compensation, role fit
- Hiring manager round — projects, ownership, communication, business impact
- SQL round — joins, window functions, CTEs, aggregations, data quality queries
- Coding round — Python, Java, or Scala-style data problems
- Data modeling / system design — pipeline design, CDC, warehouse modeling, SLAs
- Behavioral round — incidents, ambiguity, conflict, leadership, Trust mindset
For senior roles, expect deeper discussion on:
- Pipeline reliability
- Cost and performance tuning
- Data contracts
- Multi-tenant data handling
- PII/governance
- Backfills and incident recovery
- Cross-team influence
For Salesforce-specific teams, prepare examples around:
- CRM data semantics
- Account / Contact / Opportunity modeling
- Data Cloud ingestion and identity resolution
- Marketing, Sales, Service, Tableau, MuleSoft, or Slack-adjacent data flows
A strong answer is:
I would expect SQL, data modeling, coding, system design, and behavioral rounds. For senior roles, I would prepare production stories around pipeline failures, backfills, SLA misses, data quality incidents, and trade-offs in architecture.
Do I need Apex and SOQL for a Data Engineer interview at Salesforce?
For most Salesforce Data Engineer roles, Apex is not the main focus. The core interview usually focuses on data engineering skills such as SQL, Python, Spark, cloud data platforms, orchestration, and distributed pipeline design.
However, Salesforce-specific knowledge helps because the data often comes from Salesforce products.
Useful Salesforce knowledge includes:
| Area | What to know |
|---|---|
| CRM objects | Account, Contact, Lead, Opportunity, Case, Activity |
| SOQL basics | How Salesforce data is queried at source |
| Data Cloud / Data 360 | Data Streams, DLOs, DMOs, identity resolution |
| CRM semantics | Opportunity stages, close dates, ownership, Account hierarchy |
| Integration | APIs, CDC, MuleSoft-style integration patterns |
| Governance | PII, consent, retention, audit, access control |
Apex is useful if the role is closer to Salesforce platform development or Sales/Service Cloud customization. But for data-platform roles, interviewers usually care more about whether you can build reliable data pipelines and understand CRM data correctly.
A strong answer is:
Apex is useful for Salesforce platform roles, but for a Data Engineer interview I would focus more on SQL, Python, Spark, cloud warehouses, orchestration, data modeling, CDC, and Data Cloud concepts. I would still learn basic CRM objects and SOQL so I understand the source data.
How should I structure a 4-week prep plan?
A good 4-week plan should prepare you for SQL, data modeling, pipeline design, coding, and production stories.
| Week | Focus | What to practice |
|---|---|---|
| 1 | SQL | Window functions, joins, anti-joins, CTEs, dedup, top-N, SCD checks |
| 2 | Data modeling | Account, Contact, Opportunity, stage history, customer identity, fact/dimension grain |
| 3 | Pipeline design | CDC, bronze/silver/gold, idempotency, backfills, schema evolution, SLAs |
| 4 | Coding + behavioral | Python data problems, system design outlines, STAR stories |
For SQL, practice CRM-style problems:
- Top opportunities by region
- Month-over-month revenue
- Closed opportunities missing stage history
- Duplicate contacts by email
- Latest status per account
- SCD Type 2 overlap detection
For system design, prepare:
- CRM CDC pipeline into a warehouse
- Customer identity unification pipeline
- Data quality framework
- Reliable backfill design
- Dashboard-serving mart with SLA
For behavioral, prepare stories on:
- Production incident
- Data quality issue
- Disagreement with stakeholder
- Performance improvement
- Ambiguous requirement
A strong answer is:
I would spend the first week on SQL, the second on CRM data modeling, the third on pipeline and system design, and the fourth on coding plus behavioral stories. I would prepare real examples with volume, latency, error rate, cost, and business impact.
SQL and data modeling (CRM-centric)
Write SQL to find the top 3 salaries in each department.
This is a classic window-function question. The interviewer wants to see whether you understand partitioning, ranking, and tie handling.
WITH ranked AS (
SELECT
employee_id,
department,
salary,
DENSE_RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS salary_rank
FROM employees
)
SELECT
employee_id,
department,
salary
FROM ranked
WHERE salary_rank <= 3
ORDER BY department, salary DESC;Use DENSE_RANK() when ties should receive the same rank. For example, if two employees share the highest salary, both get rank 1.
Use ROW_NUMBER() if the interviewer wants exactly three rows per department, even when salaries tie.
Clarify these points before writing the final query:
- Should ties be included?
- Should null salaries be ignored?
- Should departments with fewer than three employees return fewer rows?
- Is salary stored as annual, monthly, or effective-dated compensation?
A strong answer is:
I would use a window function partitioned by department and ordered by salary descending. I would choose
DENSE_RANK()if ties should be included, orROW_NUMBER()if the requirement is exactly three rows per department.
When do you use INNER JOIN vs LEFT JOIN in analytics SQL?
Use an INNER JOIN when matching records must exist on both sides.
Use a LEFT JOIN when you want to preserve all rows from the left table, even if the right table has no match.
| Join type | Use when | Example |
|---|---|---|
INNER JOIN |
Both sides are required | Opportunities that must have valid Accounts |
LEFT JOIN |
Missing relationship is meaningful | All Accounts, even those with no Opportunities |
CRM example:
Account LEFT JOIN OpportunityThis is useful when you want to show all accounts and count open opportunities. Accounts with zero opportunities should still appear.
A common mistake is putting filters on the right table in the WHERE clause after a LEFT JOIN, which accidentally turns it into an inner join.
Problem pattern:
SELECT *
FROM account a
LEFT JOIN opportunity o
ON a.account_id = o.account_id
WHERE o.stage_name = 'Closed Won';Better pattern:
SELECT *
FROM account a
LEFT JOIN opportunity o
ON a.account_id = o.account_id
AND o.stage_name = 'Closed Won';The most important interview habit: state the grain before joining.
Example:
The output grain is one row per account.A strong answer is:
I use INNER JOIN when both tables must match, and LEFT JOIN when I need to preserve the left-side grain and detect missing relationships. In analytics SQL, I also watch filter placement because a
WHEREfilter on the right table can accidentally remove unmatched rows.
How do you calculate month-over-month sales growth by product category?
First define the grain:
One row per month and product category.Then aggregate revenue and compare it to the previous month using LAG().
WITH monthly AS (
SELECT
DATE_TRUNC('month', order_date) AS month_start,
product_category,
SUM(amount) AS revenue
FROM orders
GROUP BY 1, 2
),
growth AS (
SELECT
month_start,
product_category,
revenue,
LAG(revenue) OVER (
PARTITION BY product_category
ORDER BY month_start
) AS prev_revenue
FROM monthly
)
SELECT
month_start,
product_category,
revenue,
prev_revenue,
CASE
WHEN prev_revenue IS NULL OR prev_revenue = 0 THEN NULL
ELSE (revenue - prev_revenue) / prev_revenue
END AS mom_growth_rate
FROM growth
ORDER BY product_category, month_start;Explain edge cases:
- First month has no previous month
- Previous revenue can be zero
- Current month may be partial
- Refunds or credits may reduce revenue
- Multi-currency data needs conversion before aggregation
- Timezone affects month boundaries
Salesforce-style example:
For Opportunity revenue, I would clarify whether to use created date, close date, booked date, or stage transition date.A strong answer is:
I would aggregate revenue at month and category grain, use
LAG()to get previous-month revenue, then calculate the percentage change. I would also clarify partial months, refunds, currency conversion, and which business date defines the month.
How do you detect overlapping or gapped validity windows in an SCD Type 2 Account history table?
In an SCD Type 2 table, each business key should have non-overlapping validity windows.
Example grain:
One row per account version.Typical columns:
account_id, account_name, owner_id, valid_from, valid_to, is_currentUse LEAD() to compare each row with the next row for the same account.
WITH ordered AS (
SELECT
account_id,
valid_from,
valid_to,
LEAD(valid_from) OVER (
PARTITION BY account_id
ORDER BY valid_from
) AS next_valid_from
FROM account_history
)
SELECT
account_id,
valid_from,
valid_to,
next_valid_from
FROM ordered
WHERE next_valid_from IS NOT NULL
AND (
next_valid_from < valid_to
OR next_valid_from > valid_to
);This assumes valid_to is exclusive. If your model uses inclusive end dates, adjust the comparison.
For example:
Exclusive end: next_valid_from should equal valid_to
Inclusive end: next_valid_from should equal valid_to + 1 dayAlso validate:
- Only one
is_current = truerow per account - No null
valid_from - Current row has open-ended
valid_to - Deleted records are represented consistently
- Timezone is consistent
A strong answer is:
I would order SCD2 rows by
valid_fromper account and useLEAD()to compare each row with the next window. Then I would flag overlaps, gaps, multiple current rows, and inconsistent delete or open-ended records.
How do you model Opportunity stage history for daily pipeline-as-of and current pipeline?
First separate current pipeline from pipeline-as-of. They answer different business questions.
| Metric type | Meaning |
|---|---|
| Current pipeline | What is open right now? |
| Pipeline-as-of | What did the pipeline look like on a past date? |
Recommended model:
-
Opportunity dimension
opportunity_id- account/customer key
- owner
- region
- amount
- created date
- close date
- current status
-
Opportunity stage history fact
- grain: one row per stage transition
opportunity_idstage_namestage_changed_at- optional previous stage
- source event ID
-
Daily snapshot / as-of mart
- grain: one row per opportunity per as-of date
- latest stage as of that date
- amount as of that date
- owner/region as of that date
For as-of reporting:
- Build a date spine
- Join opportunities active on each date
- Pick the latest stage at or before the as-of timestamp
- Use SCD2 or snapshots for changing attributes like owner, region, and amount
Avoid double counting:
- Do not join raw stage history directly to opportunity totals without selecting one stage per as-of date
- Do not key customer facts only by mutable email
- Use stable surrogate keys when identity resolution can merge records
A strong answer is:
I would model stage changes as a transition fact and build a separate daily as-of mart. For each date, I would select the latest stage and attributes valid at that time, while keeping current pipeline as a separate model to avoid mixing current and historical semantics.
How do you validate Opportunities marked closed without matching stage history?
This is a data quality validation question. The goal is to detect Opportunities whose current state says closed, but stage history does not show a matching closed transition.
Example query:
SELECT
DATE(o.close_date) AS close_day,
COUNT(*) AS closed_missing_history
FROM opportunity o
LEFT JOIN opportunity_stage_history h
ON h.opportunity_id = o.opportunity_id
AND h.stage_name IN ('Closed Won', 'Closed Lost')
AND h.stage_changed_at <= o.close_date
WHERE o.is_closed = TRUE
AND h.opportunity_id IS NULL
GROUP BY 1
ORDER BY 1;What this checks:
- Opportunity is marked closed
- No matching Closed Won / Closed Lost stage history exists
- Stage transition should have happened on or before close date
In an interview, explain possible causes:
- Missing CDC event
- Late-arriving stage history
- Backfill gap
- Source system correction
- Manual data update
- Different stage naming convention
- Timezone mismatch around close date
Production handling:
- Alert if count crosses threshold
- Quarantine bad records if they break downstream metrics
- Track the issue by source system and ingestion date
- Reprocess from raw/bronze data if possible
- Backfill corrected history safely
- Add a data quality dashboard
A strong answer is:
I would LEFT JOIN closed Opportunities to closed-stage history and flag rows where the history is missing. Then I would check for late events, source corrections, timezone issues, and backfill gaps before quarantining or fixing the affected records.
What is the purpose of database indexing in a warehouse context?
In transactional databases, indexes help find rows quickly. In warehouses and lakehouses, similar goals are achieved through physical design features such as partitioning, clustering, sorting, statistics, pruning, and file layout.
The goal is to reduce scanned data and improve query performance.
In CRM analytics, common choices are:
| Design choice | Example |
|---|---|
| Partitioning | Partition Opportunity facts by close date or snapshot date |
| Clustering | Cluster by account_id, opportunity_id, or region |
| Sorting | Sort history tables by business key and event time |
| Statistics | Keep table stats updated for optimizer decisions |
| Materialized views | Precompute heavy dashboard aggregates |
Example:
A dashboard filtering Opportunity snapshots by date should not scan all historical partitions.Physical design matters when:
- Tables are very large
- Dashboards need low latency
- Joins are frequent
- Filters are selective
- Data is skewed by large customers or regions
Avoid over-optimizing too early. First check query plans, scan volume, join strategy, and dashboard SLA.
A strong answer is:
In a warehouse, the purpose is to reduce scanned data and improve query speed using indexes, partitioning, clustering, sorting, statistics, or materialized aggregates. I would design physical layout based on access patterns and validate with query plans.
What grain and keys do you use for a unified Customer dimension across Sales Cloud Contacts and a marketing email source?
For a unified customer dimension, define the grain carefully.
Recommended grain:
One row per resolved person or customer profile.Do not use raw email as the only customer key. Emails can change, be shared, be missing, or appear differently across systems.
Recommended design:
-
Unified customer dimension
customer_key- resolved profile attributes
- preferred email
- consent status
- created/updated timestamps
-
Identity link table
customer_key- source system
- source record ID
- phone
- external ID
- match confidence
- active flag
-
Source-specific dimensions
- Sales Cloud Contact
- Lead
- Marketing subscriber
- Web/app user
-
Facts
- Join through stable keys or identity bridge
- Avoid rewriting historical facts every time an identity merge changes
Important Salesforce/Data Cloud concepts:
- Source records can map to unified profiles
- Match rules decide which records belong together
- Reconciliation rules decide which field value wins
- Merge churn should be monitored
Example issue:
If two contacts merge into one unified customer, historical Opportunity facts should remain explainable and not double-count revenue.A strong answer is:
I would use one row per resolved customer profile with a stable surrogate
customer_key, and maintain an identity link table from source IDs, emails, phones, and external IDs to that key. Facts should not depend only on mutable email addresses.
Data pipelines, CDC, and Salesforce Data Cloud
When do you choose batch vs near-real-time ingestion for Salesforce CRM data?
Choose the ingestion pattern based on the freshness requirement, cost, source limits, and downstream use case.
| Choose batch ingestion | Choose near-real-time / CDC ingestion |
|---|---|
| Nightly reports | Operational dashboards |
| Historical backfills | Stage-change or case-status alerts |
| Large snapshots | Customer profile updates |
| Lower cost and simpler operations | Fresh data needed within minutes |
| Source API limits favor scheduled pulls | Downstream AI, segmentation, or automation needs fresh events |
Batch ingestion is easier to operate and good for reporting where a few hours of delay is acceptable.
Near-real-time ingestion is better when business users need fresh changes quickly, such as:
- Opportunity stage changes
- Case escalation alerts
- Marketing audience updates
- Customer profile refresh in Data Cloud
- Fraud/risk signals
- Sales pipeline coverage monitoring
A common production design is hybrid:
CDC stream for hot changes + nightly batch reconciliation for correctnessThe nightly batch helps catch missed events, source corrections, late-arriving records, and drift between source and target.
A strong answer is:
I choose batch when cost and simplicity matter more than freshness, and CDC or near-real-time ingestion when downstream users need updates within minutes. For important CRM data, I prefer a hybrid design: CDC for freshness and nightly reconciliation for correctness.
How do you design a CDC pipeline to be idempotent when events replay or arrive out of order?
An idempotent CDC pipeline produces the same final result even if the same event is processed more than once.
This is important because CDC systems can have:
- Duplicate events
- Retries
- Out-of-order delivery
- Consumer restarts
- Replays from a checkpoint
- Late-arriving updates
Key design elements:
-
Stable event key
- Use event ID, replay ID, transaction ID, or a composite key like source object ID + sequence + timestamp.
-
Append-only bronze layer
- Store raw events before transformation so replay and debugging are possible.
-
Deduplication logic
- Remove duplicates before writing conformed silver tables.
-
Monotonic merge rules
- Apply newer changes over older changes using commit timestamp or version number.
-
Checkpointing
- Store the last processed offset/replay token per object stream.
-
Dead-letter queue
- Send malformed or poison events to a DLQ instead of blocking the full pipeline.
-
Metrics
- Track lag, duplicate rate, late-event rate, failed records, and merge conflicts.
For Salesforce CRM data, do not rely only on email or name for identity. Use stable Salesforce record IDs, source system metadata, and defined identity rules.
A strong answer is:
I would make the CDC pipeline idempotent by storing raw events, deduplicating by stable event keys, applying version-aware merge logic, checkpointing per stream, and sending bad events to a DLQ. Reprocessing the same event should not change the final result incorrectly.
A new Opportunity field breaks your Data Stream schema validation. What do you do?
First protect the pipeline and downstream users. Then fix the schema properly.
Operational response:
-
Confirm the change
- Check source metadata, pipeline logs, schema validation error, and affected object.
-
Protect ingestion
- If possible, land the raw payload in bronze as JSON or semi-structured data so ingestion does not fully stop.
-
Assess downstream impact
- Find which silver/gold tables, dashboards, Data Cloud mappings, or ML features depend on the Opportunity schema.
-
Version the schema
- Add the new field as optional first, with a default null value where needed.
-
Update mappings and contracts
- Update Data Stream mapping, DLO/DMO mapping, dbt model, schema registry, or data contract.
-
Backfill if needed
- Backfill historical values if the source exposes field history or historical extracts.
-
Test before promotion
- Run compatibility tests, row-count checks, null checks, and downstream dashboard validation.
-
Communicate
- Notify consumers if the new field changes business logic or reporting definitions.
The key interview point is that schema evolution should not silently break gold tables or dashboards.
A strong answer is:
I would isolate the schema change, keep raw ingestion running if possible, add the new field in a backward-compatible way, update mappings and contracts, validate downstream models, and backfill historical values only if the business needs them.
What data quality checks do you embed in an ETL pipeline?
Data quality checks should run at multiple stages, not only at the final dashboard layer.
Common checks:
| Check type | Example |
|---|---|
| Completeness | Row count vs source extract |
| Null checks | Required keys like account_id or opportunity_id are not null |
| Uniqueness | No duplicate primary keys or event IDs |
| Referential integrity | Every Opportunity has a valid Account |
| Freshness | Latest ingested_at is within SLA |
| Valid values | Opportunity stage belongs to allowed stage list |
| Business rule | Closed Opportunity must have a terminal stage |
| Reconciliation | Source count and target count match within tolerance |
| Anomaly | Sudden drop or spike in records |
| Distribution | Stage mix or region mix changes unexpectedly |
Example Salesforce checks:
- Opportunities marked closed must have Closed Won or Closed Lost history
- Contacts must have valid Account relationship when required
- Duplicate Leads by email should be tracked
- Opportunity amount should not be negative unless credits are allowed
- Account hierarchy should not contain cycles
Failed checks should not be silently ignored. Depending on severity:
- Warn and continue
- Quarantine bad records
- Block promotion to gold
- Alert the data owner
- Open an incident if SLA or business reporting is affected
A strong answer is:
I would add checks for completeness, uniqueness, referential integrity, freshness, valid values, business rules, and anomalies. Critical failures should quarantine or block bad data, while alerts and lineage help owners understand which reports or consumers are affected.
How do you handle missing or late-arriving data in pipelines?
Missing and late-arriving data should be handled explicitly because CRM data often arrives out of order or gets corrected after the fact.
For late-arriving events:
- Use event time, not only ingestion time
- Define a watermark and allowed lateness window
- Reprocess affected partitions
- Update as-of marts safely
- Mark metrics as provisional if they may change
For missing batches:
- Compare source manifests with landed files
- Validate row counts and checksums
- Retry extraction with backoff
- Alert when SLA is at risk
- Backfill from the last successful checkpoint
For downstream reporting:
- Use
is_finalordata_statusflags when metrics can revise - Publish incident notes if dashboards are stale
- Show freshness timestamps in important reports
Salesforce-style example:
An Opportunity may be marked closed before the matching stage-history event arrives. The pipeline should detect this semantic gap, allow late correction, and avoid publishing misleading close-rate metrics.A strong answer is:
I handle late data with watermarks, allowed lateness, partition reprocessing, and provisional metrics. For missing data, I compare manifests, retry safely, backfill from checkpoints, and communicate freshness or SLA impact to downstream users.
What are DLOs and DMOs in Salesforce Data Cloud?
In Salesforce Data Cloud / Data 360, DLOs and DMOs are part of the ingestion and harmonization flow.
| Term | Meaning |
|---|---|
| DLO | Data Lake Object. It represents ingested or externally referenced source-aligned data. |
| DMO | Data Model Object. It represents harmonized data mapped to the Salesforce Customer 360 data model. |
Simple flow:
Source data → Data Stream → DLO → DMO → Identity Resolution → Unified Profile → ActivationExample:
- A marketing subscriber file lands as a DLO
- Fields from that DLO are mapped to standard objects such as Individual or Contact Point DMOs
- Identity resolution links records into unified profiles
- The unified data can be used for segmentation, analytics, activation, or AI use cases
Key interview point:
DLO is closer to source-shaped data.
DMO is closer to business/model-shaped data.Zero-copy federation can also allow Data Cloud/Data 360 to access external data without physically copying all of it into Salesforce, depending on the connector and use case.
A strong answer is:
A DLO stores or references source-aligned data after ingestion, while a DMO is the harmonized model object used in the Customer 360 model. The usual flow is source to DLO, mapping to DMO, identity resolution, unified profile, and activation.
How does identity resolution work in Data Cloud?
Identity resolution links multiple source records that represent the same real-world person, account, or customer.
High-level flow:
- Ingest data from sources
- Map source data to DMOs
- Define match rules
- Define reconciliation rules
- Run identity resolution
- Create unified profiles and link records
- Monitor match quality and merge behavior
Important concepts:
| Concept | Meaning |
|---|---|
| Match rules | Decide which records should be linked |
| Reconciliation rules | Decide which attribute value wins |
| Unified profile | The resolved customer/person/account view |
| Link objects | Preserve relationship between source records and unified records |
Example match signals:
- Exact email
- Phone number
- CRM Contact ID
- External customer ID
- Name + address
- Loyalty ID
Be careful with over-merging. A false positive can combine two different people into one profile, which is worse than leaving two duplicate profiles in many compliance and personalization scenarios.
Also explain that source records are not erased. Good identity systems preserve lineage so teams can trace a unified profile back to original CRM, marketing, or service records.
A strong answer is:
Identity resolution maps source records into a common model, applies match rules to link records, uses reconciliation rules to choose winning attribute values, and creates unified profiles while preserving source lineage. I would monitor false positives, merge churn, and compliance risk carefully.
A nightly orders pipeline failed at 2 AM. Walk through your response.
This is an incident-management question. The interviewer wants to see calm ownership, not panic debugging.
Step-by-step response:
-
Acknowledge the alert
- Check orchestrator status, failed task, logs, error message, and last successful run.
-
Assess impact
- Which tables, dashboards, ML jobs, or business users are affected?
- Is this a freshness issue or bad-data issue?
-
Stabilize
- Retry only if the failure is safe and likely transient.
- Pause downstream jobs if bad data may be published.
-
Find root cause
- Schema change
- Source API issue
- Credential expiry
- File missing
- Data skew
- Warehouse capacity issue
- Code deployment bug
-
Recover
- Restart from checkpoint
- Backfill missing partitions
- Re-run reconciliation checks
- Validate row counts and business metrics
-
Communicate
- Notify stakeholders about impact, ETA, and workaround.
- Update dashboard freshness notes if needed.
-
Prevent recurrence
- Add tests, alerts, schema contracts, retry logic, or runbook updates.
A strong answer is:
I would first assess whether users are seeing stale data or wrong data. Then I would stabilize the pipeline, protect downstream consumers, recover from the last good checkpoint, validate reconciliation checks, communicate impact, and add a guardrail to prevent the same failure.
System design
Design a system to count video views in real time.
Start by clarifying requirements:
- Views per second
- Exact vs approximate counts
- Unique views vs total views
- Per-video, per-user, per-region, or per-time-window metrics
- Latency target
- Fraud/bot filtering needs
High-level design:
Client/App → Event collector → Kafka/Kinesis/PubSub → Stream processor → Hot store + warehouse → API/dashboardComponents:
-
Event collector
- Receives view events with
event_id,video_id,user_id, and timestamp.
- Receives view events with
-
Message queue
- Buffers traffic spikes and decouples producers from consumers.
-
Stream processor
- Aggregates counts by
video_idand time window.
- Aggregates counts by
-
Hot store
- Redis or similar store for current counters.
-
Analytics store
- ClickHouse, BigQuery, Druid, Snowflake, or lakehouse tables for historical reporting.
-
Serving API
- Reads hot counters for live views and historical store for older windows.
Reliability concerns:
- Deduplicate by
event_id - Handle late events with watermarks
- Mitigate hot keys for viral videos
- Use idempotent writes
- Apply bot/fraud filtering
- Reconcile stream counts with raw events
For unique viewers, use exact distinct counts if scale allows, or approximate structures like HyperLogLog when approximate accuracy is acceptable.
A strong answer is:
I would ingest view events into a durable stream, aggregate them by video and time window, store live counters in a hot store, persist raw and aggregated data for analytics, and handle deduplication, late events, hot keys, and bot filtering.
Design a pipeline: Salesforce CRM CDC → Data Cloud + Azure Synapse with a 5-minute SLA.
A 5-minute SLA means the design must focus on low-latency ingestion, idempotent processing, and strong observability.
High-level design:
Salesforce CRM CDC → Event bus → Bronze raw lake → Silver conformed models → Synapse gold marts + Data Cloud ingestionSuggested flow:
-
CDC capture
- Subscribe to Salesforce CDC events for objects such as Account, Contact, Opportunity, and Case.
-
Streaming buffer
- Use Kafka, Azure Event Hubs, or another durable bus to absorb spikes and support replay.
-
Bronze layer
- Store raw CDC events in ADLS with metadata: source object, event ID, replay ID, schema version, and ingestion time.
-
Silver layer
- Deduplicate, validate, apply ordering rules, and upsert into conformed CRM entities.
-
Gold layer
- Build BI-ready tables in Synapse for dashboards and SLA-based reporting.
-
Data Cloud path
- Send or expose harmonized data to Data Cloud/Data 360 through Data Streams, connectors, or federation depending on the source and architecture.
-
Observability
- Monitor end-to-end lag, event backlog, failed events, DLQ count, freshness, and row-count reconciliation.
Key design points:
- Use idempotent upserts
- Keep raw events for replay
- Separate BI serving from profile activation
- Use DLQ for poison events
- Add schema evolution handling
- Reconcile CDC with scheduled batch extracts
A strong answer is:
I would capture Salesforce CDC events into a durable stream, land raw events in bronze, dedupe and merge into silver CRM models, publish gold marts to Synapse, and feed Data Cloud through the right ingestion or federation path. I would monitor lag, DLQ, freshness, and reconciliation against source.
How would you migrate petabyte-scale data from on-prem to cloud with minimal downtime?
For petabyte-scale migration, use a phased approach instead of a single cutover.
Migration plan:
-
Discovery
- Inventory datasets, owners, size, dependencies, SLAs, PII, and retention rules.
-
Target design
- Choose storage layout, partitioning, encryption, access model, and compute engine.
-
Initial bulk copy
- Run parallel transfer to object storage.
- Validate using checksums, row counts, and sample comparisons.
-
Incremental sync
- Use CDC, logs, timestamp-based extracts, or replication to capture changes after the initial copy.
-
Dual-run period
- Run old and new systems in parallel.
- Compare business metrics and critical queries.
-
Cutover
- Use feature flags, DNS/connection string changes, or application configuration.
-
Rollback plan
- Keep old system readable for a defined period.
- Define clear rollback triggers.
-
Post-cutover validation
- Compare counts, hashes, aggregates, and business invariants.
Special considerations:
- Historical field history
- Attachments and files
- Data retention policies
- PII masking/tokenization
- Network bandwidth
- Transfer cost
- Partition skew
- Query compatibility
A strong answer is:
I would do an initial bulk load, validate it, keep systems in sync with CDC or incremental replication, run both systems in parallel, reconcile business metrics, cut over with a rollback plan, and continue validation after migration.
How would you design a distributed job scheduler like Airflow?
A distributed scheduler coordinates workflows, dependencies, retries, and execution across many workers.
Core components:
| Component | Responsibility |
|---|---|
| Metadata DB | Stores DAGs, runs, task state, retries, schedules |
| Scheduler | Determines which tasks are ready to run |
| Executor | Sends tasks to workers |
| Workers | Execute tasks |
| Web UI/API | Shows status, logs, retries, and history |
| Log store | Stores task logs |
| Alerting | Notifies on failure or SLA miss |
High-level flow:
DAG definition → Scheduler → Queue → Worker → Metadata DB update → Logs/metricsDesign concerns:
- Task dependencies
- Retries and backoff
- Idempotent task execution
- Backfills
- Catchup behavior
- SLA monitoring
- Worker autoscaling
- Priority queues
- Dead task detection
- Secure secrets handling
For data platforms, tasks should be idempotent because retries are normal. For example, a load task should safely overwrite or merge a partition without creating duplicates.
A strong answer is:
I would design a scheduler with a metadata database, scheduler, executor, distributed workers, logs, retries, backfill support, and alerting. The most important reliability rule is that tasks should be idempotent because retries and reruns are expected.
How do you handle a sudden 10x traffic spike on your ingestion cluster?
A 10x ingestion spike should be handled with buffering, autoscaling, prioritization, and backpressure.
Immediate response:
-
Check queue lag
- Is the system dropping data or only falling behind?
-
Protect critical pipelines
- Prioritize revenue, CRM CDC, security, or customer-impacting data over batch reprocessing.
-
Scale consumers
- Add workers if partitions and downstream systems can handle it.
-
Apply backpressure
- Throttle non-critical producers or consumers.
-
Use durable buffering
- Let Kafka/Event Hubs/Kinesis absorb the burst within retention limits.
-
Protect downstream systems
- Avoid overwhelming the warehouse, database, Data Cloud ingestion, or APIs.
-
Monitor cost
- Autoscaling should have limits and alerts.
After the spike:
- Replay DLQ
- Validate missed records
- Reconcile counts
- Review partition strategy
- Check hot keys
- Update capacity planning
- Add traffic forecasts if spike was business-driven
A strong answer is:
I would first confirm whether data is being lost or only delayed. Then I would use queue buffering, scale consumers safely, apply backpressure to non-critical flows, protect downstream systems, prioritize critical SLAs, and reconcile data after the spike.
Design a multi-tenant lakehouse fed from Data Cloud where each BU needs PII isolation and audit logs.
A multi-tenant lakehouse must isolate data, enforce access controls, and provide auditability.
Clarify first:
- Number of business units
- Regulatory requirements
- Whether tenants can share any data
- Whether isolation must be logical or physical
- Who can access PII
- Retention and deletion requirements
Design options:
| Area | Design choice |
|---|---|
| Logical isolation | Separate schemas/catalogs per tenant |
| Physical isolation | Separate storage accounts/buckets for regulated tenants |
| Access control | IAM roles, RBAC, row-level security, column masking |
| PII protection | Encryption, tokenization, masking, consent filtering |
| Audit | Immutable access logs and query history |
| Lineage | Track source, transform, and serving tables |
| Observability | Freshness and DQ metrics by tenant |
High-level design:
Data Cloud / source systems → ingestion layer → tenant-aware bronze/silver/gold → governed access layer → BI/ML/activationImportant controls:
- Add
tenant_idto shared tables where logical isolation is used - Prevent broad service principals
- Encrypt data at rest and in transit
- Mask or tokenize PII columns
- Log who accessed what data and when
- Separate dev/test/prod data access
- Apply retention and deletion policies per tenant
A strong answer is:
I would design tenant isolation using separate schemas or storage for regulated tenants, enforce IAM and row/column-level controls, protect PII with masking or tokenization, and maintain immutable audit logs, lineage, and freshness metrics per tenant.
Design a data quality controls framework for an ETL platform.
A data quality framework should make checks reusable, visible, and enforceable across pipelines.
Core components:
-
Rule definition
- Rules written in SQL, YAML, dbt tests, Great Expectations, or a custom framework.
-
Rule types
- Null checks
- Uniqueness
- Referential integrity
- Freshness
- Accepted values
- Reconciliation
- Anomaly detection
- Business rules
-
Execution layer
- Run checks after bronze, silver, and gold stages.
-
Severity levels
- Warn
- Quarantine
- Block promotion
- Incident
-
Metadata and ownership
- Every rule has owner, table, column, SLA, severity, and linked dashboard.
-
Alerting
- Notify the right owner through Slack, email, PagerDuty, or ticketing.
-
Observability
- Track pass rate, failure trend, freshness, and data quality score.
-
Lineage integration
- Show which dashboards, marts, or consumers are affected.
Example pilot:
Start with Account and Opportunity because they are high-value CRM entities.Example checks:
- Opportunity must have valid Account
- Closed Opportunity must have terminal stage
- Account ID must be unique
- Required owner fields must not be null
- Daily Opportunity count should not drop suddenly
A strong answer is:
I would build a declarative data quality framework with reusable rules, severity levels, alerts, ownership, lineage, and trend reporting. Critical checks should block or quarantine bad data, while lower-severity checks warn owners before issues become incidents.
Coding and algorithms (data-focused)
Given a large log file, find the five most frequent error messages in the last hour.
This tests whether you can process large data without loading everything into memory.
Approach:
- Read the file line by line
- Parse timestamp and error message
- Keep only records from the last one hour
- Count messages using a hash map
- Return the top five using
Counter.most_common()or a min-heap
Simple Python approach:
from collections import Counter
from datetime import timedelta
def top_errors(lines, now, window=timedelta(hours=1)):
cutoff = now - window
counts = Counter()
for line in lines:
ts, msg = parse_log_line(line)
if ts >= cutoff:
counts[msg] += 1
return counts.most_common(5)For very large or streaming logs, improve the design:
- Use a min-heap of size 5 for top-K
- Use event time instead of processing time
- Handle out-of-order logs with a watermark
- Normalize error messages so IDs do not create fake unique errors
- Store rolling counts in Redis or a stream processor for real-time dashboards
Complexity:
| Area | Complexity |
|---|---|
| Time | O(n) |
| Space | O(u) where u is unique error messages in the window |
A Salesforce-style example:
Find the top failing ingestion errors in the last hour for Account and Opportunity CDC events.A strong answer is:
I would stream the log file line by line, filter records to the last hour, count error messages with a hash map, and return the top five. For production-scale logs, I would handle event-time ordering, message normalization, and rolling-window aggregation.
Validate whether a string of brackets is balanced.
This is a classic stack problem. It checks whether you understand last-in, first-out matching.
Use a stack:
def is_balanced(s: str) -> bool:
stack = []
pairs = {
')': '(',
']': '[',
'}': '{'
}
for ch in s:
if ch in '([{':
stack.append(ch)
elif ch in ')]}':
if not stack or stack.pop() != pairs[ch]:
return False
return not stackHow it works:
- Push opening brackets to the stack
- When a closing bracket appears, check if it matches the latest opening bracket
- If anything mismatches, return
False - At the end, stack must be empty
Examples:
| Input | Output | Reason |
|---|---|---|
{[]} |
True |
Properly nested |
{[}] |
False |
Wrong order |
((())) |
True |
All matched |
(() |
False |
Missing closing bracket |
Complexity:
| Area | Complexity |
|---|---|
| Time | O(n) |
| Space | O(n) in worst case |
Possible follow-ups:
- Ignore non-bracket characters
- Return the position of the first invalid bracket
- Validate brackets while streaming a file
- Track line and column number for parser errors
A strong answer is:
I would use a stack. Opening brackets are pushed, closing brackets must match the most recent opening bracket, and the stack must be empty at the end. This gives
O(n)time andO(n)worst-case space.
Find all pairs in an array that sum to a target.
First clarify whether the interviewer wants:
- One pair or all pairs
- Indexes or values
- Duplicate pairs allowed or not
- Same element reuse allowed or not
For all unique value pairs, use a hash set.
def two_sum_pairs(nums, target):
seen = set()
pairs = set()
for x in nums:
need = target - x
if need in seen:
pairs.add(tuple(sorted((need, x))))
seen.add(x)
return list(pairs)Example:
nums = [1, 2, 3, 4, 5, 3]
target = 6Possible output:
[(1, 5), (2, 4), (3, 3)]Complexity:
| Area | Complexity |
|---|---|
| Time | O(n) |
| Space | O(n) |
If the array is already sorted, you can use the two-pointer approach:
left at start, right at end
move left/right based on current sumTwo-pointer approach uses O(1) extra space but requires sorted input.
Salesforce/data example:
Find pairs of adjustment records whose amounts cancel each other out.A strong answer is:
I would clarify duplicates and whether indexes or values are required. For unsorted input, I would use a hash set for
O(n)time. For sorted input, I would use two pointers to reduce extra space.
Implement a function to flatten a nested dictionary.
Flattening a nested dictionary is common when semi-structured JSON lands in a bronze layer and needs to be converted into column-style paths.
Example input:
{
"account": {
"id": "A1",
"owner": {
"id": "U1"
}
}
}Expected output:
{
"account.id": "A1",
"account.owner.id": "U1"
}Recursive solution:
def flatten(obj, prefix=""):
out = {}
for key, value in obj.items():
path = f"{prefix}.{key}" if prefix else key
if isinstance(value, dict):
out.update(flatten(value, path))
else:
out[path] = value
return outComplexity:
| Area | Complexity |
|---|---|
| Time | O(n) where n is total keys |
| Space | O(n) for output and recursion stack |
Important follow-ups:
- How should lists be handled?
- What if keys already contain dots?
- Should null values be kept?
- Should deeply nested JSON use iterative traversal to avoid recursion depth issues?
- Should output preserve data types?
For CDC pipelines, this helps when raw Salesforce or external events land as JSON and you need predictable field paths for validation or schema inference.
A strong answer is:
I would recursively traverse the dictionary, build a dotted path for each nested key, and store leaf values in the output map. I would also clarify how to handle arrays, nulls, dotted keys, and very deep JSON.
Merge k sorted lists of records by timestamp.
This is a good data-engineering coding question because it resembles merging sorted CDC shards, partitioned files, or ordered event streams.
Use a min-heap of size k.
import heapq
def merge_k_lists(lists):
heap = []
for list_index, records in enumerate(lists):
if records:
heapq.heappush(
heap,
(records[0][0], list_index, 0)
)
result = []
while heap:
ts, list_index, record_index = heapq.heappop(heap)
records = lists[list_index]
result.append(records[record_index])
next_index = record_index + 1
if next_index < len(records):
heapq.heappush(
heap,
(records[next_index][0], list_index, next_index)
)
return resultAssumption:
Each record is sorted by timestamp, and timestamp is at index 0.Example record:
("2026-06-27T10:00:00Z", "event_1")Complexity:
| Area | Complexity |
|---|---|
| Time | O(N log k) |
| Space | O(k) excluding output |
Where:
N= total records across all listsk= number of sorted lists
Important follow-ups:
- What if timestamps are equal?
- Do we need stable ordering?
- Are records already sorted by event time or ingestion time?
- Can lists be streams instead of in-memory arrays?
- How do we handle late events?
A strong answer is:
I would use a min-heap containing the current record from each sorted list. Each pop gives the next earliest timestamp, and I push the next record from the same list. This gives
O(N log k)time and works well for merging ordered CDC shards.
How do you deduplicate a stream of CRM events in code?
To deduplicate a CRM event stream, use a stable event identifier and keep track of processed events.
Simple logic:
seen = set()
def process(event):
if event.id in seen:
return None
seen.add(event.id)
return transform(event)This works for small in-memory examples, but production streams need bounded and durable state.
Production design options:
| Option | Use case |
|---|---|
| In-memory set | Small coding exercise or short-lived process |
| TTL cache | Dedup within replay window |
| Redis / DynamoDB / RocksDB | Durable dedup state |
| Database unique constraint | Idempotent sink writes |
| Bloom filter | Memory-efficient approximate dedup |
| Bronze table dedup | Reprocess-safe batch/stream pipeline |
Important considerations:
- Event ID must be stable
- Dedup state should survive restarts
- State must be bounded using TTL or compaction
- Replays should not produce duplicate target rows
- Sink writes should be idempotent
- Checkpointing should align with successful writes
For Salesforce CDC, possible keys include:
event_id, replay_id, source_object_id, commit_timestamp, change_sequenceDo not deduplicate only by business fields like email or account name because those can change and may not uniquely identify events.
A strong answer is:
I would deduplicate by a stable event ID, store seen IDs in a bounded or durable state store, and make the sink idempotent with upserts or unique constraints. In production, the dedup state must survive retries and restarts.
Behavioral, leadership, and culture
Why Salesforce, and which company value resonates with you?
A strong answer should connect Salesforce's values to your actual data engineering work, not generic praise.
Good values to connect with data work:
| Salesforce value | How to connect it to data engineering |
|---|---|
| Trust | Data privacy, PII handling, encryption, access control, auditability |
| Customer Success | Reliable pipelines and metrics that business users can act on |
| Innovation | Data Cloud, Agentforce, AI-ready data, real-time customer context |
| Equality | Fair access to data, inclusive collaboration, clear documentation |
| Sustainability | Efficient pipelines, cost-aware compute, less wasteful processing |
A weak answer sounds like:
Salesforce is a big company and I want to work on interesting products.A stronger answer connects your experience:
Trust resonates with me because data engineering directly affects customer confidence. If CRM, consent, or PII pipelines are wrong, downstream analytics and AI decisions can be wrong. In my past work, I handled access controls, audit logs, data quality checks, and incident communication, so I see Trust as a daily engineering responsibility.You can also mention interest in Data Cloud / Data 360 and Agentforce if it is genuine:
I am interested in how Salesforce uses unified customer data to power analytics, automation, and AI agents. That kind of system needs strong data engineering foundations: identity resolution, freshness, lineage, governance, and quality checks.A strong answer is:
Trust resonates most with me because data engineering is directly tied to reliability, privacy, and customer confidence. I want to work on systems where clean, governed, and timely data powers customer-facing analytics, Data Cloud, and AI use cases.
Tell me about a technical disagreement with a senior engineer.
Use a STAR structure and show that you can disagree without making it personal.
A good example can come from:
- Streaming vs batch ingestion
- SCD2 vs daily snapshot
- SQL model vs denormalized mart
- Redis cache vs warehouse optimization
- Custom pipeline vs managed connector
- Strict data quality gate vs warning-only check
Example structure:
| STAR step | What to say |
|---|---|
| Situation | Team disagreed on whether a CRM mart should use SCD2 or daily snapshots |
| Task | You needed to support accurate as-of reporting without making the pipeline too complex |
| Action | You documented trade-offs, checked query patterns, ran a small benchmark, and reviewed SLA needs |
| Result | Team chose a hybrid model: SCD2 for core dimensions and daily snapshots for high-traffic dashboard metrics |
What interviewers want to hear:
- You listened first
- You used data, not ego
- You considered maintainability
- You aligned the decision with customer or business impact
- You preserved the relationship
Avoid saying:
The senior engineer was wrong, so I proved my approach was better.Say instead:
We had different assumptions, so I made the trade-offs visible and helped the team choose based on SLA, cost, and maintainability.A strong answer is:
I handled the disagreement by clarifying the requirement, documenting trade-offs, validating assumptions with a small test, and aligning the final decision with SLA, maintenance cost, and business impact rather than personal preference.
Describe a time you spotted a security risk in a data system.
Choose a story where you identified a real risk, acted responsibly, and improved the system.
Good examples:
- PII copied into a sandbox without masking
- Overly broad service principal access to storage
- API key committed to Git
- Data export shared with too many users
- Logs containing tokens, emails, or customer identifiers
- No audit trail for sensitive table access
- Production data available from a non-production notebook
Answer structure:
-
Risk found
- What did you notice?
-
Impact
- What could have gone wrong?
-
Action
- Who did you inform?
- Did you rotate credentials?
- Did you restrict access?
- Did you mask or delete exposed data?
-
Prevention
- What control did you add?
Example:
I found that a shared service principal had write access to multiple storage containers, including PII datasets that the pipeline did not need. I raised it with the platform and security teams, helped reduce permissions to least privilege, rotated credentials, and added a review step for new data access requests.Strong prevention actions:
- Secret scanning in CI
- IAM least-privilege review
- Token rotation
- Column masking
- Audit logging
- Access review workflow
- Policy-as-code checks
A strong answer is:
I identified the risk, assessed the blast radius, involved the right owners, fixed the exposure, and added a preventive control such as secret scanning, least-privilege IAM, masking, or audit logging so the issue would not repeat.
Tell me about a project that failed. What did you learn?
Pick a failure where you learned something meaningful and changed your behavior.
Good data-engineering examples:
- Pipeline launched without enough data quality gates
- Dashboard users lost trust after metric definitions changed
- Identity merge churn caused duplicate or missing customer counts
- Backfill took longer than expected because dependencies were not mapped
- Streaming pipeline missed late-arriving data
- Gold table design did not match actual business questions
Avoid blaming:
The stakeholder kept changing requirements.Better framing:
We did not define the data contract and success criteria clearly enough before building.Example answer:
In one project, we built a gold sales mart quickly, but we did not define the exact grain and metric ownership early. After launch, different teams interpreted pipeline value differently, and users lost confidence in the dashboard. I helped pause further rollout, documented metric definitions, added reconciliation checks, and created a sign-off process with business owners. The lesson was that trusted data products need contracts, ownership, and validation before wide release.What the interviewer wants:
- You take ownership
- You learned a practical lesson
- You changed your process
- You do not blame others
- You understand how data trust is lost and rebuilt
A strong answer is:
The project failed because we did not define ownership, grain, and data quality gates early enough. I learned to align on data contracts, success metrics, validation rules, and stakeholder sign-off before promoting data to trusted reporting layers.
How do you prioritize when two executives want conflicting high-priority data work?
This tests whether you can operate in a large enterprise without silently choosing favorites.
A good framework:
-
Clarify business impact
- Revenue, compliance, customer-facing impact, executive reporting, operational risk.
-
Clarify deadline
- Is it truly urgent or just important?
-
Estimate effort and dependencies
- Can one request unblock the other?
- Is there a shared data model or pipeline component?
-
Identify risk
- What happens if each request is delayed?
-
Offer phased delivery
- MVP metric first, full mart later.
- Temporary extract first, governed pipeline later.
-
Escalate transparently
- Present trade-offs to the decision maker with facts.
-
Document the decision
- Capture scope, timeline, owner, and follow-up.
Example trade-off matrix:
| Request | Impact | Deadline | Effort | Risk if delayed |
|---|---|---|---|---|
| Revenue dashboard fix | High | This week | Medium | Forecasting impact |
| New segmentation mart | Medium | Next month | High | Campaign delay |
Avoid:
I would work on whichever executive is louder.Better:
I would make the trade-off visible and ask for prioritization based on business impact and risk.A strong answer is:
I would compare business impact, urgency, effort, dependencies, and risk, then present a transparent trade-off matrix. If both cannot be done, I would propose phased delivery and get explicit prioritization rather than silently moving work around.
Describe leading through an ambiguous project crisis.
Choose a story where the problem was unclear, requirements changed, or multiple teams were blocked.
Good examples:
- Acquisition data had to be integrated quickly
- New CRM source had undocumented fields
- A dashboard broke before quarter-end reporting
- Data Cloud identity rules caused unexpected merge behavior
- Source system changed schema without notice
- Pipeline ownership was unclear during an incident
Strong answer structure:
-
Stabilize
- Identify immediate user impact and protect critical outputs.
-
Clarify known vs unknown
- Separate facts, assumptions, and open questions.
-
Create a short plan
- Time-box investigation.
- Assign owners.
- Define decision points.
-
Communicate frequently
- Explain what is affected, what is safe, and what is still unknown.
-
Deliver incrementally
- Provide a partial fix or MVP while deeper correction continues.
-
Close with learning
- Update runbooks, tests, contracts, or escalation paths.
Example:
During a CRM migration, we discovered late that several account hierarchy fields had inconsistent meanings across regions. I paused promotion to gold tables, created a data profiling report, met with regional owners, defined a temporary mapping for critical dashboards, and tracked unresolved fields in a risk register. We delivered the quarter-end dashboard on time with documented limitations, then followed up with a permanent contract.A strong answer is:
I lead ambiguous crises by separating facts from assumptions, stabilizing the highest-impact users first, assigning clear owners, communicating known risks, and delivering incremental fixes while building the longer-term solution.
Lead and senior IC depth
How do you ensure scalability and performance of a large Salesforce-related data platform?
A lead-level answer should cover architecture, operations, and team practices.
Key areas:
| Area | What to do |
|---|---|
| Data modeling | Define correct grain, keys, and ownership |
| Partitioning | Partition large facts by event date, snapshot date, or ingestion date |
| Skew handling | Watch large accounts, hot keys, and uneven partitions |
| Processing | Use bulk and distributed processing instead of row-by-row API pulls |
| CDC | Add backpressure, replay support, and idempotent merges |
| Serving | Pre-aggregate dashboard metrics and avoid heavy ad hoc queries on raw tables |
| Governance | Define standards, contracts, reviews, and ownership |
| Observability | Track freshness, lag, cost, quality, and SLA breaches |
Salesforce-specific scaling concerns:
- Account hierarchies can create skew
- Opportunity history can grow quickly
- CDC events can arrive in bursts
- Identity resolution can change profile counts
- Multi-tenant or business-unit data may require isolation
- PII controls can affect how data is joined and served
A lead candidate should also mention people and process:
- Establish design reviews
- Mentor engineers on data modeling and performance
- Create reusable pipeline templates
- Set performance budgets
- Run capacity reviews
- Practice incident game days
A strong answer is:
I would scale the platform through correct data grain, partitioning, skew handling, bulk processing, idempotent CDC design, pre-aggregated serving layers, governance standards, and strong observability. As a lead, I would also create reusable patterns and mentor teams to apply them consistently.
How do you manage technical debt on a data platform?
Technical debt in data platforms is dangerous because it often appears later as wrong metrics, slow pipelines, high cost, or poor trust.
Common data-platform debt:
- Unowned tables
- No data contracts
- No tests for critical transformations
- Duplicated pipeline logic
- Hardcoded business rules
- Manual backfills
- Poor naming conventions
- No lineage or documentation
- Expensive queries nobody owns
- Legacy pipelines no one wants to touch
Management framework:
-
Make debt visible
- Create a debt register with owner, impact, risk, and effort.
-
Prioritize by customer impact
- Fix debt that affects correctness, trust, SLA, cost, or security first.
-
Reserve capacity
- Allocate regular sprint capacity for debt reduction.
-
Prevent new debt
- Use PR templates, naming standards, tests, and data contracts.
-
Refactor safely
- Use a strangler pattern: run old and new pipelines in parallel until reconciliation passes.
-
Measure improvement
- Track pipeline runtime, failure rate, cost, test coverage, and incident count.
Example:
We replaced a legacy Opportunity mart by building a new model in parallel, validating row counts and business metrics for two weeks, then switching dashboards gradually.A strong answer is:
I manage technical debt by making it visible, ranking it by business impact and risk, reserving capacity to reduce it, and preventing new debt with standards, tests, contracts, and design reviews. For risky refactors, I run old and new pipelines in parallel until reconciliation passes.
When do you buy or configure vs build custom pipeline logic?
The decision should be based on reliability, maintenance cost, flexibility, and business differentiation.
Prefer managed or out-of-the-box options when:
- Standard connector supports the source well
- Required SLA is covered
- Transformations are simple
- Security and compliance controls are built in
- Long-term maintenance should be low
- The integration is not a competitive differentiator
Build custom logic when:
- You need complex CDC handling
- Identity matching rules are custom
- Transformations are highly business-specific
- You need advanced deduplication or ordering logic
- Multi-cloud or hybrid architecture is required
- Existing connector cannot meet latency, schema, or governance needs
Decision rubric:
| Question | Why it matters |
|---|---|
| Does a reliable connector already exist? | Avoid unnecessary custom code |
| What is the SLA? | Some managed tools may not meet low-latency needs |
| Who will own failures? | Custom code needs on-call ownership |
| How complex are transforms? | Complex logic may require custom pipelines |
| What is the total cost? | Include build, support, monitoring, and upgrades |
| Is there vendor lock-in risk? | Important for long-term architecture |
Avoid building custom pipelines only because it feels more flexible. Custom code also brings monitoring, alerting, testing, security, and upgrade responsibility.
A strong answer is:
I prefer managed connectors or configuration when they meet SLA, security, and transformation needs. I build custom logic only when the business requirements, latency, identity rules, or integration complexity cannot be handled safely by standard tools.
Where does MuleSoft fit in Salesforce enterprise integration?
MuleSoft fits in enterprise integration by providing reusable APIs and integration flows between Salesforce and other systems.
A common MuleSoft pattern is API-led connectivity:
| API layer | Purpose |
|---|---|
| System APIs | Connect to systems of record such as ERP, CRM, databases, or legacy apps |
| Process APIs | Combine and orchestrate business logic across systems |
| Experience APIs | Serve specific channels such as web, mobile, partner apps, or internal tools |
Where MuleSoft helps:
- Connect Salesforce CRM with ERP, billing, support, marketing, or data platforms
- Transform payloads between systems
- Apply routing and orchestration
- Enforce authentication and rate limits
- Provide reusable integration contracts
- Improve observability of enterprise APIs
- Reduce point-to-point integration sprawl
For data engineers, MuleSoft may be relevant in two ways:
- You consume APIs exposed by MuleSoft as pipeline sources.
- You coordinate event contracts and payload formats with integration teams.
When MuleSoft is better than point-to-point integration:
- Multiple consumers need the same business data
- Enterprise governance matters
- APIs need versioning and monitoring
- Direct source access is risky or inconsistent
- Integrations should be reusable across teams
A strong answer is:
MuleSoft provides API-led connectivity between Salesforce and enterprise systems. As a data engineer, I may consume MuleSoft APIs, coordinate event contracts, and rely on reusable System, Process, and Experience APIs instead of building fragile point-to-point integrations.
Describe a challenging CRM or data migration you led.
Use this as a hybrid behavioral and technical answer. Show leadership, planning, validation, and measurable results.
Good structure:
-
Scope
- What was migrated?
- From which system to which target?
- How much data?
- Which users or business processes were affected?
-
Challenges
- Duplicate contacts/accounts
- Bad account hierarchy
- Missing required fields
- Attachment migration
- Historical field history
- Downtime window
- Data ownership conflicts
- Consent and PII requirements
-
Plan
- Profiling
- Mapping
- Cleansing
- Test loads
- Reconciliation
- Parallel run
- Cutover
- Rollback
-
Execution
- Built migration pipelines
- Created validation queries
- Coordinated with business owners
- Ran mock cutovers
- Tracked defects and sign-offs
-
Result
- Match rate
- Downtime
- Error reduction
- Adoption
- Reporting accuracy
- Business outcome
Example:
I led a CRM migration where duplicate Accounts and inconsistent region mappings were breaking sales reporting. We profiled the data, created match rules, ran test loads, reconciled key metrics with business owners, and used a phased cutover. The final migration completed within the planned downtime window, and post-cutover reconciliation passed for the critical revenue and pipeline dashboards.A strong answer is:
I would describe the migration scope, data quality challenges, cutover plan, validation strategy, stakeholder coordination, and measurable result. The key is showing that I handled both the technical migration and the business trust needed for adoption.
How does data engineering support Agentforce and AI features in 2026?
Agentforce and AI features depend on trusted, governed, and timely data. Raw CRM dumps are not enough.
Data engineering supports AI by providing:
| Need | Data engineering responsibility |
|---|---|
| Trusted context | Clean and validated customer data |
| Unified profile | Identity resolution across CRM, service, marketing, and external sources |
| Freshness | CDC or near-real-time pipelines for important events |
| Governance | PII controls, consent, masking, access control, retention |
| Grounding | Relevant business data for AI responses and actions |
| Lineage | Trace where AI context came from |
| Feedback loop | Capture agent outcomes for evaluation and improvement |
Example:
If an AI agent recommends the next best action for a customer, it needs accurate account status, recent service cases, consent preferences, purchase history, and current engagement signals.Important risks:
- Stale data can cause wrong recommendations
- Poor identity resolution can mix customers incorrectly
- Missing consent can create compliance issues
- Bad lineage makes AI behavior hard to explain
- Unchecked feedback loops can reinforce bad decisions
A Salesforce Data Engineer should connect Agentforce/Data Cloud work to fundamentals:
- DLO/DMO mapping
- Unified profiles
- Identity resolution
- Data quality checks
- Real-time ingestion
- Secure access
- Audit and lineage
- Monitoring and feedback metrics
A strong answer is:
Data engineering supports Agentforce by providing trusted, fresh, unified, and governed customer data. I would focus on identity resolution, data quality, lineage, PII controls, real-time signals, and feedback loops so AI agents act on accurate and compliant context.
Final checklist before your loop
- Three project stories with metrics and trade-offs
- SQL pad ready: windows, CTEs, SCD2 audits on CRM schemas
- One system design narrative: CDC → bronze/silver/gold → Data Cloud + Synapse
- Coding practice: top-K, stacks, heaps, stream dedup
- Behavioral answers tied to Trust and collaboration
- Clarified which org/team your loop targets (platform vs product analytics)
For more interview prep on this site, see AWS, Java, and the full Interview Questions category.

