45+ Salesforce Data Engineer Interview Questions and Answers 2026

A Salesforce Data Engineer interview goes deeper than a generic data pipeline loop. You are expected to reason about CRM data models, Data Cloud ingestion, cloud ETL patterns, and operating pipelines in production—not only naming tools on a resume.

Below are 45 questions organized from role context through SQL, platform specifics, system design, and behavioral prompts. Each answer stays collapsed until you are ready to check your work after answering out loud.

If you are preparing for a Salesforce Developer loop (Apex, LWC, triggers, governor limits), see the dedicated Salesforce developer interview questions guide—this page stays focused on data engineering depth.

Role context and interview process

What does a Salesforce Data Engineer actually own in 2026?

A Salesforce Data Engineer usually owns the data pipelines and models that make CRM, analytics, Data Cloud, and AI features reliable.

You are not only moving rows between systems. You are responsible for making sure customer, sales, marketing, and engagement data is accurate, timely, governed, and usable.

Typical ownership includes:

Building batch and near-real-time pipelines
Ingesting data from Salesforce CRM, external systems, APIs, files, and event streams
Designing bronze / silver / gold style data layers
Modeling CRM entities like Account, Contact, Lead, Opportunity, Case, and Activity
Handling data quality checks, SLAs, backfills, retries, and schema changes
Supporting identity resolution and unified customer/profile views in Data Cloud
Partnering with analytics, product, security, and platform teams
Protecting PII with retention, masking, audit, and access controls

A strong candidate does not only say:

text

I build ETL pipelines.

A better answer connects pipelines to business reliability:

text

I build and operate trusted data pipelines that ingest CRM and engagement data, transform it into analytics-ready models, validate quality, and make it available for reporting, Data Cloud, AI, and downstream business workflows.

A strong answer is:

I see the Salesforce Data Engineer role as owning reliable CRM-centric data pipelines from ingestion to serving. That includes data modeling, quality checks, backfills, SLA monitoring, schema evolution, governance, and making data trustworthy for analytics, Data Cloud, and AI use cases.

What is the typical Salesforce Data Engineer interview loop?

The exact loop depends on the team, but a Salesforce Data Engineer interview commonly tests SQL, data modeling, coding, system design, production ownership, and collaboration.

A typical loop may include:

Recruiter screen — background, location, compensation, role fit
Hiring manager round — projects, ownership, communication, business impact
SQL round — joins, window functions, CTEs, aggregations, data quality queries
Coding round — Python, Java, or Scala-style data problems
Data modeling / system design — pipeline design, CDC, warehouse modeling, SLAs
Behavioral round — incidents, ambiguity, conflict, leadership, Trust mindset

For senior roles, expect deeper discussion on:

Pipeline reliability
Cost and performance tuning
Data contracts
Multi-tenant data handling
PII/governance
Backfills and incident recovery
Cross-team influence

For Salesforce-specific teams, prepare examples around:

CRM data semantics
Account / Contact / Opportunity modeling
Data Cloud ingestion and identity resolution
Marketing, Sales, Service, Tableau, MuleSoft, or Slack-adjacent data flows

A strong answer is:

I would expect SQL, data modeling, coding, system design, and behavioral rounds. For senior roles, I would prepare production stories around pipeline failures, backfills, SLA misses, data quality incidents, and trade-offs in architecture.

Do I need Apex and SOQL for a Data Engineer interview at Salesforce?

For most Salesforce Data Engineer roles, Apex is not the main focus. The core interview usually focuses on data engineering skills such as SQL, Python, Spark, cloud data platforms, orchestration, and distributed pipeline design.

However, Salesforce-specific knowledge helps because the data often comes from Salesforce products.

Useful Salesforce knowledge includes:

Area	What to know
CRM objects	Account, Contact, Lead, Opportunity, Case, Activity
SOQL basics	How Salesforce data is queried at source
Data Cloud / Data 360	Data Streams, DLOs, DMOs, identity resolution
CRM semantics	Opportunity stages, close dates, ownership, Account hierarchy
Integration	APIs, CDC, MuleSoft-style integration patterns
Governance	PII, consent, retention, audit, access control

Apex is useful if the role is closer to Salesforce platform development or Sales/Service Cloud customization. But for data-platform roles, interviewers usually care more about whether you can build reliable data pipelines and understand CRM data correctly.

A strong answer is:

Apex is useful for Salesforce platform roles, but for a Data Engineer interview I would focus more on SQL, Python, Spark, cloud warehouses, orchestration, data modeling, CDC, and Data Cloud concepts. I would still learn basic CRM objects and SOQL so I understand the source data.

How should I structure a 4-week prep plan?

A good 4-week plan should prepare you for SQL, data modeling, pipeline design, coding, and production stories.

Week	Focus	What to practice
1	SQL	Window functions, joins, anti-joins, CTEs, dedup, top-N, SCD checks
2	Data modeling	Account, Contact, Opportunity, stage history, customer identity, fact/dimension grain
3	Pipeline design	CDC, bronze/silver/gold, idempotency, backfills, schema evolution, SLAs
4	Coding + behavioral	Python data problems, system design outlines, STAR stories

For SQL, practice CRM-style problems:

Top opportunities by region
Month-over-month revenue
Closed opportunities missing stage history
Duplicate contacts by email
Latest status per account
SCD Type 2 overlap detection

For system design, prepare:

CRM CDC pipeline into a warehouse
Customer identity unification pipeline
Data quality framework
Reliable backfill design
Dashboard-serving mart with SLA

For behavioral, prepare stories on:

Production incident
Data quality issue
Disagreement with stakeholder
Performance improvement
Ambiguous requirement

A strong answer is:

I would spend the first week on SQL, the second on CRM data modeling, the third on pipeline and system design, and the fourth on coding plus behavioral stories. I would prepare real examples with volume, latency, error rate, cost, and business impact.

SQL and data modeling (CRM-centric)

Write SQL to find the top 3 salaries in each department.

This is a classic window-function question. The interviewer wants to see whether you understand partitioning, ranking, and tie handling.

sql


WITH ranked AS (
  SELECT
    employee_id,
    department,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department
      ORDER BY salary DESC
    ) AS salary_rank
  FROM employees
)
SELECT
  employee_id,
  department,
  salary
FROM ranked
WHERE salary_rank <= 3
ORDER BY department, salary DESC;

Use DENSE_RANK() when ties should receive the same rank. For example, if two employees share the highest salary, both get rank 1.

Use ROW_NUMBER() if the interviewer wants exactly three rows per department, even when salaries tie.

Clarify these points before writing the final query:

Should ties be included?
Should null salaries be ignored?
Should departments with fewer than three employees return fewer rows?
Is salary stored as annual, monthly, or effective-dated compensation?

A strong answer is:

I would use a window function partitioned by department and ordered by salary descending. I would choose DENSE_RANK() if ties should be included, or ROW_NUMBER() if the requirement is exactly three rows per department.

When do you use INNER JOIN vs LEFT JOIN in analytics SQL?

Use an INNER JOIN when matching records must exist on both sides.

Use a LEFT JOIN when you want to preserve all rows from the left table, even if the right table has no match.

Join type	Use when	Example
`INNER JOIN`	Both sides are required	Opportunities that must have valid Accounts
`LEFT JOIN`	Missing relationship is meaningful	All Accounts, even those with no Opportunities

CRM example:

text

Account LEFT JOIN Opportunity

This is useful when you want to show all accounts and count open opportunities. Accounts with zero opportunities should still appear.

A common mistake is putting filters on the right table in the WHERE clause after a LEFT JOIN, which accidentally turns it into an inner join.

Problem pattern:

sql


SELECT *
FROM account a
LEFT JOIN opportunity o
  ON a.account_id = o.account_id
WHERE o.stage_name = 'Closed Won';

Better pattern:

sql


SELECT *
FROM account a
LEFT JOIN opportunity o
  ON a.account_id = o.account_id
 AND o.stage_name = 'Closed Won';

The most important interview habit: state the grain before joining.

Example:

text

The output grain is one row per account.

A strong answer is:

I use INNER JOIN when both tables must match, and LEFT JOIN when I need to preserve the left-side grain and detect missing relationships. In analytics SQL, I also watch filter placement because a WHERE filter on the right table can accidentally remove unmatched rows.

How do you calculate month-over-month sales growth by product category?

First define the grain:

text

One row per month and product category.

Then aggregate revenue and compare it to the previous month using LAG().

sql


WITH monthly AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month_start,
    product_category,
    SUM(amount) AS revenue
  FROM orders
  GROUP BY 1, 2
),
growth AS (
  SELECT
    month_start,
    product_category,
    revenue,
    LAG(revenue) OVER (
      PARTITION BY product_category
      ORDER BY month_start
    ) AS prev_revenue
  FROM monthly
)
SELECT
  month_start,
  product_category,
  revenue,
  prev_revenue,
  CASE
    WHEN prev_revenue IS NULL OR prev_revenue = 0 THEN NULL
    ELSE (revenue - prev_revenue) / prev_revenue
  END AS mom_growth_rate
FROM growth
ORDER BY product_category, month_start;

Explain edge cases:

First month has no previous month
Previous revenue can be zero
Current month may be partial
Refunds or credits may reduce revenue
Multi-currency data needs conversion before aggregation
Timezone affects month boundaries

Salesforce-style example:

text

For Opportunity revenue, I would clarify whether to use created date, close date, booked date, or stage transition date.

A strong answer is:

I would aggregate revenue at month and category grain, use LAG() to get previous-month revenue, then calculate the percentage change. I would also clarify partial months, refunds, currency conversion, and which business date defines the month.

How do you detect overlapping or gapped validity windows in an SCD Type 2 Account history table?

In an SCD Type 2 table, each business key should have non-overlapping validity windows.

Example grain:

text

One row per account version.

Typical columns:

text

account_id, account_name, owner_id, valid_from, valid_to, is_current

Use LEAD() to compare each row with the next row for the same account.

sql


WITH ordered AS (
  SELECT
    account_id,
    valid_from,
    valid_to,
    LEAD(valid_from) OVER (
      PARTITION BY account_id
      ORDER BY valid_from
    ) AS next_valid_from
  FROM account_history
)
SELECT
  account_id,
  valid_from,
  valid_to,
  next_valid_from
FROM ordered
WHERE next_valid_from IS NOT NULL
  AND (
    next_valid_from < valid_to
    OR next_valid_from > valid_to
  );

This assumes valid_to is exclusive. If your model uses inclusive end dates, adjust the comparison.

For example:

text

Exclusive end: next_valid_from should equal valid_to
Inclusive end: next_valid_from should equal valid_to + 1 day

Also validate:

Only one is_current = true row per account
No null valid_from
Current row has open-ended valid_to
Deleted records are represented consistently
Timezone is consistent

A strong answer is:

I would order SCD2 rows by valid_from per account and use LEAD() to compare each row with the next window. Then I would flag overlaps, gaps, multiple current rows, and inconsistent delete or open-ended records.

How do you model Opportunity stage history for daily pipeline-as-of and current pipeline?

First separate current pipeline from pipeline-as-of. They answer different business questions.

Metric type	Meaning
Current pipeline	What is open right now?
Pipeline-as-of	What did the pipeline look like on a past date?

Recommended model:

Opportunity dimension
- opportunity_id
- account/customer key
- owner
- region
- amount
- created date
- close date
- current status
Opportunity stage history fact
- grain: one row per stage transition
- opportunity_id
- stage_name
- stage_changed_at
- optional previous stage
- source event ID
Daily snapshot / as-of mart
- grain: one row per opportunity per as-of date
- latest stage as of that date
- amount as of that date
- owner/region as of that date

For as-of reporting:

Build a date spine
Join opportunities active on each date
Pick the latest stage at or before the as-of timestamp
Use SCD2 or snapshots for changing attributes like owner, region, and amount

Avoid double counting:

Do not join raw stage history directly to opportunity totals without selecting one stage per as-of date
Do not key customer facts only by mutable email
Use stable surrogate keys when identity resolution can merge records

A strong answer is:

I would model stage changes as a transition fact and build a separate daily as-of mart. For each date, I would select the latest stage and attributes valid at that time, while keeping current pipeline as a separate model to avoid mixing current and historical semantics.

How do you validate Opportunities marked closed without matching stage history?

This is a data quality validation question. The goal is to detect Opportunities whose current state says closed, but stage history does not show a matching closed transition.

Example query:

sql


SELECT
  DATE(o.close_date) AS close_day,
  COUNT(*) AS closed_missing_history
FROM opportunity o
LEFT JOIN opportunity_stage_history h
  ON h.opportunity_id = o.opportunity_id
 AND h.stage_name IN ('Closed Won', 'Closed Lost')
 AND h.stage_changed_at <= o.close_date
WHERE o.is_closed = TRUE
  AND h.opportunity_id IS NULL
GROUP BY 1
ORDER BY 1;

What this checks:

Opportunity is marked closed
No matching Closed Won / Closed Lost stage history exists
Stage transition should have happened on or before close date

In an interview, explain possible causes:

Missing CDC event
Late-arriving stage history
Backfill gap
Source system correction
Manual data update
Different stage naming convention
Timezone mismatch around close date

Production handling:

Alert if count crosses threshold
Quarantine bad records if they break downstream metrics
Track the issue by source system and ingestion date
Reprocess from raw/bronze data if possible
Backfill corrected history safely
Add a data quality dashboard

A strong answer is:

I would LEFT JOIN closed Opportunities to closed-stage history and flag rows where the history is missing. Then I would check for late events, source corrections, timezone issues, and backfill gaps before quarantining or fixing the affected records.

What is the purpose of database indexing in a warehouse context?

In transactional databases, indexes help find rows quickly. In warehouses and lakehouses, similar goals are achieved through physical design features such as partitioning, clustering, sorting, statistics, pruning, and file layout.

The goal is to reduce scanned data and improve query performance.

In CRM analytics, common choices are:

Design choice	Example
Partitioning	Partition Opportunity facts by close date or snapshot date
Clustering	Cluster by `account_id`, `opportunity_id`, or region
Sorting	Sort history tables by business key and event time
Statistics	Keep table stats updated for optimizer decisions
Materialized views	Precompute heavy dashboard aggregates

Example:

text

A dashboard filtering Opportunity snapshots by date should not scan all historical partitions.

Physical design matters when:

Tables are very large
Dashboards need low latency
Joins are frequent
Filters are selective
Data is skewed by large customers or regions

Avoid over-optimizing too early. First check query plans, scan volume, join strategy, and dashboard SLA.

A strong answer is:

In a warehouse, the purpose is to reduce scanned data and improve query speed using indexes, partitioning, clustering, sorting, statistics, or materialized aggregates. I would design physical layout based on access patterns and validate with query plans.

What grain and keys do you use for a unified Customer dimension across Sales Cloud Contacts and a marketing email source?

For a unified customer dimension, define the grain carefully.

Recommended grain:

text

One row per resolved person or customer profile.

Do not use raw email as the only customer key. Emails can change, be shared, be missing, or appear differently across systems.

Recommended design:

Unified customer dimension
- customer_key
- resolved profile attributes
- preferred email
- consent status
- created/updated timestamps
Identity link table
- customer_key
- source system
- source record ID
- email
- phone
- external ID
- match confidence
- active flag
Source-specific dimensions
- Sales Cloud Contact
- Lead
- Marketing subscriber
- Web/app user
Facts
- Join through stable keys or identity bridge
- Avoid rewriting historical facts every time an identity merge changes

Important Salesforce/Data Cloud concepts:

Source records can map to unified profiles
Match rules decide which records belong together
Reconciliation rules decide which field value wins
Merge churn should be monitored

Example issue:

text

If two contacts merge into one unified customer, historical Opportunity facts should remain explainable and not double-count revenue.

A strong answer is:

I would use one row per resolved customer profile with a stable surrogate customer_key, and maintain an identity link table from source IDs, emails, phones, and external IDs to that key. Facts should not depend only on mutable email addresses.

Data pipelines, CDC, and Salesforce Data Cloud

When do you choose batch vs near-real-time ingestion for Salesforce CRM data?

Choose the ingestion pattern based on the freshness requirement, cost, source limits, and downstream use case.

Choose batch ingestion	Choose near-real-time / CDC ingestion
Nightly reports	Operational dashboards
Historical backfills	Stage-change or case-status alerts
Large snapshots	Customer profile updates
Lower cost and simpler operations	Fresh data needed within minutes
Source API limits favor scheduled pulls	Downstream AI, segmentation, or automation needs fresh events

Batch ingestion is easier to operate and good for reporting where a few hours of delay is acceptable.

Near-real-time ingestion is better when business users need fresh changes quickly, such as:

Opportunity stage changes
Case escalation alerts
Marketing audience updates
Customer profile refresh in Data Cloud
Fraud/risk signals
Sales pipeline coverage monitoring

A common production design is hybrid:

text

CDC stream for hot changes + nightly batch reconciliation for correctness

The nightly batch helps catch missed events, source corrections, late-arriving records, and drift between source and target.

A strong answer is:

I choose batch when cost and simplicity matter more than freshness, and CDC or near-real-time ingestion when downstream users need updates within minutes. For important CRM data, I prefer a hybrid design: CDC for freshness and nightly reconciliation for correctness.

How do you design a CDC pipeline to be idempotent when events replay or arrive out of order?

An idempotent CDC pipeline produces the same final result even if the same event is processed more than once.

This is important because CDC systems can have:

Duplicate events
Retries
Out-of-order delivery
Consumer restarts
Replays from a checkpoint
Late-arriving updates

Key design elements:

Stable event key
- Use event ID, replay ID, transaction ID, or a composite key like source object ID + sequence + timestamp.
Append-only bronze layer
- Store raw events before transformation so replay and debugging are possible.
Deduplication logic
- Remove duplicates before writing conformed silver tables.
Monotonic merge rules
- Apply newer changes over older changes using commit timestamp or version number.
Checkpointing
- Store the last processed offset/replay token per object stream.
Dead-letter queue
- Send malformed or poison events to a DLQ instead of blocking the full pipeline.
Metrics
- Track lag, duplicate rate, late-event rate, failed records, and merge conflicts.

For Salesforce CRM data, do not rely only on email or name for identity. Use stable Salesforce record IDs, source system metadata, and defined identity rules.

A strong answer is:

I would make the CDC pipeline idempotent by storing raw events, deduplicating by stable event keys, applying version-aware merge logic, checkpointing per stream, and sending bad events to a DLQ. Reprocessing the same event should not change the final result incorrectly.

A new Opportunity field breaks your Data Stream schema validation. What do you do?

First protect the pipeline and downstream users. Then fix the schema properly.

Operational response:

Confirm the change
- Check source metadata, pipeline logs, schema validation error, and affected object.
Protect ingestion
- If possible, land the raw payload in bronze as JSON or semi-structured data so ingestion does not fully stop.
Assess downstream impact
- Find which silver/gold tables, dashboards, Data Cloud mappings, or ML features depend on the Opportunity schema.
Version the schema
- Add the new field as optional first, with a default null value where needed.
Update mappings and contracts
- Update Data Stream mapping, DLO/DMO mapping, dbt model, schema registry, or data contract.
Backfill if needed
- Backfill historical values if the source exposes field history or historical extracts.
Test before promotion
- Run compatibility tests, row-count checks, null checks, and downstream dashboard validation.
Communicate
- Notify consumers if the new field changes business logic or reporting definitions.

The key interview point is that schema evolution should not silently break gold tables or dashboards.

A strong answer is:

I would isolate the schema change, keep raw ingestion running if possible, add the new field in a backward-compatible way, update mappings and contracts, validate downstream models, and backfill historical values only if the business needs them.

What data quality checks do you embed in an ETL pipeline?

Data quality checks should run at multiple stages, not only at the final dashboard layer.

Common checks:

Check type	Example
Completeness	Row count vs source extract
Null checks	Required keys like `account_id` or `opportunity_id` are not null
Uniqueness	No duplicate primary keys or event IDs
Referential integrity	Every Opportunity has a valid Account
Freshness	Latest `ingested_at` is within SLA
Valid values	Opportunity stage belongs to allowed stage list
Business rule	Closed Opportunity must have a terminal stage
Reconciliation	Source count and target count match within tolerance
Anomaly	Sudden drop or spike in records
Distribution	Stage mix or region mix changes unexpectedly

Example Salesforce checks:

Opportunities marked closed must have Closed Won or Closed Lost history
Contacts must have valid Account relationship when required
Duplicate Leads by email should be tracked
Opportunity amount should not be negative unless credits are allowed
Account hierarchy should not contain cycles

Failed checks should not be silently ignored. Depending on severity:

Warn and continue
Quarantine bad records
Block promotion to gold
Alert the data owner
Open an incident if SLA or business reporting is affected

A strong answer is:

I would add checks for completeness, uniqueness, referential integrity, freshness, valid values, business rules, and anomalies. Critical failures should quarantine or block bad data, while alerts and lineage help owners understand which reports or consumers are affected.

How do you handle missing or late-arriving data in pipelines?

Missing and late-arriving data should be handled explicitly because CRM data often arrives out of order or gets corrected after the fact.

For late-arriving events:

Use event time, not only ingestion time
Define a watermark and allowed lateness window
Reprocess affected partitions
Update as-of marts safely
Mark metrics as provisional if they may change

For missing batches:

Compare source manifests with landed files
Validate row counts and checksums
Retry extraction with backoff
Alert when SLA is at risk
Backfill from the last successful checkpoint

For downstream reporting:

Use is_final or data_status flags when metrics can revise
Publish incident notes if dashboards are stale
Show freshness timestamps in important reports

Salesforce-style example:

text

An Opportunity may be marked closed before the matching stage-history event arrives. The pipeline should detect this semantic gap, allow late correction, and avoid publishing misleading close-rate metrics.

A strong answer is:

I handle late data with watermarks, allowed lateness, partition reprocessing, and provisional metrics. For missing data, I compare manifests, retry safely, backfill from checkpoints, and communicate freshness or SLA impact to downstream users.

What are DLOs and DMOs in Salesforce Data Cloud?

In Salesforce Data Cloud / Data 360, DLOs and DMOs are part of the ingestion and harmonization flow.

Term	Meaning
DLO	Data Lake Object. It represents ingested or externally referenced source-aligned data.
DMO	Data Model Object. It represents harmonized data mapped to the Salesforce Customer 360 data model.

Simple flow:

text

Source data → Data Stream → DLO → DMO → Identity Resolution → Unified Profile → Activation

Example:

A marketing subscriber file lands as a DLO
Fields from that DLO are mapped to standard objects such as Individual or Contact Point DMOs
Identity resolution links records into unified profiles
The unified data can be used for segmentation, analytics, activation, or AI use cases

Key interview point:

text

DLO is closer to source-shaped data.
DMO is closer to business/model-shaped data.

Zero-copy federation can also allow Data Cloud/Data 360 to access external data without physically copying all of it into Salesforce, depending on the connector and use case.

A strong answer is:

A DLO stores or references source-aligned data after ingestion, while a DMO is the harmonized model object used in the Customer 360 model. The usual flow is source to DLO, mapping to DMO, identity resolution, unified profile, and activation.

How does identity resolution work in Data Cloud?

Identity resolution links multiple source records that represent the same real-world person, account, or customer.

High-level flow:

Ingest data from sources
Map source data to DMOs
Define match rules
Define reconciliation rules
Run identity resolution
Create unified profiles and link records
Monitor match quality and merge behavior

Important concepts:

Concept	Meaning
Match rules	Decide which records should be linked
Reconciliation rules	Decide which attribute value wins
Unified profile	The resolved customer/person/account view
Link objects	Preserve relationship between source records and unified records

Example match signals:

Exact email
Phone number
CRM Contact ID
External customer ID
Name + address
Loyalty ID

Be careful with over-merging. A false positive can combine two different people into one profile, which is worse than leaving two duplicate profiles in many compliance and personalization scenarios.

Also explain that source records are not erased. Good identity systems preserve lineage so teams can trace a unified profile back to original CRM, marketing, or service records.

A strong answer is:

Identity resolution maps source records into a common model, applies match rules to link records, uses reconciliation rules to choose winning attribute values, and creates unified profiles while preserving source lineage. I would monitor false positives, merge churn, and compliance risk carefully.

A nightly orders pipeline failed at 2 AM. Walk through your response.

This is an incident-management question. The interviewer wants to see calm ownership, not panic debugging.

Step-by-step response:

Acknowledge the alert
- Check orchestrator status, failed task, logs, error message, and last successful run.
Assess impact
- Which tables, dashboards, ML jobs, or business users are affected?
- Is this a freshness issue or bad-data issue?
Stabilize
- Retry only if the failure is safe and likely transient.
- Pause downstream jobs if bad data may be published.
Find root cause
- Schema change
- Source API issue
- Credential expiry
- File missing
- Data skew
- Warehouse capacity issue
- Code deployment bug
Recover
- Restart from checkpoint
- Backfill missing partitions
- Re-run reconciliation checks
- Validate row counts and business metrics
Communicate
- Notify stakeholders about impact, ETA, and workaround.
- Update dashboard freshness notes if needed.
Prevent recurrence
- Add tests, alerts, schema contracts, retry logic, or runbook updates.

A strong answer is:

I would first assess whether users are seeing stale data or wrong data. Then I would stabilize the pipeline, protect downstream consumers, recover from the last good checkpoint, validate reconciliation checks, communicate impact, and add a guardrail to prevent the same failure.

System design

Design a system to count video views in real time.

Start by clarifying requirements:

Views per second
Exact vs approximate counts
Unique views vs total views
Per-video, per-user, per-region, or per-time-window metrics
Latency target
Fraud/bot filtering needs

High-level design:

text

Client/App → Event collector → Kafka/Kinesis/PubSub → Stream processor → Hot store + warehouse → API/dashboard

Components:

Event collector
- Receives view events with event_id, video_id, user_id, and timestamp.
Message queue
- Buffers traffic spikes and decouples producers from consumers.
Stream processor
- Aggregates counts by video_id and time window.
Hot store
- Redis or similar store for current counters.
Analytics store
- ClickHouse, BigQuery, Druid, Snowflake, or lakehouse tables for historical reporting.
Serving API
- Reads hot counters for live views and historical store for older windows.

Reliability concerns:

Deduplicate by event_id
Handle late events with watermarks
Mitigate hot keys for viral videos
Use idempotent writes
Apply bot/fraud filtering
Reconcile stream counts with raw events

For unique viewers, use exact distinct counts if scale allows, or approximate structures like HyperLogLog when approximate accuracy is acceptable.

A strong answer is:

I would ingest view events into a durable stream, aggregate them by video and time window, store live counters in a hot store, persist raw and aggregated data for analytics, and handle deduplication, late events, hot keys, and bot filtering.

Design a pipeline: Salesforce CRM CDC → Data Cloud + Azure Synapse with a 5-minute SLA.

A 5-minute SLA means the design must focus on low-latency ingestion, idempotent processing, and strong observability.

High-level design:

text

Salesforce CRM CDC → Event bus → Bronze raw lake → Silver conformed models → Synapse gold marts + Data Cloud ingestion

Suggested flow:

CDC capture
- Subscribe to Salesforce CDC events for objects such as Account, Contact, Opportunity, and Case.
Streaming buffer
- Use Kafka, Azure Event Hubs, or another durable bus to absorb spikes and support replay.
Bronze layer
- Store raw CDC events in ADLS with metadata: source object, event ID, replay ID, schema version, and ingestion time.
Silver layer
- Deduplicate, validate, apply ordering rules, and upsert into conformed CRM entities.
Gold layer
- Build BI-ready tables in Synapse for dashboards and SLA-based reporting.
Data Cloud path
- Send or expose harmonized data to Data Cloud/Data 360 through Data Streams, connectors, or federation depending on the source and architecture.
Observability
- Monitor end-to-end lag, event backlog, failed events, DLQ count, freshness, and row-count reconciliation.

Key design points:

Use idempotent upserts
Keep raw events for replay
Separate BI serving from profile activation
Use DLQ for poison events
Add schema evolution handling
Reconcile CDC with scheduled batch extracts

A strong answer is:

I would capture Salesforce CDC events into a durable stream, land raw events in bronze, dedupe and merge into silver CRM models, publish gold marts to Synapse, and feed Data Cloud through the right ingestion or federation path. I would monitor lag, DLQ, freshness, and reconciliation against source.

How would you migrate petabyte-scale data from on-prem to cloud with minimal downtime?

For petabyte-scale migration, use a phased approach instead of a single cutover.

Migration plan:

Discovery
- Inventory datasets, owners, size, dependencies, SLAs, PII, and retention rules.
Target design
- Choose storage layout, partitioning, encryption, access model, and compute engine.
Initial bulk copy
- Run parallel transfer to object storage.
- Validate using checksums, row counts, and sample comparisons.
Incremental sync
- Use CDC, logs, timestamp-based extracts, or replication to capture changes after the initial copy.
Dual-run period
- Run old and new systems in parallel.
- Compare business metrics and critical queries.
Cutover
- Use feature flags, DNS/connection string changes, or application configuration.
Rollback plan
- Keep old system readable for a defined period.
- Define clear rollback triggers.
Post-cutover validation
- Compare counts, hashes, aggregates, and business invariants.

Special considerations:

Historical field history
Attachments and files
Data retention policies
PII masking/tokenization
Network bandwidth
Transfer cost
Partition skew
Query compatibility

A strong answer is:

I would do an initial bulk load, validate it, keep systems in sync with CDC or incremental replication, run both systems in parallel, reconcile business metrics, cut over with a rollback plan, and continue validation after migration.

How would you design a distributed job scheduler like Airflow?

A distributed scheduler coordinates workflows, dependencies, retries, and execution across many workers.

Core components:

Component	Responsibility
Metadata DB	Stores DAGs, runs, task state, retries, schedules
Scheduler	Determines which tasks are ready to run
Executor	Sends tasks to workers
Workers	Execute tasks
Web UI/API	Shows status, logs, retries, and history
Log store	Stores task logs
Alerting	Notifies on failure or SLA miss

High-level flow:

text

DAG definition → Scheduler → Queue → Worker → Metadata DB update → Logs/metrics

Design concerns:

Task dependencies
Retries and backoff
Idempotent task execution
Backfills
Catchup behavior
SLA monitoring
Worker autoscaling
Priority queues
Dead task detection
Secure secrets handling

For data platforms, tasks should be idempotent because retries are normal. For example, a load task should safely overwrite or merge a partition without creating duplicates.

A strong answer is:

I would design a scheduler with a metadata database, scheduler, executor, distributed workers, logs, retries, backfill support, and alerting. The most important reliability rule is that tasks should be idempotent because retries and reruns are expected.

How do you handle a sudden 10x traffic spike on your ingestion cluster?

A 10x ingestion spike should be handled with buffering, autoscaling, prioritization, and backpressure.

Immediate response:

Check queue lag
- Is the system dropping data or only falling behind?
Protect critical pipelines
- Prioritize revenue, CRM CDC, security, or customer-impacting data over batch reprocessing.
Scale consumers
- Add workers if partitions and downstream systems can handle it.
Apply backpressure
- Throttle non-critical producers or consumers.
Use durable buffering
- Let Kafka/Event Hubs/Kinesis absorb the burst within retention limits.
Protect downstream systems
- Avoid overwhelming the warehouse, database, Data Cloud ingestion, or APIs.
Monitor cost
- Autoscaling should have limits and alerts.

After the spike:

Replay DLQ
Validate missed records
Reconcile counts
Review partition strategy
Check hot keys
Update capacity planning
Add traffic forecasts if spike was business-driven

A strong answer is:

I would first confirm whether data is being lost or only delayed. Then I would use queue buffering, scale consumers safely, apply backpressure to non-critical flows, protect downstream systems, prioritize critical SLAs, and reconcile data after the spike.

Design a multi-tenant lakehouse fed from Data Cloud where each BU needs PII isolation and audit logs.

A multi-tenant lakehouse must isolate data, enforce access controls, and provide auditability.

Clarify first:

Number of business units
Regulatory requirements
Whether tenants can share any data
Whether isolation must be logical or physical
Who can access PII
Retention and deletion requirements

Design options:

Area	Design choice
Logical isolation	Separate schemas/catalogs per tenant
Physical isolation	Separate storage accounts/buckets for regulated tenants
Access control	IAM roles, RBAC, row-level security, column masking
PII protection	Encryption, tokenization, masking, consent filtering
Audit	Immutable access logs and query history
Lineage	Track source, transform, and serving tables
Observability	Freshness and DQ metrics by tenant

High-level design:

text

Data Cloud / source systems → ingestion layer → tenant-aware bronze/silver/gold → governed access layer → BI/ML/activation

Important controls:

Add tenant_id to shared tables where logical isolation is used
Prevent broad service principals
Encrypt data at rest and in transit
Mask or tokenize PII columns
Log who accessed what data and when
Separate dev/test/prod data access
Apply retention and deletion policies per tenant

A strong answer is:

I would design tenant isolation using separate schemas or storage for regulated tenants, enforce IAM and row/column-level controls, protect PII with masking or tokenization, and maintain immutable audit logs, lineage, and freshness metrics per tenant.

Design a data quality controls framework for an ETL platform.

A data quality framework should make checks reusable, visible, and enforceable across pipelines.

Core components:

Rule definition
- Rules written in SQL, YAML, dbt tests, Great Expectations, or a custom framework.
Rule types
- Null checks
- Uniqueness
- Referential integrity
- Freshness
- Accepted values
- Reconciliation
- Anomaly detection
- Business rules
Execution layer
- Run checks after bronze, silver, and gold stages.
Severity levels
- Warn
- Quarantine
- Block promotion
- Incident
Metadata and ownership
- Every rule has owner, table, column, SLA, severity, and linked dashboard.
Alerting
- Notify the right owner through Slack, email, PagerDuty, or ticketing.
Observability
- Track pass rate, failure trend, freshness, and data quality score.
Lineage integration
- Show which dashboards, marts, or consumers are affected.

Example pilot:

text

Start with Account and Opportunity because they are high-value CRM entities.

Example checks:

Opportunity must have valid Account
Closed Opportunity must have terminal stage
Account ID must be unique
Required owner fields must not be null
Daily Opportunity count should not drop suddenly

A strong answer is:

I would build a declarative data quality framework with reusable rules, severity levels, alerts, ownership, lineage, and trend reporting. Critical checks should block or quarantine bad data, while lower-severity checks warn owners before issues become incidents.

Coding and algorithms (data-focused)

Given a large log file, find the five most frequent error messages in the last hour.

This tests whether you can process large data without loading everything into memory.

Approach:

Read the file line by line
Parse timestamp and error message
Keep only records from the last one hour
Count messages using a hash map
Return the top five using Counter.most_common() or a min-heap

Simple Python approach:

python


from collections import Counter
from datetime import timedelta

def top_errors(lines, now, window=timedelta(hours=1)):
    cutoff = now - window
    counts = Counter()

    for line in lines:
        ts, msg = parse_log_line(line)

        if ts >= cutoff:
            counts[msg] += 1

    return counts.most_common(5)

Output

For very large or streaming logs, improve the design:

Use a min-heap of size 5 for top-K
Use event time instead of processing time
Handle out-of-order logs with a watermark
Normalize error messages so IDs do not create fake unique errors
Store rolling counts in Redis or a stream processor for real-time dashboards

Complexity:

Area	Complexity
Time	`O(n)`
Space	`O(u)` where `u` is unique error messages in the window

A Salesforce-style example:

text

Find the top failing ingestion errors in the last hour for Account and Opportunity CDC events.

A strong answer is:

I would stream the log file line by line, filter records to the last hour, count error messages with a hash map, and return the top five. For production-scale logs, I would handle event-time ordering, message normalization, and rolling-window aggregation.

Validate whether a string of brackets is balanced.

This is a classic stack problem. It checks whether you understand last-in, first-out matching.

Use a stack:

python


def is_balanced(s: str) -> bool:
    stack = []
    pairs = {
        ')': '(',
        ']': '[',
        '}': '{'
    }

    for ch in s:
        if ch in '([{':
            stack.append(ch)
        elif ch in ')]}':
            if not stack or stack.pop() != pairs[ch]:
                return False

    return not stack

Output

How it works:

Push opening brackets to the stack
When a closing bracket appears, check if it matches the latest opening bracket
If anything mismatches, return False
At the end, stack must be empty

Examples:

Input	Output	Reason
`{[]}`	`True`	Properly nested
`{[}]`	`False`	Wrong order
`((()))`	`True`	All matched
`(()`	`False`	Missing closing bracket

Complexity:

Area	Complexity
Time	`O(n)`
Space	`O(n)` in worst case

Possible follow-ups:

Ignore non-bracket characters
Return the position of the first invalid bracket
Validate brackets while streaming a file
Track line and column number for parser errors

A strong answer is:

I would use a stack. Opening brackets are pushed, closing brackets must match the most recent opening bracket, and the stack must be empty at the end. This gives O(n) time and O(n) worst-case space.

Find all pairs in an array that sum to a target.

First clarify whether the interviewer wants:

One pair or all pairs
Indexes or values
Duplicate pairs allowed or not
Same element reuse allowed or not

For all unique value pairs, use a hash set.

python


def two_sum_pairs(nums, target):
    seen = set()
    pairs = set()

    for x in nums:
        need = target - x

        if need in seen:
            pairs.add(tuple(sorted((need, x))))

        seen.add(x)

    return list(pairs)

Output

Example:

python

nums = [1, 2, 3, 4, 5, 3]
target = 6

Output

Possible output:

text

[(1, 5), (2, 4), (3, 3)]

Complexity:

Area	Complexity
Time	`O(n)`
Space	`O(n)`

If the array is already sorted, you can use the two-pointer approach:

text

left at start, right at end
move left/right based on current sum

Two-pointer approach uses O(1) extra space but requires sorted input.

Salesforce/data example:

text

Find pairs of adjustment records whose amounts cancel each other out.

A strong answer is:

I would clarify duplicates and whether indexes or values are required. For unsorted input, I would use a hash set for O(n) time. For sorted input, I would use two pointers to reduce extra space.

Implement a function to flatten a nested dictionary.

Flattening a nested dictionary is common when semi-structured JSON lands in a bronze layer and needs to be converted into column-style paths.

Example input:

python


{
    "account": {
        "id": "A1",
        "owner": {
            "id": "U1"
        }
    }
}

Output

Expected output:

python


{
    "account.id": "A1",
    "account.owner.id": "U1"
}

Output

Recursive solution:

python


def flatten(obj, prefix=""):
    out = {}

    for key, value in obj.items():
        path = f"{prefix}.{key}" if prefix else key

        if isinstance(value, dict):
            out.update(flatten(value, path))
        else:
            out[path] = value

    return out

Output

Complexity:

Area	Complexity
Time	`O(n)` where `n` is total keys
Space	`O(n)` for output and recursion stack

Important follow-ups:

How should lists be handled?
What if keys already contain dots?
Should null values be kept?
Should deeply nested JSON use iterative traversal to avoid recursion depth issues?
Should output preserve data types?

For CDC pipelines, this helps when raw Salesforce or external events land as JSON and you need predictable field paths for validation or schema inference.

A strong answer is:

I would recursively traverse the dictionary, build a dotted path for each nested key, and store leaf values in the output map. I would also clarify how to handle arrays, nulls, dotted keys, and very deep JSON.

Merge k sorted lists of records by timestamp.

This is a good data-engineering coding question because it resembles merging sorted CDC shards, partitioned files, or ordered event streams.

Use a min-heap of size k.

python


import heapq

def merge_k_lists(lists):
    heap = []

    for list_index, records in enumerate(lists):
        if records:
            heapq.heappush(
                heap,
                (records[0][0], list_index, 0)
            )

    result = []

    while heap:
        ts, list_index, record_index = heapq.heappop(heap)
        records = lists[list_index]

        result.append(records[record_index])

        next_index = record_index + 1

        if next_index < len(records):
            heapq.heappush(
                heap,
                (records[next_index][0], list_index, next_index)
            )

    return result

Output

Assumption:

text

Each record is sorted by timestamp, and timestamp is at index 0.

Example record:

python

("2026-06-27T10:00:00Z", "event_1")

Output

Complexity:

Area	Complexity
Time	`O(N log k)`
Space	`O(k)` excluding output

Where:

N = total records across all lists
k = number of sorted lists

Important follow-ups:

What if timestamps are equal?
Do we need stable ordering?
Are records already sorted by event time or ingestion time?
Can lists be streams instead of in-memory arrays?
How do we handle late events?

A strong answer is:

I would use a min-heap containing the current record from each sorted list. Each pop gives the next earliest timestamp, and I push the next record from the same list. This gives O(N log k) time and works well for merging ordered CDC shards.

How do you deduplicate a stream of CRM events in code?

To deduplicate a CRM event stream, use a stable event identifier and keep track of processed events.

Simple logic:

python


seen = set()

def process(event):
    if event.id in seen:
        return None

    seen.add(event.id)
    return transform(event)

Output

This works for small in-memory examples, but production streams need bounded and durable state.

Production design options:

Option	Use case
In-memory set	Small coding exercise or short-lived process
TTL cache	Dedup within replay window
Redis / DynamoDB / RocksDB	Durable dedup state
Database unique constraint	Idempotent sink writes
Bloom filter	Memory-efficient approximate dedup
Bronze table dedup	Reprocess-safe batch/stream pipeline

Important considerations:

Event ID must be stable
Dedup state should survive restarts
State must be bounded using TTL or compaction
Replays should not produce duplicate target rows
Sink writes should be idempotent
Checkpointing should align with successful writes

For Salesforce CDC, possible keys include:

text

event_id, replay_id, source_object_id, commit_timestamp, change_sequence

Do not deduplicate only by business fields like email or account name because those can change and may not uniquely identify events.

A strong answer is:

I would deduplicate by a stable event ID, store seen IDs in a bounded or durable state store, and make the sink idempotent with upserts or unique constraints. In production, the dedup state must survive retries and restarts.

Behavioral, leadership, and culture

Why Salesforce, and which company value resonates with you?

A strong answer should connect Salesforce's values to your actual data engineering work, not generic praise.

Good values to connect with data work:

Salesforce value	How to connect it to data engineering
Trust	Data privacy, PII handling, encryption, access control, auditability
Customer Success	Reliable pipelines and metrics that business users can act on
Innovation	Data Cloud, Agentforce, AI-ready data, real-time customer context
Equality	Fair access to data, inclusive collaboration, clear documentation
Sustainability	Efficient pipelines, cost-aware compute, less wasteful processing

A weak answer sounds like:

text

Salesforce is a big company and I want to work on interesting products.

A stronger answer connects your experience:

text

Trust resonates with me because data engineering directly affects customer confidence. If CRM, consent, or PII pipelines are wrong, downstream analytics and AI decisions can be wrong. In my past work, I handled access controls, audit logs, data quality checks, and incident communication, so I see Trust as a daily engineering responsibility.

You can also mention interest in Data Cloud / Data 360 and Agentforce if it is genuine:

text

I am interested in how Salesforce uses unified customer data to power analytics, automation, and AI agents. That kind of system needs strong data engineering foundations: identity resolution, freshness, lineage, governance, and quality checks.

A strong answer is:

Trust resonates most with me because data engineering is directly tied to reliability, privacy, and customer confidence. I want to work on systems where clean, governed, and timely data powers customer-facing analytics, Data Cloud, and AI use cases.

Tell me about a technical disagreement with a senior engineer.

Use a STAR structure and show that you can disagree without making it personal.

A good example can come from:

Streaming vs batch ingestion
SCD2 vs daily snapshot
SQL model vs denormalized mart
Redis cache vs warehouse optimization
Custom pipeline vs managed connector
Strict data quality gate vs warning-only check

Example structure:

STAR step	What to say
Situation	Team disagreed on whether a CRM mart should use SCD2 or daily snapshots
Task	You needed to support accurate as-of reporting without making the pipeline too complex
Action	You documented trade-offs, checked query patterns, ran a small benchmark, and reviewed SLA needs
Result	Team chose a hybrid model: SCD2 for core dimensions and daily snapshots for high-traffic dashboard metrics

What interviewers want to hear:

You listened first
You used data, not ego
You considered maintainability
You aligned the decision with customer or business impact
You preserved the relationship

Avoid saying:

text

The senior engineer was wrong, so I proved my approach was better.

Say instead:

text

We had different assumptions, so I made the trade-offs visible and helped the team choose based on SLA, cost, and maintainability.

A strong answer is:

I handled the disagreement by clarifying the requirement, documenting trade-offs, validating assumptions with a small test, and aligning the final decision with SLA, maintenance cost, and business impact rather than personal preference.

Describe a time you spotted a security risk in a data system.

Choose a story where you identified a real risk, acted responsibly, and improved the system.

Good examples:

PII copied into a sandbox without masking
Overly broad service principal access to storage
API key committed to Git
Data export shared with too many users
Logs containing tokens, emails, or customer identifiers
No audit trail for sensitive table access
Production data available from a non-production notebook

Answer structure:

Risk found
- What did you notice?
Impact
- What could have gone wrong?
Action
- Who did you inform?
- Did you rotate credentials?
- Did you restrict access?
- Did you mask or delete exposed data?
Prevention
- What control did you add?

Example:

text

I found that a shared service principal had write access to multiple storage containers, including PII datasets that the pipeline did not need. I raised it with the platform and security teams, helped reduce permissions to least privilege, rotated credentials, and added a review step for new data access requests.

Strong prevention actions:

Secret scanning in CI
IAM least-privilege review
Token rotation
Column masking
Audit logging
Access review workflow
Policy-as-code checks

A strong answer is:

I identified the risk, assessed the blast radius, involved the right owners, fixed the exposure, and added a preventive control such as secret scanning, least-privilege IAM, masking, or audit logging so the issue would not repeat.

Tell me about a project that failed. What did you learn?

Pick a failure where you learned something meaningful and changed your behavior.

Good data-engineering examples:

Pipeline launched without enough data quality gates
Dashboard users lost trust after metric definitions changed
Identity merge churn caused duplicate or missing customer counts
Backfill took longer than expected because dependencies were not mapped
Streaming pipeline missed late-arriving data
Gold table design did not match actual business questions

Avoid blaming:

text

The stakeholder kept changing requirements.

Better framing:

text

We did not define the data contract and success criteria clearly enough before building.

Example answer:

text

In one project, we built a gold sales mart quickly, but we did not define the exact grain and metric ownership early. After launch, different teams interpreted pipeline value differently, and users lost confidence in the dashboard. I helped pause further rollout, documented metric definitions, added reconciliation checks, and created a sign-off process with business owners. The lesson was that trusted data products need contracts, ownership, and validation before wide release.

What the interviewer wants:

You take ownership
You learned a practical lesson
You changed your process
You do not blame others
You understand how data trust is lost and rebuilt

A strong answer is:

The project failed because we did not define ownership, grain, and data quality gates early enough. I learned to align on data contracts, success metrics, validation rules, and stakeholder sign-off before promoting data to trusted reporting layers.

How do you prioritize when two executives want conflicting high-priority data work?

This tests whether you can operate in a large enterprise without silently choosing favorites.

A good framework:

Clarify business impact
- Revenue, compliance, customer-facing impact, executive reporting, operational risk.
Clarify deadline
- Is it truly urgent or just important?
Estimate effort and dependencies
- Can one request unblock the other?
- Is there a shared data model or pipeline component?
Identify risk
- What happens if each request is delayed?
Offer phased delivery
- MVP metric first, full mart later.
- Temporary extract first, governed pipeline later.
Escalate transparently
- Present trade-offs to the decision maker with facts.
Document the decision
- Capture scope, timeline, owner, and follow-up.

Example trade-off matrix:

Request	Impact	Deadline	Effort	Risk if delayed
Revenue dashboard fix	High	This week	Medium	Forecasting impact
New segmentation mart	Medium	Next month	High	Campaign delay

Avoid:

text

I would work on whichever executive is louder.

Better:

text

I would make the trade-off visible and ask for prioritization based on business impact and risk.

A strong answer is:

I would compare business impact, urgency, effort, dependencies, and risk, then present a transparent trade-off matrix. If both cannot be done, I would propose phased delivery and get explicit prioritization rather than silently moving work around.

Describe leading through an ambiguous project crisis.

Choose a story where the problem was unclear, requirements changed, or multiple teams were blocked.

Good examples:

Acquisition data had to be integrated quickly
New CRM source had undocumented fields
A dashboard broke before quarter-end reporting
Data Cloud identity rules caused unexpected merge behavior
Source system changed schema without notice
Pipeline ownership was unclear during an incident

Strong answer structure:

Stabilize
- Identify immediate user impact and protect critical outputs.
Clarify known vs unknown
- Separate facts, assumptions, and open questions.
Create a short plan
- Time-box investigation.
- Assign owners.
- Define decision points.
Communicate frequently
- Explain what is affected, what is safe, and what is still unknown.
Deliver incrementally
- Provide a partial fix or MVP while deeper correction continues.
Close with learning
- Update runbooks, tests, contracts, or escalation paths.

Example:

text

During a CRM migration, we discovered late that several account hierarchy fields had inconsistent meanings across regions. I paused promotion to gold tables, created a data profiling report, met with regional owners, defined a temporary mapping for critical dashboards, and tracked unresolved fields in a risk register. We delivered the quarter-end dashboard on time with documented limitations, then followed up with a permanent contract.

A strong answer is:

I lead ambiguous crises by separating facts from assumptions, stabilizing the highest-impact users first, assigning clear owners, communicating known risks, and delivering incremental fixes while building the longer-term solution.

Lead and senior IC depth

How do you ensure scalability and performance of a large Salesforce-related data platform?

A lead-level answer should cover architecture, operations, and team practices.

Key areas:

Area	What to do
Data modeling	Define correct grain, keys, and ownership
Partitioning	Partition large facts by event date, snapshot date, or ingestion date
Skew handling	Watch large accounts, hot keys, and uneven partitions
Processing	Use bulk and distributed processing instead of row-by-row API pulls
CDC	Add backpressure, replay support, and idempotent merges
Serving	Pre-aggregate dashboard metrics and avoid heavy ad hoc queries on raw tables
Governance	Define standards, contracts, reviews, and ownership
Observability	Track freshness, lag, cost, quality, and SLA breaches

Salesforce-specific scaling concerns:

Account hierarchies can create skew
Opportunity history can grow quickly
CDC events can arrive in bursts
Identity resolution can change profile counts
Multi-tenant or business-unit data may require isolation
PII controls can affect how data is joined and served

A lead candidate should also mention people and process:

Establish design reviews
Mentor engineers on data modeling and performance
Create reusable pipeline templates
Set performance budgets
Run capacity reviews
Practice incident game days

A strong answer is:

I would scale the platform through correct data grain, partitioning, skew handling, bulk processing, idempotent CDC design, pre-aggregated serving layers, governance standards, and strong observability. As a lead, I would also create reusable patterns and mentor teams to apply them consistently.

How do you manage technical debt on a data platform?

Technical debt in data platforms is dangerous because it often appears later as wrong metrics, slow pipelines, high cost, or poor trust.

Common data-platform debt:

Unowned tables
No data contracts
No tests for critical transformations
Duplicated pipeline logic
Hardcoded business rules
Manual backfills
Poor naming conventions
No lineage or documentation
Expensive queries nobody owns
Legacy pipelines no one wants to touch

Management framework:

Make debt visible
- Create a debt register with owner, impact, risk, and effort.
Prioritize by customer impact
- Fix debt that affects correctness, trust, SLA, cost, or security first.
Reserve capacity
- Allocate regular sprint capacity for debt reduction.
Prevent new debt
- Use PR templates, naming standards, tests, and data contracts.
Refactor safely
- Use a strangler pattern: run old and new pipelines in parallel until reconciliation passes.
Measure improvement
- Track pipeline runtime, failure rate, cost, test coverage, and incident count.

Example:

text

We replaced a legacy Opportunity mart by building a new model in parallel, validating row counts and business metrics for two weeks, then switching dashboards gradually.

A strong answer is:

I manage technical debt by making it visible, ranking it by business impact and risk, reserving capacity to reduce it, and preventing new debt with standards, tests, contracts, and design reviews. For risky refactors, I run old and new pipelines in parallel until reconciliation passes.

When do you buy or configure vs build custom pipeline logic?

The decision should be based on reliability, maintenance cost, flexibility, and business differentiation.

Prefer managed or out-of-the-box options when:

Standard connector supports the source well
Required SLA is covered
Transformations are simple
Security and compliance controls are built in
Long-term maintenance should be low
The integration is not a competitive differentiator

Build custom logic when:

You need complex CDC handling
Identity matching rules are custom
Transformations are highly business-specific
You need advanced deduplication or ordering logic
Multi-cloud or hybrid architecture is required
Existing connector cannot meet latency, schema, or governance needs

Decision rubric:

Question	Why it matters
Does a reliable connector already exist?	Avoid unnecessary custom code
What is the SLA?	Some managed tools may not meet low-latency needs
Who will own failures?	Custom code needs on-call ownership
How complex are transforms?	Complex logic may require custom pipelines
What is the total cost?	Include build, support, monitoring, and upgrades
Is there vendor lock-in risk?	Important for long-term architecture

Avoid building custom pipelines only because it feels more flexible. Custom code also brings monitoring, alerting, testing, security, and upgrade responsibility.

A strong answer is:

I prefer managed connectors or configuration when they meet SLA, security, and transformation needs. I build custom logic only when the business requirements, latency, identity rules, or integration complexity cannot be handled safely by standard tools.

Where does MuleSoft fit in Salesforce enterprise integration?

MuleSoft fits in enterprise integration by providing reusable APIs and integration flows between Salesforce and other systems.

A common MuleSoft pattern is API-led connectivity:

API layer	Purpose
System APIs	Connect to systems of record such as ERP, CRM, databases, or legacy apps
Process APIs	Combine and orchestrate business logic across systems
Experience APIs	Serve specific channels such as web, mobile, partner apps, or internal tools

Where MuleSoft helps:

Connect Salesforce CRM with ERP, billing, support, marketing, or data platforms
Transform payloads between systems
Apply routing and orchestration
Enforce authentication and rate limits
Provide reusable integration contracts
Improve observability of enterprise APIs
Reduce point-to-point integration sprawl

For data engineers, MuleSoft may be relevant in two ways:

You consume APIs exposed by MuleSoft as pipeline sources.
You coordinate event contracts and payload formats with integration teams.

When MuleSoft is better than point-to-point integration:

Multiple consumers need the same business data
Enterprise governance matters
APIs need versioning and monitoring
Direct source access is risky or inconsistent
Integrations should be reusable across teams

A strong answer is:

MuleSoft provides API-led connectivity between Salesforce and enterprise systems. As a data engineer, I may consume MuleSoft APIs, coordinate event contracts, and rely on reusable System, Process, and Experience APIs instead of building fragile point-to-point integrations.

Describe a challenging CRM or data migration you led.

Use this as a hybrid behavioral and technical answer. Show leadership, planning, validation, and measurable results.

Good structure:

Scope
- What was migrated?
- From which system to which target?
- How much data?
- Which users or business processes were affected?
Challenges
- Duplicate contacts/accounts
- Bad account hierarchy
- Missing required fields
- Attachment migration
- Historical field history
- Downtime window
- Data ownership conflicts
- Consent and PII requirements
Plan
- Profiling
- Mapping
- Cleansing
- Test loads
- Reconciliation
- Parallel run
- Cutover
- Rollback
Execution
- Built migration pipelines
- Created validation queries
- Coordinated with business owners
- Ran mock cutovers
- Tracked defects and sign-offs
Result
- Match rate
- Downtime
- Error reduction
- Adoption
- Reporting accuracy
- Business outcome

Example:

text

I led a CRM migration where duplicate Accounts and inconsistent region mappings were breaking sales reporting. We profiled the data, created match rules, ran test loads, reconciled key metrics with business owners, and used a phased cutover. The final migration completed within the planned downtime window, and post-cutover reconciliation passed for the critical revenue and pipeline dashboards.

A strong answer is:

I would describe the migration scope, data quality challenges, cutover plan, validation strategy, stakeholder coordination, and measurable result. The key is showing that I handled both the technical migration and the business trust needed for adoption.

How does data engineering support Agentforce and AI features in 2026?

Agentforce and AI features depend on trusted, governed, and timely data. Raw CRM dumps are not enough.

Data engineering supports AI by providing:

Need	Data engineering responsibility
Trusted context	Clean and validated customer data
Unified profile	Identity resolution across CRM, service, marketing, and external sources
Freshness	CDC or near-real-time pipelines for important events
Governance	PII controls, consent, masking, access control, retention
Grounding	Relevant business data for AI responses and actions
Lineage	Trace where AI context came from
Feedback loop	Capture agent outcomes for evaluation and improvement

Example:

text

If an AI agent recommends the next best action for a customer, it needs accurate account status, recent service cases, consent preferences, purchase history, and current engagement signals.

Important risks:

Stale data can cause wrong recommendations
Poor identity resolution can mix customers incorrectly
Missing consent can create compliance issues
Bad lineage makes AI behavior hard to explain
Unchecked feedback loops can reinforce bad decisions

A Salesforce Data Engineer should connect Agentforce/Data Cloud work to fundamentals:

DLO/DMO mapping
Unified profiles
Identity resolution
Data quality checks
Real-time ingestion
Secure access
Audit and lineage
Monitoring and feedback metrics

A strong answer is:

Data engineering supports Agentforce by providing trusted, fresh, unified, and governed customer data. I would focus on identity resolution, data quality, lineage, PII controls, real-time signals, and feedback loops so AI agents act on accurate and compliant context.

Final checklist before your loop

Three project stories with metrics and trade-offs
SQL pad ready: windows, CTEs, SCD2 audits on CRM schemas
One system design narrative: CDC → bronze/silver/gold → Data Cloud + Synapse
Coding practice: top-K, stacks, heaps, stream dedup
Behavioral answers tied to Trust and collaboration
Clarified which org/team your loop targets (platform vs product analytics)

For more interview prep on this site, see AWS, Java, and the full Interview Questions category.

Role context and interview process

SQL and data modeling (CRM-centric)

Data pipelines, CDC, and Salesforce Data Cloud

System design

Coding and algorithms (data-focused)

Behavioral, leadership, and culture

Lead and senior IC depth

Final checklist before your loop

Related Articles

Salesforce Developer Interview Questions and Answers

SQL Technical Interview Questions and Answers

C and C++ Interview Questions and Answers

Search GoLinuxCloud