Salesforce Data Engineer Interview Questions and Answers

Salesforce Data Engineer interview questions for 2026: SQL, Data Cloud, CDC pipelines, system design, coding, and behavioral prep for senior and lead roles.

Published

Updated

Tech reviewed byDeepak Prasad

Salesforce Data Engineer Interview Questions and Answers

A Salesforce Data Engineer interview goes deeper than a generic data pipeline loop. You are expected to reason about CRM data models, Data Cloud ingestion, cloud ETL patterns, and operating pipelines in production—not only naming tools on a resume.

Below are 45 questions organized from role context through SQL, platform specifics, system design, and behavioral prompts. Each answer stays collapsed until you are ready to check your work after answering out loud.

If you are preparing for a Salesforce Developer loop (Apex, LWC, triggers, governor limits), see the dedicated Salesforce developer interview questions guide—this page stays focused on data engineering depth.


Role context and interview process

What does a Salesforce Data Engineer actually own in 2026?

A Salesforce Data Engineer usually owns the data pipelines and models that make CRM, analytics, Data Cloud, and AI features reliable.

You are not only moving rows between systems. You are responsible for making sure customer, sales, marketing, and engagement data is accurate, timely, governed, and usable.

Typical ownership includes:

  • Building batch and near-real-time pipelines
  • Ingesting data from Salesforce CRM, external systems, APIs, files, and event streams
  • Designing bronze / silver / gold style data layers
  • Modeling CRM entities like Account, Contact, Lead, Opportunity, Case, and Activity
  • Handling data quality checks, SLAs, backfills, retries, and schema changes
  • Supporting identity resolution and unified customer/profile views in Data Cloud
  • Partnering with analytics, product, security, and platform teams
  • Protecting PII with retention, masking, audit, and access controls

A strong candidate does not only say:

text
I build ETL pipelines.

A better answer connects pipelines to business reliability:

text
I build and operate trusted data pipelines that ingest CRM and engagement data, transform it into analytics-ready models, validate quality, and make it available for reporting, Data Cloud, AI, and downstream business workflows.

A strong answer is:

I see the Salesforce Data Engineer role as owning reliable CRM-centric data pipelines from ingestion to serving. That includes data modeling, quality checks, backfills, SLA monitoring, schema evolution, governance, and making data trustworthy for analytics, Data Cloud, and AI use cases.

What is the typical Salesforce Data Engineer interview loop?

The exact loop depends on the team, but a Salesforce Data Engineer interview commonly tests SQL, data modeling, coding, system design, production ownership, and collaboration.

A typical loop may include:

  1. Recruiter screen — background, location, compensation, role fit
  2. Hiring manager round — projects, ownership, communication, business impact
  3. SQL round — joins, window functions, CTEs, aggregations, data quality queries
  4. Coding round — Python, Java, or Scala-style data problems
  5. Data modeling / system design — pipeline design, CDC, warehouse modeling, SLAs
  6. Behavioral round — incidents, ambiguity, conflict, leadership, Trust mindset

For senior roles, expect deeper discussion on:

  • Pipeline reliability
  • Cost and performance tuning
  • Data contracts
  • Multi-tenant data handling
  • PII/governance
  • Backfills and incident recovery
  • Cross-team influence

For Salesforce-specific teams, prepare examples around:

  • CRM data semantics
  • Account / Contact / Opportunity modeling
  • Data Cloud ingestion and identity resolution
  • Marketing, Sales, Service, Tableau, MuleSoft, or Slack-adjacent data flows

A strong answer is:

I would expect SQL, data modeling, coding, system design, and behavioral rounds. For senior roles, I would prepare production stories around pipeline failures, backfills, SLA misses, data quality incidents, and trade-offs in architecture.

Do I need Apex and SOQL for a Data Engineer interview at Salesforce?

For most Salesforce Data Engineer roles, Apex is not the main focus. The core interview usually focuses on data engineering skills such as SQL, Python, Spark, cloud data platforms, orchestration, and distributed pipeline design.

However, Salesforce-specific knowledge helps because the data often comes from Salesforce products.

Useful Salesforce knowledge includes:

Area What to know
CRM objects Account, Contact, Lead, Opportunity, Case, Activity
SOQL basics How Salesforce data is queried at source
Data Cloud / Data 360 Data Streams, DLOs, DMOs, identity resolution
CRM semantics Opportunity stages, close dates, ownership, Account hierarchy
Integration APIs, CDC, MuleSoft-style integration patterns
Governance PII, consent, retention, audit, access control

Apex is useful if the role is closer to Salesforce platform development or Sales/Service Cloud customization. But for data-platform roles, interviewers usually care more about whether you can build reliable data pipelines and understand CRM data correctly.

A strong answer is:

Apex is useful for Salesforce platform roles, but for a Data Engineer interview I would focus more on SQL, Python, Spark, cloud warehouses, orchestration, data modeling, CDC, and Data Cloud concepts. I would still learn basic CRM objects and SOQL so I understand the source data.

How should I structure a 4-week prep plan?

A good 4-week plan should prepare you for SQL, data modeling, pipeline design, coding, and production stories.

Week Focus What to practice
1 SQL Window functions, joins, anti-joins, CTEs, dedup, top-N, SCD checks
2 Data modeling Account, Contact, Opportunity, stage history, customer identity, fact/dimension grain
3 Pipeline design CDC, bronze/silver/gold, idempotency, backfills, schema evolution, SLAs
4 Coding + behavioral Python data problems, system design outlines, STAR stories

For SQL, practice CRM-style problems:

  • Top opportunities by region
  • Month-over-month revenue
  • Closed opportunities missing stage history
  • Duplicate contacts by email
  • Latest status per account
  • SCD Type 2 overlap detection

For system design, prepare:

  • CRM CDC pipeline into a warehouse
  • Customer identity unification pipeline
  • Data quality framework
  • Reliable backfill design
  • Dashboard-serving mart with SLA

For behavioral, prepare stories on:

  • Production incident
  • Data quality issue
  • Disagreement with stakeholder
  • Performance improvement
  • Ambiguous requirement

A strong answer is:

I would spend the first week on SQL, the second on CRM data modeling, the third on pipeline and system design, and the fourth on coding plus behavioral stories. I would prepare real examples with volume, latency, error rate, cost, and business impact.


SQL and data modeling (CRM-centric)

Write SQL to find the top 3 salaries in each department.

This is a classic window-function question. The interviewer wants to see whether you understand partitioning, ranking, and tie handling.

sql
WITH ranked AS (
  SELECT
    employee_id,
    department,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department
      ORDER BY salary DESC
    ) AS salary_rank
  FROM employees
)
SELECT
  employee_id,
  department,
  salary
FROM ranked
WHERE salary_rank <= 3
ORDER BY department, salary DESC;

Use DENSE_RANK() when ties should receive the same rank. For example, if two employees share the highest salary, both get rank 1.

Use ROW_NUMBER() if the interviewer wants exactly three rows per department, even when salaries tie.

Clarify these points before writing the final query:

  • Should ties be included?
  • Should null salaries be ignored?
  • Should departments with fewer than three employees return fewer rows?
  • Is salary stored as annual, monthly, or effective-dated compensation?

A strong answer is:

I would use a window function partitioned by department and ordered by salary descending. I would choose DENSE_RANK() if ties should be included, or ROW_NUMBER() if the requirement is exactly three rows per department.

When do you use INNER JOIN vs LEFT JOIN in analytics SQL?

Use an INNER JOIN when matching records must exist on both sides.

Use a LEFT JOIN when you want to preserve all rows from the left table, even if the right table has no match.

Join type Use when Example
INNER JOIN Both sides are required Opportunities that must have valid Accounts
LEFT JOIN Missing relationship is meaningful All Accounts, even those with no Opportunities

CRM example:

text
Account LEFT JOIN Opportunity

This is useful when you want to show all accounts and count open opportunities. Accounts with zero opportunities should still appear.

A common mistake is putting filters on the right table in the WHERE clause after a LEFT JOIN, which accidentally turns it into an inner join.

Problem pattern:

sql
SELECT *
FROM account a
LEFT JOIN opportunity o
  ON a.account_id = o.account_id
WHERE o.stage_name = 'Closed Won';

Better pattern:

sql
SELECT *
FROM account a
LEFT JOIN opportunity o
  ON a.account_id = o.account_id
 AND o.stage_name = 'Closed Won';

The most important interview habit: state the grain before joining.

Example:

text
The output grain is one row per account.

A strong answer is:

I use INNER JOIN when both tables must match, and LEFT JOIN when I need to preserve the left-side grain and detect missing relationships. In analytics SQL, I also watch filter placement because a WHERE filter on the right table can accidentally remove unmatched rows.

How do you calculate month-over-month sales growth by product category?

First define the grain:

text
One row per month and product category.

Then aggregate revenue and compare it to the previous month using LAG().

sql
WITH monthly AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month_start,
    product_category,
    SUM(amount) AS revenue
  FROM orders
  GROUP BY 1, 2
),
growth AS (
  SELECT
    month_start,
    product_category,
    revenue,
    LAG(revenue) OVER (
      PARTITION BY product_category
      ORDER BY month_start
    ) AS prev_revenue
  FROM monthly
)
SELECT
  month_start,
  product_category,
  revenue,
  prev_revenue,
  CASE
    WHEN prev_revenue IS NULL OR prev_revenue = 0 THEN NULL
    ELSE (revenue - prev_revenue) / prev_revenue
  END AS mom_growth_rate
FROM growth
ORDER BY product_category, month_start;

Explain edge cases:

  • First month has no previous month
  • Previous revenue can be zero
  • Current month may be partial
  • Refunds or credits may reduce revenue
  • Multi-currency data needs conversion before aggregation
  • Timezone affects month boundaries

Salesforce-style example:

text
For Opportunity revenue, I would clarify whether to use created date, close date, booked date, or stage transition date.

A strong answer is:

I would aggregate revenue at month and category grain, use LAG() to get previous-month revenue, then calculate the percentage change. I would also clarify partial months, refunds, currency conversion, and which business date defines the month.

How do you detect overlapping or gapped validity windows in an SCD Type 2 Account history table?

In an SCD Type 2 table, each business key should have non-overlapping validity windows.

Example grain:

text
One row per account version.

Typical columns:

text
account_id, account_name, owner_id, valid_from, valid_to, is_current

Use LEAD() to compare each row with the next row for the same account.

sql
WITH ordered AS (
  SELECT
    account_id,
    valid_from,
    valid_to,
    LEAD(valid_from) OVER (
      PARTITION BY account_id
      ORDER BY valid_from
    ) AS next_valid_from
  FROM account_history
)
SELECT
  account_id,
  valid_from,
  valid_to,
  next_valid_from
FROM ordered
WHERE next_valid_from IS NOT NULL
  AND (
    next_valid_from < valid_to
    OR next_valid_from > valid_to
  );

This assumes valid_to is exclusive. If your model uses inclusive end dates, adjust the comparison.

For example:

text
Exclusive end: next_valid_from should equal valid_to
Inclusive end: next_valid_from should equal valid_to + 1 day

Also validate:

  • Only one is_current = true row per account
  • No null valid_from
  • Current row has open-ended valid_to
  • Deleted records are represented consistently
  • Timezone is consistent

A strong answer is:

I would order SCD2 rows by valid_from per account and use LEAD() to compare each row with the next window. Then I would flag overlaps, gaps, multiple current rows, and inconsistent delete or open-ended records.

How do you model Opportunity stage history for daily pipeline-as-of and current pipeline?

First separate current pipeline from pipeline-as-of. They answer different business questions.

Metric type Meaning
Current pipeline What is open right now?
Pipeline-as-of What did the pipeline look like on a past date?

Recommended model:

  1. Opportunity dimension

    • opportunity_id
    • account/customer key
    • owner
    • region
    • amount
    • created date
    • close date
    • current status
  2. Opportunity stage history fact

    • grain: one row per stage transition
    • opportunity_id
    • stage_name
    • stage_changed_at
    • optional previous stage
    • source event ID
  3. Daily snapshot / as-of mart

    • grain: one row per opportunity per as-of date
    • latest stage as of that date
    • amount as of that date
    • owner/region as of that date

For as-of reporting:

  • Build a date spine
  • Join opportunities active on each date
  • Pick the latest stage at or before the as-of timestamp
  • Use SCD2 or snapshots for changing attributes like owner, region, and amount

Avoid double counting:

  • Do not join raw stage history directly to opportunity totals without selecting one stage per as-of date
  • Do not key customer facts only by mutable email
  • Use stable surrogate keys when identity resolution can merge records

A strong answer is:

I would model stage changes as a transition fact and build a separate daily as-of mart. For each date, I would select the latest stage and attributes valid at that time, while keeping current pipeline as a separate model to avoid mixing current and historical semantics.

How do you validate Opportunities marked closed without matching stage history?

This is a data quality validation question. The goal is to detect Opportunities whose current state says closed, but stage history does not show a matching closed transition.

Example query:

sql
SELECT
  DATE(o.close_date) AS close_day,
  COUNT(*) AS closed_missing_history
FROM opportunity o
LEFT JOIN opportunity_stage_history h
  ON h.opportunity_id = o.opportunity_id
 AND h.stage_name IN ('Closed Won', 'Closed Lost')
 AND h.stage_changed_at <= o.close_date
WHERE o.is_closed = TRUE
  AND h.opportunity_id IS NULL
GROUP BY 1
ORDER BY 1;

What this checks:

  • Opportunity is marked closed
  • No matching Closed Won / Closed Lost stage history exists
  • Stage transition should have happened on or before close date

In an interview, explain possible causes:

  • Missing CDC event
  • Late-arriving stage history
  • Backfill gap
  • Source system correction
  • Manual data update
  • Different stage naming convention
  • Timezone mismatch around close date

Production handling:

  • Alert if count crosses threshold
  • Quarantine bad records if they break downstream metrics
  • Track the issue by source system and ingestion date
  • Reprocess from raw/bronze data if possible
  • Backfill corrected history safely
  • Add a data quality dashboard

A strong answer is:

I would LEFT JOIN closed Opportunities to closed-stage history and flag rows where the history is missing. Then I would check for late events, source corrections, timezone issues, and backfill gaps before quarantining or fixing the affected records.

What is the purpose of database indexing in a warehouse context?

In transactional databases, indexes help find rows quickly. In warehouses and lakehouses, similar goals are achieved through physical design features such as partitioning, clustering, sorting, statistics, pruning, and file layout.

The goal is to reduce scanned data and improve query performance.

In CRM analytics, common choices are:

Design choice Example
Partitioning Partition Opportunity facts by close date or snapshot date
Clustering Cluster by account_id, opportunity_id, or region
Sorting Sort history tables by business key and event time
Statistics Keep table stats updated for optimizer decisions
Materialized views Precompute heavy dashboard aggregates

Example:

text
A dashboard filtering Opportunity snapshots by date should not scan all historical partitions.

Physical design matters when:

  • Tables are very large
  • Dashboards need low latency
  • Joins are frequent
  • Filters are selective
  • Data is skewed by large customers or regions

Avoid over-optimizing too early. First check query plans, scan volume, join strategy, and dashboard SLA.

A strong answer is:

In a warehouse, the purpose is to reduce scanned data and improve query speed using indexes, partitioning, clustering, sorting, statistics, or materialized aggregates. I would design physical layout based on access patterns and validate with query plans.

What grain and keys do you use for a unified Customer dimension across Sales Cloud Contacts and a marketing email source?

For a unified customer dimension, define the grain carefully.

Recommended grain:

text
One row per resolved person or customer profile.

Do not use raw email as the only customer key. Emails can change, be shared, be missing, or appear differently across systems.

Recommended design:

  1. Unified customer dimension

    • customer_key
    • resolved profile attributes
    • preferred email
    • consent status
    • created/updated timestamps
  2. Identity link table

    • customer_key
    • source system
    • source record ID
    • email
    • phone
    • external ID
    • match confidence
    • active flag
  3. Source-specific dimensions

    • Sales Cloud Contact
    • Lead
    • Marketing subscriber
    • Web/app user
  4. Facts

    • Join through stable keys or identity bridge
    • Avoid rewriting historical facts every time an identity merge changes

Important Salesforce/Data Cloud concepts:

  • Source records can map to unified profiles
  • Match rules decide which records belong together
  • Reconciliation rules decide which field value wins
  • Merge churn should be monitored

Example issue:

text
If two contacts merge into one unified customer, historical Opportunity facts should remain explainable and not double-count revenue.

A strong answer is:

I would use one row per resolved customer profile with a stable surrogate customer_key, and maintain an identity link table from source IDs, emails, phones, and external IDs to that key. Facts should not depend only on mutable email addresses.


Data pipelines, CDC, and Salesforce Data Cloud

When do you choose batch vs near-real-time ingestion for Salesforce CRM data?

Choose the ingestion pattern based on the freshness requirement, cost, source limits, and downstream use case.

Choose batch ingestion Choose near-real-time / CDC ingestion
Nightly reports Operational dashboards
Historical backfills Stage-change or case-status alerts
Large snapshots Customer profile updates
Lower cost and simpler operations Fresh data needed within minutes
Source API limits favor scheduled pulls Downstream AI, segmentation, or automation needs fresh events

Batch ingestion is easier to operate and good for reporting where a few hours of delay is acceptable.

Near-real-time ingestion is better when business users need fresh changes quickly, such as:

  • Opportunity stage changes
  • Case escalation alerts
  • Marketing audience updates
  • Customer profile refresh in Data Cloud
  • Fraud/risk signals
  • Sales pipeline coverage monitoring

A common production design is hybrid:

text
CDC stream for hot changes + nightly batch reconciliation for correctness

The nightly batch helps catch missed events, source corrections, late-arriving records, and drift between source and target.

A strong answer is:

I choose batch when cost and simplicity matter more than freshness, and CDC or near-real-time ingestion when downstream users need updates within minutes. For important CRM data, I prefer a hybrid design: CDC for freshness and nightly reconciliation for correctness.

How do you design a CDC pipeline to be idempotent when events replay or arrive out of order?

An idempotent CDC pipeline produces the same final result even if the same event is processed more than once.

This is important because CDC systems can have:

  • Duplicate events
  • Retries
  • Out-of-order delivery
  • Consumer restarts
  • Replays from a checkpoint
  • Late-arriving updates

Key design elements:

  1. Stable event key

    • Use event ID, replay ID, transaction ID, or a composite key like source object ID + sequence + timestamp.
  2. Append-only bronze layer

    • Store raw events before transformation so replay and debugging are possible.
  3. Deduplication logic

    • Remove duplicates before writing conformed silver tables.
  4. Monotonic merge rules

    • Apply newer changes over older changes using commit timestamp or version number.
  5. Checkpointing

    • Store the last processed offset/replay token per object stream.
  6. Dead-letter queue

    • Send malformed or poison events to a DLQ instead of blocking the full pipeline.
  7. Metrics

    • Track lag, duplicate rate, late-event rate, failed records, and merge conflicts.

For Salesforce CRM data, do not rely only on email or name for identity. Use stable Salesforce record IDs, source system metadata, and defined identity rules.

A strong answer is:

I would make the CDC pipeline idempotent by storing raw events, deduplicating by stable event keys, applying version-aware merge logic, checkpointing per stream, and sending bad events to a DLQ. Reprocessing the same event should not change the final result incorrectly.

A new Opportunity field breaks your Data Stream schema validation. What do you do?

First protect the pipeline and downstream users. Then fix the schema properly.

Operational response:

  1. Confirm the change

    • Check source metadata, pipeline logs, schema validation error, and affected object.
  2. Protect ingestion

    • If possible, land the raw payload in bronze as JSON or semi-structured data so ingestion does not fully stop.
  3. Assess downstream impact

    • Find which silver/gold tables, dashboards, Data Cloud mappings, or ML features depend on the Opportunity schema.
  4. Version the schema

    • Add the new field as optional first, with a default null value where needed.
  5. Update mappings and contracts

    • Update Data Stream mapping, DLO/DMO mapping, dbt model, schema registry, or data contract.
  6. Backfill if needed

    • Backfill historical values if the source exposes field history or historical extracts.
  7. Test before promotion

    • Run compatibility tests, row-count checks, null checks, and downstream dashboard validation.
  8. Communicate

    • Notify consumers if the new field changes business logic or reporting definitions.

The key interview point is that schema evolution should not silently break gold tables or dashboards.

A strong answer is:

I would isolate the schema change, keep raw ingestion running if possible, add the new field in a backward-compatible way, update mappings and contracts, validate downstream models, and backfill historical values only if the business needs them.

What data quality checks do you embed in an ETL pipeline?

Data quality checks should run at multiple stages, not only at the final dashboard layer.

Common checks:

Check type Example
Completeness Row count vs source extract
Null checks Required keys like account_id or opportunity_id are not null
Uniqueness No duplicate primary keys or event IDs
Referential integrity Every Opportunity has a valid Account
Freshness Latest ingested_at is within SLA
Valid values Opportunity stage belongs to allowed stage list
Business rule Closed Opportunity must have a terminal stage
Reconciliation Source count and target count match within tolerance
Anomaly Sudden drop or spike in records
Distribution Stage mix or region mix changes unexpectedly

Example Salesforce checks:

  • Opportunities marked closed must have Closed Won or Closed Lost history
  • Contacts must have valid Account relationship when required
  • Duplicate Leads by email should be tracked
  • Opportunity amount should not be negative unless credits are allowed
  • Account hierarchy should not contain cycles

Failed checks should not be silently ignored. Depending on severity:

  • Warn and continue
  • Quarantine bad records
  • Block promotion to gold
  • Alert the data owner
  • Open an incident if SLA or business reporting is affected

A strong answer is:

I would add checks for completeness, uniqueness, referential integrity, freshness, valid values, business rules, and anomalies. Critical failures should quarantine or block bad data, while alerts and lineage help owners understand which reports or consumers are affected.

How do you handle missing or late-arriving data in pipelines?

Missing and late-arriving data should be handled explicitly because CRM data often arrives out of order or gets corrected after the fact.

For late-arriving events:

  • Use event time, not only ingestion time
  • Define a watermark and allowed lateness window
  • Reprocess affected partitions
  • Update as-of marts safely
  • Mark metrics as provisional if they may change

For missing batches:

  • Compare source manifests with landed files
  • Validate row counts and checksums
  • Retry extraction with backoff
  • Alert when SLA is at risk
  • Backfill from the last successful checkpoint

For downstream reporting:

  • Use is_final or data_status flags when metrics can revise
  • Publish incident notes if dashboards are stale
  • Show freshness timestamps in important reports

Salesforce-style example:

text
An Opportunity may be marked closed before the matching stage-history event arrives. The pipeline should detect this semantic gap, allow late correction, and avoid publishing misleading close-rate metrics.

A strong answer is:

I handle late data with watermarks, allowed lateness, partition reprocessing, and provisional metrics. For missing data, I compare manifests, retry safely, backfill from checkpoints, and communicate freshness or SLA impact to downstream users.

What are DLOs and DMOs in Salesforce Data Cloud?

In Salesforce Data Cloud / Data 360, DLOs and DMOs are part of the ingestion and harmonization flow.

Term Meaning
DLO Data Lake Object. It represents ingested or externally referenced source-aligned data.
DMO Data Model Object. It represents harmonized data mapped to the Salesforce Customer 360 data model.

Simple flow:

text
Source data → Data Stream → DLO → DMO → Identity Resolution → Unified Profile → Activation

Example:

  • A marketing subscriber file lands as a DLO
  • Fields from that DLO are mapped to standard objects such as Individual or Contact Point DMOs
  • Identity resolution links records into unified profiles
  • The unified data can be used for segmentation, analytics, activation, or AI use cases

Key interview point:

text
DLO is closer to source-shaped data.
DMO is closer to business/model-shaped data.

Zero-copy federation can also allow Data Cloud/Data 360 to access external data without physically copying all of it into Salesforce, depending on the connector and use case.

A strong answer is:

A DLO stores or references source-aligned data after ingestion, while a DMO is the harmonized model object used in the Customer 360 model. The usual flow is source to DLO, mapping to DMO, identity resolution, unified profile, and activation.

How does identity resolution work in Data Cloud?

Identity resolution links multiple source records that represent the same real-world person, account, or customer.

High-level flow:

  1. Ingest data from sources
  2. Map source data to DMOs
  3. Define match rules
  4. Define reconciliation rules
  5. Run identity resolution
  6. Create unified profiles and link records
  7. Monitor match quality and merge behavior

Important concepts:

Concept Meaning
Match rules Decide which records should be linked
Reconciliation rules Decide which attribute value wins
Unified profile The resolved customer/person/account view
Link objects Preserve relationship between source records and unified records

Example match signals:

  • Exact email
  • Phone number
  • CRM Contact ID
  • External customer ID
  • Name + address
  • Loyalty ID

Be careful with over-merging. A false positive can combine two different people into one profile, which is worse than leaving two duplicate profiles in many compliance and personalization scenarios.

Also explain that source records are not erased. Good identity systems preserve lineage so teams can trace a unified profile back to original CRM, marketing, or service records.

A strong answer is:

Identity resolution maps source records into a common model, applies match rules to link records, uses reconciliation rules to choose winning attribute values, and creates unified profiles while preserving source lineage. I would monitor false positives, merge churn, and compliance risk carefully.

A nightly orders pipeline failed at 2 AM. Walk through your response.

This is an incident-management question. The interviewer wants to see calm ownership, not panic debugging.

Step-by-step response:

  1. Acknowledge the alert

    • Check orchestrator status, failed task, logs, error message, and last successful run.
  2. Assess impact

    • Which tables, dashboards, ML jobs, or business users are affected?
    • Is this a freshness issue or bad-data issue?
  3. Stabilize

    • Retry only if the failure is safe and likely transient.
    • Pause downstream jobs if bad data may be published.
  4. Find root cause

    • Schema change
    • Source API issue
    • Credential expiry
    • File missing
    • Data skew
    • Warehouse capacity issue
    • Code deployment bug
  5. Recover

    • Restart from checkpoint
    • Backfill missing partitions
    • Re-run reconciliation checks
    • Validate row counts and business metrics
  6. Communicate

    • Notify stakeholders about impact, ETA, and workaround.
    • Update dashboard freshness notes if needed.
  7. Prevent recurrence

    • Add tests, alerts, schema contracts, retry logic, or runbook updates.

A strong answer is:

I would first assess whether users are seeing stale data or wrong data. Then I would stabilize the pipeline, protect downstream consumers, recover from the last good checkpoint, validate reconciliation checks, communicate impact, and add a guardrail to prevent the same failure.


System design

Design a system to count video views in real time.

Start by clarifying requirements:

  • Views per second
  • Exact vs approximate counts
  • Unique views vs total views
  • Per-video, per-user, per-region, or per-time-window metrics
  • Latency target
  • Fraud/bot filtering needs

High-level design:

text
Client/App → Event collector → Kafka/Kinesis/PubSub → Stream processor → Hot store + warehouse → API/dashboard

Components:

  1. Event collector

    • Receives view events with event_id, video_id, user_id, and timestamp.
  2. Message queue

    • Buffers traffic spikes and decouples producers from consumers.
  3. Stream processor

    • Aggregates counts by video_id and time window.
  4. Hot store

    • Redis or similar store for current counters.
  5. Analytics store

    • ClickHouse, BigQuery, Druid, Snowflake, or lakehouse tables for historical reporting.
  6. Serving API

    • Reads hot counters for live views and historical store for older windows.

Reliability concerns:

  • Deduplicate by event_id
  • Handle late events with watermarks
  • Mitigate hot keys for viral videos
  • Use idempotent writes
  • Apply bot/fraud filtering
  • Reconcile stream counts with raw events

For unique viewers, use exact distinct counts if scale allows, or approximate structures like HyperLogLog when approximate accuracy is acceptable.

A strong answer is:

I would ingest view events into a durable stream, aggregate them by video and time window, store live counters in a hot store, persist raw and aggregated data for analytics, and handle deduplication, late events, hot keys, and bot filtering.

Design a pipeline: Salesforce CRM CDC → Data Cloud + Azure Synapse with a 5-minute SLA.

A 5-minute SLA means the design must focus on low-latency ingestion, idempotent processing, and strong observability.

High-level design:

text
Salesforce CRM CDC → Event bus → Bronze raw lake → Silver conformed models → Synapse gold marts + Data Cloud ingestion

Suggested flow:

  1. CDC capture

    • Subscribe to Salesforce CDC events for objects such as Account, Contact, Opportunity, and Case.
  2. Streaming buffer

    • Use Kafka, Azure Event Hubs, or another durable bus to absorb spikes and support replay.
  3. Bronze layer

    • Store raw CDC events in ADLS with metadata: source object, event ID, replay ID, schema version, and ingestion time.
  4. Silver layer

    • Deduplicate, validate, apply ordering rules, and upsert into conformed CRM entities.
  5. Gold layer

    • Build BI-ready tables in Synapse for dashboards and SLA-based reporting.
  6. Data Cloud path

    • Send or expose harmonized data to Data Cloud/Data 360 through Data Streams, connectors, or federation depending on the source and architecture.
  7. Observability

    • Monitor end-to-end lag, event backlog, failed events, DLQ count, freshness, and row-count reconciliation.

Key design points:

  • Use idempotent upserts
  • Keep raw events for replay
  • Separate BI serving from profile activation
  • Use DLQ for poison events
  • Add schema evolution handling
  • Reconcile CDC with scheduled batch extracts

A strong answer is:

I would capture Salesforce CDC events into a durable stream, land raw events in bronze, dedupe and merge into silver CRM models, publish gold marts to Synapse, and feed Data Cloud through the right ingestion or federation path. I would monitor lag, DLQ, freshness, and reconciliation against source.

How would you migrate petabyte-scale data from on-prem to cloud with minimal downtime?

For petabyte-scale migration, use a phased approach instead of a single cutover.

Migration plan:

  1. Discovery

    • Inventory datasets, owners, size, dependencies, SLAs, PII, and retention rules.
  2. Target design

    • Choose storage layout, partitioning, encryption, access model, and compute engine.
  3. Initial bulk copy

    • Run parallel transfer to object storage.
    • Validate using checksums, row counts, and sample comparisons.
  4. Incremental sync

    • Use CDC, logs, timestamp-based extracts, or replication to capture changes after the initial copy.
  5. Dual-run period

    • Run old and new systems in parallel.
    • Compare business metrics and critical queries.
  6. Cutover

    • Use feature flags, DNS/connection string changes, or application configuration.
  7. Rollback plan

    • Keep old system readable for a defined period.
    • Define clear rollback triggers.
  8. Post-cutover validation

    • Compare counts, hashes, aggregates, and business invariants.

Special considerations:

  • Historical field history
  • Attachments and files
  • Data retention policies
  • PII masking/tokenization
  • Network bandwidth
  • Transfer cost
  • Partition skew
  • Query compatibility

A strong answer is:

I would do an initial bulk load, validate it, keep systems in sync with CDC or incremental replication, run both systems in parallel, reconcile business metrics, cut over with a rollback plan, and continue validation after migration.

How would you design a distributed job scheduler like Airflow?

A distributed scheduler coordinates workflows, dependencies, retries, and execution across many workers.

Core components:

Component Responsibility
Metadata DB Stores DAGs, runs, task state, retries, schedules
Scheduler Determines which tasks are ready to run
Executor Sends tasks to workers
Workers Execute tasks
Web UI/API Shows status, logs, retries, and history
Log store Stores task logs
Alerting Notifies on failure or SLA miss

High-level flow:

text
DAG definition → Scheduler → Queue → Worker → Metadata DB update → Logs/metrics

Design concerns:

  • Task dependencies
  • Retries and backoff
  • Idempotent task execution
  • Backfills
  • Catchup behavior
  • SLA monitoring
  • Worker autoscaling
  • Priority queues
  • Dead task detection
  • Secure secrets handling

For data platforms, tasks should be idempotent because retries are normal. For example, a load task should safely overwrite or merge a partition without creating duplicates.

A strong answer is:

I would design a scheduler with a metadata database, scheduler, executor, distributed workers, logs, retries, backfill support, and alerting. The most important reliability rule is that tasks should be idempotent because retries and reruns are expected.

How do you handle a sudden 10x traffic spike on your ingestion cluster?

A 10x ingestion spike should be handled with buffering, autoscaling, prioritization, and backpressure.

Immediate response:

  1. Check queue lag

    • Is the system dropping data or only falling behind?
  2. Protect critical pipelines

    • Prioritize revenue, CRM CDC, security, or customer-impacting data over batch reprocessing.
  3. Scale consumers

    • Add workers if partitions and downstream systems can handle it.
  4. Apply backpressure

    • Throttle non-critical producers or consumers.
  5. Use durable buffering

    • Let Kafka/Event Hubs/Kinesis absorb the burst within retention limits.
  6. Protect downstream systems

    • Avoid overwhelming the warehouse, database, Data Cloud ingestion, or APIs.
  7. Monitor cost

    • Autoscaling should have limits and alerts.

After the spike:

  • Replay DLQ
  • Validate missed records
  • Reconcile counts
  • Review partition strategy
  • Check hot keys
  • Update capacity planning
  • Add traffic forecasts if spike was business-driven

A strong answer is:

I would first confirm whether data is being lost or only delayed. Then I would use queue buffering, scale consumers safely, apply backpressure to non-critical flows, protect downstream systems, prioritize critical SLAs, and reconcile data after the spike.

Design a multi-tenant lakehouse fed from Data Cloud where each BU needs PII isolation and audit logs.

A multi-tenant lakehouse must isolate data, enforce access controls, and provide auditability.

Clarify first:

  • Number of business units
  • Regulatory requirements
  • Whether tenants can share any data
  • Whether isolation must be logical or physical
  • Who can access PII
  • Retention and deletion requirements

Design options:

Area Design choice
Logical isolation Separate schemas/catalogs per tenant
Physical isolation Separate storage accounts/buckets for regulated tenants
Access control IAM roles, RBAC, row-level security, column masking
PII protection Encryption, tokenization, masking, consent filtering
Audit Immutable access logs and query history
Lineage Track source, transform, and serving tables
Observability Freshness and DQ metrics by tenant

High-level design:

text
Data Cloud / source systems → ingestion layer → tenant-aware bronze/silver/gold → governed access layer → BI/ML/activation

Important controls:

  • Add tenant_id to shared tables where logical isolation is used
  • Prevent broad service principals
  • Encrypt data at rest and in transit
  • Mask or tokenize PII columns
  • Log who accessed what data and when
  • Separate dev/test/prod data access
  • Apply retention and deletion policies per tenant

A strong answer is:

I would design tenant isolation using separate schemas or storage for regulated tenants, enforce IAM and row/column-level controls, protect PII with masking or tokenization, and maintain immutable audit logs, lineage, and freshness metrics per tenant.

Design a data quality controls framework for an ETL platform.

A data quality framework should make checks reusable, visible, and enforceable across pipelines.

Core components:

  1. Rule definition

    • Rules written in SQL, YAML, dbt tests, Great Expectations, or a custom framework.
  2. Rule types

    • Null checks
    • Uniqueness
    • Referential integrity
    • Freshness
    • Accepted values
    • Reconciliation
    • Anomaly detection
    • Business rules
  3. Execution layer

    • Run checks after bronze, silver, and gold stages.
  4. Severity levels

    • Warn
    • Quarantine
    • Block promotion
    • Incident
  5. Metadata and ownership

    • Every rule has owner, table, column, SLA, severity, and linked dashboard.
  6. Alerting

    • Notify the right owner through Slack, email, PagerDuty, or ticketing.
  7. Observability

    • Track pass rate, failure trend, freshness, and data quality score.
  8. Lineage integration

    • Show which dashboards, marts, or consumers are affected.

Example pilot:

text
Start with Account and Opportunity because they are high-value CRM entities.

Example checks:

  • Opportunity must have valid Account
  • Closed Opportunity must have terminal stage
  • Account ID must be unique
  • Required owner fields must not be null
  • Daily Opportunity count should not drop suddenly

A strong answer is:

I would build a declarative data quality framework with reusable rules, severity levels, alerts, ownership, lineage, and trend reporting. Critical checks should block or quarantine bad data, while lower-severity checks warn owners before issues become incidents.


Coding and algorithms (data-focused)

Given a large log file, find the five most frequent error messages in the last hour.

This tests whether you can process large data without loading everything into memory.

Approach:

  1. Read the file line by line
  2. Parse timestamp and error message
  3. Keep only records from the last one hour
  4. Count messages using a hash map
  5. Return the top five using Counter.most_common() or a min-heap

Simple Python approach:

python
from collections import Counter
from datetime import timedelta

def top_errors(lines, now, window=timedelta(hours=1)):
    cutoff = now - window
    counts = Counter()

    for line in lines:
        ts, msg = parse_log_line(line)

        if ts >= cutoff:
            counts[msg] += 1

    return counts.most_common(5)
Output

For very large or streaming logs, improve the design:

  • Use a min-heap of size 5 for top-K
  • Use event time instead of processing time
  • Handle out-of-order logs with a watermark
  • Normalize error messages so IDs do not create fake unique errors
  • Store rolling counts in Redis or a stream processor for real-time dashboards

Complexity:

Area Complexity
Time O(n)
Space O(u) where u is unique error messages in the window

A Salesforce-style example:

text
Find the top failing ingestion errors in the last hour for Account and Opportunity CDC events.

A strong answer is:

I would stream the log file line by line, filter records to the last hour, count error messages with a hash map, and return the top five. For production-scale logs, I would handle event-time ordering, message normalization, and rolling-window aggregation.

Validate whether a string of brackets is balanced.

This is a classic stack problem. It checks whether you understand last-in, first-out matching.

Use a stack:

python
def is_balanced(s: str) -> bool:
    stack = []
    pairs = {
        ')': '(',
        ']': '[',
        '}': '{'
    }

    for ch in s:
        if ch in '([{':
            stack.append(ch)
        elif ch in ')]}':
            if not stack or stack.pop() != pairs[ch]:
                return False

    return not stack
Output

How it works:

  • Push opening brackets to the stack
  • When a closing bracket appears, check if it matches the latest opening bracket
  • If anything mismatches, return False
  • At the end, stack must be empty

Examples:

Input Output Reason
{[]} True Properly nested
{[}] False Wrong order
((())) True All matched
(() False Missing closing bracket

Complexity:

Area Complexity
Time O(n)
Space O(n) in worst case

Possible follow-ups:

  • Ignore non-bracket characters
  • Return the position of the first invalid bracket
  • Validate brackets while streaming a file
  • Track line and column number for parser errors

A strong answer is:

I would use a stack. Opening brackets are pushed, closing brackets must match the most recent opening bracket, and the stack must be empty at the end. This gives O(n) time and O(n) worst-case space.

Find all pairs in an array that sum to a target.

First clarify whether the interviewer wants:

  • One pair or all pairs
  • Indexes or values
  • Duplicate pairs allowed or not
  • Same element reuse allowed or not

For all unique value pairs, use a hash set.

python
def two_sum_pairs(nums, target):
    seen = set()
    pairs = set()

    for x in nums:
        need = target - x

        if need in seen:
            pairs.add(tuple(sorted((need, x))))

        seen.add(x)

    return list(pairs)
Output

Example:

python
nums = [1, 2, 3, 4, 5, 3]
target = 6
Output

Possible output:

text
[(1, 5), (2, 4), (3, 3)]

Complexity:

Area Complexity
Time O(n)
Space O(n)

If the array is already sorted, you can use the two-pointer approach:

text
left at start, right at end
move left/right based on current sum

Two-pointer approach uses O(1) extra space but requires sorted input.

Salesforce/data example:

text
Find pairs of adjustment records whose amounts cancel each other out.

A strong answer is:

I would clarify duplicates and whether indexes or values are required. For unsorted input, I would use a hash set for O(n) time. For sorted input, I would use two pointers to reduce extra space.

Implement a function to flatten a nested dictionary.

Flattening a nested dictionary is common when semi-structured JSON lands in a bronze layer and needs to be converted into column-style paths.

Example input:

python
{
    "account": {
        "id": "A1",
        "owner": {
            "id": "U1"
        }
    }
}
Output

Expected output:

python
{
    "account.id": "A1",
    "account.owner.id": "U1"
}
Output

Recursive solution:

python
def flatten(obj, prefix=""):
    out = {}

    for key, value in obj.items():
        path = f"{prefix}.{key}" if prefix else key

        if isinstance(value, dict):
            out.update(flatten(value, path))
        else:
            out[path] = value

    return out
Output

Complexity:

Area Complexity
Time O(n) where n is total keys
Space O(n) for output and recursion stack

Important follow-ups:

  • How should lists be handled?
  • What if keys already contain dots?
  • Should null values be kept?
  • Should deeply nested JSON use iterative traversal to avoid recursion depth issues?
  • Should output preserve data types?

For CDC pipelines, this helps when raw Salesforce or external events land as JSON and you need predictable field paths for validation or schema inference.

A strong answer is:

I would recursively traverse the dictionary, build a dotted path for each nested key, and store leaf values in the output map. I would also clarify how to handle arrays, nulls, dotted keys, and very deep JSON.

Merge k sorted lists of records by timestamp.

This is a good data-engineering coding question because it resembles merging sorted CDC shards, partitioned files, or ordered event streams.

Use a min-heap of size k.

python
import heapq

def merge_k_lists(lists):
    heap = []

    for list_index, records in enumerate(lists):
        if records:
            heapq.heappush(
                heap,
                (records[0][0], list_index, 0)
            )

    result = []

    while heap:
        ts, list_index, record_index = heapq.heappop(heap)
        records = lists[list_index]

        result.append(records[record_index])

        next_index = record_index + 1

        if next_index < len(records):
            heapq.heappush(
                heap,
                (records[next_index][0], list_index, next_index)
            )

    return result
Output

Assumption:

text
Each record is sorted by timestamp, and timestamp is at index 0.

Example record:

python
("2026-06-27T10:00:00Z", "event_1")
Output

Complexity:

Area Complexity
Time O(N log k)
Space O(k) excluding output

Where:

  • N = total records across all lists
  • k = number of sorted lists

Important follow-ups:

  • What if timestamps are equal?
  • Do we need stable ordering?
  • Are records already sorted by event time or ingestion time?
  • Can lists be streams instead of in-memory arrays?
  • How do we handle late events?

A strong answer is:

I would use a min-heap containing the current record from each sorted list. Each pop gives the next earliest timestamp, and I push the next record from the same list. This gives O(N log k) time and works well for merging ordered CDC shards.

How do you deduplicate a stream of CRM events in code?

To deduplicate a CRM event stream, use a stable event identifier and keep track of processed events.

Simple logic:

python
seen = set()

def process(event):
    if event.id in seen:
        return None

    seen.add(event.id)
    return transform(event)
Output

This works for small in-memory examples, but production streams need bounded and durable state.

Production design options:

Option Use case
In-memory set Small coding exercise or short-lived process
TTL cache Dedup within replay window
Redis / DynamoDB / RocksDB Durable dedup state
Database unique constraint Idempotent sink writes
Bloom filter Memory-efficient approximate dedup
Bronze table dedup Reprocess-safe batch/stream pipeline

Important considerations:

  • Event ID must be stable
  • Dedup state should survive restarts
  • State must be bounded using TTL or compaction
  • Replays should not produce duplicate target rows
  • Sink writes should be idempotent
  • Checkpointing should align with successful writes

For Salesforce CDC, possible keys include:

text
event_id, replay_id, source_object_id, commit_timestamp, change_sequence

Do not deduplicate only by business fields like email or account name because those can change and may not uniquely identify events.

A strong answer is:

I would deduplicate by a stable event ID, store seen IDs in a bounded or durable state store, and make the sink idempotent with upserts or unique constraints. In production, the dedup state must survive retries and restarts.


Behavioral, leadership, and culture

Why Salesforce, and which company value resonates with you?

A strong answer should connect Salesforce's values to your actual data engineering work, not generic praise.

Good values to connect with data work:

Salesforce value How to connect it to data engineering
Trust Data privacy, PII handling, encryption, access control, auditability
Customer Success Reliable pipelines and metrics that business users can act on
Innovation Data Cloud, Agentforce, AI-ready data, real-time customer context
Equality Fair access to data, inclusive collaboration, clear documentation
Sustainability Efficient pipelines, cost-aware compute, less wasteful processing

A weak answer sounds like:

text
Salesforce is a big company and I want to work on interesting products.

A stronger answer connects your experience:

text
Trust resonates with me because data engineering directly affects customer confidence. If CRM, consent, or PII pipelines are wrong, downstream analytics and AI decisions can be wrong. In my past work, I handled access controls, audit logs, data quality checks, and incident communication, so I see Trust as a daily engineering responsibility.

You can also mention interest in Data Cloud / Data 360 and Agentforce if it is genuine:

text
I am interested in how Salesforce uses unified customer data to power analytics, automation, and AI agents. That kind of system needs strong data engineering foundations: identity resolution, freshness, lineage, governance, and quality checks.

A strong answer is:

Trust resonates most with me because data engineering is directly tied to reliability, privacy, and customer confidence. I want to work on systems where clean, governed, and timely data powers customer-facing analytics, Data Cloud, and AI use cases.

Tell me about a technical disagreement with a senior engineer.

Use a STAR structure and show that you can disagree without making it personal.

A good example can come from:

  • Streaming vs batch ingestion
  • SCD2 vs daily snapshot
  • SQL model vs denormalized mart
  • Redis cache vs warehouse optimization
  • Custom pipeline vs managed connector
  • Strict data quality gate vs warning-only check

Example structure:

STAR step What to say
Situation Team disagreed on whether a CRM mart should use SCD2 or daily snapshots
Task You needed to support accurate as-of reporting without making the pipeline too complex
Action You documented trade-offs, checked query patterns, ran a small benchmark, and reviewed SLA needs
Result Team chose a hybrid model: SCD2 for core dimensions and daily snapshots for high-traffic dashboard metrics

What interviewers want to hear:

  • You listened first
  • You used data, not ego
  • You considered maintainability
  • You aligned the decision with customer or business impact
  • You preserved the relationship

Avoid saying:

text
The senior engineer was wrong, so I proved my approach was better.

Say instead:

text
We had different assumptions, so I made the trade-offs visible and helped the team choose based on SLA, cost, and maintainability.

A strong answer is:

I handled the disagreement by clarifying the requirement, documenting trade-offs, validating assumptions with a small test, and aligning the final decision with SLA, maintenance cost, and business impact rather than personal preference.

Describe a time you spotted a security risk in a data system.

Choose a story where you identified a real risk, acted responsibly, and improved the system.

Good examples:

  • PII copied into a sandbox without masking
  • Overly broad service principal access to storage
  • API key committed to Git
  • Data export shared with too many users
  • Logs containing tokens, emails, or customer identifiers
  • No audit trail for sensitive table access
  • Production data available from a non-production notebook

Answer structure:

  1. Risk found

    • What did you notice?
  2. Impact

    • What could have gone wrong?
  3. Action

    • Who did you inform?
    • Did you rotate credentials?
    • Did you restrict access?
    • Did you mask or delete exposed data?
  4. Prevention

    • What control did you add?

Example:

text
I found that a shared service principal had write access to multiple storage containers, including PII datasets that the pipeline did not need. I raised it with the platform and security teams, helped reduce permissions to least privilege, rotated credentials, and added a review step for new data access requests.

Strong prevention actions:

  • Secret scanning in CI
  • IAM least-privilege review
  • Token rotation
  • Column masking
  • Audit logging
  • Access review workflow
  • Policy-as-code checks

A strong answer is:

I identified the risk, assessed the blast radius, involved the right owners, fixed the exposure, and added a preventive control such as secret scanning, least-privilege IAM, masking, or audit logging so the issue would not repeat.

Tell me about a project that failed. What did you learn?

Pick a failure where you learned something meaningful and changed your behavior.

Good data-engineering examples:

  • Pipeline launched without enough data quality gates
  • Dashboard users lost trust after metric definitions changed
  • Identity merge churn caused duplicate or missing customer counts
  • Backfill took longer than expected because dependencies were not mapped
  • Streaming pipeline missed late-arriving data
  • Gold table design did not match actual business questions

Avoid blaming:

text
The stakeholder kept changing requirements.

Better framing:

text
We did not define the data contract and success criteria clearly enough before building.

Example answer:

text
In one project, we built a gold sales mart quickly, but we did not define the exact grain and metric ownership early. After launch, different teams interpreted pipeline value differently, and users lost confidence in the dashboard. I helped pause further rollout, documented metric definitions, added reconciliation checks, and created a sign-off process with business owners. The lesson was that trusted data products need contracts, ownership, and validation before wide release.

What the interviewer wants:

  • You take ownership
  • You learned a practical lesson
  • You changed your process
  • You do not blame others
  • You understand how data trust is lost and rebuilt

A strong answer is:

The project failed because we did not define ownership, grain, and data quality gates early enough. I learned to align on data contracts, success metrics, validation rules, and stakeholder sign-off before promoting data to trusted reporting layers.

How do you prioritize when two executives want conflicting high-priority data work?

This tests whether you can operate in a large enterprise without silently choosing favorites.

A good framework:

  1. Clarify business impact

    • Revenue, compliance, customer-facing impact, executive reporting, operational risk.
  2. Clarify deadline

    • Is it truly urgent or just important?
  3. Estimate effort and dependencies

    • Can one request unblock the other?
    • Is there a shared data model or pipeline component?
  4. Identify risk

    • What happens if each request is delayed?
  5. Offer phased delivery

    • MVP metric first, full mart later.
    • Temporary extract first, governed pipeline later.
  6. Escalate transparently

    • Present trade-offs to the decision maker with facts.
  7. Document the decision

    • Capture scope, timeline, owner, and follow-up.

Example trade-off matrix:

Request Impact Deadline Effort Risk if delayed
Revenue dashboard fix High This week Medium Forecasting impact
New segmentation mart Medium Next month High Campaign delay

Avoid:

text
I would work on whichever executive is louder.

Better:

text
I would make the trade-off visible and ask for prioritization based on business impact and risk.

A strong answer is:

I would compare business impact, urgency, effort, dependencies, and risk, then present a transparent trade-off matrix. If both cannot be done, I would propose phased delivery and get explicit prioritization rather than silently moving work around.

Describe leading through an ambiguous project crisis.

Choose a story where the problem was unclear, requirements changed, or multiple teams were blocked.

Good examples:

  • Acquisition data had to be integrated quickly
  • New CRM source had undocumented fields
  • A dashboard broke before quarter-end reporting
  • Data Cloud identity rules caused unexpected merge behavior
  • Source system changed schema without notice
  • Pipeline ownership was unclear during an incident

Strong answer structure:

  1. Stabilize

    • Identify immediate user impact and protect critical outputs.
  2. Clarify known vs unknown

    • Separate facts, assumptions, and open questions.
  3. Create a short plan

    • Time-box investigation.
    • Assign owners.
    • Define decision points.
  4. Communicate frequently

    • Explain what is affected, what is safe, and what is still unknown.
  5. Deliver incrementally

    • Provide a partial fix or MVP while deeper correction continues.
  6. Close with learning

    • Update runbooks, tests, contracts, or escalation paths.

Example:

text
During a CRM migration, we discovered late that several account hierarchy fields had inconsistent meanings across regions. I paused promotion to gold tables, created a data profiling report, met with regional owners, defined a temporary mapping for critical dashboards, and tracked unresolved fields in a risk register. We delivered the quarter-end dashboard on time with documented limitations, then followed up with a permanent contract.

A strong answer is:

I lead ambiguous crises by separating facts from assumptions, stabilizing the highest-impact users first, assigning clear owners, communicating known risks, and delivering incremental fixes while building the longer-term solution.


Lead and senior IC depth

How do you ensure scalability and performance of a large Salesforce-related data platform?

A lead-level answer should cover architecture, operations, and team practices.

Key areas:

Area What to do
Data modeling Define correct grain, keys, and ownership
Partitioning Partition large facts by event date, snapshot date, or ingestion date
Skew handling Watch large accounts, hot keys, and uneven partitions
Processing Use bulk and distributed processing instead of row-by-row API pulls
CDC Add backpressure, replay support, and idempotent merges
Serving Pre-aggregate dashboard metrics and avoid heavy ad hoc queries on raw tables
Governance Define standards, contracts, reviews, and ownership
Observability Track freshness, lag, cost, quality, and SLA breaches

Salesforce-specific scaling concerns:

  • Account hierarchies can create skew
  • Opportunity history can grow quickly
  • CDC events can arrive in bursts
  • Identity resolution can change profile counts
  • Multi-tenant or business-unit data may require isolation
  • PII controls can affect how data is joined and served

A lead candidate should also mention people and process:

  • Establish design reviews
  • Mentor engineers on data modeling and performance
  • Create reusable pipeline templates
  • Set performance budgets
  • Run capacity reviews
  • Practice incident game days

A strong answer is:

I would scale the platform through correct data grain, partitioning, skew handling, bulk processing, idempotent CDC design, pre-aggregated serving layers, governance standards, and strong observability. As a lead, I would also create reusable patterns and mentor teams to apply them consistently.

How do you manage technical debt on a data platform?

Technical debt in data platforms is dangerous because it often appears later as wrong metrics, slow pipelines, high cost, or poor trust.

Common data-platform debt:

  • Unowned tables
  • No data contracts
  • No tests for critical transformations
  • Duplicated pipeline logic
  • Hardcoded business rules
  • Manual backfills
  • Poor naming conventions
  • No lineage or documentation
  • Expensive queries nobody owns
  • Legacy pipelines no one wants to touch

Management framework:

  1. Make debt visible

    • Create a debt register with owner, impact, risk, and effort.
  2. Prioritize by customer impact

    • Fix debt that affects correctness, trust, SLA, cost, or security first.
  3. Reserve capacity

    • Allocate regular sprint capacity for debt reduction.
  4. Prevent new debt

    • Use PR templates, naming standards, tests, and data contracts.
  5. Refactor safely

    • Use a strangler pattern: run old and new pipelines in parallel until reconciliation passes.
  6. Measure improvement

    • Track pipeline runtime, failure rate, cost, test coverage, and incident count.

Example:

text
We replaced a legacy Opportunity mart by building a new model in parallel, validating row counts and business metrics for two weeks, then switching dashboards gradually.

A strong answer is:

I manage technical debt by making it visible, ranking it by business impact and risk, reserving capacity to reduce it, and preventing new debt with standards, tests, contracts, and design reviews. For risky refactors, I run old and new pipelines in parallel until reconciliation passes.

When do you buy or configure vs build custom pipeline logic?

The decision should be based on reliability, maintenance cost, flexibility, and business differentiation.

Prefer managed or out-of-the-box options when:

  • Standard connector supports the source well
  • Required SLA is covered
  • Transformations are simple
  • Security and compliance controls are built in
  • Long-term maintenance should be low
  • The integration is not a competitive differentiator

Build custom logic when:

  • You need complex CDC handling
  • Identity matching rules are custom
  • Transformations are highly business-specific
  • You need advanced deduplication or ordering logic
  • Multi-cloud or hybrid architecture is required
  • Existing connector cannot meet latency, schema, or governance needs

Decision rubric:

Question Why it matters
Does a reliable connector already exist? Avoid unnecessary custom code
What is the SLA? Some managed tools may not meet low-latency needs
Who will own failures? Custom code needs on-call ownership
How complex are transforms? Complex logic may require custom pipelines
What is the total cost? Include build, support, monitoring, and upgrades
Is there vendor lock-in risk? Important for long-term architecture

Avoid building custom pipelines only because it feels more flexible. Custom code also brings monitoring, alerting, testing, security, and upgrade responsibility.

A strong answer is:

I prefer managed connectors or configuration when they meet SLA, security, and transformation needs. I build custom logic only when the business requirements, latency, identity rules, or integration complexity cannot be handled safely by standard tools.

Where does MuleSoft fit in Salesforce enterprise integration?

MuleSoft fits in enterprise integration by providing reusable APIs and integration flows between Salesforce and other systems.

A common MuleSoft pattern is API-led connectivity:

API layer Purpose
System APIs Connect to systems of record such as ERP, CRM, databases, or legacy apps
Process APIs Combine and orchestrate business logic across systems
Experience APIs Serve specific channels such as web, mobile, partner apps, or internal tools

Where MuleSoft helps:

  • Connect Salesforce CRM with ERP, billing, support, marketing, or data platforms
  • Transform payloads between systems
  • Apply routing and orchestration
  • Enforce authentication and rate limits
  • Provide reusable integration contracts
  • Improve observability of enterprise APIs
  • Reduce point-to-point integration sprawl

For data engineers, MuleSoft may be relevant in two ways:

  1. You consume APIs exposed by MuleSoft as pipeline sources.
  2. You coordinate event contracts and payload formats with integration teams.

When MuleSoft is better than point-to-point integration:

  • Multiple consumers need the same business data
  • Enterprise governance matters
  • APIs need versioning and monitoring
  • Direct source access is risky or inconsistent
  • Integrations should be reusable across teams

A strong answer is:

MuleSoft provides API-led connectivity between Salesforce and enterprise systems. As a data engineer, I may consume MuleSoft APIs, coordinate event contracts, and rely on reusable System, Process, and Experience APIs instead of building fragile point-to-point integrations.

Describe a challenging CRM or data migration you led.

Use this as a hybrid behavioral and technical answer. Show leadership, planning, validation, and measurable results.

Good structure:

  1. Scope

    • What was migrated?
    • From which system to which target?
    • How much data?
    • Which users or business processes were affected?
  2. Challenges

    • Duplicate contacts/accounts
    • Bad account hierarchy
    • Missing required fields
    • Attachment migration
    • Historical field history
    • Downtime window
    • Data ownership conflicts
    • Consent and PII requirements
  3. Plan

    • Profiling
    • Mapping
    • Cleansing
    • Test loads
    • Reconciliation
    • Parallel run
    • Cutover
    • Rollback
  4. Execution

    • Built migration pipelines
    • Created validation queries
    • Coordinated with business owners
    • Ran mock cutovers
    • Tracked defects and sign-offs
  5. Result

    • Match rate
    • Downtime
    • Error reduction
    • Adoption
    • Reporting accuracy
    • Business outcome

Example:

text
I led a CRM migration where duplicate Accounts and inconsistent region mappings were breaking sales reporting. We profiled the data, created match rules, ran test loads, reconciled key metrics with business owners, and used a phased cutover. The final migration completed within the planned downtime window, and post-cutover reconciliation passed for the critical revenue and pipeline dashboards.

A strong answer is:

I would describe the migration scope, data quality challenges, cutover plan, validation strategy, stakeholder coordination, and measurable result. The key is showing that I handled both the technical migration and the business trust needed for adoption.

How does data engineering support Agentforce and AI features in 2026?

Agentforce and AI features depend on trusted, governed, and timely data. Raw CRM dumps are not enough.

Data engineering supports AI by providing:

Need Data engineering responsibility
Trusted context Clean and validated customer data
Unified profile Identity resolution across CRM, service, marketing, and external sources
Freshness CDC or near-real-time pipelines for important events
Governance PII controls, consent, masking, access control, retention
Grounding Relevant business data for AI responses and actions
Lineage Trace where AI context came from
Feedback loop Capture agent outcomes for evaluation and improvement

Example:

text
If an AI agent recommends the next best action for a customer, it needs accurate account status, recent service cases, consent preferences, purchase history, and current engagement signals.

Important risks:

  • Stale data can cause wrong recommendations
  • Poor identity resolution can mix customers incorrectly
  • Missing consent can create compliance issues
  • Bad lineage makes AI behavior hard to explain
  • Unchecked feedback loops can reinforce bad decisions

A Salesforce Data Engineer should connect Agentforce/Data Cloud work to fundamentals:

  • DLO/DMO mapping
  • Unified profiles
  • Identity resolution
  • Data quality checks
  • Real-time ingestion
  • Secure access
  • Audit and lineage
  • Monitoring and feedback metrics

A strong answer is:

Data engineering supports Agentforce by providing trusted, fresh, unified, and governed customer data. I would focus on identity resolution, data quality, lineage, PII controls, real-time signals, and feedback loops so AI agents act on accurate and compliant context.


Final checklist before your loop

  • Three project stories with metrics and trade-offs
  • SQL pad ready: windows, CTEs, SCD2 audits on CRM schemas
  • One system design narrative: CDC → bronze/silver/gold → Data Cloud + Synapse
  • Coding practice: top-K, stacks, heaps, stream dedup
  • Behavioral answers tied to Trust and collaboration
  • Clarified which org/team your loop targets (platform vs product analytics)

For more interview prep on this site, see AWS, Java, and the full Interview Questions category.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with more than 15 years of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive …