Top 50 Essential SQL Interview Questions for Senior Engineers in 2026

If you’re preparing for a backend, data, or full-stack interview in 2026, SQL is still the language that separates the engineers who understand data from the ones who just write queries. I’ve sat on both sides of the interview table, and I can tell you that the questions have gotten harder. Companies are no longer satisfied with “can you write a SELECT join.” They want to know if you understand what happens when your query hits a million rows, whether you know the difference between a clustered index seek and a table scan, and if you can reason through a transactions locking scenario under pressure.

This guide gives you 50 real SQL interview questions broken down by difficulty tier. Each question comes with a working answer or at minimum a clear direction toward the solution. I have written these the way I would explain them to a junior engineer on my team who is preparing for an upcoming round. No fluff, no抽象 theory disconnected from reality.

TL;DR

50 real SQL interview questions grouped by difficulty
Beginner tier: SELECT, JOIN, GROUP BY, subqueries
Intermediate tier: window functions, CTEs, indexing, query optimization
Advanced tier: transactions, locking, distributed systems, data modeling
Every answer includes working SQL where applicable
Internal links to MySQL with Node.js and Postgres with Node.js tutorials

Comparison Table: SQL Difficulty Tiers

Tier	Topics Covered	Example Questions
Beginner	SELECT, JOIN, WHERE, GROUP BY, ORDER BY, subqueries	What is the difference between HAVING and WHERE
Intermediate	Window functions, CTEs, indexing, EXPLAIN, normalization	How do ROW_NUMBER, RANK, DENSE_RANK differ
Advanced	Transactions, ACID, locking, isolation levels, distributed SQL	Explain optimistic vs pessimistic locking
Expert	Data modeling, SCDs, fact/dimension tables, query optimization	How do you design a schema for a ride-sharing service

Beginner SQL Interview Questions

These are questions I expect any engineer with 6 months of SQL experience to answer without hesitation. If you stumble here, the interview is over before it starts.

1. What is the difference between WHERE and HAVING?

WHERE filters rows before aggregation happens. HAVING filters groups after aggregation. Think of WHERE as the bouncer at the door who checks people before they enter the party, and HAVING as the host who only lets certain groups stay once they are inside.

-- WHERE filters before aggregation
SELECT department, COUNT(*)
FROM employees
WHERE status = 'active'
GROUP BY department;

-- HAVING filters after aggregation
SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

2. What is the difference between INNER JOIN and LEFT JOIN?

INNER JOIN returns only the rows where there is a match in both tables. LEFT JOIN returns all rows from the left table and matching rows from the right table, with NULL for non-matching right-side rows.

-- INNER JOIN: only matching customers with orders
SELECT c.name, o.order_id
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id;

-- LEFT JOIN: all customers, with or without orders
SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id;

If a customer has never placed an order, they show up in the LEFT JOIN result with NULL for order_id. They disappear entirely from the INNER JOIN result.

3. What is the difference between UNION and UNION ALL?

UNION removes duplicate rows and sorts them. UNION ALL keeps all rows including duplicates. UNION ALL is faster because it skips the deduplication step. Only use UNION when you specifically need unique rows.

-- UNION: unique rows only, slower due to deduplication
SELECT name FROM customers
UNION
SELECT name FROM suppliers;

-- UNION ALL: all rows, faster
SELECT name FROM customers
UNION ALL
SELECT name FROM suppliers;

4. How do you find duplicate records in a table?

Use GROUP BY with HAVING COUNT greater than 1, then join back to find the full rows.

-- Find duplicate emails
SELECT email, COUNT(*) as cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Find full duplicate rows
SELECT *
FROM users
WHERE id IN (
    SELECT id
    FROM users
    GROUP BY email, first_name, last_name
    HAVING COUNT(*) > 1
);

5. What is the ORDER BY execution order in a SELECT statement?

The logical order is: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY. ORDER BY executes last, which means you can order by an alias defined in the SELECT clause in some databases like MySQL. The database optimizer may reorder things for performance, but logically ORDER BY is the final step before LIMIT.

6. What is the difference between CHAR and VARCHAR?

CHAR is fixed length. VARCHAR is variable length. CHAR pads strings with spaces to the defined length, so “hello” in CHAR(10) stores as “hello “. VARCHAR stores only the actual bytes plus 1-2 length bytes. Use CHAR for fixed-length strings like country codes (US, IN, GB). Use VARCHAR for variable-length data like names, emails, addresses.

Type	Storage	Use case
CHAR(10)	Always 10 bytes	Country codes, status flags
VARCHAR(255)	1-255 bytes + overhead	Names, emails, addresses

7. What is a primary key versus a unique constraint?

A primary key is a NOT NULL unique constraint. Only one primary key per table. It is the logical identifier for the row and typically becomes the foreign key in related tables. A unique constraint allows NULL values (multiple NULLs are allowed in most databases since NULL != NULL). The primary key is the physical anchor for the clustered index in most SQL databases.

8. How do you handle NULL values in SQL?

Use IS NULL or IS NOT NULL for comparisons because NULL cannot be compared with equals (=) due to three-valued logic. The COALESCE function returns the first non-NULL argument. The IFNULL or NVL functions serve a similar purpose in specific databases.

-- Check for NULL
SELECT * FROM orders WHERE shipped_date IS NULL;

-- Replace NULL with default
SELECT COALESCE(shipping_address, 'No address provided') FROM customers;

-- IFNULL example (MySQL)
SELECT IFNULL(phone, 'Unknown') FROM contacts;

9. What is the difference between LIMIT and TOP?

LIMIT is used in MySQL, PostgreSQL, and SQLite. TOP is used in SQL Server. Both serve the same purpose of restricting the number of rows returned. The syntax differs.

-- MySQL, PostgreSQL, SQLite
SELECT * FROM products ORDER BY price DESC LIMIT 10;

-- SQL Server
SELECT TOP 10 * FROM products ORDER BY price DESC;

10. What is a subquery and when would you use one?

A subquery is a query nested inside another query. You can use them in WHERE clauses, FROM clauses, and SELECT clauses. Correlated subqueries reference the outer query and are evaluated per row. Non-correlated subqueries are independent and evaluated once.

-- Non-correlated: find products priced above average
SELECT name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);

-- Correlated: find each department's above-average salaries
SELECT name, salary, department_id
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e2.department_id = e1.department_id
);

11. How do you calculate the total sales per customer for the last 30 days?

SELECT
    c.id,
    c.name,
    SUM(o.total) AS total_sales
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.created_at >= NOW() - INTERVAL '30 days'
GROUP BY c.id, c.name
ORDER BY total_sales DESC;

12. What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?

COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values. If a column has values 1, 2, NULL, NULL, then COUNT(column) returns 2 and COUNT(DISTINCT column) returns 2.

SELECT
    COUNT(*) AS total_rows,
    COUNT(region) AS rows_with_region,
    COUNT(DISTINCT region) AS unique_regions
FROM stores;

13. What is a self JOIN and when would you use it?

A self JOIN joins a table to itself. Use it for hierarchical data like employee-manager relationships or for comparing rows within the same table.

-- Find employees and their managers
SELECT
    e.name AS employee,
    m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

14. How do you pivot rows into columns in SQL?

Use conditional aggregation (CASE WHEN inside an aggregate function) or the database-specific PIVOT operator. PostgreSQL, MySQL, and SQLite do not have a native PIVOT keyword, so you use conditional aggregation.

-- Pivot monthly sales into columns
SELECT
    product_name,
    SUM(CASE WHEN month = 'Jan' THEN revenue END) AS Jan,
    SUM(CASE WHEN month = 'Feb' THEN revenue END) AS Feb,
    SUM(CASE WHEN month = 'Mar' THEN revenue END) AS Mar
FROM monthly_sales
GROUP BY product_name;

-- SQL Server with PIVOT
SELECT *
FROM (SELECT product, month, revenue) AS src
PIVOT (SUM(revenue) FOR month IN (Jan, Feb, Mar)) AS pvt;

15. How do you unpivot columns into rows?

Use UNION ALL or UNPIVOT. The reverse of pivoting.

SELECT product, 'Jan' AS month, Jan AS revenue FROM sales
UNION ALL
SELECT product, 'Feb' AS month, Feb AS revenue FROM sales
UNION ALL
SELECT product, 'Mar' AS month, Mar AS revenue FROM sales;

Intermediate SQL Interview Questions

These questions separate the engineers who have been using SQL for real work from the ones who learned it from a tutorial and have never touched a production database with millions of rows.

16. What is the difference between ROW_NUMBER, RANK, and DENSE_RANK?

All three assign a ranking number to each row within a partition. ROW_NUMBER assigns unique sequential numbers starting from 1, even for ties. RANK assigns the same rank to ties but skips the next number. DENSE_RANK assigns the same rank to ties without skipping.

SELECT
    name,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
    RANK() OVER (ORDER BY salary DESC) AS rank_num,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;

-- If salaries are 100, 100, 90:
-- ROW_NUMBER: 1, 2, 3
-- RANK: 1, 1, 3 (skips 2)
-- DENSE_RANK: 1, 1, 2 (no skip)

17. How do you calculate a running total in SQL?

Use a window function with SUM over a window frame that stretches from the first row to the current row.

SELECT
    date,
    revenue,
    SUM(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM daily_sales
ORDER BY date;

If you want a running total that resets when the region changes, add PARTITION BY region.

SELECT
    region,
    date,
    revenue,
    SUM(revenue) OVER (PARTITION BY region ORDER BY date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM regional_sales;

18. What is a CTE and when would you use it over a subquery?

A CTE (Common Table Expression) is a named temporary result set defined before the main query. Use CTEs when you want to improve readability, when you need to reference the same subquery multiple times in the main query, or when writing recursive queries.

-- CTE for clarity
WITH high_value_customers AS (
    SELECT id, name
    FROM customers
    WHERE lifetime_value > 10000
)
SELECT
    c.name,
    SUM(o.total) AS total_orders
FROM high_value_customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;

If you need the same subquery twice in a query, a CTE avoids duplicating the logic. CTEs also make recursive queries possible for hierarchical data like org charts.

19. How do you write a recursive CTE?

A recursive CTE has three parts: anchor member, recursive member, and a termination condition. The recursive member references the CTE itself.

-- Find the full reporting chain for an employee
WITH RECURSIVE org_chart AS (
    -- Anchor: start with the CEO
    SELECT id, name, manager_id, 1 AS depth
    FROM employees
    WHERE manager_id IS NULL

UNION ALL

-- Recursive: join employees to their managers
    SELECT e.id, e.name, e.manager_id, oc.depth + 1
    FROM employees e
    JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT * FROM org_chart ORDER BY depth, manager_id;

20. What is the difference between clustered and non-clustered indexes?

A clustered index stores the actual data rows sorted in the index order. A table can have only one clustered index because the data can be sorted in only one way. A non-clustered index creates a separate structure with the indexed values and pointers to the actual data rows.

Think of a clustered index like a dictionary where the words are stored alphabetically in the book itself. A non-clustered index is like a books index at the back of the book, a separate list that points you to the right page.

In SQL Server, the primary key defaults to a clustered index. In MySQL with InnoDB, the primary key is the clustered index. In PostgreSQL, there is no concept of a clustered index in the same way; PostgreSQL uses heap storage and all indexes are non-clustered.

21. What is a covering index?

A covering index is a non-clustered index that includes all the columns needed to satisfy a query, so the database never has to look up the actual table pages. This eliminates the extra I/O step called a key lookup in SQL Server or an index lookup in PostgreSQL.

-- Instead of just indexing user_id:
CREATE INDEX idx_orders_covering
ON orders(user_id, order_date, total)
INCLUDE (status);

-- Now this query uses only the index:
SELECT order_date, total, status
FROM orders
WHERE user_id = 123;

22. How do you identify slow queries using EXPLAIN?

Run EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL) before your query to see the query execution plan. Look for table scans instead of index seeks, nested loops on large datasets, and sort operations that spill to disk.

-- PostgreSQL
EXPLAIN ANALYZE
SELECT c.name, SUM(o.total)
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'APAC'
GROUP BY c.name;

-- MySQL
EXPLAIN SELECT c.name, SUM(o.total)
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'APAC'
GROUP BY c.name;

The key things to look for: type (should be range, ref, or eq_ref, not ALL for full table scan), key (the index being used), rows (how many rows are examined), and Extra (look for Using filesort or Using temporary, both are warning signs).

23. What is normalization and what are 1NF, 2NF, and 3NF?

Normalization is the process of organizing data to reduce redundancy and improve data integrity. Each form builds on the previous one.

1NF requires atomic values. No repeating groups. No array or list columns. Each cell contains only one value.

2NF requires 1NF plus that every non-key column is fully functionally dependent on the entire primary key. If the primary key is a single column, 2NF is automatically satisfied.

3NF requires 2NF plus that no non-key column depends on another non-key column. The transitive dependency rule.

A practical example: a table with student_id, course_id, and course_department violates 3NF because course_department depends on course_id, not directly on student_id. The fix is a separate courses table.

24. What is denormalization and when would you use it?

Denormalization is intentionally adding redundant data to a table to improve read performance. You trade write performance and data consistency for faster queries. Use it in read-heavy scenarios like reporting databases, data warehouses, and analytics systems where you query aggregated data far more often than you update individual rows.

A data warehouse is the most common example. You store pre-aggregated totals in a fact table rather than computing joins every time a report runs.

25. What is the difference between OLTP and OLAP?

OLTP (Online Transaction Processing) systems handle day-to-day operational data. They are write-heavy, normalized, and need to support concurrent transactions with ACID guarantees. An order processing system is a classic OLTP example.

OLAP (Online Analytical Processing) systems are read-heavy and used for reporting and analysis. They are often denormalized (star or snowflake schema), store historical data, and need to handle complex aggregation queries across large datasets. A data warehouse that powers your quarterly revenue dashboard is a classic OLAP example.

26. How do you handle slowly changing dimensions (SCDs) in data warehousing?

Type 1 SCD overwrites the old value. You lose history. Type 2 SCD creates a new row with the new value and keeps the old row with an effective date range. Type 3 SCD adds a previous_value column.

For tracking customer address changes where historical accuracy matters, Type 2 is the standard approach.

-- Type 2 SCD for customer address
INSERT INTO customer_addresses (customer_id, address, effective_from, effective_to, is_current)
VALUES (123, 'New Address', CURRENT_DATE, '9999-12-31', TRUE);

UPDATE customer_addresses
SET effective_to = CURRENT_DATE - 1, is_current = FALSE
WHERE customer_id = 123 AND is_current = TRUE;

27. What is a surrogate key and why would you use one?

A surrogate key is a system-generated key (usually auto-increment or UUID) that has no business meaning. Use a surrogate key instead of a natural key when the natural key could change (like a person’s name or an email address that can change), when the natural key is composite and unwieldy, or when you want to decouple the physical storage from business logic.

The alternative is a natural key that has real-world meaning like a country code (IN, US) or a vehicle VIN number. Natural keys have their place in dimension tables for readability, but surrogate keys are the default for fact tables.

28. How do window functions differ from GROUP BY?

GROUP BY collapses rows and reduces the number of output rows. Window functions operate on a window frame without collapsing rows. You still get one output row per input row, but with computed values from related rows.

-- GROUP BY: one row per department
SELECT department, AVG(salary) FROM employees GROUP BY department;

-- Window function: every employee row plus their department average
SELECT name, department, salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;

Every employee row is preserved in the window function output, with the department average attached as an additional column.

29. What is a fact table versus a dimension table?

A fact table stores measurable, numerical events (facts) like transactions, measurements, or events. A dimension table stores descriptive attributes that give context to those facts (who, what, where, when, how). Fact tables are narrow and deep. Dimension tables are wide and shallow.

A sales fact table might have order_id, product_id, customer_id, store_id, quantity, revenue. Dimension tables for products, customers, stores, and time provide names, descriptions, and hierarchies for each.

30. How do you optimize a query that joins multiple large tables?

First, make sure every join condition has an index on the foreign key column. Second, reduce the dataset early with filters before the join. Third, check the join order using EXPLAIN. Fourth, consider hints (Oracle, SQL Server) or settings (PostgreSQL join_collapse_limit) if the optimizer is choosing a bad plan. Fifth, check whether the join type is appropriate.

The ideal plan has index seeks on both sides of the join, not table scans. If you see a nested loop join on two million rows each, you have a problem.

Advanced SQL Interview Questions

These are the questions that separate senior engineers from the rest. Expect them in tech leads, staff engineer, and senior backend roles at companies where the database is a first-class system.

31. Explain ACID properties in transactions.

Atomicity means the transaction succeeds or fails as a whole. There is no partial state. Consistency means the database moves from one valid state to another, honoring all constraints and triggers. Isolation means concurrent transactions do not interfere with each other. The isolation level determines how they interact. Durability means once a transaction commits, the changes survive system crashes.

Of these, isolation is the one that gets complicated in practice because most databases offer multiple isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) with different performance and consistency trade-offs.

32. What is the difference between optimistic and pessimistic locking?

Pessimistic locking assumes conflict will happen and locks resources before using them. You acquire a lock and hold it until you are done. Useful when contention is high and conflicts are frequent. In SQL, you use SELECT FOR UPDATE or SELECT FOR SHARE.

Optimistic locking assumes conflict is rare and detects it at commit time using a version column. Each row has a version number that increments on update. When you try to update, you check that the version has not changed since you read it. If it has changed, you retry or abort. In SQL, this looks like WHERE id = ? AND version = ? followed by UPDATE … SET … version = version + 1.

Optimistic locking scales better in distributed systems because there are no locks to manage across nodes.

33. What are the different isolation levels and what problems does each solve?

READ UNCOMMITTED allows dirty reads, where you can see uncommitted changes from other transactions. Nobody uses this in practice.

READ COMMITTED ensures you only see committed data. Each query in a transaction sees only rows committed before that specific query started. This can cause non-repeatable reads where the same row has different values if you read it twice in the same transaction.

REPEATABLE READ (the default in MySQL with InnoDB) ensures that if you read a row once, you get the same value on the next read within the same transaction. It uses MVCC to achieve this. However, it can still have phantom reads where a new row appears in a range scan if another transaction inserts it.

SERIALIZABLE is the strongest level. It effectively runs transactions sequentially, eliminating all anomalies but at the highest performance cost.

PostgreSQL uses MVCC and defaults to READ COMMITTED. MySQL InnoDB defaults to REPEATABLE READ. SQL Server lets you choose the isolation level per query.

34. What is deadlock and how do you resolve it?

A deadlock occurs when two or more transactions hold locks that the other needs, and neither can proceed without the other releasing. The only way out is for one transaction to be rolled back. The database detects deadlocks automatically in most systems and terminates one of the transactions.

Prevention is better than detection. Always acquire locks in a consistent order across all transactions. If transaction A locks resources 1 then 2, transaction B must also lock 1 then 2, never 2 then 1. In practice, this means defining a lock ordering protocol for your application.

35. How do you tune a slow-running query in production?

The process starts with identifying the query using slow query logs, pg_stat_statements in PostgreSQL, or sys.dm_exec_query_stats in SQL Server. Pull the execution plan and look for table scans, large nested loops, sorts that spill to disk, and missing indexes.

Add indexes for the columns in WHERE, JOIN, and ORDER BY clauses. Consider composite indexes and their column ordering. Rewrite the query to filter early, eliminate unnecessary columns from SELECT, and avoid functions on indexed columns in WHERE clauses (these prevent index usage, a pattern called sargable predicates).

Check if statistics are stale using ANALYZE or UPDATE STATISTICS. Stale statistics cause the optimizer to choose bad plans. If the query is still slow, consider partitioning the table or materializing the result if it is called frequently.

36. What is a covering index versus a composite index?

A composite index is a single index on multiple columns. The order matters. The index helps queries that filter on the leftmost columns of the index. If you create INDEX (a, b, c), it helps queries filtering on a, or a+b, or a+b+c. It does not help queries filtering only on b or only on c.

A covering index is a non-clustered index that includes all columns needed by a query, eliminating the table lookup. A composite index can be a covering index if it includes the right columns.

37. How do you handle duplicate records in a large table?

For deduplication, identify candidate keys (usually a unique constraint on the columns that define a logical duplicate), then delete all but the one with the latest created_at or highest id.

-- Delete duplicates keeping the row with highest id
DELETE FROM orders
WHERE id NOT IN (
    SELECT MAX(id)
    FROM orders
    GROUP BY customer_id, product_id, created_at
);

For large tables, run this in batches to avoid locking the table for too long.

-- Batch delete in chunks
DELETE FROM orders
WHERE id IN (
    SELECT id FROM (
        SELECT id,
            ROW_NUMBER() OVER (PARTITION BY customer_id, product_id ORDER BY created_at DESC) AS rn
        FROM orders
    ) sub
    WHERE rn > 1
)
AND id IN (
    SELECT id FROM orders ORDER BY id LIMIT 10000
);

38. What is a materialized view and when would you use it?

A materialized view is a pre-computed result set stored on disk. Unlike a regular view which is a saved query definition, a materialized view holds the actual data. Use materialized views when you have expensive aggregations that are queried frequently but data changes infrequently.

PostgreSQL supports materialized views natively. MySQL does not have native materialized view support (you simulate it with stored tables and scheduled events). Oracle and SQL Server have native materialized views.

The tradeoff is storage and maintenance overhead. You need to refresh the materialized view on a schedule or after data changes, which adds write latency.

39. What is query plan caching and when can it hurt you?

Most databases cache compiled query plans. If the same query text is executed again, the database reuses the cached plan. This saves compilation overhead.

It hurts you when parameter values cause different plans to be optimal. A query that returns 10 rows might benefit from an index seek, while the same query with a parameter that returns 100,000 rows might need a table scan. If the cached plan was compiled for the first execution, the second execution suffers. This is called parameter sniffing.

Workarounds include using OPTIMIZE FOR hints, PLAN guides, or rewriting the query to prevent plan reuse.

40. How do you design a schema for an e-commerce platform?

Start with the core entities: users, products, orders, payments, shipments.

Users table: id, email, password_hash, created_at. No multi-email patterns. One email per user.

Products table: id, name, description, price, inventory_count, seller_id. Products are inventory-tracked items.

Orders table: id, user_id, status, total, shipping_address_id, created_at. A parent order containing multiple line items.

Order_items table: id, order_id, product_id, quantity, price_at_purchase. Denormalizing price_at_purchase because the product price can change after the order.

Payments table: id, order_id, amount, payment_method, status, transaction_id. External payment references go here.

Shipments table: id, order_id, carrier, tracking_number, status.

Consider how you handle inventory. For high-traffic platforms, optimistic concurrency on inventory_count (check that quantity is still available before decrementing) beats row-level locking.

Expert SQL Interview Questions

These are questions for principal engineers, staff engineers, and architects. They test systems thinking, not just SQL syntax.

41. How do you ensure data consistency across distributed databases?

This is one of the hardest problems in distributed systems. Single-phase commits across multiple databases are not possible. You have a few patterns.

Two-phase commit (2PC) coordinates a commit across nodes in two phases: prepare then commit. It works but is blocking and can have coordinator failures that leave locks in an uncertain state.

Saga pattern breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that undoes the local change. If step 3 fails, you run compensating transactions for steps 2 and 1 in reverse order. This is what most modern microservices use.

Change Data Capture (CDC) streams database changes to a message broker like Kafka. Downstream consumers apply the changes to their own stores. This is eventual consistency, not immediate consistency, but it scales.

The CAP theorem states that you cannot have consistency, availability, and partition tolerance simultaneously in a distributed system. You must choose two. Most distributed databases choose availability and partition tolerance (AP) and give you eventual consistency.

42. How does query optimization work at a cost-based optimizer level?

The query optimizer generates multiple execution plans for your SQL query and picks the one with the lowest estimated cost. The cost model considers CPU cost (row processing), I/O cost (page reads), and memory usage. Statistics about table sizes, row counts, column cardinalities, and index selectivity feed into this estimation.

If your statistics are stale or missing, the optimizer guesses wrong. This is why running ANALYZE or UPDATE STATISTICS regularly matters. A bad plan because of bad statistics can be 100x slower than a good plan.

Modern optimizers (PostgreSQL planner, Oracle Cost-Based Optimizer, SQL Server optimizer) use rule-based and cost-based approaches. They will use indexes if the estimated cost is lower than a table scan, which depends on how selective the filter condition is.

43. What is sharding and when would you shard a database?

Sharding splits data across multiple database nodes by a shard key. The goal is horizontal scalability when a single database server can no longer handle the read/write load or storage requirements.

Choose a shard key with high cardinality and even data distribution. A bad shard key (like a region name if 80 percent of your users are in one region) creates hotspots, where one shard is overwhelmed while others sit idle.

Sharding adds significant complexity. Cross-shard queries (joining data across shards) are expensive or impossible in naive sharding implementations. Most teams avoid sharding until they have exhausted vertical scaling, read replicas, and caching layers.

44. How do you handle high-concurrency writes to the same table?

For high-concurrency writes to the same table, consider connection pooling to reduce the load on the database server. Use batch inserts instead of individual inserts when possible (INSERT INTO … VALUES (…), (…), (…) is far more efficient than N separate INSERT statements).

For write contention on the same row (like updating a shared counter), use atomic operations like UPDATE counters SET value = value + 1 WHERE id = ? instead of SELECT then UPDATE in separate statements, which creates a race condition. Or use optimistic locking with a version column.

If you need true serializability for financial transactions, consider SELECT FOR UPDATE with appropriate isolation levels, but understand that this reduces concurrency.

45. What is the difference between a stored procedure and a user-defined function?

Stored procedures are called imperatively with CALL or EXEC. They can have multiple statements, modify database state, and may or may not return a result set. Functions are called as part of an expression in SELECT or WHERE. They must return a value and cannot modify database state.

Functions are composable. You can use them in SQL expressions everywhere. Procedures are not composable in the same way. Functions run in the same execution context as the calling query, which means they cannot do things that violate query-at-a-time semantics like committing transactions.

The practical difference: use functions for computations that transform data (like a haversine distance function). Use procedures for batch operations, complex validation logic, or workflows that need multiple steps.

46. How do you use SQL for time-series data?

For time-series data, use date_trunc to bucket by time period. Use window functions for running calculations. Consider partitioning by time range if the table is large (SQL Server and PostgreSQL support range partitioning).

Forgrafana-style dashboards querying large time-series tables, a composite index on (device_id, timestamp) with timestamp as the last column makes range queries efficient because the index orders by device first, then time within each device.

If you need downsampling (storing 1-minute data for 24 hours, then hourly aggregates after that), use a scheduled job or trigger to populate an aggregate table.

47. What is change data capture and how does it work?

CDC captures insert, update, and delete events from a database log. The database write-ahead log (WAL) in PostgreSQL, the binlog in MySQL, or the transaction log in SQL Server are the sources. Tools like Debezium, Maxwell, or AWS DMS read these logs and publish changes to a stream.

Applications subscribe to the stream and apply changes to their own stores, build audit logs, update search indexes, or propagate to data warehouses. CDC is the backbone of the modern data stack: it decouples the operational database from analytical workloads, search indexes, and cache layers.

48. How do you secure SQL queries against injection?

Use parameterized queries or prepared statements exclusively. Never concatenate user input into SQL strings, even if you think the input is sanitized. Use whitelist validation for values that must be from a known set.

ORMs like SQLAlchemy, Hibernate, and Active Record generate parameterized queries, but raw SQL strings you write yourself need explicit parameter binding.

# Safe: parameterized query
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))

# Unsafe: string concatenation (never do this)
cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")

The second form is SQL injection. If user_id comes from an HTTP parameter and someone sends id = 1; DROP TABLE users; –, your database is gone.

49. What is a database connection pool and why does it matter?

A connection pool holds a set of open database connections that are reused across requests. Opening a new database connection costs TCP handshake, authentication, and session setup overhead, typically 10-50ms. A connection pool amortizes this cost across many requests.

Connection pools also limit concurrency. Your database has a maximum number of connections. A pool enforces this limit at the application layer, queuing requests when all connections are in use. Without a pool, your application can easily overwhelm the database by spawning unlimited connections under load.

Popular connection pools: HikariCP (Java), pgBouncer and PgPool-II (PostgreSQL), ProxySQL (MySQL), and built-in pool modules in Node.js (pg pool), Python (SQLAlchemy pool), and Rails (ActiveRecord pool).

50. How do you test SQL queries before deploying to production?

Write unit tests that assert the query results against known data. For complex queries, create a test database with fixture data that covers edge cases (NULL values, empty tables, boundary conditions like exactly-zero quantities).

For PostgreSQL, use pgTAP or write tests in a framework like Jest with a test database. For MySQL, use MySQL Sandbox or containers to spin up an ephemeral test instance.

Use EXPLAIN ANALYZE to verify the query plan is what you expect before deployment. If you added an index specifically for this query, confirm the plan uses it.

Run the query against production-size data in a staging environment with real cardinality. A query that returns correct results on 100 test rows might time out on 10 million production rows because the index selectivity is different at scale.

RankMath FAQ

Q1: What is the best way to prepare for SQL interview questions?
Practice writing real queries on real data. Set up a sandbox database with sample data and work through scenarios. Understanding beats memorization.

Q2: How many SQL interview questions should I know?
Know the fundamentals solidly. For senior roles, be comfortable with window functions, query optimization, and data modeling. The 50 questions in this guide cover what you need.

Q3: What is the difference between RANK and DENSE_RANK in SQL?
RANK skips numbers after a tie. DENSE_RANK does not. If three rows tie at rank 1, RANK gives the next row rank 4. DENSE_RANK gives it rank 2.

Q4: What is a clustered index in SQL?
A clustered index sorts the actual data rows of a table by the index key. Only one clustered index per table. The primary key typically becomes the clustered index.

Q5: How do you optimize a slow SQL query?
Start with EXPLAIN to see the execution plan. Look for table scans. Add indexes on WHERE, JOIN, and ORDER BY columns. Filter early to reduce the dataset before joins. Check that statistics are up to date.

Q6: What is ACID in database transactions?
Atomicity (all or nothing), Consistency (valid state to valid state), Isolation (concurrent transactions do not interfere), Durability (committed changes survive crashes).

Q7: What is the difference between OLTP and OLAP databases?
OLTP handles live transaction processing and is write-heavy. OLAP handles analytics and reporting and is read-heavy. OLTP uses normalized schemas. OLAP uses denormalized star or snowflake schemas.

Q8: How do you handle duplicate records in SQL?
Identify duplicates using GROUP BY with HAVING COUNT(*) > 1. Keep the row with the highest id or latest created_at. Delete the rest in batches for large tables.

References

SQLShack: Top SQL Interview Questions and Answers About Indexes. https://www.sqlshack.com/top-25-sql-interview-questions-and-answers-about-indexes/
Dataquest: SQL Interview Questions from Beginner to Advanced. https://www.dataquest.io/blog/sql-interview-questions-from-beginner-to-advanced/
PostgreSQL Documentation: Window Functions. https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
MySQL Documentation: JOIN Syntax. https://dev.mysql.com/doc/refman/8.0/en/join.html
Microsoft Learn: Query Optimization Techniques. https://learn.microsoft.com/en-us/sql/relational-databases/query-tuning-techniques