Database interview questions
Table of Contents
Lock-Based Concurrency Control
Lock-based protocols in DBMS ensure that a transaction cannot read or write a data item until it obtains the appropriate lock. By controlling access in this way, these protocols maintain serializability and prevent concurrency problems such as lost updates or dirty reads.
A lock is a mechanism used to manage concurrent access to database items. Each transaction must acquire the necessary lock before performing read or write operations. There are two main types of locks:
lock-S.lock-X.A transaction can acquire a lock only if it is compatible with existing locks held by other transactions:
Types of Lock-Based Protocols
insert,delete, or update operations. The lock is released only when the transaction completes.The moment a transaction has acquired all the locks it requires is called the lock point. After this point, it transitions from the growing to the shrinking phase.
Lock Upgrading and Downgrading
By enforcing these rules, the Two-Phase Locking protocol ensures consistency and prevents conflicts between concurrent transactions. Restricting when locks can be acquired, upgraded, or released helps maintain a serializable schedule, preserving the integrity of the database.
Optimistic vs Pessimistic Locking
Pessimistic locking is a concurrency control mechanism that assumes data conflicts are likely when multiple users or processes access the same resource simultaneously. To prevent these conflicts, it acquires locks on resources before performing operations, ensuring exclusive access for the duration of a transaction. Once a lock is obtained, no other transaction can modify or read the locked resource until the lock is released. This approach guarantees consistency but can lead to waiting times, reduced concurrency, and potential performance bottlenecks under heavy load.
Pessimistic locking can be applied at different levels of granularity — from database-level and table-level locks to row-level or even column-level locks. It is commonly used in environments where data integrity is critical and the cost of conflicts is high, such as in banking systems and financial transactions.
Optimistic locking takes the opposite approach to traditional locking mechanisms. It assumes that write conflicts are rare and that most transactions can proceed concurrently without interfering with one another. Instead of preventing access to shared data, the database allows multiple transactions to read and modify the same row at the same time.
Rather than locking rows during reads, optimistic locking detects conflicts only at the time an update is attempted. Each row is associated with a version identifier, such as a numeric counter, a timestamp, or a hash. This version is read together with the row and later used to verify that the data has not been modified by another transaction.
When an update occurs, the database checks whether the version stored in the table is still the same as the version that was originally read. If the versions match, the update succeeds and the version is updated atomically along with the data. If the versions do not match, it means another transaction has modified the row in the meantime, and the update is rejected to prevent lost updates.
Version identifiers can be implemented in several ways. A common approach is an integer version field that increments on every successful update. Some systems use a timestamp field such asupdated_at, which is overwritten on modification, while others use a checksum or hash derived from the row contents that is recomputed whenever the data changes.
The optimistic locking workflow begins by reading the row together with its current version.
1SELECT id, balance, version
2FROM accounts
3WHERE id = 1;The application receives the row data along with the version value.
1id = 1, balance = 100, version = 5When updating the row, the application includes the previously read version in the update condition. This ensures that the update will only succeed if no other transaction has modified the row since it was read.
1UPDATE accounts
2SET balance = 120, version = version + 1
3WHERE id = 1 AND version = 5;If exactly one row is affected, the update succeeds and the version is incremented. If zero rows are affected, the version check has failed, indicating a conflict. The application can then decide whether to retry the operation using the latest data or abort the transaction altogether.
Optimistic locking minimizes locking overhead and maximizes concurrency, making it well suited for high-read systems and workloads with low write contention, such as REST APIs, microservices, and distributed applications.
In summary, pessimistic locking prioritizes safety and consistency by restricting access through locks, while optimistic locking prioritizes performance and scalability by assuming that conflicts are infrequent and resolving them only when necessary.
Write-Ahead Logging (WAL)
Write-Ahead Logging (WAL) is a core durability mechanism in databases. Before any change is made to the actual data files on disk, the change must first be written to a log that is persisted safely. This design allows PostgreSQL to guarantee durability without slowing down every transaction with expensive disk writes.
When you commit a transaction, PostgreSQL does not immediately flush the modified data pages to disk. Instead, it follows a carefully optimized sequence. The database first records all changes made by the transaction in the WAL. These WAL records are then flushed to disk using fsync, ensuring the log is safely persisted. Once this flush completes, PostgreSQL reports a successful commit back to the application. The actual data pages are written to disk later by background processes.
The key insight is that the WAL contains enough information to fully reconstruct every committed transaction. As long as the WAL is durable, the database can always recover to a consistent state, even if the data files themselves are temporarily out of date.
If a crash occurs after the commit has been acknowledged but before the modified data pages are written to disk, PostgreSQL will replay the WAL during startup. This recovery process re-applies all committed changes, restoring the database to the exact state it was in before the crash. Although recovery may take longer in this scenario, crashes are relatively rare, making this trade-off highly effective in practice.
From a performance perspective, WAL provides several important advantages. Writes to the WAL are sequential, which is significantly faster than random writes to data pages scattered across disk. Multiple transactions can also be grouped together during WAL flushes, reducing the number of expensive disk sync operations. As a result, commit latency becomes more predictable and primarily depends on WAL write throughput rather than complex data page updates.
In short, WAL allows PostgreSQL to decouple transaction durability from data page persistence. By making the log durable first and deferring data writes, PostgreSQL achieves both strong crash safety and high transaction throughput.
How to Prevent Deadlocks
Deadlock prevention techniques ensure that circular wait conditions never occur. One common approach is to assign priorities to transactions based on timestamps.
Each transaction is assigned a timestamp when it starts. Older transactions have higher priority (smaller timestamp), while younger transactions have lower priority (larger timestamp). When a transaction requests a lock that conflicts with another, the system decides whether it should wait or abort based on these priorities.
Wait-Die Policy
This is a non-preemptive scheme where only the requesting transaction may be aborted. If an older transaction requests a lock held by a younger transaction, it is allowed to wait. If a younger transaction requests a lock held by an older transaction, it is forced to abort ("die") and restart later. A drawback is that younger transactions may be repeatedly aborted, which can lead to inefficiency. However, once a transaction has acquired all the locks it needs, it will never be aborted under this policy.
Wound-Wait Policy
This is a preemptive scheme where a transaction holding a lock may be aborted. If an older transaction requests a lock held by a younger transaction, the younger transaction is aborted ("wounded") and the older one proceeds. If a younger transaction requests a lock held by an older transaction, it must wait. This policy reduces unnecessary waiting but may result in more transaction rollbacks.
Avoiding Starvation
To prevent starvation, when a transaction is restarted after being aborted, it must retain its original timestamp. This ensures that its priority does not decrease over time and it will eventually complete.
Phantom Problem
A phantom read is a concurrency anomaly where the same query executed multiple times within a transaction returns a different set of rows. This occurs because another transaction inserts, deletes, or updates rows that satisfy the query condition.
For example, a transaction may query all accounts with a balance greater than 1000. If another transaction modifies the data such that new rows satisfy this condition, the original transaction will observe additional rows when the query is executed again. These newly appearing rows are called phantoms.
1SELECT * FROM accounts WHERE balance > 1000;Preventing phantom reads requires the database to protect not only the rows currently returned by the query, but also the range of values defined by the query condition.
Range / Next-Key Locking (Most Common)
Instead of locking individual rows, the database locks a range of values in an index that corresponds to the query condition.
1Query: balance > 1000
2Locks: (1000, +∞)This prevents inserts or updates that would introduce new rows into the range. As a result, no new rows can appear in subsequent queries, eliminating phantom reads. This technique is commonly used in storage engines such as InnoDB in MySQL.
Predicate Locking (Conceptual)
Predicate locking directly locks the logical condition of a query rather than specific rows or index ranges.
1LOCK WHERE balance > 1000This blocks any insert, delete, or update that would satisfy the predicate, fully preventing phantom reads. Although precise, this approach is expensive to implement and is rarely used directly in practice.
Serializable Isolation Level
At the Serializable isolation level, the database guarantees that concurrent transactions behave as if they were executed in a serial order. To enforce this, the database may use range locking (in lock-based systems) or conflict detection (in MVCC-based systems). Phantom reads are fully prevented under this isolation level.
MVCC with Conflict Detection
Some databases, such as PostgreSQL, rely on Multi-Version Concurrency Control (MVCC) instead of strict locking. These systems allow non-blocking reads, track dependencies between transactions, and abort transactions when a conflict (such as a phantom) would violate serializability. This approach avoids heavy locking while still ensuring correctness.
Why Row Locking Is Not Enough
Row-level locks only protect existing rows, not the gaps between them. This means new rows can still be inserted into the queried range.
1Existing rows: [2000, 3000]
2Another transaction inserts: 1500Since the new row was not previously locked, it appears in subsequent queries, resulting in a phantom read. Preventing this requires locking the range, not just the rows.
Multi-Version Concurrency Control (MVCC)
Multi-Version Concurrency Control (MVCC) is a database concurrency technique that allows multiple transactions to operate on the same data without blocking each other. Instead of overwriting existing data, the database maintains multiple versions of each record. This enables readers to access a consistent snapshot of the database while writers continue making changes concurrently.
When a transaction begins, it is assigned a timestamp (or transaction ID) that defines its view of the database. The transaction will only see data that was committed before this point in time. Any updates made after the transaction starts are invisible to it, ensuring a stable and repeatable view of the data throughout its execution.
In MVCC, every database record is associated with a version. Rather than modifying a record in place, write operations create a new version of that record. Concurrent reads always access the latest version that is visible to their transaction snapshot, not necessarily the absolute newest version in the system.
When an update occurs, the database creates a copy of the record and applies the changes to this new version. While the update is in progress, other transactions can continue reading the older version without being blocked. Once the write is successfully committed, the new version becomes visible to subsequent transactions, effectively advancing the version of the record. This process repeats for every update, forming a chain of versions over time.
Because reads and writes do not block each other, MVCC improves overall system performance, especially in read-heavy workloads. It also reduces the likelihood of deadlocks and minimizes lock management overhead, while still maintaining strong isolation guarantees for transactions.
Write conflicts are typically resolved at commit time. For example, if two transactions attempt to update the same record concurrently, the database may use a strategy such as "first committer wins," where one transaction succeeds and the other is aborted and must retry.
However, MVCC comes with trade-offs. Since old versions of records are retained, the database can grow in size over time, leading to storage bloat. To manage this, databases like PostgreSQL use a background process called VACUUM to reclaim space by removing obsolete versions that are no longer visible to any active transaction.
Internally, MVCC relies on tracking transaction visibility using metadata such as transaction IDs and status tables. When a transaction starts, it effectively takes a snapshot of which transactions are active or committed. This snapshot is then used to determine which versions of each record are visible, ensuring correctness without requiring expensive locking mechanisms.
Log-Based Database Recovery
Log-based recovery ensures that a database can recover correctly after a crash. Instead of directly relying on data pages, the database first records every change in a log, which is stored on stable storage so it survives crashes.
Each log record contains:
- Transaction ID (TID): identifies which transaction made the change
- LSN (Log Sequence Number): strictly increasing, determines the order of operations
- Action / update info: e.g., page ID, old value, new value, or COMMIT/ABORT
Example log records:
1LSN TID Action
2100 T1 Update P1: 100 → 200
3101 T2 Update P2: 50 → 75
4102 T1 Update P3: 10 → 15
5103 T1 COMMIT
6104 T3 Update P4: 5 → 8The Transaction Table (TT) keeps track of all active transactions in memory:
1TID lastLSN Status
2T1 103 Committed
3T2 101 Running
4T3 104 RunningThe Dirty Page Table (DPT) tracks memory pages modified but not yet flushed to disk:
1Page recLSN(Earliest LSN that made the page dirty to identify starting point for recovery)
2P1 100
3P3 102
4P4 104Recovery after crash:
1. Analysis phase: rebuild TT & DPT from the log
2. Redo phase: replay only committed transactions based on TT
1- T1 committed → redo LSN 100 (P1), 102 (P3)
2- T2 running → skip
3- T3 running → skip3. Undo phase: roll back uncommitted transactions using logs
1- T2 → revert P2 to old value
2- T3 → revert P4 to old valueDirty page flushing: during normal execution, dirty pages are written to disk in the background. The database ensures all logs affecting that page have already been flushed (pageLSN ≤ flushedLSN) before writing, so recovery can always reconstruct the correct state.
By combining logs, the Transaction Table, the Dirty Page Table, and WAL rules, the database achieves high performance and reliability. Even if a crash happens at any moment, it can safely redo committed transactions and undo uncommitted ones, restoring consistency.
Replication
Replication is the process of storing multiple copies of the same data across different machines (nodes). The goal is to improve:
• Fault tolerance — data is still available even if a server crashes
• High availability — requests can still be served during failures
• Durability — data is less likely to be permanently lost
• Read scalability — reads can be distributed across replicas
Replication with Consistent Hashing
In distributed databases using consistent hashing, data is assigned to nodes on a logical hash ring.
1Hash Ring:
2
3 [Node A]
4 / \
5 key=123 [Node B]
6 \ /
7 [Node C]To determine where data should be stored:
Step 1: Hash the key
Step 2: Move clockwise on the hash ring
Step 3: The first node encountered becomes the primary replica
Additional replicas are then placed on the next N - 1 nodes clockwise.
1Replication Factor = 3
2
3Key "user:15"
4
5Primary Replica -> Node A
6Replica Copy -> Node B
7Replica Copy -> Node CReplication Factor
The replication factor determines how many copies of the data exist.
1Replication Factor = 3
2
31 primary copy
42 replica copiesA higher replication factor improves fault tolerance, but also increases storage usage and network overhead.
Write Replication Strategies
1. Synchronous Replication
1Client
2 |
3 v
4Write -> Node A
5 |
6 +--> Node B
7 |
8 +--> Node C
9
10Success only after replicas acknowledgeIn synchronous replication, the write is considered successful only after multiple replicas confirm the write.
✔ Strong consistency
✔ Safer during node failures
❌ Higher latency due to network coordination
2. Asynchronous Replication
1Client
2 |
3 v
4Write -> Node A (returns immediately)
5
6Node A later replicates to:
7 -> Node B
8 -> Node CIn asynchronous replication, the primary node immediately responds to the client, then propagates changes to replicas afterward.
✔ Lower latency
✔ Higher throughput
❌ Replicas may temporarily contain stale data
❌ Risk of data loss if the primary crashes before replication completes
Read Strategies
1. Read From One Replica
The client reads from a single replica node. This is fast, but the replica may not yet contain the latest updates.
1Client -> Replica Node B✔ Fast reads
✔ Lower network overhead
❌ May return stale data
2. Quorum Reads
The client queries multiple replicas and selects the latest version using timestamps or version numbers.
1Client
2 |
3 +--> Node A
4 +--> Node B
5 +--> Node C
6
7Return newest version✔ Stronger consistency
✔ Better fault tolerance
❌ Increased latency and network usage
Eventual Consistency
Many distributed databases use eventual consistency. This means replicas may temporarily differ, but given enough time without new writes, all replicas will eventually converge to the same value.
1Time T1:
2Node A = Version 5
3Node B = Version 4
4
5Time T2:
6Node A = Version 5
7Node B = Version 5Summary
Replication improves availability, durability, and scalability by storing multiple copies of data across nodes. Systems must balance:
• Consistency
• Availability
• Latency
• Fault tolerance
Read Repair
Read repair is a mechanism used in distributed databases to detect and fix inconsistent replicas during a read operation. Instead of repairing stale replicas in the background, the system repairs them when data is read by a client.
Why Read Repair Exists
Read repair exists because many distributed databases use eventual consistency. In systems such as Apache Cassandra and Amazon DynamoDB, writes may not reach all replicas at the same time. Temporary inconsistency is therefore expected.
Problem Scenario
Assume the replication factor is3.
1Key X stored on:
2- Node A
3- Node B
4- Node CA client performs the following write:
1Write X = 10Suppose:
• Node A successfully stores the write
• Node B successfully stores the write
• Node C is temporarily down and misses the update
1Node A = 10
2Node B = 10
3Node C = 5 <-- stale replicaThe replicas are now inconsistent.
What Happens During a Read
The client now performs:
1Read XThe coordinator node reads from multiple replicas:
1Node A -> 10
2Node B -> 10
3Node C -> 5The system detects that Node C contains stale data.
Read repair now:
• Determines the latest correct value
• Sends the updated value back to stale replicas
• Repairs the inconsistency automatically
1Repair:
2Node C <- 10How the System Knows Which Value is Correct
Each stored value usually contains metadata such as:
• Timestamps
• Version numbers
• Vector clocks / version vectors
A common strategy is: Last Write Wins (LWW).
1Node A -> (10, timestamp=100)
2Node B -> (10, timestamp=100)
3Node C -> (5, timestamp=80)Since timestamp 100 is newer than 80, the system concludes that:
110 is the latest valid valueHow Inconsistency is Detected
Read repair only works because the coordinator reads from multiple replicas and compares their versions.
There is no global cache storing the “latest value”. The coordinator determines correctness dynamically during the read.
1Read from replicas:
2(A) -> (10, t=100)
3(B) -> (10, t=100)
4(C) -> (5, t=80)
5
6Mismatch detected -> repair triggeredQuorum Reads
Systems like Apache Cassandra do not always read from every replica.
Instead, they often use quorum reads.
1Replication Factor (N) = 3
2Read Quorum (R) = 2
3
4Read from any 2 replicasIf inconsistencies are detected among the replicas read, the stale replicas can be repaired.
Types of Read Repair
1. Blocking Read Repair
The coordinator repairs stale replicas before returning the response.
✔ Stronger consistency guarantees
✔ Client receives repaired data
❌ Higher read latency
2. Background (Asynchronous) Read Repair
The system immediately returns the newest value to the client, while repairs happen asynchronously in the background.
✔ Lower latency
✔ Faster reads
❌ Temporary inconsistency may still exist briefly
Summary
Read repair is a self-healing mechanism used in eventually consistent systems. During reads, the coordinator compares replicas, detects stale data, and repairs inconsistent nodes automatically.
It helps distributed databases gradually converge toward consistency without requiring every write to be fully synchronous.
Sharding
Sharding is a database scaling technique where data is split across multiple machines called shards. Instead of storing all data on a single database server, we horizontally partition the data so that each shard only stores a subset of it. This allows the system to scale storage and traffic beyond the limits of a single machine.
Choosing a Good Shard Key
The shard key determines how data is distributed across shards. Choosing the right shard key is critical because it directly affects scalability, performance, and query efficiency.
A good shard key should have high cardinality, meaning the field contains many unique values so data can be spread evenly across shards. It should also have an even distribution to prevent some shards from becoming much larger or busier than others. Finally, the shard key should align with common query patterns so that most requests can be routed directly to a single shard instead of broadcasting queries to every shard.
A bad shard key usually has low cardinality or uneven traffic patterns. This can lead to hot shards, where a small number of machines receive the majority of requests. It can also cause expensive scatter-gather queries, where the database must query every shard before combining the results.
Hash-Based Sharding
Modern systems commonly use hash-based sharding instead of range-based sharding. In hash-based sharding, we compute a hash of the shard key and distribute data based on the hash result.
1shard = hash(userId) % numberOfShardsHashing helps distribute data more evenly compared to range-based sharding, where sequential keys can overload specific shards. Many systems also use consistent hashing, which minimizes data movement whenever shards are added or removed.
Hotspots
Even with a good shard key, some shards can still receive significantly more traffic than others. This problem is known as a hotspot.
For example, imagine a social media platform where a celebrity profile receives millions of views and likes. If all requests for that user are routed to a single shard, that shard may become overloaded while other shards remain underutilized.
One solution is to use a composite shard key. Instead of sharding purely by userId, we can distribute requests using values such as:
1(userId + n)
2(userId + createdAt)Another common approach is to create dedicated shards for high-traffic users. Regular users continue using normal hash-based sharding, while celebrity accounts are isolated onto their own shards to avoid affecting the rest of the system.
Cross-Shard Operations
Cross-shard operations occur when a query requires data from multiple shards. These operations are expensive because the system must fan out the query to several machines, gather the responses, merge the results, and return the final output.
Queries that align with the shard key are efficient because they only need to access a single shard. For example, fetching data for a specific user is straightforward if the database is sharded by userId.
However, global queries become more difficult. For example, retrieving the top 10 most popular posts across the entire platform requires each shard to calculate its local top posts before the system aggregates and ranks the combined results.
One way to reduce the cost of cross-shard operations is caching. Frequently requested global queries can be precomputed and stored in a cache instead of recalculating them repeatedly.
Another optimization is denormalization. If the application repeatedly needs data from multiple shards together, we can duplicate some data across shards so that reads can be served locally from a single shard. The tradeoff is increased write complexity because updates now need to modify multiple copies of the same data.
Maintaining Consistency
Transactions become much more complicated once related data is split across multiple shards.
Consider a banking example where Alice transfers 5 dollars to Bob. In a single database, this operation can be executed atomically inside one transaction.
1BEGIN;
2
3UPDATE accounts
4SET balance = balance - 5
5WHERE name = 'alice';
6
7UPDATE accounts
8SET balance = balance + 5
9WHERE name = 'bob';
10
11COMMIT;Once Alice and Bob are stored on different shards, the transaction can no longer be executed atomically on a single database server.
Two-Phase Commit (2PC)
The textbook solution is Two-Phase Commit (2PC). In this protocol, a central coordinator first asks all participating shards whether they are ready to commit the transaction. If every shard agrees, the coordinator instructs them to commit.
While 2PC guarantees strong consistency, it is often avoided in large-scale production systems because it can be slow and fragile. If the coordinator or a shard crashes during the process, transactions may become stuck and locks may remain held for long periods of time.
Avoiding Cross-Shard Transactions
In practice, the best strategy is usually to design the system in a way that minimizes cross-shard transactions entirely. Related data should be colocated on the same shard whenever possible.
Saga Pattern
Another common approach is the Saga Pattern. Instead of one large atomic transaction, the workflow is broken into smaller independent operations. Each operation also defines a compensating action that can undo the change if a later step fails.
For example:
1Step 1: Deduct 5 dollars from Bob on shard B
2Step 2: Add 5 dollars to Alice on shard A
3
4If Step 2 fails:
5Compensating action -> Refund Bob his 5 dollarsThe Saga Pattern does not provide true atomicity, but it allows distributed systems to recover gracefully without leaving the system in an inconsistent state.
The key takeaway is that distributed databases introduce tradeoffs. Sharding improves scalability and performance, but it also creates challenges around query efficiency, hotspots, and consistency. Most production systems focus heavily on choosing the right shard key and designing data models that minimize expensive cross-shard operations.