# Cracking the SQL Interview: Real Questions and PostgreSQL Internals You Should Know

Over the past few months since the start of new year, I’ve been actively interviewing for technical roles, and each conversation proved to be both challenging and rewarding. In this post, I’ll share some of the **SQL questions** I was asked, along with deep-dive **PostgreSQL internals topics** that stood out in interviews.

If you’re preparing for data engineering, backend, or database-focused roles, this guide is for you.

### Real SQL Questions I Was Asked in Interviews

Here are some real SQL questions I encountered across multiple interviews. These test your ability to work with complex queries and understand database structures:

* Recursive Hierarchy with Levels - Return employee ID, name, manager name, and depth in the org chart (Recursive CTE)
    
* Find Managers with 2 or more employees
    
* Departmental Top 3 Salaries
    
* Customers who bought all products
    
* Rolling sums - Calculate a **7-day rolling sum** of sales for each product, ordered by date.
    
* Find continuous date ranges when there was sales activity every day without gaps.
    

The below are advanced topics came up in several interview rounds and often led to deep discussions.

### 1\. PostgreSQL Indexing: Internals, Use Cases, and Performance

#### Had low level discussions on indexes along with their use cases.

| Index Type | Use Case |
| --- | --- |
| B-Tree | Equality, Range queries |
| Hash | Equality only |
| GIN | Full-text, JSONB, array overlap |
| GiST | Geometric search, similarity search |
| BRIN | Large, naturally sorted tables (e.g., logs) |
| Partial | Index on filtered rows |
| Expression | Index on function output (e.g., `lower(name)`) |

* **Covering Indexes**: `CREATE INDEX ON emp (deptid) INCLUDE (salary)`
    
* **Multicolumn Indexes**: Consider column order (leading column matters!)
    
* **Bloom Index**: Used in specialized fuzzy match cases
    
* **Index Usage**: How planner choose Indexes, Analyze & Statistics effects on Index Usage.
    
* Index Maintenance: Need for vacuum &/ reindex, index Bloat, Fill factor & tuning.
    
* Other: Index Impact on Write Performance, OLTP & OLAP workloads, Finding unused index, Index usage metrics, Indexes and concurrent writes, etc.
    

### 2\. Reading PostgreSQL Execution Plans

How to interpret & understand execution plans in turn understanding query performance. What are the things to look into the plans.

```sql
EXPLAIN (ANALYZE, BUFFERS, COSTS, VERBOSE)
SELECT * FROM employee WHERE salary > 100000;
```

#### Things to know:

* **Seq Scan**: No useful index found
    
* **Index Scan**: Ideal for high-selectivity queries
    
* **Bitmap Heap Scan**: Used for broader matches, combines index results
    
* **Rows Removed by Filter**: Wasted work — suggests filter isn’t pushed down
    
* Joins & Join Methods (Nested Join, Merge Join & Hash Join): Use case scenarios, Memory Impacts, how sorting effects the join, hash table spills etc.
    
* Other Topics: Parallel Query, CTEs & subplans, cost estimates, watch for **loops &gt; 1** in nested loops, monitor **buffers read/hit** to spot I/O inefficiencies, etc.
    

### 3\. Partitioning

> I was asked: “How would you partition a 5B-row logs table by region and time?”

#### Few of the topics discussed during interviews were,

* Partitioning Methods - Range Partition, List Partition, Hash Partition, Composite Partition
    
* Partitioning Work - Declarative partitioning vs table inheritance, Planner routes inserts/update, default partitions.
    
* Performance Implications - Partition Pruning, indexing on partitions vs global index (Postgres doesn’t support native global index)
    
* Impact on Query Plans & when pruning fails.
    
* Maintenance - Adding / Removing partitions, Detaching / Attaching partitions, Archiving, Table bloat, constraints behavior with partitioned tables etc.
    

### 4\. Sharding PostgreSQL (Horizontal Scale)

> Interviewer: “What happens when a single node isn’t enough?”

**Sharding** = **horizontal partitioning** → splitting data across multiple *physical* databases/servers, not just partitions in one DB.

* Each **shard** holds a subset of rows.
    
* Common shard keys: customer ID, tenant ID, geographic region, time.
    
* Goal: keep each shard small & fast to query independently.
    

#### Manual Sharding Approaches:

* Application-controlled sharding (based on user\_id, region)
    
* Foreign Data Wrappers (`postgres_fdw`)
    
* Tools like **Citus**
    

#### Example Shard Strategy (Customer DB):

* Users A–M → shard\_1
    
* Users N–Z → shard\_2
    
* Metadata service maps user → shard
    

> You must handle picking a good shard key, **cross-shard joins**, **global transactions**, Distributed transactions, Data rebalancing, Consistency & failover, **replication lag** etc.

### 5\. Change Data Capture (CDC)

**Change Data Capture** means continuously capturing **row-level data changes** (INSERT, UPDATE, DELETE) from a source database and delivering them to downstream systems.

> “How do you build a real-time pipeline to stream changes?”

#### ✅ PostgreSQL CDC Techniques:

| **Technique** | **Latency** | **Setup Effort** | **Pros** | **Cons** | **When to Use** |
| --- | --- | --- | --- | --- | --- |
| 🔹 **Triggers + Audit Tables** | Low | Easy | Simple to implement for small DBs | Performance overhead at high TPS, hard to scale | When you just need an audit trail (e.g. small B2B SaaS) |
| 🔹 **Logical Replication** | Medium–Low | Native | Supports pub/sub; only committed changes; easy failover | Cannot replicate DDL; table must have PK | Multi-region read replicas, selective replication |
| 🔹 **WAL Stream Decoding** | Low | Medium–Hard | True real-time streaming; works well with Kafka, Debezium | Harder to manage slots, risk of WAL bloat | Enterprise streaming pipelines |
| 🔹 **pgoutput** (logical decoding plugin) | Efficient | Native, preferred | Default plugin for logical replication; Debezium uses it | Limited to logical replication; same WAL slot constraints | Best practice for Kafka CDC |

**Logical Replication (Built-in)**

* PostgreSQL 10+ supports *logical replication*.
    
* Publishes **row-level changes** as a stream.
    
* Use: `CREATE PUBLICATION` and `CREATE SUBSCRIPTION`.
    
* Works via WAL (Write-Ahead Log): changes are encoded as logical changes, not raw blocks.
    
* Good for replica clusters or feeding Kafka connectors.
    

**Logical Decoding**

* The foundation for logical replication.
    
* `pgoutput` is the default plugin.
    
* Or use plugins like `wal2json` or `decoderbufs`:
    
    * `wal2json`: output changes as JSON — easy for Kafka, Debezium.
        
    * `decoderbufs`: protobuf format.
        
* Tools read the WAL stream via a **replication slot**.
    

**Triggers-Based CDC**

* Implemented with `AFTER INSERT/UPDATE/DELETE` triggers.
    
* Simpler for small systems, but high overhead under heavy writes.
    

### 6\. Other concepts:

Few other things which I think it’s important to go over are,

* When to use **BRIN** over B-tree indexes
    
* Index design tradeoffs in high-write OLTP systems
    
* Join types and when to apply each
    
* Timeseries partitioning and pruning validation
    
* Postgres internals: MVCC, WAL, autovacuum
    
* Locking, concurrency control, and deadlocks
    
* Designing a hybrid search (full-text + semantic)
    
* Data ingestion pipelines with Kafka and Postgres
    
* Handling **SCDs**, surrogate vs natural keys
    
* Real-time event guarantees: exactly-once vs at-least-once delivery
    
* What indexing strategies do you use for analytical vs transactional workloads?
    

If you’re preparing for roles that involve SQL, data engineering, or backend systems, mastering both query skills and PostgreSQL internals can really set you apart. The questions span from hands-on SQL to system-level architecture — and each was an opportunity to demonstrate practical depth.

Let me know in the comments if you'd like to add more questions or topics to this list!
