Databases: SQL vs NoSQL, Indexing, and Query Optimization
How relational and non-relational databases differ in data modeling and guarantees, how B-tree and hash indexes accelerate lookups, and how query optimizers turn SQL into efficient execution plans.
Terminology
- Relational database (SQL): a database that organizes data into tables with rows and columns, enforces a fixed schema, and supports SQL for querying with ACID transaction guarantees
- Non-relational database (NoSQL): a broad category of databases that do not use the tabular relational model; includes document stores, key-value stores, column-family stores, and graph databases
- ACID: a set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee reliable transaction processing in relational databases
- BASE: a set of properties (Basically Available, Soft state, Eventually consistent) that describe the trade-offs many NoSQL systems make in favor of availability and partition tolerance
- Schema: the structure definition of a database, specifying tables, columns, data types, and constraints in SQL databases, or the expected document shape in NoSQL databases
- Index: a separate data structure maintained by the database that maps column values to row locations, enabling fast lookups without scanning every row
- B-tree index: a balanced tree structure where each node holds multiple keys in sorted order, optimized for range queries and equality lookups on disk-based storage
- Hash index: an index that uses a hash function to map keys directly to storage locations, providing $O(1)$ average-case equality lookups but no support for range queries
- Primary key: a column (or set of columns) that uniquely identifies each row in a table; the database automatically creates an index on the primary key
- Composite index: an index built on multiple columns, useful for queries that filter or sort on several fields simultaneously
- Query optimizer: the database component that analyzes a SQL query and chooses the most efficient execution plan based on available indexes, table statistics, and cost estimates
- Execution plan: the step-by-step strategy the database uses to execute a query, including which indexes to use, join order, and access methods
- Full table scan (sequential scan): reading every row in a table to find matching results; efficient for small tables but slow for large ones
- Selectivity: the fraction of rows that match a query predicate; high selectivity (few matching rows) makes index usage beneficial, low selectivity (many matching rows) may favor a full scan
- Denormalization: intentionally duplicating data across tables or documents to reduce the number of joins needed at query time, trading write complexity for read performance
- Sharding: splitting a database across multiple servers by partitioning rows based on a shard key, enabling horizontal scaling
What & Why
Almost every application needs to persist data. The first decision is which type of database to use. Relational databases (PostgreSQL, MySQL) store data in tables with a fixed schema, enforce relationships through foreign keys, and guarantee ACID transactions. They excel when data has clear structure, relationships matter, and correctness is critical: financial systems, inventory management, user accounts.
NoSQL databases trade some of these guarantees for flexibility and scale. Document stores (MongoDB) let each record have a different shape, making them natural for content management and catalogs. Key-value stores (Redis, DynamoDB) provide the fastest possible lookups by primary key. Column-family stores (Cassandra) handle massive write throughput across distributed nodes. Graph databases (Neo4j) model relationships as first-class citizens for social networks and recommendation engines.
The choice is not binary. Many production systems use both: a relational database for transactional data and a NoSQL store for caching, search, or analytics. The key is understanding the trade-offs: schema rigidity vs. flexibility, strong consistency vs. eventual consistency, vertical scaling vs. horizontal scaling.
Once you have chosen a database, performance depends heavily on indexing. Without indexes, every query requires scanning the entire table. A well-chosen index turns an $O(n)$ scan into an $O(\log n)$ lookup. But indexes are not free: they consume storage, slow down writes (every insert and update must also update the index), and the wrong index can actually hurt performance. Understanding how B-tree and hash indexes work, and when to use each, is essential for building fast systems.
How It Works
SQL vs NoSQL Data Models
A relational database stores an order as rows across normalized tables: an orders table, an order_items table, and a products table linked by foreign keys. Fetching a complete order requires joining these tables. The benefit is that product information is stored once and shared.
A document database stores the same order as a single JSON document containing the order details, line items, and embedded product snapshots. Fetching the order is a single read with no joins. The trade-off is that product information is duplicated, and updating a product name requires updating every document that references it.
B-Tree Indexes
B-trees are the default index structure in most relational databases. Each node holds multiple keys in sorted order and has a branching factor $B$ (typically hundreds to thousands). A B-tree with branching factor $B$ and $n$ keys has height $O(\log_B n)$. Because each node maps to a disk page, a lookup requires $O(\log_B n)$ disk reads. With $B = 500$ and 1 billion keys, the tree is only 3-4 levels deep.
B-trees support both equality lookups ($\text{WHERE id} = 42$) and range queries ($\text{WHERE price BETWEEN 10 AND 50}$) because keys are sorted. Range queries walk the leaf nodes in order, which are linked together in most implementations (B+ trees).
Hash Indexes
Hash indexes compute a hash of the key and map it directly to a storage bucket. Equality lookups are $O(1)$ on average. However, hash indexes cannot support range queries, ordering, or partial key matching. They are ideal for exact-match lookups on columns with high cardinality (many distinct values).
Query Optimization
When a SQL query arrives, the optimizer considers multiple execution strategies:
- Which indexes are available for the tables and columns in the query.
- Table statistics: row counts, value distributions, index selectivity.
- Join ordering: for multi-table queries, the order in which tables are joined dramatically affects performance.
- Access methods: index scan vs. full table scan vs. index-only scan.
The optimizer estimates the cost of each plan (in terms of disk I/O and CPU) and picks the cheapest one. This is why adding an index can speed up reads but the optimizer might still choose a full scan if the index selectivity is too low (the query matches most rows anyway).
Complexity Analysis
Let $n$ be the number of rows, $B$ be the B-tree branching factor, and $k$ be the number of results in a range query.
| Operation | No Index | B-Tree Index | Hash Index |
|---|---|---|---|
| Equality lookup | $O(n)$ | $O(\log_B n)$ | $O(1)$ avg |
| Range query | $O(n)$ | $O(\log_B n + k)$ | $O(n)$ (not supported) |
| Insert | $O(1)$ | $O(\log_B n)$ | $O(1)$ avg |
| Delete | $O(n)$ | $O(\log_B n)$ | $O(1)$ avg |
| Ordered scan | $O(n \log n)$ sort | $O(n)$ leaf walk | $O(n \log n)$ sort |
The space overhead of a B-tree index is typically 1-3x the size of the indexed column data. For a table with $n$ rows and an index on a column of size $s$ bytes per value:
where $p$ is the pointer size (typically 6-8 bytes) for each entry pointing to the actual row.
Implementation
ALGORITHM BTreeLookup(root, searchKey)
INPUT: root: B-tree root node, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
node <- root
WHILE node IS NOT a leaf DO
i <- 0
WHILE i < node.keyCount AND searchKey >= node.keys[i] DO
i <- i + 1
END WHILE
node <- node.children[i]
END WHILE
// Search within the leaf node
FOR i <- 0 TO node.keyCount - 1 DO
IF node.keys[i] = searchKey THEN
RETURN node.pointers[i]
END IF
END FOR
RETURN NIL
END
ALGORITHM BTreeRangeQuery(root, low, high)
INPUT: root: B-tree root node, low: range start, high: range end
OUTPUT: list of matching record pointers
BEGIN
results <- empty list
// Find the leaf containing the low bound
node <- root
WHILE node IS NOT a leaf DO
i <- 0
WHILE i < node.keyCount AND low >= node.keys[i] DO
i <- i + 1
END WHILE
node <- node.children[i]
END WHILE
// Walk leaf nodes collecting results
WHILE node IS NOT NIL DO
FOR i <- 0 TO node.keyCount - 1 DO
IF node.keys[i] > high THEN
RETURN results
END IF
IF node.keys[i] >= low THEN
APPEND node.pointers[i] TO results
END IF
END FOR
node <- node.nextLeaf
END WHILE
RETURN results
END
ALGORITHM HashIndexLookup(hashTable, searchKey)
INPUT: hashTable: array of buckets, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
bucketIndex <- HASH(searchKey) MOD LENGTH(hashTable)
bucket <- hashTable[bucketIndex]
FOR EACH entry IN bucket DO
IF entry.key = searchKey THEN
RETURN entry.pointer
END IF
END FOR
RETURN NIL
END
ALGORITHM SimpleQueryOptimizer(query, tableStats, availableIndexes)
INPUT: query with table, predicates, and sort order
OUTPUT: chosen execution plan
BEGIN
plans <- empty list
// Option 1: Full table scan
scanCost <- tableStats.rowCount * COST_PER_ROW_SCAN
APPEND (plan: "full_scan", cost: scanCost) TO plans
// Option 2: Check each available index
FOR EACH index IN availableIndexes DO
IF index covers any predicate in query THEN
selectivity <- ESTIMATE_SELECTIVITY(index, query.predicates)
matchingRows <- tableStats.rowCount * selectivity
indexCost <- LOG_B(tableStats.rowCount) * COST_PER_INDEX_READ
+ matchingRows * COST_PER_ROW_FETCH
APPEND (plan: "index_scan:" + index.name, cost: indexCost) TO plans
END IF
END FOR
// Pick the plan with the lowest estimated cost
bestPlan <- plan IN plans WITH MINIMUM cost
RETURN bestPlan
END
Real-World Applications
- E-commerce platforms: relational databases store orders, inventory, and user accounts with ACID guarantees, while document stores or search engines (Elasticsearch) power product catalogs and full-text search
- Social media: graph databases model friend relationships and recommendations, while wide-column stores (Cassandra) handle high-throughput timeline writes across global data centers
- Financial systems: banks rely on relational databases with strong ACID transactions for account balances and transfers, where eventual consistency would be unacceptable
- Content management: document databases (MongoDB) store articles, blog posts, and media metadata with flexible schemas that evolve without migrations
- IoT and time-series: specialized time-series databases (TimescaleDB, InfluxDB) use columnar storage and time-partitioned indexes to handle millions of sensor readings per second
- Analytics warehouses: columnar databases (BigQuery, Redshift) store data by column rather than by row, enabling fast aggregation queries that scan only the columns needed
Key Takeaways
- SQL databases provide ACID transactions, enforced schemas, and powerful joins; NoSQL databases offer schema flexibility, horizontal scaling, and optimized access patterns for specific workloads
- B-tree indexes are the workhorse of relational databases, supporting both equality and range queries in $O(\log_B n)$ with branching factors large enough to keep trees shallow on disk
- Hash indexes provide $O(1)$ equality lookups but cannot support range queries, ordering, or partial matching
- Every index speeds up reads but slows down writes, because inserts and updates must maintain the index structure alongside the table data
- Query optimizers use table statistics and cost models to choose between full scans and index scans; an index is only helpful when the query is selective enough to benefit from it
Read More
2021-02-15
Message Queues and Event-Driven Architecture
How message queues decouple producers from consumers, the differences between pub/sub and point-to-point messaging, and the trade-offs between at-least-once and exactly-once delivery guarantees.
2021-02-16
Microservices and API Gateway Patterns
How microservices decompose monoliths into independently deployable services, how API gateways and service meshes manage cross-cutting concerns, and when the Backend for Frontend pattern simplifies client integration.
2021-02-17
Rate Limiting and Circuit Breakers
How rate limiting algorithms like token bucket and sliding window protect services from overload, and how circuit breakers prevent cascading failures by failing fast when downstream services are unhealthy.