Databases: SQL vs NoSQL, Indexing, and Query Optimization
How relational and non-relational databases differ in data modeling and guarantees, how B-tree and hash indexes accelerate lookups, and how query optimizers turn SQL into efficient execution plans.
Terminology
- Relational database (SQL): a database that organizes data into tables with rows and columns, enforces a fixed schema, and supports SQL for querying with ACID transaction guarantees
- Non-relational database (NoSQL): a broad category of databases that do not use the tabular relational model; includes document stores, key-value stores, column-family stores, and graph databases
- ACID: a set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee reliable transaction processing in relational databases
- BASE: a set of properties (Basically Available, Soft state, Eventually consistent) that describe the trade-offs many NoSQL systems make in favor of availability and partition tolerance
- Schema: the structure definition of a database, specifying tables, columns, data types, and constraints in SQL databases, or the expected document shape in NoSQL databases
- Index: a separate data structure maintained by the database that maps column values to row locations, enabling fast lookups without scanning every row
- B-tree index: a balanced tree structure where each node holds multiple keys in sorted order, optimized for range queries and equality lookups on disk-based storage
- Hash index: an index that uses a hash function to map keys directly to storage locations, providing $O(1)$ average-case equality lookups but no support for range queries
- Primary key: a column (or set of columns) that uniquely identifies each row in a table; the database automatically creates an index on the primary key
- Composite index: an index built on multiple columns, useful for queries that filter or sort on several fields simultaneously
- Query optimizer: the database component that analyzes a SQL query and chooses the most efficient execution plan based on available indexes, table statistics, and cost estimates
- Execution plan: the step-by-step strategy the database uses to execute a query, including which indexes to use, join order, and access methods
- Full table scan (sequential scan): reading every row in a table to find matching results; efficient for small tables but slow for large ones
- Selectivity: the fraction of rows that match a query predicate; high selectivity (few matching rows) makes index usage beneficial, low selectivity (many matching rows) may favor a full scan
- Denormalization: intentionally duplicating data across tables or documents to reduce the number of joins needed at query time, trading write complexity for read performance
- Sharding: splitting a database across multiple servers by partitioning rows based on a shard key, enabling horizontal scaling
What & Why
Almost every application needs to persist data. The first decision is which type of database to use. Relational databases (PostgreSQL, MySQL) store data in tables with a fixed schema, enforce relationships through foreign keys, and guarantee ACID transactions. They excel when data has clear structure, relationships matter, and correctness is critical: financial systems, inventory management, user accounts.
NoSQL databases trade some of these guarantees for flexibility and scale. Document stores (MongoDB) let each record have a different shape, making them natural for content management and catalogs. Key-value stores (Redis, DynamoDB) provide the fastest possible lookups by primary key. Column-family stores (Cassandra) handle massive write throughput across distributed nodes. Graph databases (Neo4j) model relationships as first-class citizens for social networks and recommendation engines.
The choice is not binary. Many production systems use both: a relational database for transactional data and a NoSQL store for caching, search, or analytics. The key is understanding the trade-offs: schema rigidity vs. flexibility, strong consistency vs. eventual consistency, vertical scaling vs. horizontal scaling.
Once you have chosen a database, performance depends heavily on indexing. Without indexes, every query requires scanning the entire table. A well-chosen index turns an O(n) scan into an O(\log n) lookup. But indexes are not free: they consume storage, slow down writes (every insert and update must also update the index), and the wrong index can actually hurt performance. Understanding how B-tree and hash indexes work, and when to use each, is essential for building fast systems.
How It Works
SQL vs NoSQL Data Models
A relational database stores an order as rows across normalized tables: an orders table, an order_items table, and a products table linked by foreign keys. Fetching a complete order requires joining these tables. The benefit is that product information is stored once and shared.
A document database stores the same order as a single JSON document containing the order details, line items, and embedded product snapshots. Fetching the order is a single read with no joins. The trade-off is that product information is duplicated, and updating a product name requires updating every document that references it.
B-Tree Indexes
B-trees are the default index structure in most relational databases. Each node holds multiple keys in sorted order and has a branching factor B (typically hundreds to thousands). A B-tree with branching factor B and n keys has height O(\log_B n). Because each node maps to a disk page, a lookup requires O(\log_B n) disk reads. With B = 500 and 1 billion keys, the tree is only 3-4 levels deep.
B-trees support both equality lookups (\text{WHERE id} = 42) and range queries (\text{WHERE price BETWEEN 10 AND 50}) because keys are sorted. Range queries walk the leaf nodes in order, which are linked together in most implementations (B+ trees).
Hash Indexes
Hash indexes compute a hash of the key and map it directly to a storage bucket. Equality lookups are O(1) on average. However, hash indexes cannot support range queries, ordering, or partial key matching. They are ideal for exact-match lookups on columns with high cardinality (many distinct values).
Query Optimization
When a SQL query arrives, the optimizer considers multiple execution strategies:
- Which indexes are available for the tables and columns in the query.
- Table statistics: row counts, value distributions, index selectivity.
- Join ordering: for multi-table queries, the order in which tables are joined dramatically affects performance.
- Access methods: index scan vs. full table scan vs. index-only scan.
The optimizer estimates the cost of each plan (in terms of disk I/O and CPU) and picks the cheapest one. This is why adding an index can speed up reads but the optimizer might still choose a full scan if the index selectivity is too low (the query matches most rows anyway).
Complexity Analysis
Let $n$ be the number of rows, $B$ be the B-tree branching factor, and $k$ be the number of results in a range query.
| Operation | No Index | B-Tree Index | Hash Index |
|---|---|---|---|
| Equality lookup | $O(n)$ | $O(\log_B n)$ | $O(1)$ avg |
| Range query | $O(n)$ | $O(\log_B n + k)$ | $O(n)$ (not supported) |
| Insert | $O(1)$ | $O(\log_B n)$ | $O(1)$ avg |
| Delete | $O(n)$ | $O(\log_B n)$ | $O(1)$ avg |
| Ordered scan | $O(n \log n)$ sort | $O(n)$ leaf walk | $O(n \log n)$ sort |
The space overhead of a B-tree index is typically 1-3x the size of the indexed column data. For a table with $n$ rows and an index on a column of size $s$ bytes per value:
where $p$ is the pointer size (typically 6-8 bytes) for each entry pointing to the actual row.
Implementation
ALGORITHM BTreeLookup(root, searchKey)
INPUT: root: B-tree root node, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
node <- root
WHILE node IS NOT a leaf DO
i <- 0
WHILE i < node.keyCount AND searchKey >= node.keys[i] DO
i <- i + 1
END WHILE
node <- node.children[i]
END WHILE
// Search within the leaf node
FOR i <- 0 TO node.keyCount - 1 DO
IF node.keys[i] = searchKey THEN
RETURN node.pointers[i]
END IF
END FOR
RETURN NIL
END
ALGORITHM BTreeRangeQuery(root, low, high)
INPUT: root: B-tree root node, low: range start, high: range end
OUTPUT: list of matching record pointers
BEGIN
results <- empty list
// Find the leaf containing the low bound
node <- root
WHILE node IS NOT a leaf DO
i <- 0
WHILE i < node.keyCount AND low >= node.keys[i] DO
i <- i + 1
END WHILE
node <- node.children[i]
END WHILE
// Walk leaf nodes collecting results
WHILE node IS NOT NIL DO
FOR i <- 0 TO node.keyCount - 1 DO
IF node.keys[i] > high THEN
RETURN results
END IF
IF node.keys[i] >= low THEN
APPEND node.pointers[i] TO results
END IF
END FOR
node <- node.nextLeaf
END WHILE
RETURN results
END
ALGORITHM HashIndexLookup(hashTable, searchKey)
INPUT: hashTable: array of buckets, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
bucketIndex <- HASH(searchKey) MOD LENGTH(hashTable)
bucket <- hashTable[bucketIndex]
FOR EACH entry IN bucket DO
IF entry.key = searchKey THEN
RETURN entry.pointer
END IF
END FOR
RETURN NIL
END
ALGORITHM SimpleQueryOptimizer(query, tableStats, availableIndexes)
INPUT: query with table, predicates, and sort order
OUTPUT: chosen execution plan
BEGIN
plans <- empty list
// Option 1: Full table scan
scanCost <- tableStats.rowCount * COST_PER_ROW_SCAN
APPEND (plan: "full_scan", cost: scanCost) TO plans
// Option 2: Check each available index
FOR EACH index IN availableIndexes DO
IF index covers any predicate in query THEN
selectivity <- ESTIMATE_SELECTIVITY(index, query.predicates)
matchingRows <- tableStats.rowCount * selectivity
indexCost <- LOG_B(tableStats.rowCount) * COST_PER_INDEX_READ
+ matchingRows * COST_PER_ROW_FETCH
APPEND (plan: "index_scan:" + index.name, cost: indexCost) TO plans
END IF
END FOR
// Pick the plan with the lowest estimated cost
bestPlan <- plan IN plans WITH MINIMUM cost
RETURN bestPlan
END
Real-World Applications
- E-commerce platforms: relational databases store orders, inventory, and user accounts with ACID guarantees, while document stores or search engines (Elasticsearch) power product catalogs and full-text search
- Social media: graph databases model friend relationships and recommendations, while wide-column stores (Cassandra) handle high-throughput timeline writes across global data centers
- Financial systems: banks rely on relational databases with strong ACID transactions for account balances and transfers, where eventual consistency would be unacceptable
- Content management: document databases (MongoDB) store articles, blog posts, and media metadata with flexible schemas that evolve without migrations
- IoT and time-series: specialized time-series databases (TimescaleDB, InfluxDB) use columnar storage and time-partitioned indexes to handle millions of sensor readings per second
- Analytics warehouses: columnar databases (BigQuery, Redshift) store data by column rather than by row, enabling fast aggregation queries that scan only the columns needed
Key Takeaways
- SQL databases provide ACID transactions, enforced schemas, and powerful joins; NoSQL databases offer schema flexibility, horizontal scaling, and optimized access patterns for specific workloads
- B-tree indexes are the workhorse of relational databases, supporting both equality and range queries in $O(\log_B n)$ with branching factors large enough to keep trees shallow on disk
- Hash indexes provide $O(1)$ equality lookups but cannot support range queries, ordering, or partial matching
- Every index speeds up reads but slows down writes, because inserts and updates must maintain the index structure alongside the table data
- Query optimizers use table statistics and cost models to choose between full scans and index scans; an index is only helpful when the query is selective enough to benefit from it