Notes/Databases: SQL vs NoSQL, Indexing, and Query Optimization

Databases: SQL vs NoSQL, Indexing, and Query Optimization

How relational and non-relational databases differ in data modeling and guarantees, how B-tree and hash indexes accelerate lookups, and how query optimizers turn SQL into efficient execution plans.

2021-02-14AI-Synthesized from Personal Notes

System DesignDatabasesIndexing

Terminology

Relational database (SQL): a database that organizes data into tables with rows and columns, enforces a fixed schema, and supports SQL for querying with ACID transaction guarantees
Non-relational database (NoSQL): a broad category of databases that do not use the tabular relational model; includes document stores, key-value stores, column-family stores, and graph databases
ACID: a set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee reliable transaction processing in relational databases
BASE: a set of properties (Basically Available, Soft state, Eventually consistent) that describe the trade-offs many NoSQL systems make in favor of availability and partition tolerance
Schema: the structure definition of a database, specifying tables, columns, data types, and constraints in SQL databases, or the expected document shape in NoSQL databases
Index: a separate data structure maintained by the database that maps column values to row locations, enabling fast lookups without scanning every row
B-tree index: a balanced tree structure where each node holds multiple keys in sorted order, optimized for range queries and equality lookups on disk-based storage
Hash index: an index that uses a hash function to map keys directly to storage locations, providing $O(1)$ average-case equality lookups but no support for range queries
Primary key: a column (or set of columns) that uniquely identifies each row in a table; the database automatically creates an index on the primary key
Composite index: an index built on multiple columns, useful for queries that filter or sort on several fields simultaneously
Query optimizer: the database component that analyzes a SQL query and chooses the most efficient execution plan based on available indexes, table statistics, and cost estimates
Execution plan: the step-by-step strategy the database uses to execute a query, including which indexes to use, join order, and access methods
Full table scan (sequential scan): reading every row in a table to find matching results; efficient for small tables but slow for large ones
Selectivity: the fraction of rows that match a query predicate; high selectivity (few matching rows) makes index usage beneficial, low selectivity (many matching rows) may favor a full scan
Denormalization: intentionally duplicating data across tables or documents to reduce the number of joins needed at query time, trading write complexity for read performance
Sharding: splitting a database across multiple servers by partitioning rows based on a shard key, enabling horizontal scaling

What & Why

Almost every application needs to persist data. The first decision is which type of database to use. Relational databases (PostgreSQL, MySQL) store data in tables with a fixed schema, enforce relationships through foreign keys, and guarantee ACID transactions. They excel when data has clear structure, relationships matter, and correctness is critical: financial systems, inventory management, user accounts.

NoSQL databases trade some of these guarantees for flexibility and scale. Document stores (MongoDB) let each record have a different shape, making them natural for content management and catalogs. Key-value stores (Redis, DynamoDB) provide the fastest possible lookups by primary key. Column-family stores (Cassandra) handle massive write throughput across distributed nodes. Graph databases (Neo4j) model relationships as first-class citizens for social networks and recommendation engines.

The choice is not binary. Many production systems use both: a relational database for transactional data and a NoSQL store for caching, search, or analytics. The key is understanding the trade-offs: schema rigidity vs. flexibility, strong consistency vs. eventual consistency, vertical scaling vs. horizontal scaling.

Once you have chosen a database, performance depends heavily on indexing. Without indexes, every query requires scanning the entire table. A well-chosen index turns an $O(n)$ scan into an $O(\log n)$ lookup. But indexes are not free: they consume storage, slow down writes (every insert and update must also update the index), and the wrong index can actually hurt performance. Understanding how B-tree and hash indexes work, and when to use each, is essential for building fast systems.

How It Works

SQL vs NoSQL Data Models

A relational database stores an order as rows across normalized tables: an orders table, an order_items table, and a products table linked by foreign keys. Fetching a complete order requires joining these tables. The benefit is that product information is stored once and shared.

A document database stores the same order as a single JSON document containing the order details, line items, and embedded product snapshots. Fetching the order is a single read with no joins. The trade-off is that product information is duplicated, and updating a product name requires updating every document that references it.

B-Tree Indexes

B-trees are the default index structure in most relational databases. Each node holds multiple keys in sorted order and has a branching factor $B$ (typically hundreds to thousands). A B-tree with branching factor $B$ and $n$ keys has height $O(\log_B n)$. Because each node maps to a disk page, a lookup requires $O(\log_B n)$ disk reads. With $B = 500$ and 1 billion keys, the tree is only 3-4 levels deep.

B-trees support both equality lookups ($\text{WHERE id} = 42$) and range queries ($\text{WHERE price BETWEEN 10 AND 50}$) because keys are sorted. Range queries walk the leaf nodes in order, which are linked together in most implementations (B+ trees).

Hash Indexes

Hash indexes compute a hash of the key and map it directly to a storage bucket. Equality lookups are $O(1)$ on average. However, hash indexes cannot support range queries, ordering, or partial key matching. They are ideal for exact-match lookups on columns with high cardinality (many distinct values).

Query Optimization

When a SQL query arrives, the optimizer considers multiple execution strategies:

Which indexes are available for the tables and columns in the query.
Table statistics: row counts, value distributions, index selectivity.
Join ordering: for multi-table queries, the order in which tables are joined dramatically affects performance.
Access methods: index scan vs. full table scan vs. index-only scan.

The optimizer estimates the cost of each plan (in terms of disk I/O and CPU) and picks the cheapest one. This is why adding an index can speed up reads but the optimizer might still choose a full scan if the index selectivity is too low (the query matches most rows anyway).

Complexity Analysis

Let $n$ be the number of rows, $B$ be the B-tree branching factor, and $k$ be the number of results in a range query.

Operation	No Index	B-Tree Index	Hash Index
Equality lookup	$O(n)$	$O(\log_B n)$	$O(1)$ avg
Range query	$O(n)$	$O(\log_B n + k)$	$O(n)$ (not supported)
Insert	$O(1)$	$O(\log_B n)$	$O(1)$ avg
Delete	$O(n)$	$O(\log_B n)$	$O(1)$ avg
Ordered scan	$O(n \log n)$ sort	$O(n)$ leaf walk	$O(n \log n)$ sort

The space overhead of a B-tree index is typically 1-3x the size of the indexed column data. For a table with $n$ rows and an index on a column of size $s$ bytes per value:

$\text{Index size} \approx n \times (s + p)$

where $p$ is the pointer size (typically 6-8 bytes) for each entry pointing to the actual row.

Implementation

ALGORITHM BTreeLookup(root, searchKey)
INPUT: root: B-tree root node, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
  node <- root

  WHILE node IS NOT a leaf DO
    i <- 0
    WHILE i < node.keyCount AND searchKey >= node.keys[i] DO
      i <- i + 1
    END WHILE
    node <- node.children[i]
  END WHILE

  // Search within the leaf node
  FOR i <- 0 TO node.keyCount - 1 DO
    IF node.keys[i] = searchKey THEN
      RETURN node.pointers[i]
    END IF
  END FOR

  RETURN NIL
END


ALGORITHM BTreeRangeQuery(root, low, high)
INPUT: root: B-tree root node, low: range start, high: range end
OUTPUT: list of matching record pointers
BEGIN
  results <- empty list

  // Find the leaf containing the low bound
  node <- root
  WHILE node IS NOT a leaf DO
    i <- 0
    WHILE i < node.keyCount AND low >= node.keys[i] DO
      i <- i + 1
    END WHILE
    node <- node.children[i]
  END WHILE

  // Walk leaf nodes collecting results
  WHILE node IS NOT NIL DO
    FOR i <- 0 TO node.keyCount - 1 DO
      IF node.keys[i] > high THEN
        RETURN results
      END IF
      IF node.keys[i] >= low THEN
        APPEND node.pointers[i] TO results
      END IF
    END FOR
    node <- node.nextLeaf
  END WHILE

  RETURN results
END


ALGORITHM HashIndexLookup(hashTable, searchKey)
INPUT: hashTable: array of buckets, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
  bucketIndex <- HASH(searchKey) MOD LENGTH(hashTable)
  bucket <- hashTable[bucketIndex]

  FOR EACH entry IN bucket DO
    IF entry.key = searchKey THEN
      RETURN entry.pointer
    END IF
  END FOR

  RETURN NIL
END


ALGORITHM SimpleQueryOptimizer(query, tableStats, availableIndexes)
INPUT: query with table, predicates, and sort order
OUTPUT: chosen execution plan
BEGIN
  plans <- empty list

  // Option 1: Full table scan
  scanCost <- tableStats.rowCount * COST_PER_ROW_SCAN
  APPEND (plan: "full_scan", cost: scanCost) TO plans

  // Option 2: Check each available index
  FOR EACH index IN availableIndexes DO
    IF index covers any predicate in query THEN
      selectivity <- ESTIMATE_SELECTIVITY(index, query.predicates)
      matchingRows <- tableStats.rowCount * selectivity
      indexCost <- LOG_B(tableStats.rowCount) * COST_PER_INDEX_READ
                   + matchingRows * COST_PER_ROW_FETCH
      APPEND (plan: "index_scan:" + index.name, cost: indexCost) TO plans
    END IF
  END FOR

  // Pick the plan with the lowest estimated cost
  bestPlan <- plan IN plans WITH MINIMUM cost
  RETURN bestPlan
END

Real-World Applications

E-commerce platforms: relational databases store orders, inventory, and user accounts with ACID guarantees, while document stores or search engines (Elasticsearch) power product catalogs and full-text search
Social media: graph databases model friend relationships and recommendations, while wide-column stores (Cassandra) handle high-throughput timeline writes across global data centers
Financial systems: banks rely on relational databases with strong ACID transactions for account balances and transfers, where eventual consistency would be unacceptable
Content management: document databases (MongoDB) store articles, blog posts, and media metadata with flexible schemas that evolve without migrations
IoT and time-series: specialized time-series databases (TimescaleDB, InfluxDB) use columnar storage and time-partitioned indexes to handle millions of sensor readings per second
Analytics warehouses: columnar databases (BigQuery, Redshift) store data by column rather than by row, enabling fast aggregation queries that scan only the columns needed

Key Takeaways

SQL databases provide ACID transactions, enforced schemas, and powerful joins; NoSQL databases offer schema flexibility, horizontal scaling, and optimized access patterns for specific workloads
B-tree indexes are the workhorse of relational databases, supporting both equality and range queries in $O(\log_B n)$ with branching factors large enough to keep trees shallow on disk
Hash indexes provide $O(1)$ equality lookups but cannot support range queries, ordering, or partial matching
Every index speeds up reads but slows down writes, because inserts and updates must maintain the index structure alongside the table data
Query optimizers use table statistics and cost models to choose between full scans and index scans; an index is only helpful when the query is selective enough to benefit from it

2021-02-15

Message Queues and Event-Driven Architecture

How message queues decouple producers from consumers, the differences between pub/sub and point-to-point messaging, and the trade-offs between at-least-once and exactly-once delivery guarantees.

2021-02-16

Microservices and API Gateway Patterns

How microservices decompose monoliths into independently deployable services, how API gateways and service meshes manage cross-cutting concerns, and when the Backend for Frontend pattern simplifies client integration.

2021-02-17

Rate Limiting and Circuit Breakers

How rate limiting algorithms like token bucket and sliding window protect services from overload, and how circuit breakers prevent cascading failures by failing fast when downstream services are unhealthy.