Back to Blog

Databases: SQL vs NoSQL, Indexing, and Query Optimization

How relational and non-relational databases differ in data modeling and guarantees, how B-tree and hash indexes accelerate lookups, and how query optimizers turn SQL into efficient execution plans.

2021-02-14
Share
System Designdatabasesindexing

Terminology

  • Relational database (SQL): a database that organizes data into tables with rows and columns, enforces a fixed schema, and supports SQL for querying with ACID transaction guarantees
  • Non-relational database (NoSQL): a broad category of databases that do not use the tabular relational model; includes document stores, key-value stores, column-family stores, and graph databases
  • ACID: a set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee reliable transaction processing in relational databases
  • BASE: a set of properties (Basically Available, Soft state, Eventually consistent) that describe the trade-offs many NoSQL systems make in favor of availability and partition tolerance
  • Schema: the structure definition of a database, specifying tables, columns, data types, and constraints in SQL databases, or the expected document shape in NoSQL databases
  • Index: a separate data structure maintained by the database that maps column values to row locations, enabling fast lookups without scanning every row
  • B-tree index: a balanced tree structure where each node holds multiple keys in sorted order, optimized for range queries and equality lookups on disk-based storage
  • Hash index: an index that uses a hash function to map keys directly to storage locations, providing $O(1)$ average-case equality lookups but no support for range queries
  • Primary key: a column (or set of columns) that uniquely identifies each row in a table; the database automatically creates an index on the primary key
  • Composite index: an index built on multiple columns, useful for queries that filter or sort on several fields simultaneously
  • Query optimizer: the database component that analyzes a SQL query and chooses the most efficient execution plan based on available indexes, table statistics, and cost estimates
  • Execution plan: the step-by-step strategy the database uses to execute a query, including which indexes to use, join order, and access methods
  • Full table scan (sequential scan): reading every row in a table to find matching results; efficient for small tables but slow for large ones
  • Selectivity: the fraction of rows that match a query predicate; high selectivity (few matching rows) makes index usage beneficial, low selectivity (many matching rows) may favor a full scan
  • Denormalization: intentionally duplicating data across tables or documents to reduce the number of joins needed at query time, trading write complexity for read performance
  • Sharding: splitting a database across multiple servers by partitioning rows based on a shard key, enabling horizontal scaling

What & Why

Almost every application needs to persist data. The first decision is which type of database to use. Relational databases (PostgreSQL, MySQL) store data in tables with a fixed schema, enforce relationships through foreign keys, and guarantee ACID transactions. They excel when data has clear structure, relationships matter, and correctness is critical: financial systems, inventory management, user accounts.

NoSQL databases trade some of these guarantees for flexibility and scale. Document stores (MongoDB) let each record have a different shape, making them natural for content management and catalogs. Key-value stores (Redis, DynamoDB) provide the fastest possible lookups by primary key. Column-family stores (Cassandra) handle massive write throughput across distributed nodes. Graph databases (Neo4j) model relationships as first-class citizens for social networks and recommendation engines.

The choice is not binary. Many production systems use both: a relational database for transactional data and a NoSQL store for caching, search, or analytics. The key is understanding the trade-offs: schema rigidity vs. flexibility, strong consistency vs. eventual consistency, vertical scaling vs. horizontal scaling.

Once you have chosen a database, performance depends heavily on indexing. Without indexes, every query requires scanning the entire table. A well-chosen index turns an O(n) scan into an O(\log n) lookup. But indexes are not free: they consume storage, slow down writes (every insert and update must also update the index), and the wrong index can actually hurt performance. Understanding how B-tree and hash indexes work, and when to use each, is essential for building fast systems.

How It Works

SQL vs NoSQL Data Models

A relational database stores an order as rows across normalized tables: an orders table, an order_items table, and a products table linked by foreign keys. Fetching a complete order requires joining these tables. The benefit is that product information is stored once and shared.

A document database stores the same order as a single JSON document containing the order details, line items, and embedded product snapshots. Fetching the order is a single read with no joins. The trade-off is that product information is duplicated, and updating a product name requires updating every document that references it.

B-Tree Indexes

20 | 40 | 60 5 | 10 | 15 25 | 30 | 35 45 | 50 | 55 1,3,4 6,8,9 21,23 31,33 41,43 51,53 Lookup key=31: root(40) -> left child(25|30|35) -> leaf(31,33) 3 node accesses = 3 disk reads = $O(\log_B n)$

B-trees are the default index structure in most relational databases. Each node holds multiple keys in sorted order and has a branching factor B (typically hundreds to thousands). A B-tree with branching factor B and n keys has height O(\log_B n). Because each node maps to a disk page, a lookup requires O(\log_B n) disk reads. With B = 500 and 1 billion keys, the tree is only 3-4 levels deep.

B-trees support both equality lookups (\text{WHERE id} = 42) and range queries (\text{WHERE price BETWEEN 10 AND 50}) because keys are sorted. Range queries walk the leaf nodes in order, which are linked together in most implementations (B+ trees).

Hash Indexes

Hash indexes compute a hash of the key and map it directly to a storage bucket. Equality lookups are O(1) on average. However, hash indexes cannot support range queries, ordering, or partial key matching. They are ideal for exact-match lookups on columns with high cardinality (many distinct values).

Query Optimization

When a SQL query arrives, the optimizer considers multiple execution strategies:

  1. Which indexes are available for the tables and columns in the query.
  2. Table statistics: row counts, value distributions, index selectivity.
  3. Join ordering: for multi-table queries, the order in which tables are joined dramatically affects performance.
  4. Access methods: index scan vs. full table scan vs. index-only scan.

The optimizer estimates the cost of each plan (in terms of disk I/O and CPU) and picks the cheapest one. This is why adding an index can speed up reads but the optimizer might still choose a full scan if the index selectivity is too low (the query matches most rows anyway).

Complexity Analysis

Let $n$ be the number of rows, $B$ be the B-tree branching factor, and $k$ be the number of results in a range query.

Operation No Index B-Tree Index Hash Index
Equality lookup $O(n)$ $O(\log_B n)$ $O(1)$ avg
Range query $O(n)$ $O(\log_B n + k)$ $O(n)$ (not supported)
Insert $O(1)$ $O(\log_B n)$ $O(1)$ avg
Delete $O(n)$ $O(\log_B n)$ $O(1)$ avg
Ordered scan $O(n \log n)$ sort $O(n)$ leaf walk $O(n \log n)$ sort

The space overhead of a B-tree index is typically 1-3x the size of the indexed column data. For a table with $n$ rows and an index on a column of size $s$ bytes per value:

$\text{Index size} \approx n \times (s + p)$

where $p$ is the pointer size (typically 6-8 bytes) for each entry pointing to the actual row.

Implementation

ALGORITHM BTreeLookup(root, searchKey)
INPUT: root: B-tree root node, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
  node <- root

  WHILE node IS NOT a leaf DO
    i <- 0
    WHILE i < node.keyCount AND searchKey >= node.keys[i] DO
      i <- i + 1
    END WHILE
    node <- node.children[i]
  END WHILE

  // Search within the leaf node
  FOR i <- 0 TO node.keyCount - 1 DO
    IF node.keys[i] = searchKey THEN
      RETURN node.pointers[i]
    END IF
  END FOR

  RETURN NIL
END


ALGORITHM BTreeRangeQuery(root, low, high)
INPUT: root: B-tree root node, low: range start, high: range end
OUTPUT: list of matching record pointers
BEGIN
  results <- empty list

  // Find the leaf containing the low bound
  node <- root
  WHILE node IS NOT a leaf DO
    i <- 0
    WHILE i < node.keyCount AND low >= node.keys[i] DO
      i <- i + 1
    END WHILE
    node <- node.children[i]
  END WHILE

  // Walk leaf nodes collecting results
  WHILE node IS NOT NIL DO
    FOR i <- 0 TO node.keyCount - 1 DO
      IF node.keys[i] > high THEN
        RETURN results
      END IF
      IF node.keys[i] >= low THEN
        APPEND node.pointers[i] TO results
      END IF
    END FOR
    node <- node.nextLeaf
  END WHILE

  RETURN results
END


ALGORITHM HashIndexLookup(hashTable, searchKey)
INPUT: hashTable: array of buckets, searchKey: key to find
OUTPUT: record pointer if found, NIL otherwise
BEGIN
  bucketIndex <- HASH(searchKey) MOD LENGTH(hashTable)
  bucket <- hashTable[bucketIndex]

  FOR EACH entry IN bucket DO
    IF entry.key = searchKey THEN
      RETURN entry.pointer
    END IF
  END FOR

  RETURN NIL
END


ALGORITHM SimpleQueryOptimizer(query, tableStats, availableIndexes)
INPUT: query with table, predicates, and sort order
OUTPUT: chosen execution plan
BEGIN
  plans <- empty list

  // Option 1: Full table scan
  scanCost <- tableStats.rowCount * COST_PER_ROW_SCAN
  APPEND (plan: "full_scan", cost: scanCost) TO plans

  // Option 2: Check each available index
  FOR EACH index IN availableIndexes DO
    IF index covers any predicate in query THEN
      selectivity <- ESTIMATE_SELECTIVITY(index, query.predicates)
      matchingRows <- tableStats.rowCount * selectivity
      indexCost <- LOG_B(tableStats.rowCount) * COST_PER_INDEX_READ
                   + matchingRows * COST_PER_ROW_FETCH
      APPEND (plan: "index_scan:" + index.name, cost: indexCost) TO plans
    END IF
  END FOR

  // Pick the plan with the lowest estimated cost
  bestPlan <- plan IN plans WITH MINIMUM cost
  RETURN bestPlan
END

Real-World Applications

  • E-commerce platforms: relational databases store orders, inventory, and user accounts with ACID guarantees, while document stores or search engines (Elasticsearch) power product catalogs and full-text search
  • Social media: graph databases model friend relationships and recommendations, while wide-column stores (Cassandra) handle high-throughput timeline writes across global data centers
  • Financial systems: banks rely on relational databases with strong ACID transactions for account balances and transfers, where eventual consistency would be unacceptable
  • Content management: document databases (MongoDB) store articles, blog posts, and media metadata with flexible schemas that evolve without migrations
  • IoT and time-series: specialized time-series databases (TimescaleDB, InfluxDB) use columnar storage and time-partitioned indexes to handle millions of sensor readings per second
  • Analytics warehouses: columnar databases (BigQuery, Redshift) store data by column rather than by row, enabling fast aggregation queries that scan only the columns needed

Key Takeaways

  • SQL databases provide ACID transactions, enforced schemas, and powerful joins; NoSQL databases offer schema flexibility, horizontal scaling, and optimized access patterns for specific workloads
  • B-tree indexes are the workhorse of relational databases, supporting both equality and range queries in $O(\log_B n)$ with branching factors large enough to keep trees shallow on disk
  • Hash indexes provide $O(1)$ equality lookups but cannot support range queries, ordering, or partial matching
  • Every index speeds up reads but slows down writes, because inserts and updates must maintain the index structure alongside the table data
  • Query optimizers use table statistics and cost models to choose between full scans and index scans; an index is only helpful when the query is selective enough to benefit from it