Joins are a concept in relational databases, allowing the combination of data from two or more tables. Joins establish a relationship between two tables using a common key, which is used to lookup data from one table to another. There are different types of joins and different algorithms that can be used to retrieve the data. In this article we will be focusing on discussing the differences between each algorithm. Nested Loop Joins A Nested Loop Join algorithm is a straightforward method for

Marcelo Altmann
2024-08-21 · 7 min read
Joins are a concept in relational databases, allowing the combination of data from two or more tables. Joins establish a relationship between two tables using a common key, which is used to lookup data from one table to another.
There are different types of joins and different algorithms that can be used to retrieve the data. In this article we will be focusing on discussing the differences between each algorithm.
A Nested Loop Join algorithm is a straightforward method for performing JOIN operations between two tables in a database.
FOR each row in OuterTable DO
FOR each row in InnerTable DO
IF OuterTable.row matches InnerTable.row based on join condition THEN
Add the combined row to the result set
END IF
END FOR
END FOR
O(M*N) time complexity, where M and N are the sizes of the two tables.There are some variations of Nested Loop, such as:
Block Nested Loop - Where some rows from OuterTable are buffered in memory, making it a batch lookup in the InnerTable, reducing the amount of times we need to scan rows from the InnerTable
Index Nested Loop - If there are indexes available in the InnerTable to satisfy the ON condition, the inner loop can be adjusted to use an index loop-up. This reduces the complexity to O(M*log N).
A Hash Join is a more efficient algorithm for performing JOIN operations between two tables, especially when dealing with large datasets. It is particularly effective when there are no indexes on the join columns and when the join condition is an equality condition. Here’s how it works:
// Build phase
FOR each row in BuildTable DO
Compute hash value for the join key
Insert row into HashTable based on hash value
END FOR
// Probe phase
FOR each row in ProbeTable DO
Compute hash value for the join key
IF hash value exists in HashTable THEN
Retrieve matching rows from HashTable
FOR each matching row DO
Combine rows from ProbeTable and BuildTable
Add the combined row to the result set
END FOR
END IF
END FOR
Hash Join is typically used:
O(M*N) ) for large tables, with an average time complexity of O(M + N), where M and N are the sizes of the two tablesReadyset is a fast in-memory cache for your queries, with incremental view maintenance that does not require invalidation, instead as new data is ingested in your database, the cache entries are automatically updated. The latency for a query that hits a cache hit (it’s found in cache) is in the figures of sub milliseconds. However, when we hit a cache miss we need to perform what we call up-query or replay, which basically means traverse the dataflow graph upwards to find a node that has materialized the data we are requesting. In certain cases we need to go back all the way up to our base node, which consists in the RocksDB table containing a copy of all the records.
A Normal JOIN or single table look-up is done via Index Lookup which is fast. Some types of joins that have predicates in the WHERE clause coming from different sides of the join are not fully indexed and require extra steps to ensure we are returning just the records matching the ON condition and also the predicates in the WHERE clause. We call those straddle joins and they are disabled by default.
The current algorithm used to resolve straddle joins is Nested Loop Join Algorithm.
We are thrilled to announce that starting at Readyset release stable-240627 we have implemented the Hash Join algorithm used in those joins operations to take advantage of the algorithm efficiency. The performance benefits from Hash Join are orders of magnitude faster if compared to the previous release.
To demonstrate such performance improvements, we will use a sample pair of tables and run some batch data into it and check how many people named X have an active salary:
Schema:
CREATE TABLE `person` (
`ID` int DEFAULT NULL,
`name` char(11) DEFAULT NULL
);
CREATE TABLE `salary` (
`salary` int DEFAULT NULL,
`person_id` int DEFAULT NULL,
`active` tinyint DEFAULT '1'
);
Data load - For each person, we insert a salary of 100k as inactive and 200k as active.
for f in $(seq 1 1000)
do
mysql -e "set sql_log_bin=0; INSERT INTO person VALUES (${f}, 'person_name'); INSERT INTO salary VALUES (100000, '${f}', 0), (120000, ${f}, 1);" test
done
Query:
SELECT COUNT(*)
FROM person
JOIN salary ON person.ID = salary.person_id
WHERE person.name = 'person_name'
AND salary.active=1;
Results:
| Batch Size | Readyset + Nested Loop (Cache Miss) | Readyset + Hash Join (Cache Miss) | Readyset (Cache Hit) | MySQL |
|---|---|---|---|---|
| 1k | 2,01sec | 0,03sec | 0,00sec | 0,03sec |
| 10k | 0,05sec | 0,01sec | 0,00sec | 0,01sec |
| 100k | 3 min 12,11sec | 0,20sec | 0,00sec | 0,20sec |
Readyset version stable-240627 hash join algorithm delivers a significantly higher speed for handling cache misses during join operations that involve predicates from both sides.
The increase in performance becomes more evident as the volume of data grows, highlighting the advantages of this version with larger datasets.
Queries resulting in a cache hit do not require a replay, they are served directly from cache and continue to experience sub millisecond latency delivered by Readyset.
Modern applications demand instant performance, even under unpredictable load. Readyset helps you eliminate slow queries, stabilize latency, and scale confidently.
Revolutionize your database performance with Readyset
Serve requests at sub-millisecond latencies with the modern database scaling and query caching system for MySQL and PostgreSQL.
Join our newsletter
Stay updated with the latest news, insights, and developments from Readyset — straight to your inbox.