Application Level Joins of DynamoDB NoSQL Data

Fotolia_189844501_Subscription_Monthly_M.jpg

written by: mic amiethyst

You probably don’t want to do this [#].  There are alternatives to manually joining DynamoDB NoSQL documents.  These include denormalizing the data, which is quite common in NoSQL environments, as well as using Amazon’s Elastic Map Reduce (EMR) service, which does have a join feature in it.  However using EMR just for joins is like getting a car when all you need is a radio.  If you don’t need EMR and either can’t denormalize the data or just haven’t yet then this strategy may be for you.

Have you come to NoSQL technologies only after learning SQL for years?  If so, then you’ve probably wanted to JOIN some different documents (like tables, but NoSQLier) together.  If you’ve started off your career in NoSQL then this will give you an additional mental tool in addition to the code.

At a high level, we are going to be reading in data from two different DynamoDB documents, running the same kind of join algorithm that the traditional databases do (in a simple, un-optimized way [footnote to accounting of OS level page faults in real databases]) and lastly writing the results back to another DynamoDB document.  This would be analogous to a point-in-time snapshot of the data, a view in the SQL sense or a kind of simple Extract-Transform-Load (ETL) operation.  Note that in a traditional DB we would have stronger guarantees about data staleness since locks could be enforced.

With these constraints in mind, we will proceed.
 

Read DynamoDB Data

Assuming that you have your AWS credentials in place and that your DynamoDB table already exists you can find a great example in the AWS docs themselves at [#].  A more detailed example using DynamoDBMapper can be found at [#].
 

Join Document Data

This is the real core of the problem.  We will be the same algorithm that RDBMS engines use [#], the classic hash join.  Fundamentally the DynamoDB documents are just key-value pairs, so we are going to use some loops to find matching key-value pairs (the equivalent of the WHERE clause) and when we do, we just merge the two matching documents’ key-value pairs together.  The hash part of the hash join just means that the values being matched on are hashed down in a preprocessing step to allow the search loops to match ints instead of complex equality checks.

We begin with two sets of documents (hashes) and call the larger set S and the smaller set R.  It’s confusing that the small set isn’t called S, but convention happens.  We are going to use a (very small) example as input data.  We shall imagine a bank with credit cards acquiring a real estate bank for the first time.  The parent bank will have credit card information and the real estate bank will have mortgage data.  

Screen Shot 2018-02-13 at 11.38.20 AM.png

Lets run a manual join for people who are customers of both.  First we pre-process the R document by hashing the name, which is the key we are joining on.  To keep things readable we will hash names by truncating them down to a single letter.  This gives us:

Screen Shot 2018-02-13 at 11.40.22 AM.png

Putting this table together is sometimes called the build phase.  The next is the probe phase.

The next step is to loop through S’s entries and probe the hashed R table for matches.  This probe action is just looping through looking for a match of the hashed input name.  When we find a match we merge the results and put the combined document into an output array.

Let’s work out what that would look like.  Our first entry in S is {name:Richard, Mortgage#:10}.  The name hashes down to R.  We compare it to the first entry in the Hashed R table and find a J so no match.  We compare it to the next entry and find an R so it is a match.  We merge the fields, unhash the name via a hashtable lookup and get an output so far of:

Screen Shot 2018-02-13 at 11.41.27 AM.png

We continue with the next entry in S, Chris.  This hashes down to C and has no match in R’s hashed data.  We finish with the last entry in S, Donna.  This hashes down to D and also has no matches.  That leaves our initial output data as our final result.  Now we just need to get that result back into DynamoDB and use a Create Table [#] and Batch Write Item request to populate our new tables [#].

Conclusion

We have basically taken what was the RDBMS level join algorithm and did a lift and shift to put it into the application layer.  To make use of it we had to pull all needed data onto one machine to hash and search it and then wrote the results back into the cloud.  This algorithm is not completely efficient and should be used only for small data sets.  See the wiki entry for other more efficient but more complex algorithms [#].  Even this one was simplified by assuming there was always enough memory to slurp the two DynamoDB tables into memory.  The simplicity makes it a great introduction to application level joins.

Recall the downsides of this approach.  These include paying for DynamoDB reads and writes, data consistency due to timing issue during the initial bulk-read and needing to ensure that “foreign keys” are consistent in the application layer instead of in the SQL RDBMS code.

The alternatives are to de-normalize your data or use Amazon EMR (elastic map reduce).  These would alleviate the need to do application level joins at all.