Intro to MongoDB

mongodb-logo.png

This article might come as a surprise to some -- an Intro to Mongo article in 2015? Yes, MongoDB had its heyday several years ago, but as consultants who see a lot of different companies, one thing is clear: there are a lot of Mongo databases out there, but not enough well-skilled people even know the basics of how to use it. In addition, a lot of nay-sayers denounced the tool, pointing at its lack of schema as proof that it's a tool for amateurs and others who are too lazy to learn SQL.

So What's Changed?

Every piece of technology goes through an adoption curve. The curve looks a lot like this:

Technology Adoption

We are all remembering the MongoDB in that Early Adopter phase, right around the point of Peak of Inflated Expectations. The ride down to the Trough of Disillusionment was swift for Mongo, and most people kind of considered it a dead technology at that point. However, Mongo is cruising along the slope of enlightenment. People now understand the strengths and weaknesses of Mongo better. People aren't trying to use it as a schema-free SQL. They're using it for its intended purpose: document storage.

Before we get too far into the guts of MongoDB, let's remember what its strengths are, compared to a traditional database:

  • Storing full documents atomically at one location
  • Automatically scale horizontally with additional nodes
  • Highly available service, and replicated data
  • Store files across multiple nodes for highly available data
  • Aggregation and Map-Reduce operations.

Your traditional SQL database is limited to one machine. While it can replicate to a number of others, the total size of the database cannot exceed the size of one machine. It's a scale-up architecture. With MongoDB, the assumption is that the data will end up consuming more than one machine worth of resources. It's a scale-out architecture.

ACID and CAP

ACID compliance is what makes modern RDBMSs so powerful. It allows us to write and read from the database, and get some guarantees on the consistency of the data. Whenever you hear developers talking about "transactions," a transaction is the way to atomically manipulate a set of data without side effects.

The CAP Theorem says that for any distributed system, you can only have 2 of the following:

  • Consistency - All nodes see the same data at the same time
  • Availability - Every request gets a response about success or failure
  • Partition-tolerance - System continues to operate despite arbitrary message loss or failure of some part of the system.

The tradition RDBMS has chosen to use CA as their system. If some part of the DB is unavailable, the whole thing becomes unavailable. MongoDB has chosen AP. Consistency can be achieved in this system eventually, given enough time. That's why it's called eventual consistency. What this means, however, is that if one node in the cluster gets a write for document X, another node in the cluster may return stale data for some period of time until consistency is achieved.

This does not mean that one system is necessarily better than the other. It's a design tradeoff you must make, no different than choosing your UI framework or backend programming language. For a lot of applications, it's not necessary to have perfect consistency:

  • Sensor Data Collection -- loosely structured, high volumes, high variety. Think biometric sensors, "internet of things," etc.
  • Ad Targeting -- latency guarantees are far more important than consistency guarantees.
  • High-Frequency trading -- again, latency is far more important than consistency, as long as it's mostly consistent.
  • Survey Site -- Custom surveys in custom data documents. Other people's answers have no bearing on yours.
  • Call Records -- These are fixed documents that may capture any number of variables
  • Caching -- This is one of the most compelling cases. Data that is difficult or expensive to collect can be stored in a document for later use. The absence of a record, however, is OK.

There are a variety of cases where perfect consistency is desired, such as financial transaction data. Again, this is a design decision of your project.

Playing with MongoDB

One of the coolest things MongoDB has done is create a demo site. You can try MongoDB in your browser, without dealing with an installation!

For the examples here, we're going to pretend that we're collecting biometric sensor data. Let's assume you have some sort of futuristic device that can measure everything about your body. It's to support more tailored advertising, of course.

Baby Steps

In MongoDB, you order your data in 2 ways: First, through a database. A database is just a collection of collections. A collection is a group of documents. In order to use a database, just issue a simple command:

$ use test
$ db
test

OK, we're in the test database. Collections are created implicitly the first time you insert data into it. In this example, we're going to do it in a time-series fashion. That is, the document we insert is representative of the data collected since the last document:

$ first = { ts: 10000, heart: { beats: 14 }, eyes: { left: { blinks: 3 }, right: { blinks: 3 }} }
{
    "ts" : 10000,
    "heart" : {
        "beats" : 14
    },
    "eyes" : {
        "left" : {
            "blinks" : 3
        },
        "right" : {
            "blinks" : 3
        }
    }
}
$ db.metrics.insert(first)
WriteResult({ "nInserted" : 1 })

We've created our first document! Let's get it back out:

$ db.metrics.find()
{
    "_id" : ObjectId("54d54bf51cdcaf3824ff7059"),
    "heart" : {
        "beats" : 14
    },
    "eyes" : {
        "right" : {
            "blinks" : 3
        },
        "left" : {
            "blinks" : 3
        }
    },
    "ts" : 10000
}

The _id Field

The find() operation shows an interesting thing -- the _id field. It's the unique identifier for that record. You can set or otherwise customize this id by passing it in:

$ db.foo.insert({ _id: "hello_world", hello: "world"})
WriteResult({ "nInserted" : 1 })
$ db.foo.find()
{ "_id" : "hello_world", "hello" : "world" }

One of the interesting things about the Mongo shell is that it's JavaScript. You can actually write something like this:

$ for ( var i = 0; i < 10; i++ ) { db.foo.insert({ hello: "world "+i }) }
9
> db.foo.find()
{ "_id" : "hello_world", "hello" : "world" }
{ "_id" : ObjectId("54d54ecc40694710ae63aefb"), "hello" : "world 0" }
{ "_id" : ObjectId("54d54ecf40694710ae63aefd"), "hello" : "world 1" }
{ "_id" : ObjectId("54d54ed11cdcaf3824ff7081"), "hello" : "world 2" }
{ "_id" : ObjectId("54d54ed41cdcaf3824ff7083"), "hello" : "world 3" }
{ "_id" : ObjectId("54d54ed71cdcaf3824ff7085"), "hello" : "world 4" }
{ "_id" : ObjectId("54d54edc40694710ae63af00"), "hello" : "world 5" }
{ "_id" : ObjectId("54d54edf40694710ae63af04"), "hello" : "world 6" }
{ "_id" : ObjectId("54d54ee41cdcaf3824ff7087"), "hello" : "world 7" }
{ "_id" : ObjectId("54d54ee740694710ae63af06"), "hello" : "world 8" }
{ "_id" : ObjectId("54d54eea1cdcaf3824ff7089"), "hello" : "world 9" }

Back to the Metrics

So, let's generate some biometric data for our biometric stuff:

$ function getRand(start, end) {
...     return Math.floor(start + (Math.random() * (end - start + 1)));
... }
$
$ for ( var i = 0; i < 1000; i++ ) {
...     var rec = {
...         ts: 10000 + (i * 5),
...         heart: {
...             beats: 6 + getRand(0, 15)
...         },
...         eyes: {
...             left: {
...                 blinks: 3 + (i % 3)
...             },
...             right: {
...                 blinks: 3 + (i % 3)
...             }
...         }
...     };
...     db.metrics.insert(rec);
... }
$ db.metrics.find().limit(3)
{ "_id" : ObjectId("54d5530e3ac6e2a4782b088e"), "ts" : 14845, "heart" : { "beats" : 20 }, "eyes" : { "left" : { "blinks" : 3 }, "right" : { "blinks" : 3 } } }
{ "_id" : ObjectId("54d5530e3ac6e2a4782b088f"), "ts" : 14850, "heart" : { "beats" : 8 }, "eyes" : { "left" : { "blinks" : 4 }, "right" : { "blinks" : 4 } } }
{ "_id" : ObjectId("54d5530e3ac6e2a4782b0890"), "ts" : 14855, "heart" : { "beats" : 20 }, "eyes" : { "left" : { "blinks" : 5 }, "right" : { "blinks" : 5 } } }

Notice that we also added a limit() to the find() query. It works as you would expect. Now, let's say I want to look up a specific record, let's say one at ts = 12485:

$ db.metrics.find({ts: 12485}).pretty();
{
    "_id" : ObjectId("54d5530e3ac6e2a4782b06b6"),
    "ts" : 12485,
    "heart" : {
        "beats" : 7
    },
    "eyes" : {
        "left" : {
            "blinks" : 5
        },
        "right" : {
            "blinks" : 5
        }
    }
}

Notice another neat trick here. Above, the JSON data was all on one line, and a bit difficult to read. Here, we have appended the pretty() method, and it pretty-prints the JSON. You can do the same thing with nested fields:

$ db.metrics.find({ heart: { beats: 19 }}).limit(2).pretty()
{
    "_id" : ObjectId("54d5530e3ac6e2a4782b089b"),
    "ts" : 14910,
    "heart" : {
        "beats" : 19
    },
    "eyes" : {
        "left" : {
            "blinks" : 4
        },
        "right" : {
            "blinks" : 4
        }
    }
}
{
    "_id" : ObjectId("54d5530e3ac6e2a4782b089f"),
    "ts" : 14930,
    "heart" : {
        "beats" : 19
    },
    "eyes" : {
        "left" : {
            "blinks" : 5
        },
        "right" : {
            "blinks" : 5
        }
    }
}

So, that's the basics of finding records based on exact values. Because MongoDB is schema-free, however, you are allowed to do queries that make no sense:

$ db.metrics.find({ weight: 200 })
$

We found no results because no records have a weight field. This is how you can have documents with different schemas, but still be able to find relevant records.

Now, let's try something more interesting. Let's try to find the records with more than 9 beats:

$ db.metrics.find({ heart: { beats: {$gt: 9} } })
$ db.metrics.find({ "heart.beats": {$gt: 9} }).limit(2).pretty()
{
    "_id" : ObjectId("54d5530e3ac6e2a4782b088e"),
    "ts" : 14845,
    "heart" : {
        "beats" : 20
    },
    "eyes" : {
        "left" : {
            "blinks" : 3
        },
        "right" : {
            "blinks" : 3
        }
    }
}
{
    "_id" : ObjectId("54d5530e3ac6e2a4782b0890"),
    "ts" : 14855,
    "heart" : {
        "beats" : 20
    },
    "eyes" : {
        "left" : {
            "blinks" : 5
        },
        "right" : {
            "blinks" : 5
        }
    }
}

Here we have our first bit of Mongo weirdness. For some reason, MongoDB can apply these modifiers only at the top level. So, if you want to compare a nested document value, you must use this dotted notation. If I remember my lessons from a few years ago, it's because the values are stored in dotted notation in MongoDB, but I'll let someone else confirm that. We can also search between two values:

$ db.metrics.find({ "heart.beats": {$gte: 18, $lt: 21} }).limit(2).pretty()
{
    "_id" : ObjectId("54d5530e3ac6e2a4782b088e"),
    "ts" : 14845,
    "heart" : {
        "beats" : 20
    },
    "eyes" : {
        "left" : {
            "blinks" : 3
        },
        "right" : {
            "blinks" : 3
        }
    }
}
{
    "_id" : ObjectId("54d5530e3ac6e2a4782b0890"),
    "ts" : 14855,
    "heart" : {
        "beats" : 20
    },
    "eyes" : {
        "left" : {
            "blinks" : 5
        },
        "right" : {
            "blinks" : 5
        }
    }
}

In addition, we can use something called projections to indicate which fields should be sent back:

$ db.metrics.find({ ts: 14845, heart: { beats: 20 }}, { _id: 0, "heart.beats": 1})
{ "heart" : { "beats" : 20 } }

Using the Data

Ok, we can insert and find/manipulate data. Let's start trying to use it. Let's figure out our average heartrate. How would we do this? We need to count the number of total beats, and divide it by the number of minutes. This is where MongoDB starts getting fun. We are using a part of the tool they refer to as the Aggregation Pipeline framework. It looks a bit like this:

$ db.metrics.aggregate([{ $group: { _id: "answer", sumOfBeats: { $sum: "$heart.beats" } } }])
{ "_id" : "answer", "sumOfBeats" : 13587 }

The projection framework has some wonkiness in syntax, but it becomes easier over time. First, we are calling .aggregate(), which takes an array of aggregation objects. For our first object, we're just applying a grouping operation. The object after it is the exact format that will be returned. We must specify an _id file manually, since every response must have one. Then, we're telling MongoDB to sum across all of "heart.beats". We have to put a dollar sign in front of it because... something. Seriously, I haven't found a good reason why it's required, but it is. Now, let's get the time boundaries in there:

$ db.metrics.aggregate([{ $group: { _id: "answer", sumOfBeats: { $sum: "$heart.beats" }, startTime: { $min: "$ts" }, endTime: { $max: "$ts"} } }])
{ "_id" : "answer", "sumOfBeats" : 13587, "startTime" : 10000, "endTime" : 14995 }

So along with $sum, we can also do $min and $max (much like their SQL equivalents). Now, how can we actually get BPM? Let's add another stage to the pipeline. We're going to start formatting our query to make it easier to read:

$ db.metrics.aggregate(
...     [
...         {
...             $group: {
...                 _id: "answer",
...                 sumOfBeats: { $sum: "$heart.beats" },
...                 startTime: { $min: "$ts" },
...                 endTime: { $max: "$ts"}
...             }
...         }, {
...             $project: {
...                 sumOfBeats: 1,
...                 elapsedTime: { $subtract: ["$endTime", "$startTime"]}
...             }
...         }
... ])
{ "_id" : "answer", "sumOfBeats" : 13587, "elapsedTime" : 4995 }

So now we have time and sum of beats. Let's go one step further!

$ db.metrics.aggregate(
...     [
...         {
...             $group: {
...                 _id: "answer",
...                 sumOfBeats: { $sum: "$heart.beats" },
...                 startTime: { $min: "$ts" },
...                 endTime: { $max: "$ts"}
...             }
...         }, {
...             $project: {
...                 bpm: {
...                     $divide: [
...                         "$sumOfBeats",
...                         {
...                             $divide: [
...                                 {$subtract: ["$endTime", "$startTime"]},
...                                 60
...                             ]
...                         }
...                     ]
...                 }
...             }
...         }
... ])
{ "_id" : "answer", "bpm" : 163.2072072072072 }

Now, we have our BPM. Yes, it's extremely high, but we also had a bunch of entries with >19 beats in 5 seconds. Maybe this person was running?

Conclusion

That's the basics of using MongoDB. It's a lightning-fast storage engine for full documents. You can do almost every SQL operation you are already familiar with, and even more in a map-reduce inspired pipeline framework.

While the hype cycle may be dead with MongoDB, it's still a valuable tool to have in your belt. It offers a nearly effortless way to store documents (or objects), and analyze them.

In addition, Mongo scales much easier than SQL. It's nearly trivial to expand your document database to a dozen machines. Mongo focuses on speed, flexibility, and scalability, while not dealing with stuff like transactions and relational semantics. This is a design tradeoff you might make for your application.