MongoTips by John Nunemaker 4b7b5c02dabe9d074200002c 2015-07-13T11:39:00-04:00 Another ObjectId Trick 4fc623c6dabe9d5c00004a26 2012-05-30T10:38:48-04:00 2012-05-30T09:00:00-04:00 <p>In which I share why and when you should use objects for _id.</p> <p>A while back I posted <a href="http://mongotips.com/b/a-few-objectid-tricks/">A Few ObjectId Tricks</a>. While not a new trick, this one is relatively new to me and quite useful.</p> <p>As you know, the key _id is automatically indexed for each document in a collection. What you might not know, is that <strong>you can use objects for _id</strong>.</p> <p>It seems obvious now, but I had not thought of it. A few days back, I re-read foursquare&#8217;s article on <a href="http://engineering.foursquare.com/2011/02/09/mongodb-strategies-for-the-disk-averse/">MongoDB Strategies for the Disk-Averse</a>. They make heavy use of objects if _id in their analytics.</p> <p>I work on <a href="http://get.gaug.es/">Gauges</a>, a web analytics tool, and I use several techniques to more efficiently store data. What I have been doing up until now is string _id&#8217;s with multiple values smushed in.</p> <pre><code class="javascript">{"_id": "&lt;oid&gt;:&lt;hash&gt;"}</code></pre> <p>oid is typically a bson object id and hash is some kind of hash on whatever I&#8217;m doing a write for. I do writes against _id and the hash determines uniqueness without storing the full url of the page we are tracking.</p> <p>The crappy part is strings can&#8217;t be serialized as efficiently as an object id or an integer. The foursquare article pointed out that I could just store my _id&#8217;s like this:</p> <pre><code class="javascript">{"_id": {"i": &lt;oid&gt;, "h": &lt;hash&gt;}}</code></pre> <p>So what is the benefit of objects as _id&#8217;s? I see at least four.</p> <h2>1. Fewer indexes</h2> <p>If you were not using objects or concatenated values for _id, those values would need to be with keys inside the document. If you wanted to write against those keys, you would need to index them. This means you would have an _id that is indexed and also a secondary compound index (ie: [[i, 1], [h, 1]]).</p> <p>More indexes means more writes and more <span class="caps">RAM</span>. Using an object for _id or a concatenated value saves this extra index and thus saves you efficiency and <span class="caps">RAM</span> on the server side.</p> <h2>2. Document Size</h2> <p>Object ids and integers can be serialized in bson more efficiently than strings. Switching from a string that is a mashup of object id and hash to an object can cut several bytes from each document.</p> <pre><code class="ruby">require 'rubygems' require 'bson' oid = BSON::ObjectId.new puts BSON.serialize(:_id =&gt; "#{oid}:123456789").size # 49 puts BSON.serialize(:_id =&gt; {:i =&gt; oid, :h =&gt; 123456789}).size #37</code></pre> <p>In the example above, the size difference was 37 verse 49 bytes. That saves almost a fourth of the document size, simply by using an object instead of a string mashup. Your mileage may vary, but applied across the millions of documents we track every month, this is a non-trivial amount of savings.</p> <h2>3. More Query-able (than concatenated values)</h2> <p>The example below shows using a range query using objects as _id, with greater than or equal to and less than or equal to. This would be painful with concatenated values in some scenarios and impossible in others.</p> <pre><code class="ruby">require 'pp' require 'rubygems' require 'mongo' conn = Mongo::Connection.new db = conn.db('test') col = db['test'] col.remove oid = BSON::ObjectId.new (1..3).each do |day| col.save(:_id =&gt; {:i =&gt; oid, :d =&gt; day}) end # Get all documents pp col.find.to_a # [{"_id"=&gt;{"i"=&gt;BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=&gt;1}}, # {"_id"=&gt;{"i"=&gt;BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=&gt;2}}, # {"_id"=&gt;{"i"=&gt;BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=&gt;3}}] # Only get those matching a given day for a given oid pp col.find({ :_id =&gt; { '$gte' =&gt; {:i =&gt; oid, :d =&gt; 1}, '$lte' =&gt; {:i =&gt; oid, :d =&gt; 1}, }, }).to_a # [{"_id"=&gt;{"i"=&gt;BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=&gt;1}}]</code></pre> <p>We could change the $lte :d value to 2 or to 3 and those documents would then be included as well. Sure, you can query relatively the same thing using strings, but you would need to generate all the _id&#8217;s and use a $in query to pull them all out. An added bonus is that you get the documents back sorted ascending, which is what I have wanted when using objects as _id&#8217;s.</p> <p><small><strong>Note</strong>: if you try to execute the code above on Ruby 1.8, it will not work. Ruby 1.8 hashes are not ordered and ordering is important for _id objects. Use <span class="caps">BSON</span>::OrderedHash instead of a plain hash if on 1.8. The same applies to whatever language you use if you aren&#8217;t using Ruby and said language does not default to ordered hashes or dictionaries.</small></p> <h2>4. Objects are easier to grok</h2> <p>Values that are concatenated are not very intent revealing. With objects, however, the keys reveal what the values are.</p> <p>Granted if you shorten the keys to save space, they might not reveal as much, but typically in this scenario, you create a mapping of short to long keys, which could be used to quickly deduce the key names and thus what the values are.</p> <p>This point might not be a major one, but objects definitely feel cleaner and more obvious to me than concatenated values.</p> John Nunemaker Lower Lock % and Number of Slow Queries 4f242629dabe9d1ca8001352 2012-01-30T12:17:05-05:00 2012-01-29T07:00:00-05:00 <p>In which I share a simple trick for speeding up writes.</p> <p>Gauges tracks several websites. Some get a lot of traffic and others don&#8217;t. The sites that get a lot of traffic tend to stay hot and sit in <span class="caps">RAM</span>. The sites that get little traffic, eventually get pushed out of <span class="caps">RAM</span>.</p> <p>This is why for Gauges, we can have 1GB of <span class="caps">RAM</span> on the server and over 14GB of indexes, yet Mongo hums along. Even better yet, of that 1GB of <span class="caps">RAM</span>, <strong>we only use around 175MB</strong>.</p> <h2><span class="caps">RAM</span>, What Needs to Fit There</h2> <p>You have probably heard that MongoDB recommends keeping all data in <span class="caps">RAM</span> if you can. If not, they suggest at least keeping your indexes in <span class="caps">RAM</span>.</p> <p>What I haven&#8217;t heard people say as much, is that <strong>you really just need to keep your active data or indexes in <span class="caps">RAM</span></strong>. For stats, the active data is the current day or week, which is a much smaller data set than all of the data for all time.</p> <h2>Write Heavy</h2> <p>The other interesting thing about Gauges is that we are extremely write heavy, as you would expect for a stats app. Requests due to the tracking script loading on a website are <strong>over 95%</strong> of all requests to Gauges.</p> <p>Some of these track requests are for sites that rarely get hit, of which, the data that needs to be updated has been pushed out of <span class="caps">RAM</span> and is just sitting on disk.</p> <h2>Global Lock is Global</h2> <p>As you probably know, MongoDB has a global lock. The longer your writes take, the higher your lock percentage is. <strong>Updating documents that are in <span class="caps">RAM</span> is super fast</strong>.</p> <p>Updating documents that have been pushed to disk, first have to be read from disk, stored in memory, updated, then written back to disk. This operation is slow and happens while inside the lock.</p> <p>Updating a lot of random documents that rarely get updated and have been pushed out of <span class="caps">RAM</span> <strong>can lead to slow writes and a high lock percentage</strong>.</p> <h2>More Reads Make For Faster Writes</h2> <p>The trick to lowering your lock percentage and thus having faster updates is to <strong>query the document you are going to update, before you perform the update</strong>. Querying before doing an upsert might seem counter intuitive at first glance, but it makes sense when you think about it.</p> <p>The read ensures that whatever document you are going to update is in <span class="caps">RAM</span>. This means the update, which will happen immediately after the read, always updates the document in <span class="caps">RAM</span>, which is super fast. I think of it as <strong>warming the database for the document you are about to update</strong>.</p> <p>Based on these graphs, I am pretty sure you will be able to tell that it was January 27th in the evening when I started pushing the query before update changes out:</p> <p><strong>Lock Percentage</strong><br /> <img src="/assets/4f24cf21dabe9d6067004981/article_full_width/mongo_lock_percentage.png" alt="" /></p> <p><strong>Slow Queries</strong><br /> <img src="/assets/4f24cf22dabe9d3d2400f6d4/article_full_width/mongo_slow_queries.png" alt="" /></p> <p>Obviously, this dramatically increased the number of queries that we performed, but it has added less than a few milliseconds to our application response time and the database is more happy.</p> <p><strong>Number of Reads/Writes</strong><br /> <img src="/assets/4f24cf22dabe9d3d2400f6c6/article_full_width/mongo_ops.png" alt="" /><br /> <small>Reads are blue and writes are yellow.</small></p> <p>Granted my explanation above is simplistic, but you get the gist. Query before update ensures that the updated document is in <span class="caps">RAM</span> and that the update is fast. I remember reading it somewhere and I kept thinking I should try it. Finally, I did, and it definitely helped.</p> <p><strong>Note:</strong> At the time of this writing, we are running MongoDB 1.8.&#215;. MongoDB 2.x has significant improvements with regards to locking and pulling documents from disk, not to say that this technique won&#8217;t still help.</p> <p>If you want to learn more about how we use MongoDB for Gauges, you can check out my <a href="/b/mongodb-for-analytics/">MongoDB for Analytics presentation</a>.</p> John Nunemaker MongoDB for Analytics 4ed78aeedabe9d212000298d 2015-07-13T11:39:00-04:00 2011-12-01T09:00:00-05:00 <p>Just over a month ago, I presented on storing stats in MongoDB at MongoChi 2011. 10Gen posted the video recently, so I thought I would share it here. The first 5 minutes or so are a bit rough, but it improves after that. I&#8217;ll be presenting the same thing at MongoSV next week if you want to catch it live.</p> <p>I&#8217;ve also embedded the slides below.</p> <script async class="speakerdeck-embed" data-id="4e9c82ee14adad0051005495" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script> John Nunemaker Replacing RabbitMQ with MongoDB 4e910b94dabe9d1874005d81 2011-10-08T22:49:04-04:00 2011-10-08T22:48:00-04:00 <p>Interesting post on the Boxed Ice blog about how they are using MongoDB capped collections for a queue system.</p> <p>Until today I did not realize you could index a capped collection. Not sure why I was confused about that, but now I totally understand how capped collections would work great for a queue.</p> John Nunemaker Counters Everywhere, Part 2 4e6f877fdabe9d205c00b808 2011-09-13T12:43:14-04:00 2011-09-13T12:40:00-04:00 <p>For those that follow me here, but not over at <a href="http://railstips.org">RailsTips</a>, I recently released a new post on storing stats in MongoDB. The topic of this post is around storing large ranges of data.</p> John Nunemaker MongoDB 2.0 Released 4e6f86d5dabe9d4f5c002289 2011-09-13T12:37:41-04:00 2011-09-13T12:35:00-04:00 <p>Yesterday, MongoDB 2.0 was released. Personally, I am excited about the concurrency improvements and the more compact index storage. Also, for the haters, journaling is now on by default. Great release!</p> John Nunemaker Counters Everywhere, Storing Stats in Mongo 4e09fad4dabe9d7e9a000593 2011-06-28T12:01:24-04:00 2011-06-28T12:00:00-04:00 <p>Over at RailsTips, I posted on how I use a single document to store a lot of good stuff in <a href="http://gaug.es">Gaug.es</a>, the real-time stat app we are making at <a href="http://orderedlist.com">Ordered List</a>.</p> John Nunemaker Customizing Your Mongo Shell Prompt 4e09b8dadabe9d3dcf001cf8 2011-06-28T07:19:54-04:00 2011-06-28T07:19:00-04:00 <p>Sometimes I think all I do is link to Kristina, but she has written several great posts, so enjoy another.</p> John Nunemaker MongoHQ Funded 4e05ee87dabe9d22f00040cb 2011-06-25T10:20:20-04:00 2011-06-25T10:18:00-04:00 <p>MongoHQ has raised $417K from a few investment firms to help grow the service.</p> John Nunemaker Geo Spatial Indexing 4e05ee36dabe9d22f0003c34 2011-06-25T10:19:55-04:00 2011-06-25T10:17:00-04:00 <p>In depth post on how Mongo&#8217;s geo spatial indexing works.</p> John Nunemaker A Few ObjectId Tricks 4d6e5af9dabe9d709c000008 2011-03-02T10:52:19-05:00 2011-03-02T10:24:00-05:00 <p>In which I share a few object id tricks.</p> <p>One of the things that I was not aware of until recently is how handy <a href="http://mongodb.org/">MongoDB&#8217;s</a> object ids actually are. Below are a few tips based on some things I have been doing lately.</p> <h2>To and From Strings</h2> <p>First off, it is quite simple to switch back and forth between object id and string, which is useful in <span class="caps">JSON</span>/<span class="caps">XML</span> serialization and in finding Mongo documents from params.</p> <pre><code class="ruby">id = BSON::ObjectId.new # =&gt; BSON::ObjectId('4d6e5acebcd1b3fac9000002') id.to_s # =&gt; "4d6e5acebcd1b3fac9000002" BSON::ObjectId.from_string(id.to_s) # =&gt; BSON::ObjectId('4d6e5acebcd1b3fac9000002') </code></pre> <h2>Generation Time</h2> <p>Switching back and forth between strings is simple and obvious. Something a little more interesting is that pretty much every driver supports extracting the generation time from an object id. This means that you can stop using created_at in your Mongo documents and instead just pull it from the object id. We are doing this in <a href="http://gaug.es">Gaug.es</a> quite often.</p> <pre><code class="ruby">BSON::ObjectId.new.generation_time # =&gt; 2011-03-02 15:01:08 UTC</code></pre> <p>The generation time is <span class="caps">UTC</span> and you can easily use ActiveSupport&#8217;s awesome TimeZone stuff to move the time into different zones.</p> <pre><code class="ruby">id = BSON::ObjectId.new id.generation_time.in_time_zone(Time.zone)</code></pre> <p>Miss created at? Just make a simple method that wraps this and returns the generation time in the current zone.</p> <pre><code class="ruby">def created_at id.generation_time.in_time_zone(Time.zone) end</code></pre> <h2>Date Range Queries</h2> <p>Where it gets even more interesting is that you can also create object id&#8217;s from time. This means you can use object id&#8217;s in range queries, say to find out how many people signed up for your sweet new app today.</p> <pre><code class="ruby">sites = Gauges.db['sites'] # get Mongo::Collection start_at = Time.zone.now.beginning_of_day time = BSON::ObjectId.from_time(start_at) sites.find('_id' =&gt; {'$gte' =&gt; time}).count</code></pre> <p>Above I just did a count query, but you could actually iterate the documents or whatever as with any normal Mongo query. The benefit is that _id is automatically indexed, so you do not have to have an extra field plus an extra index.</p> <p>Nothing earth shattering here, but I have found these really helpful on <a href="http://gaug.es">Gaug.es</a> and thought I would share.</p> John Nunemaker MongooseJS 1.0 4d48670ddabe9d4250000082 2011-02-01T15:03:49-05:00 2011-02-01T15:03:00-05:00 <p>Mongoose, the MongoDB object modeling tool for node.js that powers <a href="https://www.learnboost.com/">LearnBoost</a>, just hit 1.0.</p> John Nunemaker MongoDB 1.8 Map/Reduce Changes 4d41b1cfdabe9d462b000004 2011-01-27T12:57:01-05:00 2011-01-27T12:56:00-05:00 <p>Map/reduce output has changed in 1.7.5 (and thus on up). This post does a great job explaining why temporary collections are no longer an option and how to use map/reduce with the updates.</p> John Nunemaker Single Server Durability 4d41b111dabe9d44f500000c 2011-01-27T12:54:31-05:00 2011-01-27T12:53:00-05:00 <p>The MongoDB team has released 1.7.5 (unstable) which includes single server durability. Using the &#8212;dur option when on start turns on journaling. The <a href="http://www.mongodb.org/display/DOCS/Journaling">documentation on journaling</a> is also a good read.</p> John Nunemaker Why Command Helpers Suck 4d3f01cddabe9d22f2000007 2011-01-25T12:01:18-05:00 2011-01-25T12:01:00-05:00 <p>Kristina Chodorow posts about why the command helpers are harmful and what you should be doing instead.</p> <blockquote> <p>Every command helper is just a find() in disguise! This means you can do (almost) anything with a database command that you could with a query.</p> </blockquote> John Nunemaker