Wednesday, February 26, 2014

Notes on MongoDB space allocation

This describes what I learned about space allocation in MongoDB. I am new to it so I might be wrong and welcome corrections. I make some comparisons to MySQL given my many years working with it. MongoDB creates several types of files -- database, journal, temp files for index creation, temp files for repair and the oplog for replication. I only discuss database files in this post.

I have seen some results from comparisons with TokuMX that suggest MongoDB uses a lot more space than desired. I can't trust everything I read on the web and maybe you can't trust what I write but I won't discount the claims. Understanding space allocation makes it easier to figure out whether these claims might be an issue for a given workload. Here is one example where it used 10X more space than TokuMX. Here are other examples from Tokutek. And this is an example from a production workload that got between 3X and 4X better compression with TokuMX.

MongoDB uses 1+ files per database and there can be many collections per database. With MySQL/InnoDB there are options to use a file per instance or file per table and at times it would have been nice to use a few files per database. With MongoDB each file in the database has a fixed size and is preallocated to that full size by a background thread. Note that InnoDB will incrementally grow files and there have been stalls in the past from that (since fixed by Facebook, Percona and Oracle). Preallocation means that when when a database requires space from N files it is likely that N+1 files will be created. If you are testing MongoDB to see how much space it might use for your data be careful to use a dataset that isn't too small or the space from the preallocated file will overestimate the amount of space needed.

The sizes for database files are 64M, 128M, 256M, 512M, 1G, 2G, 2G, 2G and so on unless the smallfiles option is set in which case the size pattern is 16M, 32M, 64M, 128M, 256M, 512M, 512M, 512M and so on. Even with the smallfiles option it might be a not best practice to have very small databases, where small means the average database is less than 512M because there will be a lot of space wasted by file preallocation. File preallocation can be disabled by the noprealloc option but that is not recommended. I think the per-database write lock can be held during file allocation and there would be more stalls.

Space allocation for documents is a hot topic. Note that documents are stored contiguously in a file- a 4MB document will be stored from file offset X to file offset X+4MB. Compare this to InnoDB which uses no more than half of a page (16kb by default) per row and then stores large columns in overflow pages with each spilled column stored on its own page(s). There are some cases where InnoDB can waste a lot of space from spilled blob columns that are not too large. For example, when a spilled blob column is 4k then it will use a 16k page for itself. MongoDB might be more space efficient in that case.

Free space lists are maintained to store all unused space that can store records of size 32, 64, 128, 256, 512, 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M and 16M. Space is added to these lists when a file is created and probably when records are deleted and moved. Entries on the free space lists can have variable sizes but allocation requests are quantized (see below) to reduce the number of possible sizes. When InnoDB compression is used and there are either tables with different compression factors or compressed and uncompressed tables then InnoDB has to manage free pages with different sizes, but the number of sizes is greatly reduced (16k, 8k, 4k, 2k). We had one performance bug with code in InnoDB that tried to merge adjacent pages on line list to promote it to the next larger list. I wonder how the MongoDB code handles that.

A record must be moved when it is updated and the post-update size exceeds the space originally allocated. This is why padding can be important. When space is needed for a new or a moved record there are two algorithms to decide how much space to allocate. The original algorithm uses a padding factor so that X * paddingFactor bytes are allocated for an X byte document. The padding factor is managed per collection and can be queried but cannot be set by a DBA. It took me a few minutes to realize I could not set it. Well, it can be set during compact but not repairDatabase (more on this later). The padding factor is kept between 1.0 and 2.0, decreased by 0.001 after some updates that are done in place and increased by 0.001 X (min(nIndexes, 7) + 3) after some updates that are not done in place. Some means a sample of 25% of the updates. Even when the padding factor is 1.0 there is likely to be some padding because space allocation is quantized. The amount of space to allocate is rounded up to a multiple of X bytes where X is slab-size / 16 and the slab sizes are listed above (32, 64, 128, ...). There might be an open bug where allocation requests that are already a power of 2 get rounded up too much when quantized.

The new space allocation algorithm is powerOf2Sizes. The space allocated for a document is rounded up to the next power of 2. Without using this in production I assume that on average this uses 1.5X more bytes than needed and the overhead from a fragmented b-tree is similar. With the old approach (padding factor) there isn't much padding on a new collection and with the new approach (powerOf2Sizes) there is a lot of padding. As powerOf2Sizes can be set on a collection I expect that a common pattern will be to set it prior to a reload for collections that will get frequent changes after the load and to not set it for collections that will be read-only after a load. Note that using powerOf2Sizes for a read-only collection will waste a lot of space. It can take time for MongoDB to settle on a good value for the padding factor of a collection. The good value can be lost during dump & reload. It can be preserved during the compact command but not during repairDatabase or an explicit dump and reload. There might be a few feature requests here.

MongoDB doesn't support compression for database files. Support for compression of read-only collections will be much easier to implement than for non read-only collections. Maybe we will see something similar to MyISAM where it is only implemented for compressed collections. Note that TokuMX supports compression for read-write collections. InnoDB is able to do compression for read-write tables and a big reason this was possible is the ability to do a page split when a page does not compress by the expected amount. This requires clustered indexes and MongoDB doesn't do clustered indexes.

MongoDB doesn't return space to the OS when there is free space at the tail of a large file. InnoDB also lacks this feature but it has never been a priority for me. The compact command defragments one collection and there are 1+ collections per database. The docs state that it can use up to 2G of extra space when running. I assume it might be writing into new extents when copying out from fragmented extents. So the result is that compact can leave the database with more files on completion but much more free space in the old files. repairDatabase appears to have similar behavior in terms of file allocation. It defragments all collections in a database. It might be OK to describe this as incremental dump and reload where space from the input files is released as new files are created. I like the incremental nature with respect to file allocation.

Last but definitely not least there are no checksums on database pages. There are checksums on journal writes. Page checksums are extremely important and have helped identify bad storage devices on many occasions. I prefer not to support a database running at scale without them.

8 comments:

  1. Nice detailed report. One thing I noticed while benchmarking, and I'm interested in hearing others opinions on w.r.t. powerOf2Sizes is: how much it impacts the working set a system can maintain, because, with mmap(), any fragmentation on disk will also become fragmentation in memory (less so for very large documents). Reminds me of jemalloc's size classes, but once jemalloc gets past 2MB it switches from exponential sizing to linear sizing. Did you happen to learn how MongoDB allocates very large documents, if it's done differently? For example, presumably an 8MB document gets allocated 8MB. How much does an 8MB + 1 byte document get?

    ReplyDelete
  2. With powerOf2Sizes, Bbeyond X MB allocation is rounded up to the next MB. I think X == 4. So 8MB+1 document gets 9MB.

    int NamespaceDetails::getRecordAllocationSize( int minRecordSize ) {

    if ( isCapped() )
    return minRecordSize;

    if ( _paddingFactor == 0 ) {
    warning() << "implicit updgrade of paddingFactor of very old collection" << endl;
    setPaddingFactor(1.0);
    }
    verify( _paddingFactor >= 1 );

    if ( isUserFlagSet( Flag_UsePowerOf2Sizes ) ) {
    // quantize to the nearest bucketSize (or nearest 1mb boundary for large sizes).
    return quantizePowerOf2AllocationSpace(minRecordSize);
    }

    // adjust for padding factor
    return static_cast(minRecordSize * _paddingFactor);
    }

    ReplyDelete
  3. How many dirty 4k pages get written back to disk when a small counter gets updated in place on that 8MB doc? I don't know that yet.

    ReplyDelete
  4. Should just be the one the counter lies in, unless it's on a boundary. The serialized BSON format doesn't require you to change anything in other pages when you increment one number.

    $push and $addToSet need to update the array length stored at the beginning of the array, so they'll hit that page in addition to adding however many pages at the end.

    ReplyDelete
  5. Actually, they'll update multiple sizes depending on the nesting of the document.

    ReplyDelete
  6. Nope, I'm wrong, apparently they rewrite the whole array: https://jira.mongodb.org/browse/SERVER-12886

    ReplyDelete
    Replies
    1. This is likely to greatly limit the ability to use an array in a document for a read-write workload instead of something like a relationship table in SQL

      Delete
  7. Fixed mistakes courtesy of feedback from MongoDB experts -- the free list has a bucket for 16M, journal writes have checksums, typo in paddingFactor changes.

    ReplyDelete