Performance tuning MongoDB

In Database, Distributed Systems, Software Development by Prabhu Missier

MongoDB is a highly scalable, distributed document database which affords changes in horizontal scalability without changes in application logic. However there are ways in which the performance of this very powerful database with its JSON-like documents can be tuned even further. Read on

1)Examine query patterns
Optimizing performance starts with understanding your database’s query patterns and profile. Look at query patterns and see if documents can be embedded rather than using in application joins which are expensive.
Once query patterns are understood you could cache certain intermediate results or create indices on fields which are regularly queried.

You can start by looking at the MongoDB log. By default MongoDB logs all queries taking longer than 100 ms but this default can be overridden. Look for lines containing the words “COMMAND” with the execution time in milliseconds at the end.

In the same vein MongoDB provides an explain facility which you can tag along to your queries to get a detailed JSON output of query behaviour.

{ country: 'AU', city: 'Melbourne' }

2 very important metrics which you can look at are the number of documents returned and the number of documents actually scanned. If you are scanning a whole lot more than what you are actually returning then it’s time to probably look at what could be wrong. Perhaps create an index.

2) Pay attention to database schemas
JSON documents allow you to model different types of data be it time-series data, key-value pairs, columnar structures or graph edges and nodes. Whatever be the data it pays to understand what type of data you are dealing with and also try and model the relationships between the data.
Should all the data be embedded in the same document or should there be multiple documents with references to each.

3)Embedding and Referencing
MongoDB has a very flexible schema but this doesn’t mean planning a schema or doing data modelling is unnecessary. Figure out the schema at the very beginning of a project. On the contrary the flexibility afforded by MongoDB should enable you to design creative schemas which optimize the relationships between documents. Look at how sub-documents can be embedded or references between documents can be defined.

For example consider a 1 to many relationship between a product and the review comments about it. Does it makes sense to have the product document embed a reference to the document containing the comments or embedding all the comments in the same document. Remember if data is frequently accessed together then that would be a case to store them together in the same document.
Embedding generally provides better performance for read operations due to data locality.

Look for patterns in the data that could be queried by the client application. Should the entire document be returned or just a few fields? Also which are the fields that are never part of the result set? Does that indicate a possible tweak to the schema?

Referencing makes sense when there’s a many to many relationship. Referencing should be used when a document is frequently accessed but contains data that is rarely used or a case where certain parts of a document are never updated while the rest of it is.

4) Try and determine memory usage
Database performance is best when the working set which consists of frequently accessed data fits into RAM. Once the working set exceeds the RAM read activity from the disk increases and performance starts dropping. When this happens you might need to increase the RAM on your server instance or set up your database instances to autoscale if this is option is available to you.

Another option is to shard the data across multiple servers using a sharding key. Since data is split across servers depending on the value of the sharding key any access to the servers is more targeted and overall throughput is higher.

5) Data replication
A mainstay of database performance is data replication. In MongoDB replica sets are created from the primary node and this enables reads to happen from the replicas while the primary node can handle the writes.
Replica sets improve reliability by offering redundancy and better load balancing. Replica sets also allow the application to read from the closest server thereby reducing read latency.
In the event of the primary node failing, Mongo enables the election of a new primary from amongst the nodes in the replica set thereby ensuring continued availability.

6) Build appropriate indexes
MongoDB does require indexes just like its relational counterparts. Indexes allow quick retrievals as opposed to having to scan the entire collection for a result. The majority of your indexes are likely to be on single fields but compound indexes can also be created.

for eg. db.user.createIndex({ country: 1 });
db.user.createIndex({ country: 1, city: 1 });

Indexes are also helpful while returning sorted results. However the index should match the sorted result. Let’s say you have an index on country name but then return results which are sorted by country name and city name. The country index would be used to sort faster but since Mongo has to manually sort the city name the latency could shoot up and in some cases could also exceed the 32 MB sorting limit that Mongo has. In such a case you would need to define a compound index with both the country and city names.

If you’ve built the right indexes and still find that performance is not fast enough you could also try re-building your indexes.

7) Creating multiple connection objects

A single database connection object will be reused by all queries and updates. However MongoDB queues commands and processes them synchronously in the order they are received.

Slow running complex queries could thus end up being a bottleneck and your application wouldn’t be responsive if these bottlenecks start cropping every now and then.
An option is to create multiple connection objects. For instance one connection object could be used by slow running queries and another one could be used for fast queries

8) Setting timeouts
MongoDB commands can run as long as they need to. This could end up slowing down your application’s response times and eventually causing them to hang.

For instance Nodejs will happily keep waiting for the outcome of an asynchronous callback.
The right thing to do would be to use the maxTimeMS() utility whereby you tell Mongo to time out if a command takes longer than expected. Remember to set a reasonable timeout period for a command which could take considerable time.
for eg.

db.user.find({ city: /^A.+/i }).maxTimeMS(100);

sets a maximum time of 100ms for a query which finds documents where the city name starts with an ‘A’.
The timeout set applies only to the query which invoked maxTimeMS() and is not global.

These are a few general ways in which performance can be monitored and enhanced while using MongoDB. However every application has unique requirements and constraints and these have to be taken into consideration while deciding on a specific strategy to increase database performance for your application.