This is my favourite and toughest production horror story. It started around 9 PM when we got an alert that our production MongoDB went down. We thought it was because of a new cron job we'd just added, so we removed that change and restarted the database. It worked fine until 10 PM, and that's when things got weird.

Our db setup

It's important to understand our setup. We stored each account's data in a separate database within MongoDB. We didn't use a cluster then; it was just a single server. Each database had collections specific to that account. In MongoDB terms, a group of collections is called a database.

How does Mongo store data?

MongoDB uses WiredTiger to manage data. It creates a separate file for each collection and index. So if you have 30 collections, 3 indexes per collection, and 1000 accounts, MongoDB will create about 90,000 files. When a collection is accessed, WiredTiger opens the file and reads the data. To be efficient, WiredTiger keeps files open until it has to close them.

So what happened that night?

After restarting Mongo, the db used to work fine until one of the older cron started and in the middle of it the db used to crash and restart. We had several crons, which used to loop over all the databases and run some queries against it. WiredTiger opened around 300k files to read and write data. But an OS has a limit of how many files can be opened at a single point in time and after a limit (which varies from OS to OS) it doesn’t let the application open any more. Apparently, this is not handled by the WiredTiger engine and it used to crash out.

What MongoDB docs say:

They clearly state in their docs not to have more than 10,000 collections on a single server because it will cause problems. We were dealing with exactly that.

How we figured it out:

At first, we thought it was a memory issue, which we had seen earlier. Our usual fix was to upgrade to a bigger server. But this time, that didn't work, so we knew we had to look deeper into the logs. I used a clever command to check the logs of our MongoDB container right before it stopped until it restarted:

Loading...

I noticed a log line that seemed to be the cause:

Loading...

I quickly set up a new MongoDB server and created lots of dummy collections. I ran basic write operations on them and managed to recreate the same problem. At first, I tried to fix it by increasing the ulimit (the system's file limit), but that didn't work. The truth is, that no operating system can handle 300,000 open files at once.

This test helped confirm that our issue wasn't unique to our production environment. It was a fundamental problem with how we were using MongoDB. It showed that our database structure, with so many collections, was pushing MongoDB beyond its practical limits

Short term fix

Since it was late at night, we needed a quick fix. We decided to mark accounts that hadn't logged in recently as inactive. This way, the cron jobs wouldn't run on them, so files for those collections wouldn't be opened.

Long term fix

The next week, we changed our single-server setup to a cluster with one primary and two secondary nodes. Sharding our database didn't make sense because it wouldn't solve the "too many open files" problem and for other business reasons. The best solution seems to be having separate clusters for groups of accounts and directing MongoDB queries accordingly.