Imagine working on a feature of your product. That important feature that people pay for gladly. You know what I'm talking about.
Now picture this - every once in a while the intricate machinery backing that code grinds to a halt. Seemingly without a reason. There I was, in the wee hours of night musing why it wouldn't work.
The core of our system is built around celery and RabbitMQ. If you don't know it - RabbitMQ is a message broker written in Erlang implementing AMQP. It enables you to connect various components of your app asynchronously and implement, with proper configuration, a very nice map/reduce system. For languages such as Python this is a great feature.
I was trying to figure out why Celery stopped every once in a while. Everything seemed normal, but any call to
apply_async just hanged there, waiting.
Because of how mission critical this was, the whole system often got rebooted before I got my hands on it - making it even more frustrating to debug.
Eventually, I got lucky. It stopped just when I was at the keyboard and way beyond working hours for most of our customers. I started checking, step by step, what was going on.
I knew that the problem can be fixed by restarting server running the RabbitMQ broker. Somehow, problem was between what RabbitMQ was doing and how our project was interacting with it.
One of the extra features of RabbitMQ is flow control - it's a way for RabbitMQ to tell connected services that it's unable to cope with the amount of data it's being sent and that producers need to back off. Unfortunately, Celery doesn't support that extension.
For a while, I was thinking that therefore it has to be a problem with Celery. But there was almost nothing about this mentioned anywhere on the internet. Surely, if this indeed was a problem worth fixing, there would be at least a few posts on Stack Overflow about this.
Also, this really wasn't a problem I witnessed before and I had been using RabbitMQ for a few years now. In other words - something specific to this RabbitMQ instance was causing the bug.
There is a slew of RabbitMQ commands that can help you narrow down your scope. First, you need to check where the problem seems to be when the RabbitMQ enters flow control.
The important thing to know is this - there are multiple reasons RabbitMQ may go into flow control state, but if it's not recovering it's either blocked on CPU or disk I/O.
I started by issuing this command:
sudo rabbitmqctl show_connections
It returned a list of connections made to our RabbitMQ node. All were either blocking or blocked.
Time to check where it stemmed from. I run:
sudo rabbitmqctl show_channels
Which showed me all channels as running. Also, running:
Showed RabbitMQ I/O at a pretty sensible level. So now I knew, RabbitMQ was being blocked by a process hogging CPU. For anyone who had at least a bit of exposure to Erlang and RabbitMQ this may seem very strange.
When you have eliminated the impossible, whatever remains, however improbable, must be the truth
Now - how do you find what's hogging CPU inside RabbitMQ?
I know Erlang a bit and most people using RabbitMQ don't know it at all. But I feel nowhere near competent to go and read RabbitMQ source code and start profiling it - really, I wouldn't even know where to start.
Fortunately, there's some nice tools built into RabbitMQ that can help you. It's a tiny bit dark magic, but inside RabbitMQ exists
Running this command:
sudo rabbitmqctl eval "rabbit_diagnostics:maybe_stuck()."
Will go through all processes inside RabbitMQ. Don't be scared by massive amount of reported processes - these are Erlang processes which aren't OS level processes. For a normally running RabbitMQ output should be looking so:
2016-01-21 05:03:28 There are 32033 processes. 2016-01-21 05:03:28 Investigated 0 processes this round, 5000ms to go. 2016-01-21 05:03:28 Investigated 0 processes this round, 4500ms to go. 2016-01-21 05:03:29 Investigated 0 processes this round, 4000ms to go. 2016-01-21 05:03:29 Investigated 0 processes this round, 3500ms to go. 2016-01-21 05:03:30 Investigated 0 processes this round, 3000ms to go. 2016-01-21 05:03:30 Investigated 0 processes this round, 2500ms to go. 2016-01-21 05:03:31 Investigated 0 processes this round, 2000ms to go. 2016-01-21 05:03:31 Investigated 0 processes this round, 1500ms to go. 2016-01-21 05:03:32 Investigated 0 processes this round, 1000ms to go. 2016-01-21 05:03:32 Investigated 0 processes this round, 500ms to go. 2016-01-21 05:03:33 Found 0 suspicious processes. ok
In my case there was a mention of a webmachine process getting stuck. Ok, that seemed to ring a bell - webmachine is a component responsible for running REST inside RabbitMQ - if you enabled management plugin. Which I did.
After a week of search I was happy to find anything, however doubtful I was of the fact, that somehow, REST plugin can cause problems RabbitMQ server. I mean, it's just a plugin and this is all written in Erlang, so surely, it's bulletproof, isn't it?
Well, I was wrong. Disabling the plugin not only fixed the issue. It also cut memory usage by 70%.