Debugging memory corruption in production Rails app using Mozilla rr

TL;DR: We had multiple crashes/hangs in our Ruby on Rails-based web application which we couldn’t reproduce in staging/development environments. In production, we used Mozilla rr to debug and narrow down the issue to Ruby VM GC and syck gem interaction. And if you are on Ruby version > 2.1 and are using syck gem, please migrate to psych.

Background


Freshdesk is a Ruby on Rails, multi-tenant, sharded application that is hosted on the AWS platform. Our application is a monolith; it is single-threaded (and multi-process) for frontend web requests and multi-threaded for background jobs.

Even though our application is sharded, we have seen one shard failure affecting the entire application. So, to improve reliability and minimize the blast radius, we split the infrastructure into two fundamental entities: “shells” and “layers,” instead of running the entire infrastructure in one very large cluster.

Shells: These are isolated self-contained entities, typically serving requests meant for one or more specific shards (along with its own datastores like Redis, etc). So any shard/datastore failure will be isolated to the Shell it belongs to.

Layers: Within a Shell, if we distribute incoming requests evenly to all available machines, isolated latency spikes for some requests can take down the entire Shell. To minimize this, we break the Shell into multiple clusters of machines called Layers. Each Layer serves only a subset of the requests, typically by path (like /some/path/ goes to Layer1, /other/path goes to Layer2, etc). So any failure due to sudden latency spikes for a specific path is isolated to a specific Layer within a Shell.

At a high level, this is something like Shopify’s Pod Architecture

The Problem

It all started with us getting alerts from one of the Layer (in a Shell) that some of the requests were failing due to errors. Requests that were failing erred at random places in the code, with no clear pattern and often at places with fewer chances to fail.

Further analysis showed that these errors were clustered around a few machines. Once a machine starts showing an error, more requests fail in that machine at random locations. These errors persist until there is a restart or Ruby VM crash.

This “appeared” to be a classic heap corruption problem. But we are on Ruby, we shouldn’t be worrying about memory management as it is done by Ruby VM. But we also know that there are quite a few gems with C/C++ code, which might not be interacting well with the Ruby VM ecosystem.

Minimizing the impact

The first thing to find out in a bug like this is if we can consistently reproduce this in a controlled setup. We tried running the full test suite in the staging/development environment. We ran it multiple times and couldn’t reproduce the problem.

At that point, we realized that we were looking at an involved, time-consuming debugging. So we wanted to minimize the blast radius of this impact (and as well as narrow down our debugging effort to fewer areas as possible). Though already this cluster was only serving a subset of our requests (path-based routing), we further divided this into two clusters and started by sending a subset of these problematic requests (by path) to the new cluster.

The intention was to trace the specific request(s) that caused this problem. After a bit of shifting back/forth various requests, we were able to find a single path that reliably reproduced this issue (it took multiple hours for the problem to show up).

Now having narrowed down the problem, we wanted to put a temporary solution until we found a permanent fix. We ended up running a script in each machine that watched out for an elevated error rate and restarted the application gracefully.

On a side note, we didn’t have Envoy at that time. If we had, we could have enabled outlier detection to eject problematic hosts automatically and recycled them at infra level.

Narrow down the problem

Now, having put the temporary solution in place, we started looking at the problem for a permanent fix. 

We tried a few easier things in the new cluster, which didn’t help: 

  • Upgrade Ruby to the latest released version 
  • Upgrade all the gems to the latest released version 

Remember that this problem did not produce a SEGFAULT/crash immediately; it continued to run the application for multiple hours, just erring out some, not all, requests. It continued to produce errors until the problematic application process restarted or crashed with a SEGFAULT (pointing to some random operation within Ruby VM). 

To get more insight into the crash, we disabled the above workaround in one of the machines and enabled coredumps. To be safe, we sampled a certain percentage of requests and sent them to the new machine, not distributing the requests evenly to this coredump enabled machine. 

Once we had few callstacks, the offending callstacks were, as expected, all over the place. For example, one such crash: 


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 17327.17570]
0x000055a2009a52b2 in rb_get_alloc_func (klass=94154489136320) at vm_method.c:554
554 rb_alloc_func_t allocator = RCLASS_EXT(klass)->allocator;


(rr) bt
#0 0x000055a2009a52b2 in rb_get_alloc_func (klass=94154489136320) at vm_method.c:554
#1 0x000055a2008cb58f in rb_obj_alloc (klass=94154328666000) at object.c:1815
#2 0x000055a2008cb686 in rb_class_new_instance (argc=1, argv=0x7fe8247ae140, klass=94154328666000) at object.c:1861
#3 0x000055a200998b41 in call_cfunc_m1 (func=0x55a2008cb667 <rb_class_new_instance>, recv=94154328666000, argc=1, argv=0x7fe8247ae140) at vm_insnhelper.c:1208
#4 0x000055a200999845 in vm_call_cfunc_with_frame (th=0x55a21d62c900, reg_cfp=0x7fe8248ab6c0, ci=0x55a21266f438) at vm_insnhelper.c:1380
#5 0x000055a200999b5c in vm_call_cfunc (th=0x55a21d62c900, reg_cfp=0x7fe8248ab6c0, ci=0x55a21266f438) at vm_insnhelper.c:1473
#6 0x000055a20099edb7 in vm_exec_core (th=0x55a21d62c900, initial=0) at insns.def:1040
#7 0x000055a2009b062d in vm_exec (th=0x55a21d62c900) at vm.c:1440
#8 0x000055a2009ae68f in invoke_block_from_c (th=0x55a21d62c900, block=0x55a217e377c0, self=94154332221600, argc=0, argv=0x55a20385c510, blockptr=0x0, cref=0x0, defined_class=94154340325760, splattable=0) at vm.c:818
#9 0x000055a2009aec09 in vm_invoke_proc (th=0x55a21d62c900, proc=0x55a217e377c0, self=94154332221600, defined_class=94154340325760, argc=0, argv=0x55a20385c510, blockptr=0x0) at vm.c:883
#10 0x000055a2009aecca in rb_vm_invoke_proc (th=0x55a21d62c900, proc=0x55a217e377c0, argc=0, argv=0x55a20385c510, blockptr=0x0) at vm.c:902
#11 0x000055a2009bd8ad in thread_start_func_2 (th=0x55a21d62c900, stack_start=0x7fe8247ad000) at thread.c:577
#12 0x000055a2009bbf89 in thread_start_func_1 (th_ptr=0x55a21d62c900) at thread_pthread.c:869
#13 0x00007fe844856de5 in start_thread (arg=0x7fe8247ac700) at pthread_create.c:308
#14 0x00007fe843e43bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Looking at more coredumps with such callstacks, we were suspecting heap/memory corruption leading to errors much later. So we started looking at tools that can help to catch these corruption(s) early (say as soon as particular heap corruption happened, which can manifest later as errors/crashes). 

Another way to look at these tools is they convert these kinds of “lateral” crashes into “immediate” crashes, often with enough information to find the issue. 

Based on our past experience and familiarity, we narrowed down the tool list to (with least performance impact w.r.t CPU and moved all the way up): 

Running these tools involved quite a few changes, starting with recompiling RubyVM (which required some minor changes) with appropriate tools and all the way up to modifying our Ruby application management tool Phusion passenger

Once we had this automated setup in place, we ran our application under these tools (one after another) in staging/development. 

Unfortunately, everything passed. There were no issues reported. 

So, we wondered if we could run it in the production, in the new cluster we have set up. We ran benchmarks, and found that the latency overhead was within reasonable range ~1x to ~2x of our response time (~150ms max). So we deployed this in one of the instances. 

Again, everything passed. 

Remember that these tools may not spot logical errors. Say, our application has a single thread, and there are two variables A and B pointing to the same memory area by design (for example, class allocating memory doing something with it, returning that address to the caller, but also retaining the address internally, etc). To find these errors, the tools require additional information, which requires a deeper understanding/integration with the language/runtime. 

Here in our specific case, we needed deeper coordination with Ruby VM runtime/GC, feeding enough information to these tools to find logical errors, if any. Remember that Ruby also allows gems with native code to create/use/destroy Ruby objects using C API. So whatever we use has to work with those gems as well. We couldn’t find such tools (other than enabling Ruby VM GC debug logs, which didn’t help us debug this). 

So we were looking for a tool that allows us to retract in the event of a crash and verify whether each mutation was “legal”/logically allowed on every step, and finally find the offending code. 

Enter Mozilla rr

One such tool we experimented with in the past is Mozilla rr, which is a kind of “magic” tool. It can record the execution, including all the external calls (in our case, Redis, Memcached, MySQL, and various gRPC/HTTP calls) and replay it later for debugging purposes. The replay is consistent and produces the same results at every run (which allows instances such as this: found a problem in a specific memory location? Set a watchpoint for that memory location, reverse-cont or restart the gdb run, etc). This also integrates with gdb, which is very nice, as our other gdb extensions can be used as-is. 

But we haven’t really used rr in production or in a large ruby on rails application. So we decided to evaluate that in the staging environment. 

Cloud and CPU features

Mozilla rr depends on certain CPU features. We were not sure if the cloud provider we are on (AWS) supported those features. They are usually disabled for security and other reasons. So we booted a reasonably large machine to check if we can use that. 

NOTE: This was in mid-2018 


[root@ip-172-16-20-17 ec2-user]# curl http://instance-data/latest/meta-data/instance-type
i3.16xlarge
 
[root@ip-172-16-20-17 ec2-user]# cat /proc/cpuinfo | grep "model name"
model name      : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 
[root@ip-172-16-20-17 ec2-user]# dmesg | grep -i perf
[   22.307535] Performance Events: Broadwell events, core PMU driver.
 
[root@ip-172-16-20-17 ec2-user]# perf list hw
 
List of pre-defined events (to be used in -e):
 
 branch-instructions OR branches                    [Hardware event]
 branch-misses                                      [Hardware event]
 bus-cycles                                         [Hardware event]
 cache-misses                                       [Hardware event]
 cache-references                                   [Hardware event]
 cpu-cycles OR cycles                               [Hardware event]
 instructions                                       [Hardware event]
 
 
[root@ip-172-16-20-17 ec2-user]# perf stat ls
 
Performance counter stats for 'ls':
 
         0.625179      task-clock (msec)         #    0.633 CPUs utilized
                0      context-switches          #    0.000 K/sec
                0      cpu-migrations            #    0.000 K/sec
               82      page-faults               #    0.131 M/sec
        1,674,090      cycles                    #    2.678 GHz
        1,150,692      instructions              #    0.69  insn per cycle
          238,616      branches                  #  381.676 M/sec
            9,649      branch-misses             #    4.04% of all branches
 
      0.000987276 seconds time elapsed
 
[root@ip-172-16-20-17 obj]# rr record ls
rr: Saving execution to trace directory `/root/.local/share/rr/ls-0'.
[FATAL /home/ec2-user/rr/rr/src/PerfCounters.cc:324:check_working_counters() errno: EOPNOTSUPP]
Got 0 branch events, expected at least 500.
 
The hardware performance counter seems to not be working. Check
that hardware performance counters are working by running
 perf stat -e r5101c4 true
and checking that it reports a nonzero number of events.
If performance counters seem to be working with 'perf', file an
rr issue, otherwise check your hardware/OS/VM configuration. Also
check that other software is not using performance counters on
this CPU.
Aborted
 
# Mozilla rr needs r5101c4
[root@ip-172-16-20-17 ec2-user]# perf stat -e r5101c4 ls
 
Performance counter stats for 'ls':
 
                0      r5101c4
 
      0.001176665 seconds time elapsed

Unfortunately, rr was not supported. We tried other instance types to no avail. 

We corresponded with AWS to see if they would add support for this particular hardware perf event as various other events were supported. AWS acknowledged that this was under their consideration. So a month or so later, AWS added support for this feature. 

This was run on i3.metal instance, likely supported in other large instances as well. 


[root@ip-172-16-20-17 ec2-user]# perf stat -e r5101c4 true

Performance counter stats for 'true':

58,936 r5101c4

0.000669296 seconds time elapsed

Once this was ready, we had to do some more minor tweaks: back-porting certain kernel patches (we couldn’t upgrade the kernel for other reasons) and recompiling the kernel (Amazon Linux 1), modifying a few settings in Phusion Passenger (direct spawning) to make the setup work. We also compiled the Ruby with -O0 to make the debugging easier. 

Once we automated this setup and verified it in staging, we were ready to deploy it in production in a limited manner. 

Before that, we wanted to verify the performance impact. Our tests showed a ~0.4x increase in latency, which was well within our limits. We also made sure this was deployed in only one machine and sampling was used to further minimize the latency effect. 

So our rails app launch script was something like (launched via Nginx + passenger workers): 


exec /usr/local/bin/rr -F record -n \
--disable-cpuid-features 0x80050440,0x40140400 --disable-cpuid-features-ext 0xffffd854,0xffffffff,0xf3ffffff \
--disable-cpuid-features-xsave 0xfffffffe \
"/usr/local/bin/ruby" "$@"

We disabled certain cpuid features to make it work on the dev laptop. 

Notes on the setup are in this gist

Using Mozilla rr 

With everything set up, we deployed this in one production machine. Now we waited for the crash to occur. Soon, we had multiple crashes (we run multiple application processes), which was good enough for us, so we took the machine out of traffic in a few hours. 

One very nice feature of rr is the ability to record, pack, copy, and replay it in a completely different machine from the one it was recorded on. This allowed us to turn off the expensive i3.metal as soon as we were done. 

So after the crash, ‘rr pack’ed the recording, ‘scp’ed it to an independent EC2 machine (not part of prod environment) and I could ‘rr replay’ it. Note that this replay can be done in one’s laptop as well, so we can replay in a machine that is not part of the environment it was recorded on. We chose not to do that as this recording contained production data.

One additional item to remember is we needed to copy the rootfs of the EC2 machine. As we were going to run it in a completely different machine, likely with different OS/libraries, gdb would need the original libraries, along with the debug symbols. At that point (~2018), we didn’t run our application within docker/container, so we had to copy that rootfs manually (see this gist for the example command to copy rootfs). If you are already using the containers, then you just need the image alone for replay. 

After recording, to replay it, we can invoke something like:


# Run this from the directory where amzlinux_rootfs is present

# Once: mount the appropriate procfs
for f in /proc /sys /dev; do sudo mount --bind $f ./amzlinux_rootfs/$f; done

# Enter the rootfs and replay
sudo chroot ./amzlinux_rootfs/ /bin/bash -c "rr replay /rr_trace/ruby-0/ -- -ex 'handle SIGPIPE nostop noprint pass' -ex 'source /gdb_ruby.py'"

# now we should be in gdb prompt, use gdb commands to run/stop/inspect/watch/reverse-cont etc

Or if you want to use GUI debugger (like emacs gud or vscode etc) 


# Start gdb-server in port 50505 using above instructions
sudo chroot ./amzlinux_rootfs/ /bin/bash -c "rr replay -s 50505 -k /rr_trace/ruby-0/ -- -ex 'handle SIGPIPE nostop noprint pass' -ex 'source /gdb_ruby.py'"

# In Emacs, start gdb client and connect to it
(gdb) set sysroot /home/surki/tmp/some_app_crash/rr/amzlinux_rootfs
(gdb) target remote :50505
(gdb) source ~/tmp/some_app_crash/rr/amzlinux_rootfs/gdb_ruby.py
(gdb) continue

Ruby GC detour 

If you know Ruby GC internals or want to skip to root cause analysis, please skip this section. 

This requires a deeper understanding of Ruby VM internals, as well as Ruby VM GC internals. I strongly recommend going through Ruby Under a Microscope book. 

For GC, various resources are available, please see here and here. Notes based on them: 

  • Mark and sweep with tweaks to handle compatibility issues with C extensions / MRI design 
  • Two types of GC: Minor and Major 
  • Minor GC 
    • Restricted Generational GC 
    • Objects are classified into “old” and “new” generations 
    • “new” objects are promoted to “old” if they survive 3 GC cycles 
    • Mark is only done for “new” objects (== less work) 
    • To detect “old” objects referring “new” object, “write barriers” (i.e., access to the objects are done via indirection, so runtime/GC can “detect” the objects being referred) are used 
      • But not all ruby object can have write-barrier (C extension referring pointer directly etc) so objects are classified into “protected” (=objects that have write-barrier) and “unprotected” (=objects that don’t have write barrier
      • Objects can transition into and out of protected/unprotected classification 
      • Unprotected objects are never promoted to the “old” generation. Any “shady” objects referenced by “old” objects are added to “remember set” and are marked every minor GC 
    • Sweeping is done lazily 
    • Note that Ruby GC is a non-compacting, non-copying GC. So even though there are  “old” and “new” classifications, those are merely flags, not a completely separate arena. 
    • This is expected to improve the overall “throughput” 
  • Major GC 
    • Restricted Incremental GC 
    • The objective is to minimize the long STW pause time of mark phase 
    • This is done by doing GC mark in multiple steps with ruby code execution intermixed (“GC mark step” => “Ruby code execution” => “GC mark step” => “Ruby code execution” etc) 
    • On each mark step, it marks one set of objects 
    • To detect new objects being referenced by already marked objects, “write barrier” is used (same as above “protected” and “unprotected” classification) 
      • So at the end of the mark phase, objects that are both “marked and unprotected” are processed in a single shot 
      • This may lead to high pause time, depending on how many “unprotected” objects are there 
    • This is expected to improve the latency, though the overall GC time will be equal to or more (due to write-barrier cost etc) than single STW time

The root cause

NOTE: As part of this we developed a gdb extension for Ruby that allows us to look into various Ruby internals (like callstacks, dumping objects etc). You can find it here

Let’s start with a crash replay. 


sudo chroot ./amzlinux_rootfs/ /bin/bash -c "rr replay /rr_trace/ruby-0/ -- -ex 'handle SIGPIPE nostop noprint pass' -ex 'source /gdb_ruby.py'"
...
...
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 17327.17570]
0x000055a2009a52b2 in rb_get_alloc_func (klass=94154489136320) at vm_method.c:554
554 rb_alloc_func_t allocator = RCLASS_EXT(klass)->allocator;

(rr)

C callstack 


(rr) bt
#0 0x000055a2009a52b2 in rb_get_alloc_func (klass=94154489136320) at vm_method.c:554
#1 0x000055a2008cb58f in rb_obj_alloc (klass=94154328666000) at object.c:1815
#2 0x000055a2008cb686 in rb_class_new_instance (argc=1, argv=0x7fe8247ae140, klass=94154328666000) at object.c:1861
#3 0x000055a200998b41 in call_cfunc_m1 (func=0x55a2008cb667 <rb_class_new_instance>, recv=94154328666000, argc=1, argv=0x7fe8247ae140) at vm_insnhelper.c:1208
#4 0x000055a200999845 in vm_call_cfunc_with_frame (th=0x55a21d62c900, reg_cfp=0x7fe8248ab6c0, ci=0x55a21266f438) at vm_insnhelper.c:1380
#5 0x000055a200999b5c in vm_call_cfunc (th=0x55a21d62c900, reg_cfp=0x7fe8248ab6c0, ci=0x55a21266f438) at vm_insnhelper.c:1473
#6 0x000055a20099edb7 in vm_exec_core (th=0x55a21d62c900, initial=0) at insns.def:1040
#7 0x000055a2009b062d in vm_exec (th=0x55a21d62c900) at vm.c:1440
#8 0x000055a2009ae68f in invoke_block_from_c (th=0x55a21d62c900, block=0x55a217e377c0, self=94154332221600, argc=0, argv=0x55a20385c510, blockptr=0x0, cref=0x0, defined_class=94154340325760, splattable=0) at vm.c:818
#9 0x000055a2009aec09 in vm_invoke_proc (th=0x55a21d62c900, proc=0x55a217e377c0, self=94154332221600, defined_class=94154340325760, argc=0, argv=0x55a20385c510, blockptr=0x0) at vm.c:883
#10 0x000055a2009aecca in rb_vm_invoke_proc (th=0x55a21d62c900, proc=0x55a217e377c0, argc=0, argv=0x55a20385c510, blockptr=0x0) at vm.c:902
#11 0x000055a2009bd8ad in thread_start_func_2 (th=0x55a21d62c900, stack_start=0x7fe8247ad000) at thread.c:577
#12 0x000055a2009bbf89 in thread_start_func_1 (th_ptr=0x55a21d62c900) at thread_pthread.c:869
#13 0x00007fe844856de5 in start_thread (arg=0x7fe8247ac700) at pthread_create.c:308
#14 0x00007fe843e43bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Corresponding Ruby callstack (using gdb_ruby.py extension): 


(rr) ruby_bt
/data/someapp/releases/20180703094818/lib/middleware/global_restriction.rb:12:in `call'
/data/someapp/shared/bundler_gems/ruby/2.2.0/gems/newrelic_rpm-3.17.2.327/lib/new_relic/agent/instrumentation/middleware_tracing.rb:96:in `call'
/data/someapp/releases/20180703094818/vendor/gems/rack-openid/lib/rack/openid.rb:104:in `call'
/data/someapp/shared/bundler_gems/ruby/2.2.0/gems/newrelic_rpm-3.17.2.327/lib/new_relic/agent/instrumentation/middleware_tracing.rb:96:in `call'
/data/someapp/shared/bundler_gems/ruby/2.2.0/gems/actionpack-3.2.22.3/lib/action_dispatch/middleware/best_standards_support.rb:17:in `call'
...
...
...
/usr/local/lib/ruby/gems/2.1.0/bundler/gems/passenger-c4b44afe25ea/src/ruby_supportlib/phusion_passenger/request_handler/thread_handler.rb:160:in `accept_and_process_next_request'
/usr/local/lib/ruby/gems/2.1.0/bundler/gems/passenger-c4b44afe25ea/src/ruby_supportlib/phusion_passenger/request_handler/thread_handler.rb:113:in `main_loop'
/usr/local/lib/ruby/gems/2.1.0/bundler/gems/passenger-c4b44afe25ea/src/ruby_supportlib/phusion_passenger/request_handler.rb:416:in `block (3 levels) in start_threads'
/usr/local/lib/ruby/gems/2.1.0/bundler/gems/passenger-c4b44afe25ea/src/ruby_supportlib/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception'

So in this recording, it crashed in a middleware before it reached the application controller (remember that there are crashes at different places in other recordings). 

Looking at the corresponding source code (line number 12 is the crash): 


...
10 def call(env)
11 ...
11 ...
12 req = Rack::Request.new(env)
13 ...
14 ...
...
21 end
...

and corresponding YARV disassembly (crash is at line “0024”): 


== disasm: <RubyVM::InstructionSequence:call@/data/someapp/releases/20180703094818/lib/middleware/global_restriction.rb>
local table (size: 5, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 5] env<Arg> [ 4] ip [ 3] req [ 2] req_path
0000 trace 8 ( 9)
0002 trace 1 ( 10)
0004 putself
0005 opt_send_without_block <callinfo!mid:blacklisted_ips, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0007 opt_send_without_block <callinfo!mid:ip_list, argc:0, ARGS_SIMPLE>
0009 setlocal_OP__WC__0 4
0011 trace 1 ( 11)
0013 getinlinecache 22, <is:0>
0016 getconstant :Rack
0018 getconstant :Request
0020 setinlinecache <is:0>
0022 getlocal_OP__WC__0 5
0024 opt_send_without_block <callinfo!mid:new, argc:1, ARGS_SIMPLE>
0026 setlocal_OP__WC__0 3
0028 trace 1 ( 12)
0030 getlocal_OP__WC__0 3
0032 opt_send_without_block <callinfo!mid:path_info, argc:0, ARGS_SIMPLE>
0034 setlocal_OP__WC__0 2
0036 trace 1 ( 13)
0038 getlocal_OP__WC__0 5
0040 getlocal_OP__WC__0 3
0042 opt_send_without_block <callinfo!mid:ip, argc:0, ARGS_SIMPLE>
0044 opt_aset_with <callinfo!mid:[]=, argc:2, ARGS_SIMPLE>, "CLIENT_IP"
0047 pop
0048 trace 1 ( 14)
...
...

So when our application is processing a request, as part of middleware execution, we are creating an object (Rack::Request.new(env)), which internally calls rb_get_alloc_func C function inside Ruby VM, which crashes while traversing class hierarchy. 

Dumping class hierarchy (utilizing our gdb-ruby.py extension): 


(rr) python print_class_info(gdb.parse_and_eval('klass'))
T_CLASS : (struct RClass *)0x55a203506f90 allocator = 0x0 __classpath__ => "Rack::Request",
T_CLASS : (struct RClass *)0x55a201d42c60 allocator = 0x0 __classid__ => Object, __classpath__ => "Object", @parent_name => nil,
T_CLASS : (struct RClass *)0x55a203a3e120 allocator = 0x0 __classid__ => RedisWrapper, __classpath__ => "Redis::RedisWrapper",
T_CLASS : (struct RClass *)0x55a209782430 allocator = 0x0 __classid__ => Constants, __classpath__ => "Socket::Constants",
T_CLASS : (struct RClass *)0x55a203c8a3e8 allocator = 0x0 __classid__ => ObjectMixin, __classpath__ => "PP::ObjectMixin",
T_CLASS : (struct RClass *)0x55a20a723fb0 allocator = 0x0 __classid__ => MessageSending, __classpath__ => "Delayed::MessageSending",
T_CLASS : (struct RClass *)0x55a20ce104c0 Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/gdb_ruby.py", line 323, in print_class_info
sys.stdout.write("allocator = %s " % (cl['ptr']['allocator']))
gdb.MemoryError: Cannot access memory at address 0x48
Error while executing Python code.

As we can see the class hierarchy seems “corrupted”. 

Now we can use rr magic. We can set a gdb watchpoint (to find out who changed/when/how), run backwards (using reverse-cont), and we will be stopped when someone mutates that variable/address. 

Note that this is going to be a lot more complicated. In Ruby, memory management is done by runtime, so it is quite possible that this memory location has gone through hundreds of http requests or thousands of allocations before we ran into the issue. Or in other words, this memory could have been corrupted multiple hours before, possibly allocated multiple times but not used in a way that can cause a crash. And now, we got allocated this location and used it in a way that caused this crash. 

The following involved multiple runs of forward/backward running in gdb/rr, with tens of watchpoints spread around. My notes from few days of debugging this and arriving at the root cause: 

  • rb_get_alloc_func crashes while traversing class hierarchy 
    • Why? The “Delayed::MessageSending” object’s super class (“V8::Conversion::Object”) was freed 
    • Why was it freed? 
      • During last minor GC mark phase (at rr event: x), this object was not marked, so the sweep phase freed it 
      • Why was it not marked when it had a live reference? 
        • Normal GC (for reference) 
          • In rgengc_rememberset_mark, for page 0x55a201ced8d0, we are calling gc_mark_children for object 0x55a201d42c60 (“Object”) 
          • Then calling gc_mark(RCLASS_SUPER(0x55a201d42c60)) 
            • RCLASS_SUPER(0x55a201d42c60) == 0x55a203a3e120 (“Redis:RedisWrapper”) 
            • gc_mark_ptr() 
              • rgengc_check_relation() 
              • gc_mark_set() 
              • gc_aging() 
              • gc_grey() 
                • push_to_mark_stack(0x55a203a3e120) 
        • Problematic GC 
          • In rgengc_rememberset_mark, for page 0x55a201ced8d0, we are calling gc_mark_children for object 0x55a201d42c60 (“Object”) 
          • Then calling gc_mark(RCLASS_SUPER(0x55a201d42c60)) 
            • RCLASS_SUPER(0x55a201d42c60) == 0x55a203a3e120 (“Redis:RedisWrapper”) 
            • gc_mark_ptr() 
              • rgengc_check_relation() 
              • gc_mark_set() 
                • returns early since markbit is already set 
                • Why? 
                  • Markbit is set from the previous gc run, it is not reset 
                  • Why is it not reset? 
                  • lazy gc_sweep finished with “page->free_slots > 0” but page->freelist is NULL. 
                  • This led to gc assuming no more free slot in entire heap and incorrectly triggered a (minor) GC without finishing the current sweep phase 
                  • Then during this gc cycle, in mark phase, some objects’ children were ignored (since the parent is already marked, children were ignored assuming they were already processed) 
                  • This led to those unmarked but live (child) objects collected, leading to crash later 
                  • How did we end up having an empty slot? 
                  • Syck gem memset the object to zero here 
                  • Then sometime later during normal Ruby operation, we ran out of free slots and called heap_prepare 
                  • That called gc_sweep_continue which in turn called gc_sweep_step, which in turn called gc_sweep_page, in which: 
                  • this place detected that the object memzero’d by syck as T_NONE object and increased empty_slots value 
                  • this place set the page->free_slots > 0, but page->freelist heap->free_pages are still NULL. This happens when freed_slots is zero and empty_slots > 0. 
                  • this (and this) place broke out of sweep loop assuming a free slot is found (but heap->free_pages is NULL) 
                  • Then this place triggerred a minor GC, causing inconsistent state 
                  • Is that normal to have page->free_slots > 0 and page->freelist/heap->free_pages == NULL? 
                  • IOW, should it be fixed in RubyVM? 
                  • Is syck doing the right thing? (by doing RVALUE copy and memsetting the slot to 0?) 
                  • What is the correct way to do it? 

To summarize: 

  • Syck gem memset the object to zero here 
    • Which appears to be incorrect, operating directly on an object managed by Ruby 
  • Then sometime later during normal Ruby operation, we ran out of free slots and called heap_prepare 
    • This place detected that the object memzero’d by syck as T_NONE object (T_NONE is 0) and assumed this to be free, but in actuality it was being used by syck 
    • This triggered a minor GC 
  • This object got allocated to somebody else later and everything went downhill from there 

On further analysis within Syck gem, we came up with the following simple program to reproduce the issue (note that it is likely Syck has other issues in addition to the below example):


require 'syck'

my_str = <<-'EOF'
---
foo: &foo_ref
bar: *foo_ref
EOF

loop do
something = Syck::load(my_str)
# Use Psych to avoid crash
#something = Psych::load(my_str)
end

Why it was not crashing earlier: we were storing some columns in our database as YAML (and was using Syck for serializing/deserializing). This was behind a feature flag, only recently enabled selectively for few, long after we upgraded Ruby to 2.2. 

The fix 

This is an issue with Syck gem not being updated to be in sync with latest Ruby VM GC changes. Specifically this broke when Ruby GC was updated as part of Ruby 2.2+ release. 

We decided to tackle this in two-pronged approach: 

  1. Since Syck gem wass deprecated, we decided to upgrade to Psych gem. 
  2. Until Psych gem upgrade is done (which involves data fixes), we decided to downgrade the Ruby version to 2.1.9 in this new cluster. This will give our customers error free experience until we ship our fix. 

Conclusion

We have seen how difficult it is to debug memory corruption. Not just memory mutation, it is also difficult to answer questions like “what request/where/who made this DB call?” or “what request/where/who modified this Redis key?” etc. Traditional tools demand knowledge of what we need to watch (memory locations, parameters/values etc), which are not always easy to know upfront. With a tool like rr, it becomes trivial to find/answer these questions. If you are trying to answer any such questions, mozilla rr is something you should definitely consider using. 

Acknowledgments

We want to thank the Mozilla rr team, specifically Robert O’Callahan for working with us and AWS to make it work on EC2. Pernosco provides a better “omniscient” debugging experience on top of rr. We didn’t try it yet but you may want to check it out

We want to thank csfrancis for this gist, which formed the initial basis for our gdb-ruby extension