Adventures with the ruby garbage collector in NMatrix

NMatrix is a high-performance linear algebra library for ruby, currently entering its first beta version. Think NumPy, but for, you know, a sensible programming language. A few months ago, I worked on resolving some insidious segfaults resulting from the way we interacted with the ruby garbage collector in C code. I'm still no expert in ruby garbage collection in C modules, but we learned a lot of interesting things from the experience.

I'd meant to blog about this a long time ago and never got around to it, but I was revisiting some of this in preparation for GSoC 2014 (we've been selected as a mentoring organization!). Here's a perhaps meandering description of some of what we learned.

Garbage collection in ruby C modules

Initially I was a bit surprised that I might have to interact with the garbage collector at all from C code. Most of the time, though, it's a fairly pleasant experience, especially if you're passing ruby objects around between a lot of different functions. Interacting with the garbage collector means that you don't need to worry about keeping track of ruby objects and free()ing them when you're done with them.

Normally, if you're creating a ruby object from a custom class that you've defined in C code, you use the macro Data_Wrap_Struct(cls, mark_fct, free_fct, mystruct). In this function's signature,

  • cls is a VALUE (a typedef that is kind of like a pointer to a ruby object most of the time) that is the class object for the resulting object
  • mystruct is a struct that is the C representation of the object.
  • free_fct is a function that the garbage collector calls to free the structure's storage. This is fairly straightforward and just needs to free up any storage in the struct that you previously malloced (or if you're being a good rubyist, that you previously allocated using ruby's macros ALLOC or ALLOC_N).
  • mark_fct is a function that take one of these wrapped up struct and tells the garbage collector how to keep track of (mark) its components.

mark_fct warrants a bit more discussion:

mark_fct is easy if your struct doesn't store any ruby objects; it can just be NULL. It gets slightly more complicated if your struct does store ruby objects internally. Ruby has a mark and sweep garbage collector. mark_fct is called during the garbage collector's mark phase, and its job is to in turn mark all internally stored objects. Ruby provides a number of convenient functions to do this. The simplest is rb_gc_mark(VALUE), which just takes a VALUE and marks it.

We need to make sure that the marking function is associated with a struct via Data_Wrap_Struct right away because ruby garbage collection might potentially happen during any call to the ruby C API. (Actually, I think it may only happen when using the ruby memory allocation macros (or calling functions/using macros that use them), but there might be other cases I'm not aware of.)

This is all fairly straightforward so far and has the awesome (in my opinion) benefit of making some objects in C sort of garbage collected!

Running into problems

Here's a little code snippet that looks like it might cause problems:

//make an array of VALUEs
VALUE* n = ALLOC_N(VALUE, 10);

for (int i = 0; i < 10; i++) {  
    n[i] = rb_yield(INT2FIX(i));
}

// ... do something with n

This is just a simple iterator (perhaps from an implementation of a map method) that yields an index to the calling code and stores the result back in the array n. At any point during the rb_yield, garbage collection might run and collect all our stored results in n because they're not in scope in the ruby code any more, and we haven't wrapped them in a struct with a marking function.

Not a problem! Ruby also does something seriously cool and looks on the stack for things that look like VALUEs or VALUE*s and marks them automatically. So we're still ok.

If this is some other data structure and not a VALUE or VALUE* on the stack, though, the garbage collector may run and collect any VALUEs in that struct. For example:

//make a struct with an array of values
typedef struct holder {VALUE* n;} holder;

holder* h = ALLOC(holder);  
h->n = ALLOC_N(VALUE, 10);

for (int i = 0; i < 10; i++) {  
    h->n[i] = rb_yield(INT2FIX(i));
}

// ... do something with h

Then at some later point when these are used, boom: segfault. These are particularly nice because they're often nondeterministic and depend on when exactly the garbage collector runs, which in turn depends on how much free memory there is on the system.

One solution to this problem might be to wrap whatever data structure has the VALUEs using Data_Wrap_Struct before starting the loop, but if the struct you're constructing isn't finished, you can run into trouble with segfaulting during the marking loop as well.

A more commonly seen solution is just to put a pointer or a value on the stack. So in the last example, this would look like:

//make a struct with an array of values
typedef struct holder {VALUE* n;} holder;

holder* h = ALLOC(holder);  
h->n = ALLOC_N(VALUE, 10);  
VALUE* hn = h->n;  // note extra variable here

for (int i = 0; i < 10; i++) {  
    h->n[i] = rb_yield(INT2FIX(i));
}

// ... do something with h

This works great... until you turn on compiler optimizations. NMatrix is intended to be a high-performance scientific computing library, so we compile with pretty aggressive optimizations. This unfortunately means that it's hard to guarantee that anything is on the stack when you think it is, and in simple cases like this last on, the variable hn will almost certainly be completely removed since it's never used.

Volatile variables

C and C++ have a rather cryptic keyword volatile (note: very different from java's volatile) intended for declaring variables that might be changed by hardware. Essentially, volatile tells the compiler that the value of this variable might be changed by some code or hardware about which it has no knowledge, so it shouldn't do any optimizations that assume when it will and won't be changed. This happens to mean that volatile is also useful for this unintended case:

//make a struct with an array of values
typedef struct holder {VALUE* n;} holder;

holder* h = ALLOC(holder);  
h->n = ALLOC_N(VALUE, 10);  
volatile VALUE* volatile hn = h->n;  // volatile pointer to volatile VALUE just in case!!

for (int i = 0; i < 10; i++) {  
    h->n[i] = rb_yield(INT2FIX(i));
}

// ... do something with holder

Now this works fine: the garbage collector will not collect the array of values being returned from the rb_yield, and thus no segfault!

However, this is not ideal for several reasons:

First, I'm not a huge fan of using volatile for something other than its intended purpose. While it's probably fine, it may mean that it's not doing precisely what we think it is in all cases, and it's also possible that the behavior may change in the future.

Second, we have to add a lot of useless volatile variables everywhere. On the one hand, this isn't really adding a lot of extra code, since we're going to have to add something to fix the problem. On the other, this particular solution massively detracts from the readability of the code. While I was researching this problem in other projects, almost every time I saw someone introduce volatile variables for this reason, I saw someone else comment something to the effect of "wtf is this doing?!? what does this mean?". Moreover, when you have complicated data structures with lots of variables, this ends up being a lot of volatile variables. Both the readability and lots of variables problems could potentially be fixed with liberal use of preprocessor macros.

Third, we're preventing compiler optimizations! Not all of them, but some. This was something we'd hoped to avoid. It's not clear to me how much of a performance hit this actually is. I didn't profile because at the time the code was segfaulting in the relevant places, and I was pretty confused about what was going on, so I didn't have a good baseline. It'd be interesting to go back and see whether this is actually any slower not that I've got a better handle on what's going on.

Registering ruby objects

The solution we arrived at in the end is actually pretty simple: create a static structure wrapped with Data_Wrap_Struct before the code uses any VALUEs. Then any time there's a VALUE being used in a place where it could potentially cause problems, add it to the static structure so the garbage collector can find it, and then remove it when the danger has passed. We implemented this static structure with a linked list acting as a stack (since most of the time VALUEs could be removed in the reverse order that they were added). We then added functions nm_gc_register(VALUE) and nm_gc_unregister(VALUE) that add a VALUE to the stack and remove it, respectively. We also added helper functions for registering more complicated data structures.

This does cause some slowdown in the code because of the extra function calls. At this point, I had figured out enough to be able to profile it. In the places using nm_gc_(un)register the most, computations took a little under 5% longer. In other places, the effect was smaller. Not bad, but something to keep an eye on in case we end up making performance improvements elsewhere that make the registration a more significant time sink.

There's a lot of cases where we just didn't know enough about compiler optimizations to tell whether it was going to be necessary to register a VALUE or not. For instance, is there ever a case where an optimization could occur that would cause one of the parameters of a function not to be on the stack? I have no idea. For these cases, we made a preprocessor macro called NM_CONSERVATIVE() that just deletes whatever is in the parentheses and put the registration/unregistration calls inside the macro. That way, they're not sitting around causing slowdown, but if we start seeing segfaults again, we can quickly redefine the macro and see if any of these cases are to blame. We can even write a script to remove them one at a time and see what particular registration (or lack thereof) is to blame.

If you want to have a look at all of this in action, check out the NMatrix source. The static data structure and helper functions are found in ext/nmatrix/ruby_nmatrix.c. The registration functions and NM_CONSERVATIVE are used all over the place, but that same file has several nice examples of usage too.