Nic Crane


October 4, 2023

As a package maintainer, I’m constantly disappointed when folks mention Arrow bugs they’re aware of but haven’t reported. Not disappointed with the individual in question, but disappointed with the fact that we’re not at the point where we’ve created an environment in which folks are happy to just report bugs immediately. This is not an Arrow-specific problem, and I find myself behaving in exactly the same way with other open source projects. If I’m not entirely sure something is a bug, I’m not going to risk looking foolish publicly by complaining, but the irony of this is that I don’t judge users who make those mistakes with Arrow, as usually it means we need to improve our docs to be more clear, and that in itself is valuable feedback.

I really love interacting with people who use the package, as without that interaction, development and maintenance can feel like shouting into the void. I like being able to solve problems that make other people’s lives easier, and I thrive off that social energy. I implemented bindings for dplyr::across() because someone commented on Twitter that they’d love to see it as a feature. Last night I got home from a friend’s birthday drinks, saw a user question on Stack Overflow which had an easy win, and within an hour had a pull request up which implemented a new function and fixed the particular issue. I am not promising this level of responsiveness in perpetuity, but I’m still at the point where I find this kind of thing exciting and energising.

One particular bug which has haunted me for the past 6 months, is a particularly irritating one whereby when working with a partitioned CSV dataset in arrow (I’m using lowercase to denote the R package, rather than Arrow the overall project), if the user manually specified a schema, the variable used for partitioning did not appear in the resulting dataset. This is a huge problem IMO - while we can sit here all day talking about the virtues of Parquet, in reality, a lot of our users will be using CSVs, and it’s issues like these that can rule out being able to use arrow at all in some cases.

When I opened the original issue based on another user issue, I knew it was important, but felt a bit stuck. It wasn’t immediately obvious to anyone what the source of the error was. I’d assumed it must be a C++ error and flagged it as such, but nobody had taken a look at it, and I’m always hesitant to mindlessly tag folks on issues when I don’t feel like I’ve done the work to investigate (though to be fair, didn’t really know what “the work” should be in this case).

I’d ended up assuming that this bit of functionality just didn’t work with CSV datasets, and had been working around it, until I was presenting about arrow at New York Open Statistical Programming Meetup, and someone asked about it again. I take 1 user question as representative of 99 other people with the same issue who aren’t being so vocal about it, and felt like it needed to be fixed. I am unashamed to admit that I occasionally have the taste for a bit of melodrama, and publicly declared it to a few of my fellow contributors as “my white whale”, and so set out to find the source, even if it required me to delve deep into the guts of Arrow’s C++ libraries, a task which can often send me down endless rabbit holes and chasing red herrings (this sentence has become quite the nature park…)

My original exploration didn’t result in much useful - the arrow package does some cool things with R6 objects to binding them to Arrow C++ objects, but accessing the inner properties of these bound objects would mean manually creating print methods for every single one of them, and when you don’t know in which class the problem lies, this becomes, frankly, a massive pain in the arse. I still didn’t have enough to go on to take it to an Arrow C++ contributor and ask for their help, but showing I’d done some of “the work” to at least make an effort myself.

And then collaborative debugging saved me! I had a catchup with the fantastic Dane Pitkin, and I asked for his help just walking through the problem. Dane’s main contributions to Arrow have been to Python, though he has a ton of previous C++ experience, even if he isn’t a regular contributor to the Arrow C++ library. I walked through the problem with him, and the steps I’d taken so far to try to figure things out, and the fact that I still needed to figure out if the problem was in R or C++. Dane commented that the object bindings we’d been looking at had little surface area for the problem to be in R - most of them were straight-up mappings from the C++ object to an R6 object with no extension. This was my first big clue! I remembered that there’s a bit of open_dataset() where we do some manual reconfiguration of specified options, which involves a whole load of R code - something I’ll come back to later. Dane also suggested I check out Stack Overflow to see if people were complaining about the issue in C++ there. I was sceptical that I’d find anything - lots of these bugs are more often surfaced in the R and Python libraries - but realised that this wasn’t the dead end that I’d thought. It suddenly occurred to me that if I could reproduce the bug in PyArrow, then the problem must lie in the C++ library, but if I couldn’t, then the problem lay in the R code.

Fifteen minutes later, and I had confirmed it was an R problem. I happened to mention on Slack the problem I was having, steps I’d taken so far to investigate, and potential ideas to look at next, and ended up engaging in a bit more collaborative debugging, this time with the wonderful Dewey Dunnington, who mentioned more disparities between PyArrow and R in terms of how we construct schemas, which put me on the path of testing the schema values at different points in Dataset creation and able to rule that out. At that point, with a smaller problem space to explore, the only logical thing left to look into was the R code which sets up the options for the various Arrow classes, and I ended up spotting the rogue instantiation of CSVReadOptions which just needed to have the partitioning column excluded (it relates to the reading in of the individual files which make up the dataset, and so has no “knowledge” of the partitions, and so previously raised an error as it treated them as an extraneous column).

One pull request later, and the problem that I’d given myself a week to look at had been solved in less than a day! This is probably one of the most gratifying bugs I’d worked on all year; there was a user with a problem to solve, a bug which had been annoying me for ages, the chance to fall into the puzzle-like aspects of debugging, and some great opportunities for collaboration with folks whose help here I really appreciated. This is one of the things I enjoy most about being a software engineer; this process of starting off feeling entirely clueless about something, and having to work out where I need to be and how I’m going to get there, and then doing it. Actually, in the abstract, that’s probably one of the things I enjoy most about being a human :)