One of my goals for next year is to get a deeper understanding of R’s C API. I’m making a start on this by reading Chapter 14 of Deep R Programming by Marek Gagolewski: “Interfacing compiled code”. It’s a great resource, though the chapter preface states “we assume basic knowledge of the C language”. I do not have this knowledge My C knowledge is fairly limited, and so this blog post will pull out some of the details from that chapter, especially bits where I’ve had to go “wtf is that?!” or remind myself by googling. My current level of C knowledge at the moment is pretty minimal; the main things I know are:
- C is a compiled language (as opposed to an interpreted language) and C code needs compiling before it can be run
- A lot of R’s internals are based on C
- C is a statically-type language; this means that variable types are defined when the variable is created and cannot later change other than via explicit manual casting
And that’s basically it! I imagine there are other bits I don’t realise I know which I’ve picked up from being an R package maintainer and dabbling in a few C++ tutorials, but I’ll try to explain everything as much as possible. OK, let’s do this!
The book chapter provides an example package with a simple C function implemented, and walks us through the code step-by-step. Great!
Section 14.1.1 starts off with an example of a C function defined in src/cfunc.h
. I guess the first thing to note is the location of the file - in the package’s src
directory. This is where any compiled code needs to go, typically C or C++ code, or even Fortran if you’re really going old-school.
Header files and source files
Another thing to note here is the file name, which ends in .h
. C code can be divided into header files (ending in .h
) and source files (ending in .c
). Header files contain the function declarations (including variable types) and other things like macros (named bits of code for the pre-processor to work with). They’re sometimes referred to as the interface - they contain information about functions’ inputs and outputs - including the argument names and types.
Once-only includes
The first couple of lines of code in the header file contain these lines:
#ifndef __CFUNS_H
#define __CFUNS_H
and the final line is
#endif
What’s happening here is that, often we can end up with projects containing multiple files, some of which source each other, and include them. We don’t want to end up with duplication of the headers which have been included, otherwise the compiler will raise an error, so we put them in an #ifndef
wrapper, and give them a name. Basically, what we’re saying to the preprocessor here is that if this name hasn’t already been defined, defined it and include this code, but if it’s already defined then skip this.
Includes
The third line in src/cfunc.h
is:
#include <stddef.h>
This allows for the inclusion of a file from the C standard library, which has a few different variables included. The key one for us here is size_t
, which is commonly used for iterating over items in arrays - we need this as we’ll be including a for
loop in the definition of our function.
The preprocessor
Above I casually mentioned the C preprocessor a couple of times without defining it. A succinct and perhaps naive summary is that there are multiple steps in the compilation of C code. One of these phases is preprocessing and it involves things like processing any additional files we’ve said we wanted to include, and replacing macros with their definitions.
Declarations
So, the declaration for the function in the example looks like this
double my_c_sum(const double* x, size_t n);
In words, this means that:
- it is a function which returns an object of type double
- the function name is
my_c_sum
- the first argument is called
x
x
is a pointer to a variable of type doublex
is aconst
variable, which means it won’t be modified in the body of the function- the second argument is called
n
n
is of typesize_t
This concept of a pointer just means that x
contains the memory address of the double that we pass in, rather than a copy of the values in it. This prevents us from copying the values unnecessarily.
Source file
Source files contain the body of the function, sometimes called the implementation.
The code in the book chapter goes on to show the content of the file src/cfuncs.c
. The first line of this file is:
#include "cfuns.h"
This is including the header file we discussed above. The rest of the code in the source file contains the definition of my_c_sum
:
/* computes the sum of all elements in an array x of size n */
double my_c_sum(const double* x, size_t n)
{
double s = 0.0;
for (size_t i = 0; i < n; ++i) {
/* this code does not treat potential missing values specially
(they are kinds of NaNs); to fix this, add:
if (ISNA(x[i])) return NA_REAL; // #include <R.h> */
+= x[i];
s }
return s;
}
The function signature here is identical to how it is defined in the header file. The for
loop uses the n
argument which was passed in to represent the size of the array to loop through. In numerous other languages we’d calculate the size of the array in the body of the function, but in C you cannot have an array of unknown size, and so it must be passed in as a parameter. I think this is to do with how the C compiler allocates memory; more modern C does have the concept of variable-length arrays.
The chapter goes on to discuss further examples which then show how to include a wrapper which can be called by R. I won’t discuss this here, as the text there is all explained well, and the contents are more specific to R’s C API, and not specifically just C-related topics.
Here’s a summary of the C-related topics mentioned here:
- header files and source files
- includes
- once-only includes
- the preprocessor
- variable-length arrays
const
variables- pointers
- statically-typed languages
- compiled and interpreted languages