User-defined functions in arrow

R
Author

Nic Crane

Published

May 18, 2021

library(arrow)
library(dplyr)

register_scalar_function(
  name = "add_one",
  function(context, trip_distance) {
    trip_distance + 1
  },
  in_type = schema(
    trip_distance = float64()
  ),
  out_type = float64(),
  auto_convert = TRUE
)
results <- bench::mark(
  compute = open_dataset("~/data/nyc-taxi") |>
    filter(year == 2019, month == 9) |>
    transmute(x = trip_distance + 1) |>
    collect(),
  udf = open_dataset("~/data/nyc-taxi") |>
    filter(year == 2019, month == 9) |>
    transmute(x = add_one(trip_distance)) |>
    collect(),
  iterations = 10,
  check = FALSE
)
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
results
# A tibble: 2 × 13
  expression min        median     `itr/sec` mem_alloc  `gc/sec` n_itr  n_gc
  <bnch_xpr> <bench_tm> <bench_tm>     <dbl> <bnch_byt>    <dbl> <int> <dbl>
1 <language> 0.6321745  0.6625475      1.49   6372712      0.744    10     5
2 <language> 1.3555642  1.4408504      0.690 55933824      7.11     10   103
# ℹ 5 more variables: total_time <bench_tm>, result <list>, memory <list>,
#   time <list>, gc <list>

The time it took when using Arrow’s in-built compute function was about hald the time it took using a UDF. Crucially, there was significantly more memory allocated by R when using the UDF, as well as more garbage collections performed, which leads me to conclude that the UDF is being run after the results have been pulled back into R.