Posit DS Lab - Parquet

Today I was on Posit’s DS Lab talking about Parquet! In this blog post, I’ve included the notes I wrote to prep for the session, which cover a lot of what we talked about today.

What is Parquet and why should you care?

Smaller than equivalent CSV files so you can work with and share bigger datasets more easily
Way faster to work with than CSVs
Stores data types so less room for error or custom code needed to convert things when you read it in

How?

Smaller than equivalent CSV - internally uses encoding and compression and is a binary format (made for computers not humans)
Faster - data stored in small pieces and in columns so can read efficiently and operate in parallel
Data types - Parquet files store metadata internally about the data itself but also and how it’s been saved

We’ll look at these more closely later!

What packages do I need to work with Parquet files?

Pick one of…

arrow - for all features, including working with multi-file datasets
nanoparquet - for a no-dependency package which has most (but not all) features but can be slower on larger datasets
duckdb - for SQL-oriented workflows

Today we’ll focus on {arrow} but some examples with {nanoparquet}

Using {arrow} vs. using {nanoparquet}

	{arrow}	{nanoparquet}
Dependencies	C++ library (pre-built binaries available)	None
Read/write single files	✅	✅
Multi-file & partitioned datasets	✅	❌
Larger-than-memory data	✅	❌
Remote files (S3, HTTP)	✅	❌
Filter rows before reading	✅	❌
Append to existing files	❌	✅
Full Parquet type support	✅	Most types
Nested types (lists of lists)	✅	❌

The arrow package contains the full functionality, but nanoparquet is great when you have small files and really simple use cases or want to append data to an existing Parquet file.

How do you open a Parquet file?

You can use read_parquet() from arrow (or nanoarrow!)

library(arrow)


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

library(tibble)

taxi <- read_parquet(
  "https://arrow-datasets.s3.amazonaws.com/nyc-taxi-tiny/year=2019/month=1/part-0.parquet"
)
# Below is the same file but locally
taxi <- read_parquet("taxi-mini.parquet")

# it's read in as a tibble automatically
class(taxi)

[1] "tbl_df"     "tbl"        "data.frame"

taxi

# A tibble: 7,667 × 22
   vendor_name pickup_datetime     dropoff_datetime    passenger_count
   <chr>       <dttm>              <dttm>                        <int>
 1 CMT         2019-01-25 22:15:29 2019-01-25 23:03:20               2
 2 CMT         2019-01-18 14:39:02 2019-01-18 14:45:36               1
 3 VTS         2019-01-24 15:02:06 2019-01-24 15:24:38               1
 4 VTS         2019-01-01 11:23:33 2019-01-01 12:04:09               1
 5 CMT         2019-01-14 13:33:39 2019-01-14 13:46:13               2
 6 CMT         2019-01-26 19:26:48 2019-01-26 19:31:46               1
 7 VTS         2019-01-15 10:03:54 2019-01-15 10:38:38               1
 8 CMT         2019-01-26 10:17:39 2019-01-26 10:22:08               1
 9 VTS         2019-01-09 15:47:31 2019-01-09 15:54:03               1
10 VTS         2019-01-13 16:59:56 2019-01-13 17:14:49               1
# ℹ 7,657 more rows
# ℹ 18 more variables: trip_distance <dbl>, pickup_longitude <dbl>,
#   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,
#   dropoff_longitude <dbl>, dropoff_latitude <dbl>, payment_type <chr>,
#   fare_amount <dbl>, extra <dbl>, mta_tax <dbl>, tip_amount <dbl>,
#   tolls_amount <dbl>, total_amount <dbl>, improvement_surcharge <dbl>,
#   congestion_surcharge <dbl>, pickup_location_id <int>, …

How do you save a Parquet file?

Use write_parquet() from arrow (same name in nanoarrow!)

write_parquet(taxi, "new_taxi.parquet")

Size/speed compared to CSVs

What if we’re working with a really big Parquet file? This one is 135MB.

# fs <- S3FileSystem$create(anonymous = TRUE)
# df <- read_parquet(fs$path("arrow-datasets/nyc-taxi/year=2019/month=1/part-0.parquet"))
# write_parquet(df, "taxi.parquet")
taxi_big <- read_parquet("taxi.parquet")
nrow(taxi_big)

[1] 7667255

fs::file_size("taxi.parquet")

135M

If we write it to CSV, how long does it take?

library(tictoc)
tic()
readr::write_csv(taxi_big, "taxi_big.csv")
toc()

28.291 sec elapsed

It took a while last time I tried! And, how big is the CSV?

fs::file_size("taxi_big.csv")

924M

From 135MB to 924MB!

And how long does it take to read the whole file?

tic()
from_csv <- readr::read_csv("taxi_big.csv")

Rows: 7667255 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (4): vendor_name, rate_code, store_and_fwd, payment_type
dbl  (12): passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_...
lgl   (4): pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_lat...
dttm  (2): pickup_datetime, dropoff_datetime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

toc()

6.914 sec elapsed

tic()
from_parquet <- read_parquet("taxi.parquet")
toc()

0.462 sec elapsed

Waaay faster! And what if we only want data from certain columns?

tic()
just_times <- read_parquet("taxi.parquet", col_select = c(pickup_datetime, dropoff_datetime))
toc()

0.183 sec elapsed

Even faster!

Data types in Parquet files

Parquet has its own data types so it can be language-agnostic. In other words, files you write in one R can be read in Python, Java, Rust, etc, and vice versa.

The Arrow R package handles the mapping between R types and Parquet types automatically, which means you can read and write data to/from Parquet files in R and it’ll remain the same.

Why do types matter?

1. Ambiguous data

Let’s say I have a data frame containing zip codes. When I first create it, it’s character data in R.

zip_data <- data.frame(state = c("New York", "New Jersey"), zip = c("11213", "07001"))
zip_data

       state   zip
1   New York 11213
2 New Jersey 07001

But what if I write it to a CSV and read it back into R?

write.csv(zip_data, "zip.csv", row.names = FALSE)
read.csv("zip.csv")

       state   zip
1   New York 11213
2 New Jersey  7001

The CSV reader has tried to guess the type and assumed it’s an integer and so we miss the leading 0 from the New Jersey zip.

If we look at the raw CSV file, it’s actually been saved as a string.

"state","zip"
"New York","11213"
"New Jersey","07001"

But as CSV readers have to guess data types, it’s extra work. Luckily I know that readr has a CSV reader which handles things better.

library(readr)
read_csv("zip.csv")

Rows: 2 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, zip

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 2
  state      zip  
  <chr>      <chr>
1 New York   11213
2 New Jersey 07001

But I’d rather not have to know this! And what if I’m collaborating with people and sharing data and so they might read it in with the base R one and our data gets messed up.

With Parquet, storing the data type with the data means that we’re always going to get the correct type.

write_parquet(zip_data, "zip.parquet")
read_parquet("zip.parquet")

# A tibble: 2 × 2
  state      zip  
  <chr>      <chr>
1 New York   11213
2 New Jersey 07001

2. Ordered factors

The diamonds dataset from ggplot2 has cut as an ordered factor. Easy to plot cut in order when we load the dataset directly from the ggplot2 package data.

library(ggplot2)
ggplot(diamonds, aes(x = cut)) + geom_bar() + theme_minimal()

But in real life we do a lot more loading of data from files, so what happens if we save the data to a CSV and load it back in before plotting it?

library(readr)

write_csv(diamonds, "diamonds.csv")
diamonds_from_csv <- read_csv("diamonds.csv")

Rows: 53940 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cut, color, clarity
dbl (7): carat, depth, table, price, x, y, z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ggplot(diamonds_from_csv, aes(x = cut)) + geom_bar() + theme_minimal()

CSV files don’t have any concept of ordered factors - data is just saved as strings and so we’d need to tell R that cut is an ordered factor before plotting it to get it right. But, Parquet saves data types and so we can write the data to Parquet and back to R and it keep track of the fact it’s an ordered factor.

library(arrow)

write_parquet(diamonds, "diamonds.parquet")
diamonds_from_parquet <- read_parquet("diamonds.parquet")

ggplot(diamonds_from_parquet, aes(x = cut)) + geom_bar() + theme_minimal()

3. List columns

And what about reading/writing list columns? Here’s one from the starwars dataset from dplyr.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

starwars_subset <- starwars |> select(name, films)
starwars_subset

# A tibble: 87 × 2
   name               films    
   <chr>              <list>   
 1 Luke Skywalker     <chr [5]>
 2 C-3PO              <chr [6]>
 3 R2-D2              <chr [7]>
 4 Darth Vader        <chr [4]>
 5 Leia Organa        <chr [5]>
 6 Owen Lars          <chr [3]>
 7 Beru Whitesun Lars <chr [3]>
 8 R5-D4              <chr [1]>
 9 Biggs Darklighter  <chr [1]>
10 Obi-Wan Kenobi     <chr [6]>
# ℹ 77 more rows

Base R gets us an error

starwars_subset |>
  write.csv("starsub.csv")

Error in `utils::write.table()`:
! unimplemented type 'list' in 'EncodeElement'

read.csv("starsub.csv")

Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'starsub.csv'

  X           name films
1 1 Luke Skywalker    NA

With readr the data is gone.

starwars_subset |>
  write_csv("starsub.csv")

starsub_csv <- read_csv("starsub.csv")

Rows: 87 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
lgl (1): films

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

starsub_csv$films[1]

[1] NA

We could transform it into e.g. a single string separated by semi-colons before writing and extract it into columns after reading, but with parquet, it’s preserved.

starwars_subset |>
  write_parquet("starsub.parquet")

starsub_parquet <- read_parquet("starsub.parquet")

starsub_parquet$films[1]

<list<character>[1]>
[[1]]
[1] "A New Hope"              "The Empire Strikes Back"
[3] "Return of the Jedi"      "Revenge of the Sith"    
[5] "The Force Awakens"

How do R types map to Arrow types (and vice versa)

These docs, written by Danielle Navarro have excellent information about translation between types:

https://arrow.apache.org/docs/r/articles/data_types.html#list-of-default-translations

Column-oriented

If we think about how data is stored in memory, it’s in a one-dimension structure - one value after another.

When we are doing analytics, we’re typically asking questions like, “what’s the mean value of this column”, or “show me only values greater than X”, and so we’re thinking about things in terms of taking data from a column and doing something with it.

If data is stored in a row-oriented format, we need to skip between different places to retrieve values for a column, but if data is stored in a column-oriented format, we can just pull retrieve the chunk where our data is stored, which is more efficient.

Parquet file chunking

A Parquet file is divided into smaller components.

rowgroups, which contain…
column chunks, which contain…
pages

They also contain metadata in the footer

Parquet file footers

The footer is the key to how Parquet files work. It the last thing that’s written to the file when saving a file, but the first thing which is read when it’s being read.

It contains:

The schema - the column name/type mapping
Metadata for the rowgroups - where they are in the file and what encoding/compression is used; this makes it faster to read subsets
Statistics - things like minimum and maximum values for columns, and how many null values - so if you’re filtering, something like Arrow can use this to skip certain sections, or count rows without too much effort
Additional metadata - like the Arrow-equivalent schema, or in Python, pandas metadata

Reading Parquet metadata

You can do this using the nanoparquet package!

nanoparquet::read_parquet_info("taxi-mini.parquet")

# A data frame: 1 × 7
  file_name         num_cols num_rows num_row_groups file_size parquet_version
  <chr>                <int>    <dbl>          <int>     <dbl>           <int>
1 taxi-mini.parquet       22     7667              1    204709               1
# ℹ 1 more variable: created_by <chr>

nanoparquet::read_parquet_schema("taxi-mini.parquet")

# A data frame: 23 × 14
   file_name r_col name  r_type type  type_length repetition_type converted_type
   <chr>     <int> <chr> <chr>  <chr>       <int> <chr>           <chr>         
 1 taxi-min…    NA sche… <NA>   <NA>           NA REQUIRED        <NA>          
 2 taxi-min…     1 vend… chara… BYTE…          NA OPTIONAL        UTF8          
 3 taxi-min…     2 pick… POSIX… INT64          NA OPTIONAL        TIMESTAMP_MIL…
 4 taxi-min…     3 drop… POSIX… INT64          NA OPTIONAL        TIMESTAMP_MIL…
 5 taxi-min…     4 pass… double INT64          NA OPTIONAL        <NA>          
 6 taxi-min…     5 trip… double DOUB…          NA OPTIONAL        <NA>          
 7 taxi-min…     6 pick… double DOUB…          NA OPTIONAL        <NA>          
 8 taxi-min…     7 pick… double DOUB…          NA OPTIONAL        <NA>          
 9 taxi-min…     8 rate… chara… BYTE…          NA OPTIONAL        UTF8          
10 taxi-min…     9 stor… chara… BYTE…          NA OPTIONAL        UTF8          
# ℹ 13 more rows
# ℹ 6 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#   precision <int>, field_id <int>, children <list>

nanoparquet:::read_parquet_pages("taxi-mini.parquet")

# A data frame: 44 × 14
   file_name         row_group column page_type       page_header_offset
   <chr>                 <int>  <int> <chr>                        <dbl>
 1 taxi-mini.parquet         0      0 DICTIONARY_PAGE                  4
 2 taxi-mini.parquet         0      0 DATA_PAGE                       34
 3 taxi-mini.parquet         0      1 DICTIONARY_PAGE               1423
 4 taxi-mini.parquet         0      1 DATA_PAGE                    48607
 5 taxi-mini.parquet         0      2 DICTIONARY_PAGE              61281
 6 taxi-mini.parquet         0      2 DATA_PAGE                   108551
 7 taxi-mini.parquet         0      3 DICTIONARY_PAGE             121227
 8 taxi-mini.parquet         0      3 DATA_PAGE                   121278
 9 taxi-mini.parquet         0      4 DICTIONARY_PAGE             124298
10 taxi-mini.parquet         0      4 DATA_PAGE                   128776
# ℹ 34 more rows
# ℹ 9 more variables: uncompressed_page_size <int>, compressed_page_size <int>,
#   crc <int>, num_values <int>, encoding <chr>,
#   definition_level_encoding <chr>, repetition_level_encoding <chr>,
#   data_offset <dbl>, page_header_length <int>

Disclaimer: prose below here was generated with Claude, but checked by me.

Arrow vs. Parquet vs. Feather

This is the distinction that confuses everyone:

Parquet = a file format for storing data on disk (columnar, compressed, self-describing)
Arrow columnar format = a specification for how columnar data is laid out in memory - designed for zero-copy reads and fast analytics. Multiple implementations exist across languages (C++, Rust, Java, Go, etc.)
Arrow IPC format = a way to write Arrow-formatted data to disk or send it between processes, preserving the in-memory layout. Very fast to read because there’s no decoding step - the data on disk is already in Arrow format
Feather = an older name for Arrow IPC files (V1 is deprecated; V2 = Arrow IPC)
The {arrow} R package = an R interface to the Arrow C++ library, giving you tools to read/write Parquet, Arrow IPC, and CSV files, plus a dplyr backend for larger-than-memory data

In practice: Parquet is the best default for storing and sharing data. Arrow IPC can be faster to read/write but files are larger and less widely supported outside the Arrow ecosystem.

What you can and can’t store in Parquet

All standard R types (numeric, integer, character, logical, Date, POSIXct), list columns, and factor levels can be stored in Parquet. Custom R classes are preserved via metadata when round-tripping in R.

Nested data example - list columns survive the round-trip:

nested_data <- tibble(
  species = c("cat", "dog", "bird"),
  sounds  = list(
    c("meow", "purr", "hiss"),
    c("woof", "bark", "growl"),
    c("tweet", "squawk")
  )
)

write_parquet(nested_data, "nested.parquet")
read_parquet("nested.parquet")

# A tibble: 3 × 2
  species            sounds
  <chr>   <list<character>>
1 cat                   [3]
2 dog                   [3]
3 bird                  [2]

What you can’t store:

Arbitrary R objects (ggplot objects, model fits, environments)
Everything must map to Parquet’s type system
Custom R class metadata is R-specific (won’t auto-translate to Python)

Partitioning

Partitioning splits a dataset into separate files based on column values. When you filter, Arrow only reads the relevant files.

write_dataset(
  taxi_big,
  path = "taxi-partitioned",
  format = "parquet",
  partitioning = "payment_type"
)

list.files("taxi-partitioned", recursive = TRUE)

[1] "payment_type=Cash/part-0.parquet"         
[2] "payment_type=Credit%20card/part-0.parquet"
[3] "payment_type=Dispute/part-0.parquet"      
[4] "payment_type=No%20charge/part-0.parquet"

taxi_ds <- open_dataset("taxi-partitioned")

taxi_ds |>
  filter(payment_type == "Credit card") |>
  summarise(avg_tip = mean(tip_amount, na.rm = TRUE)) |>
  collect()

# A tibble: 1 × 1
  avg_tip
    <dbl>
1    2.55

Rules of thumb for partitioning:

Partition on columns you frequently filter by
Aim for individual files between 20 MB and 2 GB
Avoid more than ~10,000 partition files

Bonus: analysing with Arrow

You can use dplyr verbs on an Arrow dataset - the work happens in Arrow, not R, until you collect().

taxi_ds <- open_dataset("taxi-partitioned")

taxi_ds |>
  filter(trip_distance > 0, fare_amount > 0) |>
  group_by(payment_type) |>
  summarise(
    n = n(),
    avg_fare = mean(fare_amount, na.rm = TRUE),
    avg_tip = mean(tip_amount, na.rm = TRUE)
  ) |>
  collect()

# A tibble: 4 × 4
  payment_type       n avg_fare  avg_tip
  <chr>          <int>    <dbl>    <dbl>
1 Credit card  5461817     12.5 2.54    
2 Cash         2112585     11.3 0.000315
3 Dispute         7498     13.9 0.00550 
4 No charge      24026     39.3 0.00254

For much more on this, see Scaling Up With R and Arrow.

Where can I read more?

Scaling Up With R and Arrow - free online book
R for Data Science: Arrow chapter
Arrow R package docs
posit::conf(2024) Arrow workshop
nanoparquet