Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 decision post on using Parquet #168

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions why-parquet/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
---
title: "Why Parquet"
description: |
The format of the file for storing data directly impacts how it is used
later. Parquet is a format that has very good compressed file sizes, is very
fast to process and analyse, and is well integrated into the R and Python
ecosystems.
date: "2025-03-10"
categories:
- database
- organise
---

## Context and problem statement

The core aim of Seedcase is to structure and organise data to be modern,
FAIR, and standardised so that it can be more easily used and shared
later on. How we store the data directly affects how it can be used
later on. There are a large variety of file formats, so we need to
decide on the best one for our needs. So the question is:

*What file format should we use for storing data organised and
structured by Seedcase software?*

## Decision drivers

::: content-hidden
List some reasons for why we need to make this decision and what things
have arisen that impact work.
:::

- Mainly used for storing various sizes of data.
- We don't do "transactional" data processing or data entry, so we
don't need to worry about [ACID](https://en.wikipedia.org/wiki/ACID)
compliance as much nor do we need to worry about row-level write
speeds.
- Stores the schema within the format itself, and can also handle
[schema evolution](https://en.wikipedia.org/wiki/Schema_evolution).
- Suitable for both local, single-node processing and remote,
distributed processing.
- Has good compressed file sizes.
- Is also relatively simple to use and understand for those doing data
analysis, which means it can integrate easily with software like R
and Python.

## Considered options

Based on our needs, we considered the following options:

- [Parquet](https://parquet.apache.org/)
- [Avro](https://avro.apache.org/)
- [SQL (e.g. SQLite)](https://www.sqlite.org/)

Commonly used file formats that we don't consider are:

- CSV, JSON, and other text-based formats are some of the most
commonly used file formats for data. However, their biggest
disadvantage is that they don't store the data schema nor are they
good for file compression and eventual analysis. So we don't
consider them.
- [ORC](https://orc.apache.org) is a format that is similar to
Parquet, but is used mostly within distributed computing and Big
Data environments like in [Hadoop](https://hadoop.apache.org/) or
[Hive](https://hive.apache.org/) systems. We don't use these
technologies, so we are not considering ORC.
- [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html),
which is a common file system used in Big Data environments. We
don't consider it because both Parquet and Avro are natively
supported within HDFS systems. We also don't consider it because
Hadoop is not a common use case for our problems we aim to solve.
- Non-embedded SQL engines (like
[Postgres](https://www.postgresql.org/) or
[MySQL](https://www.mysql.com/)) because they require a server to
use and aren't stored as a file on a local filesystem.
- Excel or similar spreadsheet programs, as they aren't technically
open source nor do

### Parquet

[Parquet](https://parquet.apache.org) is a column-oriented binary file
format, where each line in the file represents a "column" of data rather
than each line being a "row" of a data as is seen with spreadsheet
structured data formats. This Parquet file format is optimized for
compression and storage as well as very large-scale data processing and
analysis. It is designed to handle complex data structures and is
natively supported by many big data processing frameworks like
[Spark](https://spark.apache.org/) or
[Hadoop](https://hadoop.apache.org/).

:::::::::::: columns
::: column
#### Benefits

- Designed for batch data processing, which is how most research data
analysis is performed.
- Stores the schema within the format itself, and can also handle
schema evolution.
- Of the options considered, it has the best compressed file sizes
because it stores data by column, not by row. What this means is
that each line in the file format is a column, not a row unlike
formats like CSV where each line in the file is a row.
- Integrates very easily into both R and Python ecosystems for data
analysis via popular packages like
[arrow](https://arrow.apache.org/).
- Is natively supported by the incredibly powerful in-memory SQL
engine [DuckDB](https://duckdb.org/), which is a great tool for data
analysis.
- Has plugins for most common data processing tools, including SQLite.
- Can handle unstructured data.
- Designed to handle the situation of many columns relative to rows
(though still lots of rows). Most research data falls under this
category, for instance -omic type data, where there are often many
hundreds of columns and maybe a few thousand rows.
:::

::: column
#### Drawbacks

- Not particularly well designed for inserting new rows into the
dataset. This isn't strictly an issue for our purposes, since we
don't do "transactional" data processing or entry.
- Is in a binary format, so is not human-readable. This is a drawback
for people who want to quickly look at their data.
- Not very good at row-level scans or lookups, though this isn't a
common use case in research data analysis.
:::

### Avro

[Avro](https://avro.apache.org/) is a row-oriented binary file format
that was designed to be a compact and fast format for serializing data
within the Hadoop system. It is also designed to be a data exchange
format to more effectively move data between systems.

::::::::: columns
::: column
#### Benefits

- Can handle unstructured data.
- Has the schema stored within the format itself.
- Very good file size compression.
- Better at writing new rows to the dataset than Parquet.
- Has better schema evolution features than Parquet.
- For individual row lookups, Avro is faster than Parquet since it
stores data by row, not column.
:::

::: column
#### Drawbacks

- Not as fast as Parquet at reading data.
- Doesn't have as good compressed file sizes as Parquet because it
stores data by row, not column.
- Isn't as well integrated into common research analysis tools like R
and Python.
:::

### SQL (e.g. SQLite)

SQL is a standard language for relational databases throughout much of
the internet, and [SQLite](https://sqlite.org) is a popular embedded SQL
database format that is used in many applications. SQLite is a
file-based database format, which means that it is a single file that
contains the entire database.

::::: columns
::: column
#### Benefits

- Is a well established, classic relational, embedded SQL database
format.
- Very fast at reading and writing data (though not as fast as Parquet
or Avro for reading).
- Can handle unstructured data.
- Can integrate in a wide range of software tools.
- Great for row-level scans, insertions, and lookups. However, this
isn't a common use case for our purposes.
:::

::: column
#### Drawbacks

- Is a relational database, which requires some knowledge of SQL to
use.
- It is row-oriented, so isn't as good of a format for processing or
analysing data compared to Parquet.
- Doesn't have as good compressed file sizes as Parquet or Avro.
- Doesn't integrate well with the [Frictionless Data
Package](/why-frictionless-data/index.qmd) standard.
- While it can integrate with R and Python, it does require some
additional knowledge and code compared to using a file format like
Parquet.
:::
:::::

## Decision outcome

We've decided to use the Parquet file format for storing our data as it
is the best file format for our needs, which are mainly for storing and
processing large amounts of data for research purposes.

### Consequences

- Since Parquet is not a text-based format, it is not human-readable.
This means that for people wanting to have a quick look at their
data won't be able to do that. A consequence will be that some use
cases will not be ideal for Parquet, so people who have smaller
datasets may not use our solutions.

## Resources used for this post

::: content-hidden
List the resources used to write this post
:::

- [Big Data File Formats
Explained](https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33/)
- [The Architect’s Guide to Data and File
Formats](https://thenewstack.io/the-architects-guide-to-data-and-file-formats/)
- [Big Data File Formats: A Comprehensive
Guide](https://risingwave.com/blog/big-data-file-formats-a-comprehensive-guide/)
- [What are the pros and cons of the Apache Parquet format compared to
other
formats?](https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-the-apache-parquet-format-compared-to-other-format)
- [Performance comparison of different file formats and storage
engines in the Hadoop
ecosystem](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
- [Should you use
Parquet?](https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html)
- [CSV vs Parquet vs Avro: Choosing the Right Tool for the Right
Job](https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8)
- [Avro vs.
Parquet](https://www.snowflake.com/trending/avro-vs-parquet)
:::::::::
::::::::::::