Contents

Big Data and Parquet

Efficient data storage solution with Parquet format. Pros and cons for your Big Data workflows.

Martin Abegglen @ Flickr

Choosing the file format to suit your needs

Selecting the appropriate file format for your data storage and processing needs is one of the most important decisions you have to make when starting a greenfield type project with any machine learning functionality.

For typical data processing needs, we can divide the most popular file formats into 2 categories: row-based approach and column-based file formats.

The difference between them, at first sight, is pretty straightforward, the row-based approach is very popular and simple, every row in a file (either a line or multiple lines separated by a special separator) represents a single row in our set of data. With the latter approach, the column-based one, the data is separated into chunks where a piece of data from a single column is stored together on a single line (simplification).

Let’s say we have a dataset with the following 3 entities encoded (in JSON format for readability):

[
  {
    "id":1,
    "name":"Skipper",
    "role":"Leader"
  },
  {
    "id": 2,
    "name": "Kowalski",
    "role": "Strategist"
  },
    {
    "id": 3,
    "name": "Rico",
    "role": "Explosives Specialist"
  }
]

Row-based data layout:

Column-based data layout:

At this point, you can ask: What’s the point of keeping data in a columnar way vs. good old rows in a file? It turns out there are multiple advantages that I will try to explain below.

Advantages of Parquet as your column-based format of choice

One of the main reasons the column-based formats gained popularity, especially in the data science field, is that the data gathered for ML pipelines usually comes from denormalized SQL stores or other data sources, and contains multiple columns (counted in hundreds) that need to be converted into a stream of information we are interested in. If you want to read a subset of data, selecting only a small number of fields/columns and filter them, in addition, by some value, the columnar data format like Parquet stands out from the crowd.

Read Optimization aka Column pruning

With your data stored in the column-based file format, you can easily get the data for selected columns without the need of reading entire rows of data to select them (like in the row-based approach). This is a huge performance improvement over the row-based file format. Both pandas and Spark dataframes can utilize column-based formats to speed up reads and query executions. Spark will automatically leverage the column pruning mechanism when building an execution plan, so you don’t even have to specify explicitly which columns you will read.

Improved disk seeks

Single column, in the column-based approach, can be spread out into multiple blocks on the disk, where each one contains a subset of data for the same column. This situation greatly enhances the read speed when searching for particular data within some given range, with the ability to skip the blocks we are not interested in.

Schema inclusion and metadata

Although there are row-based formats with a schema like Avro, having a schema along your data is a huge advantage and automates a lot of tedious work when processing. Parquet includes extensive metadata together with your actual data, which keeps the information about the data types, row groupings, and others.

Apart from the aforementioned data in metadata, Parquet keeps the information about data stored in each column like min, max, null count, and similar, which can be used by different frameworks, for example for statistical purposes.

Complex data types

With Parquet, you can store complex data structures like arrays, dictionaries or even separate nested schemas within your columns.

Popularity

The Parquet file format is gaining popularity, especially when compared to traditional uncompressed data storage formats like CSV.

Compression improvements

Compression algorithms work much better with the ability to find repeating patterns, which is much easier with column-based data where the same type of data is stored together.

Also, when file structure allows splitting into smaller files, the compression can be even better.

Compared to simple CSV and other row-based solutions, the size of the data stored is multiple times smaller.

Disadvantages

Of course, Parquet, as well as other column-based solutions, has its drawbacks. You probably should think twice when selecting a column-based solution if you want to read whole records for processing or you want to restructure and modify the schema often.

The disadvantages of column-based solutions include:

  • reading whole record (Parquet attempts to solve this issue with extensive metadata and clever row groups)
    raw readability — probably nothing can win with the ability to load a simple CSV file into an Excel spreadsheet ;)
  • not easy to change schema over time
  • mutability is hard
  • columnar formats need to remember more about where stuff is, because of that, they use more memory than a similar row format
  • data is spread around and you need to do more to gather it all together into a complete record.
  • Parquet shines with large data sets, having a file with a couple of kB of data probably won’t give you any of the aforementioned advantages and can even increase the space taken on the disk compared to the CSV solution.

Benchmarking

Multiple attempts at benchmarking Parquet with specific data sources and comparing the results with other solutions like CSV or Avro have been made. One of the most popular approaches is that available on the Cloudera website where you can get a comparison with the Avro file format. In conclusion, it states that:

Overall, Parquet showed either similar or better results on every test. The query-performance differences on the larger datasets in Parquet’s favor are partly due to the compression results; when querying the wide dataset, Spark had to read 3.5x less data for Parquet than Avro. Avro did not perform well when processing the entire dataset, as suspected.

Summing up

A row-based approach (like CSV) can be more efficient for queries that must access most of the columns or only read a fraction of the rows in the dataset. On the other hand, the column-oriented formats shine where you have to execute analytical queries — accessing only a small subset of available columns/fields and most of the records.

Looking for more quality tech content about Big Data, Machine Learning or Stream Processing? Join Data Times, a monthly dose of tech news curated by SoftwareMill's engineers.

Blog Comments powered by Disqus.