Introduction

Data ingestion is often a major bottleneck in data science. For almost 50 years, Comma Separated Values or CSV has been the format of choice for tabular data. The industry adopted CSV before parallel computing was mainstream. Given the ubiquity of CSV and the pervasive need to deal with CSV in real workflows—where speed, accuracy, and fault tolerance is a must—we decided to build a CSV reader that runs in parallel.

The simplicity of CSV is enticing. CSV is conceptually easy-to-parse. It is also human readable. Spreadsheet programs and COBOL-era legacy databases can at least write CSV. Indeed, it has become widely used to exchange tabular data. Unfortunately, the RFC standard is so loosely followed in practice that malformed CSV files proliferate. The format lacks a universally accepted schema so even "proper" CSV files may have ambiguous semantics that each application may interpret differently.

Despite these shortcomings, the CSV format is in common use, and the community therefore needs robust tools to process CSV files. We set out to build a fast, memory-efficient, generic multi-core text reader. Our CSV reader is the first to make use of this C++ library infrastructure.

Highlights of ParaText include:

  • the core library is C++, but it is language-independent. Other languages can easily be added.
  • workers have compact scratch space so the memory footprint is minimal.
  • integer, floating-point, text, and categorical data types are supported.
  • a simple type inference scheme that treats every column as numeric until non-numeric data or quoted numbers are encountered.
  • the narrowest type possible is always used for numbers.
  • fields can span multiple lines.
  • a categorical data encoding scheme that maps repeated strings to integers to keep the memory footprint low. When representing categorical data as a Python sequence of strings, the same string object is reused for identical strings.

ParaText returns an iterator over the columns of a loaded data set. This iterator can be used to populate many popular data structures:

  • Python dictionary of arrays (DictFrame)
  • Pandas DataFrames
  • Dato SFrames
  • Spark DataFrames

The iterator frees a column's memory after it has been visited. This ensures the peak memory footprint is as small as possible.

We performed extensive benchmarks of ParaText against commonly used CSV readers for Python and R. We compare ParaText with other CSV readers including Pandas read_csv, SparkCSV, Dato SFrame read_csv, R read.csv, readr's read_csv (R), data.tables's fread (R), and NumPy loadtxt. ParaText is competitive with binary readers such as HDF5, NPY, Feather, and Pickle--which do not need the extra steps of parsing file data, inferring types, or checking for malformed input. We compare performance with and without warm filesystem and disk caches. We place strong emphasis on ensuring our benchmarks are reproducible.

First, we show that ParaText approaches the fundamental capabilities of the storage system. Second, ParaText often outperforms other CSV readers. Third, ParaText has a compact memory footprint. Fourth, Dato and Spark's approach to interact with Pandas by passing serialized messages has serious performance issues that could affect the interactive data science experience. Fifth, ParaText is competitive with binary readers such as HDF5, NPY, and Feather, which do minimal-to-no parsing.

Full details of how to reproduce the benchmarks are in the Appendices.

results matching ""

    No results matching ""