Damian Eads, Wise.io, Inc. (rev 3 June 2016)

Abstract

Any modern data workflow requires reading data from disk. Despite extensive progress in distributed databases and filesystems, there is a strong need to rapidly read text files stored on a single machine. Yet few modern text file readers take advantage of multi-core architectures. This leaves much of the bandwidth unused on high performance storage systems. Introduced here, ParaText, reads text files in parallel on a single multi-core machine to consume more of that bandwidth. The alpha release includes a parallel CSV reader with Python bindings. It also includes some simple parallel text utilities. In our tests, ParaText loads a CSV file using commodity hardware from a cold filesystem into memory at a rate of up to 2.5 GB/second and up to 2.8 GB/second with a warm filesystem cache. Without storing the data, ParaText parses and performs simple out-of-core computations on a CSV file at 4.2 GB/second. ParaText uses much less memory compared to other competing approaches we tested. ParaText can load a 1 TB-sized CSV file in ~7 minutes on a single Amazon EC2 instance with 250 GB of RAM, leaving 100 GB to spare. It takes less than 30 minutes for ParaText to read, parse, and sum the columns of a ~5 TB CSV. The ParaText strategies for moving data between language layers are built to be fast compared to common approaches. We conducted benchmarks on 10 methods across 11 data sets on 3 different tasks.

results matching ""

    No results matching ""