Conclusion

ParaText is a fast and memory-efficient reader of text files for multi-core machines for C++ and Python. ParaText uses threads to consume more of the unused I/O bandwidth in text parsing. In our benchmarks, ParaText had higher throughput and lower memory than every CSV reader we tested. It parses CSV files up to 2.5 GB/second in a cold state and 4 GB/second out-of-core in a warm state. It can parse and perform simple computations on a 5 TB CSV file in under 30 minutes. It is substantially cheaper than every other CSV readers in a production deployment hosted on AWS. We invite the community to contribute to ParaText to add support for more text formats (e.g., LIBSVM, ARFF), other text algorithms, and other languages (e.g., Java, R).

results matching ""

    No results matching ""