Methods
We compared ParaText against 5 other CSV readers commonly used for Data Science: Wise.io ParaText, Pandas read_csv, R's built-in read.csv, readr for R, fread for R, NumPy’s text loader, DataBricks SparkCSV, and Dato SFrame read_csv.
| Software | Version | Installation Method |
|---|---|---|
| ParaText | 0.1.1 | python setup.py build install |
| Pandas read_csv | 0.18.0 | bundled with Anaconda |
R read.csv |
3.0.2 | bundled with R |
readr read_csv |
0.2.2 | installed via CRAN |
data.table fread |
1.9.6 | installed via CRAN |
| NumPy loadtxt | 1.10.4 | bundled with Anaconda |
| DataBricks SparkCSV | com.databricks:spark-csv_2.11:1.4.0 |
bundled with Spark |
| Dato SFrame.read_csv | 1.9 | pip install sframe |
We also compared ParaText against 5 binary readers of numeric and data frame data.
| Format | Software | Version | Installation Method |
|---|---|---|---|
| HDF5 | h5py | 2.5.0 | Anaconda |
| Feather | feather | 0.2.0 | pip install sframe |
| NPY | NumPy | 1.10.4 | Anaconda |
| Pickle | Python | 2.7.11 | Anaconda |
Additionally, the following software was required:
| Software | Version | Installation Method |
|---|---|---|
| Ubuntu | 14.04 | Launch ami-fce3c696 via AWS |
| Spark | PySpark 1.6.1, Hadoop 2.6 | Download binary |
| Anaconda | 4.0.0 | Download binary |
| NumPy | 1.10.4 | Anaconda |
| mdadm | 3.2.5 | apt-get install madm |
| g++ | 4.8.4 | apt-get install g++ |
| python | 2.7.11 | Anaconda |
| R | 3.0.2 | bundled with standard AWS Ubuntu AMI |
| SWIG | 2.0.11 | apt-get install swig |
| libcurl | 2.0.11 | `apt-get install libcurl4-openssl-dev |
| Java (Open JDK) | 1.7.0_101 | apt-get install openjdk-7-jre |