Methods
We compared ParaText against 5 other CSV readers commonly used for Data Science: Wise.io ParaText, Pandas read_csv
, R's built-in read.csv
, readr
for R, fread
for R, NumPy’s text loader, DataBricks SparkCSV, and Dato SFrame read_csv
.
Software | Version | Installation Method |
---|---|---|
ParaText | 0.1.1 | python setup.py build install |
Pandas read_csv | 0.18.0 | bundled with Anaconda |
R read.csv |
3.0.2 | bundled with R |
readr read_csv |
0.2.2 | installed via CRAN |
data.table fread |
1.9.6 | installed via CRAN |
NumPy loadtxt | 1.10.4 | bundled with Anaconda |
DataBricks SparkCSV | com.databricks:spark-csv_2.11:1.4.0 |
bundled with Spark |
Dato SFrame.read_csv | 1.9 | pip install sframe |
We also compared ParaText against 5 binary readers of numeric and data frame data.
Format | Software | Version | Installation Method |
---|---|---|---|
HDF5 | h5py | 2.5.0 | Anaconda |
Feather | feather | 0.2.0 | pip install sframe |
NPY | NumPy | 1.10.4 | Anaconda |
Pickle | Python | 2.7.11 | Anaconda |
Additionally, the following software was required:
Software | Version | Installation Method |
---|---|---|
Ubuntu | 14.04 | Launch ami-fce3c696 via AWS |
Spark | PySpark 1.6.1, Hadoop 2.6 | Download binary |
Anaconda | 4.0.0 | Download binary |
NumPy | 1.10.4 | Anaconda |
mdadm | 3.2.5 | apt-get install madm |
g++ | 4.8.4 | apt-get install g++ |
python | 2.7.11 | Anaconda |
R | 3.0.2 | bundled with standard AWS Ubuntu AMI |
SWIG | 2.0.11 | apt-get install swig |
libcurl | 2.0.11 | `apt-get install libcurl4-openssl-dev |
Java (Open JDK) | 1.7.0_101 | apt-get install openjdk-7-jre |