Having too much data is problem every data practitioner wishes of having but it’s still a problem nonetheless.
A single solution might not work to all requirements and so different solutions can work in different scenarios. This notebook aims to describe and introduce some of these techniques. We are going to use The Riiid! Answer Correctness Prediction dataset in our experiments on since it has over 100 million rows and 10 columns and should result in an out-of-memory error using the basic pd.read_csv .
Different packages have their own way of reading data. These are the methods we are going explore :
It is also very common to convert the dataset into another format which uses lesser disk space and/or can be read faster for subsequent reads.
Before exploring various methods let’s once confirm that reading the dataset using the default pandas setting fails.
It should result in a memory error, confirming the default pandas doesn’t work.
Pandas is probably the most popular method of reading datasets, it has a lot of options, flexibility and functions for reading and processing data.
One of the challenges with using pandas for reading large datasets is it’s conservative nature while infering data types of the columns of a dataset often resulting in unnecessary large memory usage for the pandas dataframe. You can pre-define optimal data types of the columns (based on prior knowledge or sample inspection) and provide it explicitly while reading the dataset.
Datatable (heavily inspired by R’s data.table) can read large datasets fairly quickly and is often faster than pandas. It is specifically meant for data processing of tabular datasets with emphasis on speed and support for large sized data.
The command line to install datatable is :
- With internet :
pip install datatable==0.11.0 > /dev/null
- with wheel file :
pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null
File Formats :
It is common to convert a dataset into a format which is easier or faster to read or smaller in size to store. There are various formats in which datasets can be stored though not all will be readable across different packages. Let’s look at how these datasets can be converted into different formats.
The csv format is pretty much the standard format in which datasets are shared, also, almost all methods can be used to read data from csv.
It is common to store data in feather (binary) format specifically for pandas. It significantly improves reading speed of datasets.
Writing dataset as csv :
data = pd.read_feather(“input/riiid_train.feather”)
HDF5 is a high-performance data management suite to store, manage and process large and complex data.
data = pd.read_hdf("input/riiid_train.h5", "riiid_train")
Read more: https://www.hdfgroup.org/solutions/hdf5
Datatable uses .jay (binary) format which makes reading datasets blazing fast.
data = dt.fread("input/riiid_train.jay")
In the Hadoop ecosystem, parquet was popularly used as the primary file format for tabular datasets and is now extensively used with Spark. It has become more available and efficient over the years and is also supported by pandas.
data = pd.read_parquet("input/riiid_train.parquet")
Python objects can be stored in the form of pickle files and pandas has inbuilt functions to read and write dataframes as pickle objects.
data = pd.read_pickle("input/riiid_train.pkl.gzip")
Each method has it’s own set of pros and cons. Some examples are:
- Pandas requires a lot more RAM to handle large datasets.
- Dask can be slow at times especially with transformations that cannot be parallelized.
- Datatable doesn’t have a very exhaustive set of data processing functions.
It is recommended to experiment various options and curate whichever is more appropriate for your use-case. So don’t shy away from testing and experimenting.