Data Processing Study Reveals Quick, Economical Way to Process Big Data

Following their study of several Python DataFrame libraries, the Financial Tech company, Code Willing, determined the fastest and most cost-efficient libraries while using CASFS on AWS for big data processing were Ray and PySpark.

CASFS is the brainchild of Code Willing. Faced with time and cost constraints when processing enormous amounts of data, their data scientists needed something no one else had for their large-scale data analysis process. So, they built it, and CASFS was born.

In the study, the libraries used were Pandas, Polars, Dask, Ray, and PySpark.

The study included data processing with these five different libraries processing three separate dataset sizes:

  • 1 file – 2,524,365 rows x 20 columns
  • 10 files – 23,746,635 rows x 20 columns
  • 100 files – 241,313,625 rows x 20 columns

They processed the data on a r5.24xlarge machine with 96 cores and 768 GB of RAM. For the code processed on a cluster, they used 10 r5.2xlarge machines with a total of 80 cores and 640 GB of RAM.

Through these individual tests, they were able to discern that Ray and PySpark were the most useful to them in terms of performance and memory efficiency when used for big data processing.

While Ray and PySpark proved best for large amounts of data processing, Code Willing’s study also determined where the remaining libraries worked best in the data analysis process.

To see the more in-depth findings of the study, you can head over to their website and review the full report here

Link to Dalton’s whitepaper: