Apache Airflow Integral Part of the CASFS+ Cloud Workspace

When building and creating CASFS+, Code Willing wanted to provide as much as possible to their clients without clients needing to download other extensions or add-ons. One thing that has come out of this standard is the use of Apache Airflow, one of the most popular open-source tools for building robust ETL pipelines.

When working in the CASFS+ Cloud Workspace, clients immediately have access to Apache Airflow. No configuring or downloading is needed. By utilizing Airflow with CASFS+, users can easily launch multiple nodes in their network, each with their own pre-configured and isolated Airflow instance.

Following their completion of building CASFS+, Code Willing utilized their own program to solve problems with their own clients. To better understand how Code Willing is specifically leveraging Airflow in CASFS+, here are three different situations in which they leverage Airflow and CASFS+ to deliver data ETL solutions to their clients:

Data Processing

The data processing “Directed Acyclic Graphs” (DAGs) in the Code Willing CASFS+ environment are the core of their Airflow ETL pipelines. These DAGs are responsible for extracting, transforming and loading (ETL) raw client data into a processed, usable dataset.

Airflow assists in this process by monitoring the CASFS+ file system in real-time for the arrival of dependent files, and launches these tasks as soon as the files arrive. This ensures that Code Willing’s clients will always receive their time-sensitive production data as soon as possible.

Data Quality

Second, several data quality check DAGs are implemented to test the quality of the data from the Data Processing stage. These data quality checks aim to do several things, from checking the format of the ETL data from the Data Processing stage to ensuring that files in the ETL stage are arriving on time. In addition, Airflow has been configured to send both emails and Slack messages to their support team in the event that an alert, such as a failed data quality check or a late file, is raised.

System Health Check Alerts

Finally, Code Willing has also considered the possibility that the entire Airflow service could fail. If this happens, the data quality DAGs discussed above become obsolete, as the service can no longer run them to ensure data is arriving as expected. In response to this situation, Code Willing also utilizes Airflow’s external health checks by regularly monitoring the API endpoints for errors. Similarly to the data quality checks above, in the event that the Airflow service is deemed to be “down,” the Code Willing team will receive Slack and email notifications about the issue. 

Conclusion

Code Willing’s goal is to have a high level of expertise on most popular data engineering and analysis tools. Building this expertise allows Code Willing to effectively support clients when they are using well known third-party tools and infrastructure in CASFS+. In addition, leveraging these open-source tools allows Code Willing to be both flexible and effective in meeting its client’s needs. In particular, Apache Airflow has been instrumental in Code Willing’s ability to manage time-sensitive data ETL pipelines for its clients, so much so that it is now part of the CASFS+ Cloud Workspace suite of tools by default. 

Data Processing Study Reveals Quick, Economical Way to Process Big Data

Following their study of several Python DataFrame libraries, the Financial Tech company, Code Willing, determined the fastest and most cost-efficient libraries while using CASFS on AWS for big data processing were Ray and PySpark.

CASFS is the brainchild of Code Willing. Faced with time and cost constraints when processing enormous amounts of data, their data scientists needed something no one else had for their large-scale data analysis process. So, they built it, and CASFS was born.

In the study, the libraries used were Pandas, Polars, Dask, Ray, and PySpark.

The study included data processing with these five different libraries processing three separate dataset sizes:

  • 1 file – 2,524,365 rows x 20 columns
  • 10 files – 23,746,635 rows x 20 columns
  • 100 files – 241,313,625 rows x 20 columns

They processed the data on a r5.24xlarge machine with 96 cores and 768 GB of RAM. For the code processed on a cluster, they used 10 r5.2xlarge machines with a total of 80 cores and 640 GB of RAM.

Through these individual tests, they were able to discern that Ray and PySpark were the most useful to them in terms of performance and memory efficiency when used for big data processing.

While Ray and PySpark proved best for large amounts of data processing, Code Willing’s study also determined where the remaining libraries worked best in the data analysis process.

To see the more in-depth findings of the study, you can head over to their website and review the full report here

Link to Dalton’s whitepaper: https://www.codewilling.com/whitepapers/dataframe.html