When building and creating CASFS+, Code Willing wanted to provide as much as possible to their clients without clients needing to download other extensions or add-ons. One thing that has come out of this standard is the use of Apache Airflow, one of the most popular open-source tools for building robust ETL pipelines.
When working in the CASFS+ Cloud Workspace, clients immediately have access to Apache Airflow. No configuring or downloading is needed. By utilizing Airflow with CASFS+, users can easily launch multiple nodes in their network, each with their own pre-configured and isolated Airflow instance.
Following their completion of building CASFS+, Code Willing utilized their own program to solve problems with their own clients. To better understand how Code Willing is specifically leveraging Airflow in CASFS+, here are three different situations in which they leverage Airflow and CASFS+ to deliver data ETL solutions to their clients:
Data Processing
The data processing “Directed Acyclic Graphs” (DAGs) in the Code Willing CASFS+ environment are the core of their Airflow ETL pipelines. These DAGs are responsible for extracting, transforming and loading (ETL) raw client data into a processed, usable dataset.
Airflow assists in this process by monitoring the CASFS+ file system in real-time for the arrival of dependent files, and launches these tasks as soon as the files arrive. This ensures that Code Willing’s clients will always receive their time-sensitive production data as soon as possible.
Data Quality
Second, several data quality check DAGs are implemented to test the quality of the data from the Data Processing stage. These data quality checks aim to do several things, from checking the format of the ETL data from the Data Processing stage to ensuring that files in the ETL stage are arriving on time. In addition, Airflow has been configured to send both emails and Slack messages to their support team in the event that an alert, such as a failed data quality check or a late file, is raised.
System Health Check Alerts
Finally, Code Willing has also considered the possibility that the entire Airflow service could fail. If this happens, the data quality DAGs discussed above become obsolete, as the service can no longer run them to ensure data is arriving as expected. In response to this situation, Code Willing also utilizes Airflow’s external health checks by regularly monitoring the API endpoints for errors. Similarly to the data quality checks above, in the event that the Airflow service is deemed to be “down,” the Code Willing team will receive Slack and email notifications about the issue.
Conclusion
Code Willing’s goal is to have a high level of expertise on most popular data engineering and analysis tools. Building this expertise allows Code Willing to effectively support clients when they are using well known third-party tools and infrastructure in CASFS+. In addition, leveraging these open-source tools allows Code Willing to be both flexible and effective in meeting its client’s needs. In particular, Apache Airflow has been instrumental in Code Willing’s ability to manage time-sensitive data ETL pipelines for its clients, so much so that it is now part of the CASFS+ Cloud Workspace suite of tools by default.