Workshop : Big Data Carpentry with Python
Submitted by Saket Bhushan (@saket) on Thursday, 31 August 2017
Automation is not replacement, but an aid to manual labour. We at Sosio have internally cut-short our data processing time to 50% by automating the monotonous simple tasks. While automating mundane tasks speeds up the processes, saving us time and energy, automation is always not an easy answer. It is often complex, requires human intervention, and even if set up successfully needs constant monitoring and review.
There are myriad data pipelining frameworks and libraries available for every use case imaginable. The complexity of handling such diversity in tooling, and uniqueness of the problem statement leads to duplicated efforts and reinvention of the wheel.
The session will primarily help the audience with an understanding of Pipeline Frameworks, Workflow Automation and the relevant pythonic toolsets that help achieve the same. We will go through some common design patterns, tradeoffs and available libraries / frameworks for designing such systems. We will focus on topics of reusability, consistency, availability, idempotency, and scalability of the systems.
We will take up basic data pipelining concepts as well as practical use cases for using data pipelines with Python. We will cover some of the popular task and data workflow tools like Celery, Luigi, and Airflow and touch on some over arching concepts when building a data pipeline.
The principles can be applied to archival, warehousing and analytics, and low-latency hot storage data.
We will solve few example problems during the workshop to make these points concrete. Much of what is being presented is based on our experience of trying different libraries learning lessons the hard way, as to what did not work, and what made things easy for us.
By the end of the session, one should be comfortable with
- Assessing if a pipeline framework is right for your the dataset.
- Comparing pipeline tools and writing tasks.
- Parallelising and Scaling tasks
- Approaching data pipelining with a python toolset
Specifically we will be talking about :
- Understanding a queue, constructs of producer and consumer
- Writing and Deploying tasks using Celery
- Scaling celery workers and monitoring with Flower
- First Steps with Dask
- Data pipelines and DAGs
- First steps with Luigi and Airflow
- Custom and Advanced Tasks with Luigi and Airflow
- Pipelines and Spark Streaming - listening to twitter stream
- Pipelines and Django Channels - pub sub and data flow
- Intermediate understanding of Python
- Basic understanding of Bash Command
- Basic of Deployment and working with remote servers
- Interest in Data and Systems
Saket is founder of Sosio. Sosio caters to the large scale data needs of enterprises, and non-profits. He has been semi-active in tech-conferences attending and delivering talks across the globe. In his personal capacity he has introduced Python to more than 500 individuals, and conducted training sessions at corporate houses like Oracle. In his previous life, he spent good chunk of his time optimising computational mechanics algorithms.