Twitter ETL Project with Apache Airflow and AWS

This project demonstrates an ETL pipeline that extracts tweets from X (formerly Twitter), transforms them into structured data, and loads them to an AWS S3 bucket. The entire process is orchestrated using Apache Airflow running on an AWS EC2 instance.

The pipeline runs entirely on a Free Tier EC2 instance (Ubuntu AMI, t2.micro) with an added 2 GiB swap file to handle processing.

Category:

Airflow

Client:

Jul 14, 2024

HealthWell Inc.

Project Duration:

Jul 14, 2024

2 weeks

See Project

a laptop computer sitting on top of a table

Architecture

Extraction: Tweets are retrieved from the X API using Tweepy
Transformation: Data is structured into a Pandas DataFrame with fields: user, text, like_count, retweet_count, created_at
Loading: Data is saved to an S3 bucket (rume-airflow-bucket) using s3fs
Orchestration: The ETL process is scheduled and managed with Airflow DAGs