Twitter ETL Project with Apache Airflow and AWS

This project demonstrates an ETL pipeline that extracts tweets from X (formerly Twitter), transforms them into structured data, and loads them to an AWS S3 bucket. The entire process is orchestrated using Apache Airflow running on an AWS EC2 instance.

The pipeline runs entirely on a Free Tier EC2 instance (Ubuntu AMI, t2.micro) with an added 2 GiB swap file to handle processing.

Category:

Airflow

Airflow

Client:

Jul 14, 2024

HealthWell Inc.

Project Duration:

Jul 14, 2024

2 weeks

a laptop computer sitting on top of a table
a laptop computer sitting on top of a table
a laptop computer sitting on top of a table

Architecture

  1. Extraction: Tweets are retrieved from the X API using Tweepy

  2. Transformation: Data is structured into a Pandas DataFrame with fields: user, text, like_count, retweet_count, created_at

  3. Loading: Data is saved to an S3 bucket (rume-airflow-bucket) using s3fs

  4. Orchestration: The ETL process is scheduled and managed with Airflow DAGs

person facing computer desktop
person facing computer desktop
person facing computer desktop

Notes

  • This project uses AWS Free Tier for EC2 and S3

  • Swap memory was added to handle t2.micro instance limitations

  • Ensure your Twitter API token has permissions to access the user timeline

person facing computer desktop

Create a free website with Framer, the website builder loved by startups, designers and agencies.