Cloud-Based ETL Pipelines for Social Media Analytics
Authors:-Parth Yangandul, Sakshi Soni
Abstract-:The rapid expansion of social media has resulted in massive volumes of user-generated content, offering valuable insights for businesses, researchers, and policymakers. However, extracting, processing, and analysing this data presents challenges in scalability, efficiency, and cost. This research proposes a cloud-based ETL (Extract, Transform, Load) pipeline designed for handling large-scale social media data, ensuring efficient extraction, transformation, and structured storage for further analysis. The study will explore data extraction techniques using the Reddit API, optimizing for rate limits and scalability. The transformation process will involve text cleaning, metadata structuring, and sentiment classification to enhance data quality. For storage, AWS S3, Redshift, and NoSQL databases will be evaluated based on performance, query speed, and cost efficiency. To handle real-time and batch processing, the research will implement Apache Spark, comparing their effectiveness in different analytics scenarios. Orchestration tools like Apache Airflow and Docker will automate ETL workflows, while Terraform will enable infrastructure provisioning. Performance will be assessed through processing speed, cost, scalability, and accuracy. Additionally, Power BI and Google Data Studio will be used for visualization and reporting. This research aims to provide a scalable, cloud-native ETL solution that enhances social media data analytics, benefiting data engineers, businesses, and researchers. Index Terms—ETL, Cloud Infrastructure, Social Media Analytics, Data Pipelines, Automation.
