Portfolio

Data Engineering Projects

Showcasing scalable data solutions across cloud platforms

01
🚀

GCP E-commerce

Data Pipeline

GCP Pipeline Personal Project

GCP E-commerce Data Pipeline

End-to-end batch processing with medallion architecture

An end-to-end batch processing data pipeline built on Google Cloud Platform (GCP). This project implements a medallion architecture (Bronze, Silver, Gold layers), ingesting simulated e-commerce data and transforming it for analytical consumption. It highlights best practices in data governance, scalability, and orchestration.

Key Features

  • Simulated Raw Data Ingestion: Uses Google Cloud Functions to generate and land raw CSV data in GCS
  • Bronze Layer: Ingests raw data into GCS (Parquet) using Dataproc PySpark for raw data storage
  • Silver Layer: Cleans, transforms, and conforms data from Bronze to Silver layer in GCS using Dataproc PySpark
  • Gold Layer: Curates data for analytics in BigQuery using dbt (data build tool) for dimensional modeling
  • Orchestration: Leverages Apache Airflow (Google Cloud Composer) for end-to-end pipeline scheduling and management
  • Visualization: Integration with Metabase for business intelligence dashboards

Technologies Used

GCP Google Cloud Storage Dataproc BigQuery Cloud Functions Cloud Composer PySpark dbt Apache Airflow Python Metabase
02
☁️

Multi-Cloud

Migration

Azure GCP Migration Kloudone

Multi-Cloud Data Workload Transition

Large-scale cloud platform migration

Led the transition of large-scale data workloads to a new cloud platform (Azure/GCP), ensuring minimal disruption to downstream processes. This involved comprehensive migration strategies, rigorous data integrity validation, and performance optimization to streamline ETL data workflows in the new environment.

Role & Impact

Played a pivotal role in ensuring continuity and efficiency during a major platform shift. My work contributed to maintaining processing efficiency and enhancing data integrity across diverse datasets.

Key Highlights

  • Managed the migration of critical data workloads, minimizing downtime to near-zero
  • Conducted extensive performance validation and optimization across cloud platforms
  • Enhanced data validation processes, ensuring higher data integrity and consistency
  • Implemented rollback strategies and contingency plans for risk mitigation
  • Coordinated with cross-functional teams to ensure seamless transition

Zero

Data Loss

Minimal

Downtime

100%

Data Integrity

Technologies Used

Microsoft Azure Google Cloud Platform ETL Tools SQL Performance Optimization Data Validation Python
03
❄️

Snowflake

Pipeline

Azure Snowflake Pipeline DXC Technology

Scalable Data Pipeline with Snowflake & Azure

Optimized data processing and warehousing solution

Optimized and maintained scalable data pipelines designed to process upstream data from Azure Blob Storage and store it efficiently in Snowflake. This project significantly improved data accessibility for analytics and reporting, involving the design and enhancement of ETL workflows to meet evolving business requirements. Managed end-to-end data processing across Snowflake-based pipelines and Databricks-powered transformations.

Role & Impact

Directly contributed to enhancing the reliability and efficiency of critical data flows, ensuring timely and accurate data delivery for business insights. Improved query performance and reduced data processing costs through optimization techniques.

Key Highlights

  • Designed and implemented new ETL workflows from scratch to meet business needs
  • Enhanced existing pipelines with performance optimizations and monitoring
  • Managed data processing across multi-architecture pipelines (Snowflake, Databricks)
  • Ensured efficient and reliable data processing from diverse sources
  • Implemented data quality checks and validation frameworks
  • Reduced data processing time through query optimization and partitioning strategies

30%

Cost Reduction

40%

Faster Queries

99.9%

Uptime

Technologies Used

Snowflake Microsoft Azure Azure Blob Storage Databricks PySpark SQL ETL/ELT Python
04
🤖

Operational

Automation

Automation Optimization Kloudone

Automated Operational Task Optimization

Reducing manual interventions through automation

Designed and implemented automation solutions for key operational tasks, significantly reducing manual interventions and improving system stability. This initiative led to a 10% reduction in manual effort and streamlined daily data operations, allowing the team to focus on higher-value tasks.

Role & Impact

Enhanced the overall efficiency and reliability of data workflows by automating repetitive and error-prone manual tasks. This resulted in improved system stability, reduced operational overhead, and faster incident response times.

Key Highlights

  • Identified and automated recurring operational processes across multiple workflows
  • Improved system stability and reduced manual overhead by 10%
  • Fine-tuned data ingestion and transformation processes for efficiency
  • Implemented monitoring and alerting systems for proactive issue detection
  • Created reusable automation frameworks for future projects
  • Documented automation procedures for knowledge sharing

10%

Less Manual Work

50+

Tasks Automated

2x

Faster Operations

Technologies Used

Python Automation Scripts ETL/ELT Processes Monitoring Tools Shell Scripting

Have a Data Engineering Challenge?

Let's discuss how I can help build scalable solutions for your business