Data Engineering

Summary

Welcome to the Data Engineering section of my portfolio! This page highlights my journey into Data Engineering, from academic foundations to hands-on professional experiences. I’ll showcase my expertise in designing scalable data pipelines, creating efficient ETL workflows, and leveraging cutting-edge cloud technologies to manage and analyze large datasets. My contributions have been instrumental in optimizing data processes and transforming raw data into actionable insights across diverse industries such as retail, healthcare, and analytics.

Education

Bachelor's in Computer Science - LDRP Institute of Technology

My introduction to the world of data began with learning the core principles of database design and programming fundamentals. It was here that I discovered the potential of structured data and developed a curiosity for solving real-world problems through efficient data management.

Core programming languages and algorithms.
Database design and management for structured data.
Problem-solving through efficient software development techniques.

Master’s in Information Technology - Griffith University (Specialization: Data Analytics)

My postgraduate studies allowed me to dive deeper into the world of data and analytics, emphasizing hands-on projects and real-world applications. During this time, I:

Explored Big Data Analytics, gaining experience with processing and analyzing massive datasets.
Mastered data modeling techniques for designing efficient storage and retrieval systems.
Worked on end-to-end projects, from data pipeline creation to cloud integration for scalable solutions.
Strengthened skills in visualizing data insights through interactive dashboards and reports.

Work Experience

I’ve had the opportunity to explore different aspects of data engineering through several internships, each contributing to my skills and understanding of the field.

Moba Mobile Automation (Python/ML Intern): I started my journey in data engineering here, focusing heavily on data exploration using libraries like Pandas, NumPy, and Scikit-Learn. I gained a solid understanding of data cleaning, data validation, feature engineering, and exploratory data analysis. This experience laid a strong foundation for my career, as I worked on real-world datasets and applied machine learning algorithms for forecasting.

Max Kelsen (Machine Learning Intern): At Max Kelsen, my focus shifted to computer vision and healthcare data. I was responsible for managing image data pipelines, running and troubleshooting workflows, and solving issues related to data processing. This role introduced me to data orchestration, where I gained hands-on experience with data pipeline orchestration which further strengthened my skills in building reliable data pipelines.

Cinefly (Data Analyst Intern): My time at Cinefly involved working with video data, a challenge I had never faced before. I leveraged various GCP APIs (like Video Intelligence API and NLP API) to analyze the data, and stored everything in Cloud SQL and BigQuery. I used Tableau to create detailed reports and dashboards, helping the team make data-driven decisions. This role gave me invaluable experience in handling complex data types and integrating multiple GCP services.

Data Pipelines

Extract : I’ve handled data extraction from a variety of sources. Over the years, I’ve used:

APIs (REST and GraphQL) for fetching live and batch data.
File uploads for handling formats like CSV, Excel, and JSON.
Databases like PostgreSQL and MySQL for pulling structured data.
JSON data processing for dealing with dynamic datasets.
Web scraping using Python libraries like BeautifulSoup and Scrapy to gather data from websites.

Transform : A lot of my transformation work comes from the data exploration projects I’ve done, especially on Kaggle. This part involves making raw data usable for analysis or further processing. My experience includes:

Cleaning data: Fixing missing values, removing duplicates, and ensuring consistency.
Exploring data: Looking for patterns, trends, and insights through exploratory data analysis (EDA).
Feature engineering and selection: Creating new features and picking the ones that matter the most.
Normalizing data: Scaling it to make sure it’s ready for analysis or modeling.

Load: For storing transformed data, I’ve worked with both data warehouses and traditional databases:

Cloud Storages : GCP Bigquery, GCP storage
Databases : Postgresql, Mysql, Mongodbs

Orchestration : Automation has been key in my projects. I’ve mostly used:

Apache Airflow: I’ve worked on creating workflows for scheduling and monitoring pipelines.
Google Cloud Workflows: A GCP-native solution I’ve used for automating data pipelines.
Apache Nifi: I’ve explored Nifi for simpler flow-based pipeline tasks.

ETL and ELT

While I’ve focused a lot on ETL, I’ve also implemented ELT workflows where raw data is loaded into the warehouse first, followed by transformations within the warehouse itself. This approach has been particularly useful when working with modern cloud data warehouses like BigQuery and Azure Synapse.

Cloud Integration

GCP : I have experience working with GCP, Azure, and AWS, with GCP being the most prominent in my work. I started using Google cloud platform out of personal interest and during my internship at Cinefly, where I worked on automating workflows and processing data. Since then, I’ve used GCP extensively in my personal projects, particularly for building data pipelines using services like Data Fusion and Dataflow.

For data storage, I’ve worked with Cloud SQL and BigQuery, the latter serving as my main data warehouse. I also rely on Looker Studio for creating visualizations, as it integrates well with BigQuery for real-time reporting.

Azure : I’ve completed the AZ-900 Fundamentals course on Azure and explored various APIs and machine learning services on the platform.

AWS : In my current role, my company uses AWS, so I’ve gained some experience working with AWS services, particularly for data pipelines and data warehousing.

Database management

Data Modeling: In data engineering, I focus on creating scalable, well-structured data models that facilitate efficient querying and reporting. I’ve implemented both star and snowflake schema designs, ensuring data is normalized and optimized for performance.
Schema Design: I’ve developed and maintained schemas that support data consistency, flexibility, and scalability. I’ve worked on both relational (PostgreSQL, MySQL) and NoSQL (MongoDB) databases, ensuring that schema designs align with the specific needs of the data and applications.

Database Security: I’ve implemented basic database security measures, such as encryption, access control, and ensuring that sensitive data is properly masked or anonymized when necessary.

Collaboration and Reporting

In my experience, effective collaboration has been key to achieving success in any project. I’ve learned that regular communication with the team and clients helps keep everything on track. Whether it’s through daily standups to share progress or weekly meetings to discuss upcoming tasks, these moments have been essential for staying aligned and ensuring smooth workflow. I've found that clear communication fosters stronger teamwork and helps solve problems quicker.
I’ve used a variety of tools to facilitate collaboration, including Slack, Teams, Atlassian, Zoom, Pipedrive, WorkJam, and Workday. These platforms have supported seamless communication, task management, and client interactions, enabling me to work efficiently in cross-functional teams.
For reporting and visualization, I’ve used Tableau as my go-to tool, creating interactive dashboards to help the team and business make data-driven decisions. In addition to Tableau, I have experience with Looker Studio, Gephi, and Excel, using them to present insights in a clear and actionable format

Sahil Panchal

Data Analyst | A Digital Detective