Open to opportunities

// Data Engineer · Analyst · AI Builder

Gautam Pai

Pipelines, models, and insights — end to end.

MS in Business Analytics & AI at UT Dallas. I build the data infrastructure that makes analytics possible — from 700M-row/day Spark pipelines to ML models and LLM-powered agents.

Get in touch View projects →
700M+
rows/day processed
$15K
monthly compute saved
90%
reporting effort cut
2.5 yrs
industry experience

// 00. about

Who I am

I'm a Data Engineer and ML practitioner originally from India, now pursuing my MS in Business Analytics & AI at the University of Texas at Dallas. My path into data started with electronics engineering, moved through analytics, and evolved into building the pipelines and systems that make data actually useful at scale.

At Cognizant, I spent 2.5 years going from writing SQL reports to architecting PySpark pipelines that processed 700M+ rows daily — learning along the way that the best data work lives at the intersection of engineering rigor and business impact.

Outside of work, I enjoy experimenting with LLMs and agents, exploring ML research, and building side projects that solve real problems. I'm currently looking for Data Engineering or ML Engineering roles where I can keep building things that matter.

Currently
MS Student @ UT Dallas (May 2026)
Based in
Dallas, TX
Looking for
Data Engineering · ML Engineering · Analytics roles
Background
2.5 yrs at Cognizant — DE + DA roles
Interests
LLM Agents · Spark Optimization · MLOps

// 01. experience

Where I've worked

Full-stack data career — from pipeline architecture to BI dashboards and ML delivery.

Data Engineer DE
Cognizant Technology Solutions
Nov 2022 – May 2024
  • Designed and maintained PySpark/Databricks ingestion pipelines processing 700M+ rows/day from AWS S3 and Delta Lake, enabling enterprise-scale BI and analytics workloads.
  • Implemented Delta Live Tables to automate batch and streaming workflows, cutting manual refresh effort by 90%.
  • Tuned Spark jobs via partitioning and caching — 30% faster execution and $15K/month in compute cost savings.
  • Re-architected dimensional models and star schemas, improving report query performance by 60% across business units.
  • Deployed CI/CD automation using AWS EC2, Lambda, and GitHub Actions, streamlining production DE pipelines.
  • Built automated data validation, profiling, and anomaly-detection checks that reduced critical data errors by 50%.
  • Collaborated with data architects on schema evolution, governance standards, and long-term data model improvements.
Data Analyst · Programmer Analyst DA
Cognizant Technology Solutions
Nov 2021 – Oct 2022
  • Built ETL pipelines in SSIS to ingest structured and semi-structured data from multiple sources into Snowflake and SQL Server.
  • Developed automated SSRS reporting pipelines with dynamic parameters, reducing manual reporting effort by 70%.
  • Performed root-cause analysis on anomalies affecting predictive pipelines and business KPIs.
  • Supported Agile delivery cycles, producing analytical outputs and post-deployment validation across 5+ teams.
Data Analyst Intern
Tevatron Technologies Pvt Ltd
Jun 2020 – Sep 2020
  • Automated data cleaning and preprocessing scripts in Python, cutting manual prep time by 80%.
  • Designed template-based dashboards and validation reports, standardizing analysis workflows across 5+ analyst teams.

// 02. projects

Things I've built

ML engineering, deep learning, LLM-powered apps, and data visualization.

AI / LLM · In progress

AI-Powered Academic Advisor

Academic advising assistant on Google Cloud Vertex AI + Claude Agent. Integrates LLM prompting with structured academic data and rule-based logic — reducing hallucination on structured course and degree rules via prompt engineering and API-based workflows.

Vertex AIClaudePythonGCPPrompt Engineering
Deep Learning

Brain Tumor Classification

Custom CNN in PyTorch for multi-class MRI tumor classification. Achieved 94%+ F1-score via ablation studies across optimizers (Adam vs SGD) and batch sizes. Deployed batch inference on AWS EC2 with predictions streamed into Databricks for real-time clinical review.

PyTorchCNNAWS EC2DatabricksData Augmentation
GitHub →
Deep Learning

Handwritten Character Recognition

CNN-based model classifying handwritten letters and digits across MNIST and USPS datasets — achieving 98.5% accuracy. Built with PyTorch using batch normalization and dropout regularization for robust generalization across both datasets.

PyTorchCNNMNISTUSPSBatch Normalization
GitHub →
ML Risk Modeling

Loan Default Prediction

Ensemble classification models predicting loan default risk on 50K+ applications — 98.3% accuracy. Feature engineering in Spark SQL (credit ratios, risk tiers) with class-imbalance handling. Insights delivered via Power BI backed by Delta Lake.

Random ForestGradient BoostingSpark SQLDelta LakePower BI
GitHub →
Machine Learning

Water Potability Prediction

Predicted water potability from key quality parameters using six ML algorithms — achieving accuracy from 92.2% to 99.8% across models. Conducted comprehensive EDA and visualization on the Kaggle dataset; tuned hyperparameters via RandomizedSearchCV.

Scikit-learnRandom ForestAdaBoostGradient BoostEDA
GitHub →
Machine Learning

Kickstarter Success Prediction

Predictive model forecasting Kickstarter campaign success to help creators optimize project strategies. Compared Decision Tree, Random Forest, Gradient Boost, and AdaBoost — achieving up to 85.78% accuracy with hyperparameter tuning.

Scikit-learnPandasNumPyGradient BoostAdaBoost
GitHub →
ML · Agriculture

NPK Scanner

Supervised ML model and web application to detect plant vitamin (NPK) deficiencies from images — achieving 87.6% accuracy. Built end-to-end from model training through deployment as an interactive web app.

PythonSupervised MLWeb AppImage Analysis
Analytics & BI

Business Analytics with Python

End-to-end retail analytics covering market basket analysis (MLxtend), customer segmentation via K-means clustering, and regression modeling on transaction data. Documented insights to improve targeted marketing and customer business value.

PythonMLxtendK-meansPandasScikit-learn
GitHub →
Analytics & BI

Mutual Funds Performance Comparator

Tableau dashboard comparing 15-year returns of Mutual Funds vs Market Index. Formulated an optimal asset allocation strategy to achieve a 20% CAGR target across different index instruments.

TableauFinancial AnalysisCAGRDashboard
GitHub →
Analytics & BI

Sales Performance Dashboard

Power BI dashboard providing comprehensive insights into sales, profit, orders, and profit margin. Designed for executive-level visibility with drill-through capability across product lines and time periods.

Power BIDAXKPI DesignData Modeling
GitHub →

// 03. skills

What I work with

Spanning data engineering, analytics, ML, and cloud infrastructure.

Languages & Processing
PythonPySparkPandasSQLPostgreSQLMySQLC++
Big Data & ETL
Apache SparkDatabricksDelta LakeDelta Live TablesAirflowKafkaSSISHadoop
Cloud — AWS
S3EC2GlueLambdaRedshiftAthenaEMRCloudWatchIAM
Cloud — Azure
Data FactorySynapse AnalyticsADLSAzure Functions
Data Warehousing
Dimensional ModelingStar SchemaSnowflakeBigQueryRedshiftOLAP
ML & Visualization
PyTorchScikit-learnPower BITableauQuickSightDAX

// 04. certifications

Credentials & licenses

AWS
AWS Data Engineer Associate
Amazon Web Services — Data Ingestion, Transformation, Management, Security & Governance
DBX
Databricks Data Engineer Associate
Databricks — Lakehouse, ETL, Spark SQL, Production Pipelines, Data Governance
PBI
Power BI Data Analyst Associate
Microsoft — Data Preparation, Modelling, Visualization & Analysis

// 05. education

Where I studied

MS — Business Analytics & AI (STEM)
The University of Texas at Dallas
Graduate Certificate: Applied Machine Learning
Aug 2024 – May 2026
BE — Electronics & Communication
Visvesvaraya Technological University
Aug 2017 – Jun 2021

// 06. blog

Thoughts & writing

Notes on data engineering, ML, and things I've learned building in the field.

Data Engineering
How I reduced Spark compute costs by 30% using partitioning and caching
A practical walkthrough of the Spark tuning techniques that saved $15K/month at Cognizant — partitioning strategies, caching decisions, and common pitfalls.
Coming soon
LLM Engineering
Reducing hallucination in LLM agents with rule-based grounding
Lessons from building the AI Academic Advisor — how combining prompt engineering with structured data lookup dramatically improves factual accuracy.
Coming soon
Career
From Data Analyst to Data Engineer: what actually changed
The skills, mindset shifts, and technical gaps I had to fill when transitioning from BI and SQL to building production-grade pipelines at scale.
Coming soon

// 07. contact

Let's connect

Open to data engineering and analytics opportunities. Let's talk.

Email gautampaiuni@gmail.com Phone (469) 237-4827 LinkedIn linkedin.com/in/gautam-pai GitHub github.com/gautam-pai