Open to opportunities

// Data · AI · Full-Stack

Gautam Pai

Building with data, end to end.

MS in Business Analytics & AI at UT Dallas. I design pipelines, build ML models, and ship production apps — from 700M-row/day data infrastructure to LLM agents and full-stack finance tools.

Get in touch View projects →
700M+
rows/day processed
$15K
monthly compute saved
90%
reporting effort cut
2.5 yrs
industry experience

// 00. about

Who I am

I'm a data and AI practitioner originally from India, now pursuing my MS in Business Analytics & AI at the University of Texas at Dallas. My background spans the full data stack — from building production pipelines and BI systems to training ML models and shipping full-stack applications.

At Cognizant, I spent 2.5 years working across data engineering and analytics — architecting PySpark pipelines processing 700M+ rows daily, designing Power BI dashboards used by leadership, and building ETL systems that eliminated manual work at scale.

I like building things that are actually used. Currently working on FinTrack, a self-hosted personal finance app with AI-powered rebalancing advice, a Telegram bot, and live portfolio tracking. I'm looking for roles where I can keep building things that matter — across data engineering, ML, or AI development.

Currently
MS Student @ UT Dallas (May 2026)
Based in
Dallas, TX
Looking for
Data Engineering · ML Engineering · AI / Analytics roles
Background
2.5 yrs at Cognizant — pipelines, dashboards, ML
Interests
LLM Agents · Spark Optimization · Full-Stack Data Apps

// 01. experience

Where I've worked

2.5 years at Cognizant spanning data engineering, analytics, and pipeline delivery at enterprise scale.

Associate — Cognizant Technology Solutions
Data & Analytics · Enterprise Scale
Nov 2021 – May 2024
  • Designed and maintained PySpark/Databricks ingestion pipelines processing 700M+ rows/day from AWS S3 and Delta Lake, enabling enterprise-scale BI and analytics workloads.
  • Implemented Delta Live Tables to automate batch and streaming workflows, cutting manual refresh effort by 90%.
  • Tuned Spark jobs via partitioning and caching — 30% faster execution and $15K/month in compute cost savings.
  • Re-architected dimensional models and star schemas, improving report query performance by 60% across business units.
  • Designed Power BI dashboards for leadership, cutting ad-hoc reporting requests by 40% and accelerating reporting cycles by 60%.
  • Built ETL pipelines (SSIS) integrating data from multiple sources into Snowflake and SQL Server; built SSRS + Power BI dashboards cutting manual time by 70%.
  • Deployed CI/CD automation using AWS EC2, Lambda, and GitHub Actions, streamlining production data pipelines.
  • Conducted anomaly detection and data profiling checks that reduced critical reporting errors by 50%.
Data Analyst Intern — Tevatron Technologies
Analytics & Dashboard Automation
Jun 2020 – Sep 2020
  • Automated data cleaning and preprocessing scripts in Python, cutting manual prep time by 80%.
  • Designed template-based dashboards and validation reports, standardizing analysis workflows across 5+ analyst teams.

// 02. projects

Things I've built

Full-stack apps, ML models, LLM agents, and data visualizations.

AI / LLM

AI-Powered Academic Advisor

Academic advising assistant built on Google Cloud Vertex AI + Claude Agent. Uses a hybrid RAG pipeline combining Pinecone vector DB (semantic search) with Neo4j graph DB (relationship traversal) to ground LLM responses in structured academic data — reducing hallucination on course prerequisites and degree requirements.

Vertex AIClaudePineconeNeo4jRAGPythonGCP
Deep Learning

Brain Tumor Classification

Custom CNN in PyTorch for multi-class MRI tumor classification. Achieved 94%+ F1-score via ablation studies across optimizers (Adam vs SGD) and batch sizes. Deployed batch inference on AWS EC2 with predictions streamed into Databricks for real-time clinical review.

PyTorchCNNAWS EC2DatabricksData Augmentation
GitHub →
Deep Learning

Handwritten Character Recognition

CNN classifying handwritten letters and digits across MNIST and USPS datasets — 98.5% accuracy. Built with PyTorch using batch normalization and dropout regularization for robust generalization.

PyTorchCNNMNISTUSPSBatch Normalization
GitHub →
ML Risk Modeling

Loan Default Prediction

Ensemble classification models predicting loan default risk on 50K+ applications — 98.3% accuracy. Feature engineering in Spark SQL (credit ratios, risk tiers) with class-imbalance handling. Insights delivered via Power BI backed by Delta Lake.

Random ForestGradient BoostingSpark SQLDelta LakePower BI
GitHub →
Machine Learning

Water Potability Prediction

Predicted water potability using six ML algorithms — accuracy ranging from 92.2% to 99.8% across models. Comprehensive EDA and visualization on Kaggle dataset; hyperparameter tuning via RandomizedSearchCV.

Scikit-learnRandom ForestAdaBoostGradient BoostEDA
GitHub →
Machine Learning

Kickstarter Success Prediction

Predictive model forecasting Kickstarter campaign success. Compared Decision Tree, Random Forest, Gradient Boost, and AdaBoost — achieving up to 85.78% accuracy with hyperparameter tuning.

Scikit-learnPandasNumPyGradient BoostAdaBoost
GitHub →
ML · Agriculture

NPK Scanner

Supervised ML model and web application detecting plant NPK deficiencies from images — 87.6% accuracy. Built end-to-end from model training through deployment as an interactive web app.

PythonSupervised MLWeb AppImage Analysis
Analytics & BI

Business Analytics with Python

End-to-end retail analytics — market basket analysis (MLxtend), customer segmentation via K-means clustering, and regression modeling on transaction data to improve targeted marketing.

PythonMLxtendK-meansPandasScikit-learn
GitHub →
Analytics & BI

Mutual Funds Performance Comparator

Tableau dashboard comparing 15-year returns of Mutual Funds vs Market Index. Formulated an optimal asset allocation strategy to achieve a 20% CAGR target across index instruments.

TableauFinancial AnalysisCAGR
GitHub →
Analytics & BI

Sales Performance Dashboard

Power BI dashboard providing executive-level visibility into sales, profit, orders, and profit margin — with drill-through capability across product lines and time periods.

Power BIDAXKPI DesignData Modeling
GitHub →

// 03. skills

What I work with

Spanning data engineering, analytics, ML, full-stack, and cloud infrastructure.

Languages & Processing
PythonPySparkPandasSQLPostgreSQLMySQLJavaScriptC++
Big Data & ETL
Apache SparkDatabricksDelta LakeDelta Live TablesAirflowKafkaSSISHadoop
AI & ML
PyTorchScikit-learnRAGPineconeNeo4jVertex AIClaude APIGroq
Cloud — AWS & Azure
S3EC2GlueLambdaRedshiftAthenaEMRData FactorySynapseADLS
Full-Stack & Backend
FlaskFastAPISupabaseSQLAlchemyPostgreSQLREST APIsChart.jsRailwayHTML/CSS
Visualization & BI
Power BIDAXTableauQuickSightDimensional ModelingStar Schema

// 04. certifications

Credentials & licenses

AWS
AWS Data Engineer Associate
Amazon Web Services — Data Ingestion, Transformation, Management, Security & Governance
DBX
Databricks Data Engineer Associate
Databricks — Lakehouse, ETL, Spark SQL, Production Pipelines, Data Governance
PBI
Power BI Data Analyst Associate
Microsoft — Data Preparation, Modelling, Visualization & Analysis

// 05. education

Where I studied

MS — Business Analytics & AI (STEM)
The University of Texas at Dallas
Graduate Certificate: Applied Machine Learning
Aug 2024 – May 2026
BE — Electronics & Communication
Visvesvaraya Technological University
Aug 2017 – Jun 2021

// 06. blog

Thoughts & writing

Notes on data engineering, ML, and things I've learned building in the field.

Full-Stack · Python · Finance · AI
Building FinTrack: A Full-Stack Personal Finance OS in Flask
From expense tracking to a complete financial OS — XIRR, spending forecasts, a financial health score, debt tracking, AI insights, and a Telegram bot. Architecture decisions, tricky bugs, and what I'd do differently.
Read →
Full-Stack · FastAPI · Health · AI
Building HabiTrack: AI-Powered Health & Habits Tracking with FastAPI
Groq vision-based metric extraction from smart scale photos, a 15-achievement system, calendar heatmaps, and what I learned switching from Flask to FastAPI for a server-rendered app.
Read →
Data Engineering
How I reduced Spark compute costs by 30% using partitioning and caching
A practical walkthrough of the Spark tuning techniques that saved $15K/month at Cognizant — partitioning strategies, caching decisions, and common pitfalls.
Coming soon
LLM Engineering
RAG + Knowledge Graphs: why I combined Pinecone and Neo4j for academic advising
Vector search is great for semantics, but graphs are better for rules. How hybrid retrieval improved accuracy on structured academic queries.
Coming soon

// 07. contact

Let's connect

Open to data engineering, ML, and AI opportunities. Let's talk.

Email gautampaiuni@gmail.com Phone (469) 237-4827 LinkedIn linkedin.com/in/gautam-pai GitHub github.com/gautam-pai