Gautam Pai — Data & AI

// 00. about

Who I am

I'm a data and AI practitioner originally from India, now pursuing my MS in Business Analytics & AI at the University of Texas at Dallas. My background spans the full data stack — from building production pipelines and BI systems to training ML models and shipping full-stack applications.

At Cognizant, I spent 2.5 years working across data engineering and analytics — architecting PySpark pipelines processing 700M+ rows daily, designing Power BI dashboards used by leadership, and building ETL systems that eliminated manual work at scale.

I like building things that are actually used. Currently working on FinTrack, a self-hosted personal finance app with AI-powered rebalancing advice, a Telegram bot, and live portfolio tracking. I'm looking for roles where I can keep building things that matter — across data engineering, ML, or AI development.

Currently
MS Student @ UT Dallas (May 2026)

Based in
Dallas, TX

Looking for
Data Engineering · ML Engineering · AI / Analytics roles

Background
2.5 yrs at Cognizant — pipelines, dashboards, ML

Interests
LLM Agents · Spark Optimization · Full-Stack Data Apps

GitHub

github.com/gautam-pai

// 01. experience

Where I've worked

2.5 years at Cognizant spanning data engineering, analytics, and pipeline delivery at enterprise scale.

Associate — Cognizant Technology Solutions

Data & Analytics · Enterprise Scale

Nov 2021 – May 2024

Designed and maintained PySpark/Databricks ingestion pipelines processing 700M+ rows/day from AWS S3 and Delta Lake, enabling enterprise-scale BI and analytics workloads.
Implemented Delta Live Tables to automate batch and streaming workflows, cutting manual refresh effort by 90%.
Tuned Spark jobs via partitioning and caching — 30% faster execution and $15K/month in compute cost savings.
Re-architected dimensional models and star schemas, improving report query performance by 60% across business units.
Designed Power BI dashboards for leadership, cutting ad-hoc reporting requests by 40% and accelerating reporting cycles by 60%.
Built ETL pipelines (SSIS) integrating data from multiple sources into Snowflake and SQL Server; built SSRS + Power BI dashboards cutting manual time by 70%.
Deployed CI/CD automation using AWS EC2, Lambda, and GitHub Actions, streamlining production data pipelines.
Conducted anomaly detection and data profiling checks that reduced critical reporting errors by 50%.

Data Analyst Intern — Tevatron Technologies

Analytics & Dashboard Automation

Jun 2020 – Sep 2020

Automated data cleaning and preprocessing scripts in Python, cutting manual prep time by 80%.
Designed template-based dashboards and validation reports, standardizing analysis workflows across 5+ analyst teams.

// 02. projects

Things I've built

Full-stack apps, ML models, LLM agents, and data visualizations.

⭐ Featured · Full-Stack · Live · PWA

FinTrack — Personal Finance OS

Full personal finance OS — 147 API routes, 16 build phases, and a two-audit security review. Tracks expenses (with Splitwise-style splits and a trip ledger), income, investments, credit cards, debts, and goals. Features a FIRE calculator, portfolio stress tester, weighted spending forecast, 5-pillar financial health score, stock research tab with AI analysis, and a Telegram bot.

Installable as a PWA with offline support via Service Worker. Broker sync (Zerodha + Groww), 3-provider live price fallback (Finnhub → Yahoo → CoinGecko), multi-currency with live rates, and AI insights (Groq GPT-OSS 120B) with smart cache invalidation. Refactored from a 5,800-line monolith into 8 blueprints with a pytest suite.

PythonFlaskSupabasePostgreSQLChart.jsGroq LLaMATelegram BotFinnhub APIRailway

Live Demo → Blog Post →

⭐ Featured · Full-Stack · Live

HabiTrack — Health & Habits Tracker

Multi-user health and habits tracker built with FastAPI + Supabase + SQLAlchemy. AI-powered metric extraction from smart scale photos via Groq Vision — photograph your Fitdays report and all 27 body composition metrics are extracted automatically. Features daily habit streaks, a 15-achievement system, monthly calendar heatmaps, 6 body composition trend charts, and goal tracking with live progress.

Built invite-only with middleware-level account management. Switched from Flask to FastAPI to explore async routing, Pydantic validation, and proper ORM migrations — each feature lives in its own router.

PythonFastAPISupabaseSQLAlchemyGroq VisionChart.jsRailwayJinja2

Live Demo → Blog Post →

AI / LLM

AI-Powered Academic Advisor

Academic advising assistant built on Google Cloud Vertex AI + Claude Agent. Uses a hybrid RAG pipeline combining Pinecone vector DB (semantic search) with Neo4j graph DB (relationship traversal) to ground LLM responses in structured academic data — reducing hallucination on course prerequisites and degree requirements.

Vertex AIClaudePineconeNeo4jRAGPythonGCP

Deep Learning

Brain Tumor Classification

Custom CNN in PyTorch for multi-class MRI tumor classification. Achieved 94%+ F1-score via ablation studies across optimizers (Adam vs SGD) and batch sizes. Deployed batch inference on AWS EC2 with predictions streamed into Databricks for real-time clinical review.

PyTorchCNNAWS EC2DatabricksData Augmentation

GitHub →

Deep Learning

Handwritten Character Recognition

CNN classifying handwritten letters and digits across MNIST and USPS datasets — 98.5% accuracy. Built with PyTorch using batch normalization and dropout regularization for robust generalization.

PyTorchCNNMNISTUSPSBatch Normalization

GitHub →

ML Risk Modeling

Loan Default Prediction

Ensemble classification models predicting loan default risk on 50K+ applications — 98.3% accuracy. Feature engineering in Spark SQL (credit ratios, risk tiers) with class-imbalance handling. Insights delivered via Power BI backed by Delta Lake.

Random ForestGradient BoostingSpark SQLDelta LakePower BI

GitHub →

Machine Learning

Water Potability Prediction

Predicted water potability using six ML algorithms — accuracy ranging from 92.2% to 99.8% across models. Comprehensive EDA and visualization on Kaggle dataset; hyperparameter tuning via RandomizedSearchCV.

Scikit-learnRandom ForestAdaBoostGradient BoostEDA

GitHub →

Machine Learning

Kickstarter Success Prediction

Predictive model forecasting Kickstarter campaign success. Compared Decision Tree, Random Forest, Gradient Boost, and AdaBoost — achieving up to 85.78% accuracy with hyperparameter tuning.

Scikit-learnPandasNumPyGradient BoostAdaBoost

GitHub →

ML · Agriculture

NPK Scanner

Supervised ML model and web application detecting plant NPK deficiencies from images — 87.6% accuracy. Built end-to-end from model training through deployment as an interactive web app.

PythonSupervised MLWeb AppImage Analysis

Analytics & BI

Business Analytics with Python

End-to-end retail analytics — market basket analysis (MLxtend), customer segmentation via K-means clustering, and regression modeling on transaction data to improve targeted marketing.

PythonMLxtendK-meansPandasScikit-learn

GitHub →

Analytics & BI

Mutual Funds Performance Comparator

Tableau dashboard comparing 15-year returns of Mutual Funds vs Market Index. Formulated an optimal asset allocation strategy to achieve a 20% CAGR target across index instruments.

TableauFinancial AnalysisCAGR

GitHub →

Analytics & BI

Sales Performance Dashboard

Power BI dashboard providing executive-level visibility into sales, profit, orders, and profit margin — with drill-through capability across product lines and time periods.

Power BIDAXKPI DesignData Modeling

GitHub →

// 03. skills

What I work with

Spanning data engineering, analytics, ML, full-stack, and cloud infrastructure.

Languages & Processing

PythonPySparkPandasSQLPostgreSQLMySQLJavaScriptC++

Big Data & ETL

Apache SparkDatabricksDelta LakeDelta Live TablesAirflowKafkaSSISHadoop

AI & ML

PyTorchScikit-learnRAGPineconeNeo4jVertex AIClaude APIGroq

Cloud — AWS & Azure

S3EC2GlueLambdaRedshiftAthenaEMRData FactorySynapseADLS

Full-Stack & Backend

FlaskFastAPISupabaseSQLAlchemyPostgreSQLREST APIsChart.jsRailwayHTML/CSS

Visualization & BI

Power BIDAXTableauQuickSightDimensional ModelingStar Schema

// 04. certifications

Credentials & licenses

AWS

AWS Data Engineer Associate

Amazon Web Services — Data Ingestion, Transformation, Management, Security & Governance

DBX

Databricks Data Engineer Associate

Databricks — Lakehouse, ETL, Spark SQL, Production Pipelines, Data Governance

PBI

Power BI Data Analyst Associate

Microsoft — Data Preparation, Modelling, Visualization & Analysis

// 05. education

Where I studied

MS — Business Analytics & AI (STEM)

The University of Texas at Dallas

Graduate Certificate: Applied Machine Learning

Aug 2024 – May 2026

BE — Electronics & Communication

Visvesvaraya Technological University

Aug 2017 – Jun 2021

// 06. blog

Thoughts & writing

Notes on data engineering, ML, and things I've learned building in the field.

Full-Stack · Python · Finance · AI

Building FinTrack: From Expense Tracker to Personal Finance OS

16 phases, 147 API routes, a PWA, a security audit, and a blueprint refactor. Everything I learned building a self-hosted finance app — weighted forecasts, merchant normalisation, split expense extraction, the production bug that taught me to pin dependencies, and more.

Read →

Full-Stack · FastAPI · Health · AI

Building HabiTrack: AI-Powered Health & Habits Tracking with FastAPI

Groq vision-based metric extraction from smart scale photos, a 15-achievement system, calendar heatmaps, and what I learned switching from Flask to FastAPI for a server-rendered app.

Read →

Data Engineering

How I reduced Spark compute costs by 30% using partitioning and caching

A practical walkthrough of the Spark tuning techniques that saved $15K/month at Cognizant — partitioning strategies, caching decisions, and common pitfalls.

Coming soon

LLM Engineering

RAG + Knowledge Graphs: why I combined Pinecone and Neo4j for academic advising

Vector search is great for semantics, but graphs are better for rules. How hybrid retrieval improved accuracy on structured academic queries.

Coming soon

// 07. contact

Let's connect

Open to data engineering, ML, and AI opportunities. Let's talk.

Email gautampaiuni@gmail.com Phone (469) 237-4827 LinkedIn linkedin.com/in/gautam-pai GitHub github.com/gautam-pai