Projects
AI / ML · Industrial Automation

MLOps
Pipelines

Distributed training and inference infrastructure for predictive models on industrial telemetry. Cut pipeline latency by 9×, made the full ML lifecycle reproducible, and replaced manual deployments with automated promotion gates.

The Problem03
01

Monolithic Training Scripts

Model training ran as single-process Python scripts on a single machine. Retraining a production model took 6–8 hours and blocked all experimentation during that window — a hard ceiling on iteration speed.

02

No Reproducibility

There was no tracking of hyperparameters, dataset versions, or model artifacts. When a model degraded in production, the team had no reliable way to reproduce the last known-good state.

03

Manual Deployment

Deploying a new model version meant SSHing into a server, copying files, and restarting a process. Rollbacks were a fire drill. Mean time to deploy was measured in hours; mean time to rollback even longer.

What We Built04
01

Distributed Training with Ray

Migrated training workloads to Ray Train with automatic data parallelism across a 16-node cluster on AWS. End-to-end training time dropped from 6+ hours to under 40 minutes for the largest model family — a 9× speedup with no changes to model code.

02

Airflow-Orchestrated DAGs

Defined the full ML lifecycle — data preprocessing, feature engineering, training, evaluation, and promotion — as Airflow DAGs. Every run is logged, every step is retryable, and failures send alerts instead of silently corrupting state.

03

MLflow Experiment Tracking

Instrumented all training runs with MLflow tracking: parameters, metrics, artifacts, and dataset hashes logged automatically. Any production model can be reproduced exactly from its run ID. Experiment comparison UI replaced ad-hoc spreadsheets.

04

SageMaker Registry & Automated Deployment

Models promoted through evaluation gates are registered in SageMaker Model Registry with full lineage. Deployment is a single DAG step that creates an endpoint version, runs a canary evaluation, and shifts traffic automatically on passing.

Technology
Training
  • Ray Train 2.x
  • PyTorch
  • Data parallel
  • AWS EC2 cluster
Orchestration
  • Apache Airflow 2.8
  • Custom operators
  • S3 artifact store
Tracking
  • MLflow 2.x
  • Feast feature store
  • DynamoDB
  • Redshift
Serving
  • AWS SageMaker
  • Evidently AI
  • Grafana
  • GitHub Actions

Building something similar?

We've solved these problems before. Let's talk about yours.

Get in Touch