Production Operations Engineer
Deploying and operating hyperscale GPU platforms for AI training and inference across 17+ global data center campuses.
- First-wave AI data center deployment and production stabilization
- Owned production health for NVIDIA, AMD, and MTIA training and inference environments
- Global incident command for high-severity, multi-region events
- RCA across Linux, network, storage, firmware, and OCP hardware
- Python and Bash diagnostics for fleet-scale failure patterns
- Analyzed telemetry, utilization, hardware failure rates, and SLA performance to identify systemic risk
- Partnered with AI, architecture, hardware, firmware, and vendor teams to move platforms from prototype to mass production
- Authored global runbooks, standardized operating processes, and mentored engineers in a 24/7 environment