PLAYER ONE // PRODUCTION OPERATIONS ENGINEER

KELVIN NGO // INFRASTRUCTURE RELIABILITY QUEST

A playable resume about keeping hyperscale AI infrastructure healthy, resilient, and ready for the next mission.

MOVE: ← → / A D · JUMP: SPACE / W · TOUCH CONTROLS INCLUDED

PLAYER STATUSPRODUCTION READY

INTERACTIVE EXPERIENCE

Traverse the infrastructure.

Reach each beacon to unlock a career milestone. The experience is optional; every detail appears in the Quest Log below.

PLAYERKELVIN

ROLEPROD OPS

DATA SHARDS0 / 7

UPTIME99.99%

STARTDFW // 2017

QUEST CLEARKEEP BUILDING →

QUEST LOG

Nearly a decade of infrastructure missions.

No controller required. Here is the complete career path.

LEVEL 03

Meta

2022 - PRESENT // META

Production Operations Engineer

Deploying and operating hyperscale GPU platforms for AI training and inference across 17+ global data center campuses.

First-wave AI data center deployment and production stabilization
Owned production health for NVIDIA, AMD, and MTIA training and inference environments
Global incident command for high-severity, multi-region events
RCA across Linux, network, storage, firmware, and OCP hardware
Python and Bash diagnostics for fleet-scale failure patterns
Analyzed telemetry, utilization, hardware failure rates, and SLA performance to identify systemic risk
Partnered with AI, architecture, hardware, firmware, and vendor teams to move platforms from prototype to mass production
Authored global runbooks, standardized operating processes, and mentored engineers in a 24/7 environment

LEVEL 02

athenahealth

2018 - 2022 // ATHENAHEALTH

Senior Data Center Engineer

Built the large-scale infrastructure operations foundation that later translated directly into hyperscale AI platform reliability.

Planned data hall designs and deployment strategies across three large data centers
Led the Dallas shutdown and Richardson migration while protecting service continuity
Maintained the firmware catalog and deployed fleet updates across approximately 10,000 servers
Configured and troubleshot Linux servers, hardware, storage, networking, and performance issues
Managed vendors, structured cabling, operating budgets, and burst labor
Built and trained the onsite team and documented repeatable operations in Confluence SOPs

LEVEL 01

Ameritrade

2017 - 2018 // TD AMERITRADE

NOC Technician II

Developed the observability, escalation, and disciplined incident-response habits required for mission-critical infrastructure.

Monitored trading services for 11M+ client accounts holding more than $1T in assets
Resolved hardware, software, application, and automated-job incidents
Coordinated system-outage response across internal support groups and service providers
Provided first-level diagnosis for automated job applications and production workflows
Communicated impact, escalation status, and restoration progress during business-critical events

PLAYER INVENTORY

Tools equipped for production.

Systems & Automation

Linux · Python · Bash · Ansible · Grafana · Nagios · JIRA · Confluence

▦

Compute & Hardware

OCP · NVIDIA · AMD · MTIA · Dell · HPE · Supermicro · IPMI / RACADM

⌁

Network & Storage

Cisco · Fiber · Pure Storage · NetApp · RAID · DNS · DHCP · VMware

★

Leadership

Incident Command · RCA · Mentoring · Platform Launch · Technical Programs

NEW MISSION AVAILABLE

Let's build reliable infrastructure.

EMAIL KELVIN LINKEDIN ↗