PLAYER ONE // PRODUCTION OPERATIONS ENGINEER

KELVIN NGO // INFRASTRUCTURE RELIABILITY QUEST

A playable resume about keeping hyperscale AI infrastructure healthy, resilient, and ready for the next mission.

MOVE: ← → / A D  ·  JUMP: SPACE / W  ·  TOUCH CONTROLS INCLUDED

Kelvin Ngo as an original pixel-art production engineer character
PLAYER STATUSPRODUCTION READY

INTERACTIVE EXPERIENCE

Traverse the infrastructure.

Reach each beacon to unlock a career milestone. The experience is optional; every detail appears in the Quest Log below.
PLAYERKELVIN
ROLEPROD OPS
DATA SHARDS0 / 7
UPTIME99.99%
STARTDFW // 2017
QUEST CLEARKEEP BUILDING →

QUEST LOG

Nearly a decade of infrastructure missions.

No controller required. Here is the complete career path.
LEVEL 02

2018 - 2022 // ATHENAHEALTH

Senior Data Center Engineer

Built the large-scale infrastructure operations foundation that later translated directly into hyperscale AI platform reliability.

  • Planned data hall designs and deployment strategies across three large data centers
  • Led the Dallas shutdown and Richardson migration while protecting service continuity
  • Maintained the firmware catalog and deployed fleet updates across approximately 10,000 servers
  • Configured and troubleshot Linux servers, hardware, storage, networking, and performance issues
  • Managed vendors, structured cabling, operating budgets, and burst labor
  • Built and trained the onsite team and documented repeatable operations in Confluence SOPs
LEVEL 01

2017 - 2018 // TD AMERITRADE

NOC Technician II

Developed the observability, escalation, and disciplined incident-response habits required for mission-critical infrastructure.

  • Monitored trading services for 11M+ client accounts holding more than $1T in assets
  • Resolved hardware, software, application, and automated-job incidents
  • Coordinated system-outage response across internal support groups and service providers
  • Provided first-level diagnosis for automated job applications and production workflows
  • Communicated impact, escalation status, and restoration progress during business-critical events

PLAYER INVENTORY

Tools equipped for production.

>_

Systems & Automation

Linux · Python · Bash · Ansible · Grafana · Nagios · JIRA · Confluence

Compute & Hardware

OCP · NVIDIA · AMD · MTIA · Dell · HPE · Supermicro · IPMI / RACADM

Network & Storage

Cisco · Fiber · Pure Storage · NetApp · RAID · DNS · DHCP · VMware

Leadership

Incident Command · RCA · Mentoring · Platform Launch · Technical Programs

NEW MISSION AVAILABLE

Let's build reliable infrastructure.