← anthropic / Anthropic Fellows Program, ML Systems & Performance
cover_letter / art_qafKqurEb3A
role
model
anthropic/claude-sonnet-4.6
created
2026-05-22T17:50
Cover letter
Dear Anthropic Fellows Program Hiring Team,
Anthropic's mission — building reliable, interpretable, and steerable AI systems — sits at the intersection of the most consequential engineering and scientific challenges of our time. The question of how to make powerful AI systems trustworthy is not an abstract one for me: it is the thread connecting two decades of work, from hand-coding backpropagation through time in C++ at UC Berkeley in 2004 to building a production RL post-training workbench in 2026 that benchmarks GRPO, DPO, and ten other algorithms across TRL, VeRL, OpenRLHF, and NeMo RL. I am applying for the ML Systems & Performance Fellows workstream because the engineering problems in this space — building the infrastructure that makes empirical AI research tractable — are exactly the class of problems I am built to work on.
---
**Technical Foundation**
My ML systems work spans the full stack from training infrastructure to evaluation to serving. The most directly relevant project is my RL post-training workbench, a three-phase platform covering the complete RLHF/DPO pipeline. The Reward Lab module supports designing and A/B testing reward functions (RLVR, learned, and hybrid) across four standard datasets: GSM8K, MATH, HumanEval, and UltraFeedback. The Playground runs real TRL-powered GRPO and DPO training with live SSE metric streaming on Apple Silicon (MPS) and CUDA. The Arena performs head-to-head framework benchmarking across TRL, VeRL, OpenRLHF, and NeMo RL with GPU passthrough in Docker containers. Across the platform, I implemented 12 RL algorithms — PPO, GRPO, DAPO, REINFORCE, REINFORCE++, RLOO, DPO, SimPO, IPO, KTO, ORPO, and SPPO — with algorithm-specific metric profiles, cross-tab workflow lineage tracking, and standardized throughput, memory, and convergence benchmarking. This is the kind of infrastructure that makes empirical comparison between training approaches rigorous rather than anecdotal.
On the evaluation side, I built aeval, a local-first model evaluation platform with five core eval types (factuality, reasoning, instruction-following, safety, and code generation), adversarial safety testing with refusal detection, and data contamination detection via SHA-256 hashing. Statistical rigor was a first-class concern: the platform implements bootstrap confidence intervals, Welch's t-test, Cohen's d effect size, and saturation detection, with CI/CD integration for regression detection and automated safety gates. The stack — FastAPI orchestrator, TimescaleDB, Redis job queue, Next.js dashboard, Ollama — was designed for operational reliability, not just research convenience.
My earlier work on the BRAIN protein structure prediction platform (originally built in C++ with custom BPTT at UC Berkeley in 2004, rewritten in 2026 in PyTorch) spans 413 parameters to 8 billion — a 19-million-fold scale increase — across five neural architectures including feedforward, GRU, Transformer, ESM-2, and multi-task models, with MLflow experiment tracking, Optuna hyperparameter optimization, FastAPI serving, and 823 automated tests across six Docker containers. The original system produced a NeurIPS 2014 accepted paper on artificial neural networks for protein secondary structure prediction, which gives me a baseline understanding of what peer-reviewed empirical research requires.
At Intuit, I operated at a different kind of scale: the ICE platform I managed reached 675M+ engagements in FY23 across QuickBooks, TurboTax, Mint, Mailchimp, and Credit Karma, with throughput scaled from 6K to 50K TPS via rSocket migration supporting approximately 1.5M concurrent connections at sub-25ms TP99. This is not ML systems work, but it is the engineering discipline — instrumentation, benchmarking, capacity planning, performance optimization — that transfers directly to the infrastructure problems in the ML Systems & Performance workstream.
---
**Why This Workstream**
The ML Systems & Performance workstream is specifically looking for engineers who can balance research exploration with operational reliability, work across distributed systems and high-performance computing, and contribute to training, fine-tuning, and evaluation of large language models. The project examples listed — CPU simulators for accelerator workloads, new accelerator backends on open-source projects, on-demand infrastructure for compute-heavy research, synthetic data pipelines — are exactly the class of infrastructure problems I find most interesting: they are not glamorous, but they are what determines whether empirical research is reproducible and whether research velocity compounds over time.
Anthropic's recent acquisition of Stainless and the expanding developer platform investment signal that the engineering infrastructure layer is becoming a strategic priority, not just a support function. The compute deal with SpaceX and the $15K/month compute allocation for fellows suggest that the ML Systems workstream is positioned to work on problems with real resource constraints and real performance requirements — not toy benchmarks.
---
**Selected Prior Experience**
- **RL Workbench (2026):** Built 3-phase post-training platform implementing 12 RL algorithms (PPO, GRPO, DAPO, DPO, SimPO, and others) with live SSE metric streaming, GPU Docker passthrough, and standardized benchmarking across TRL, VeRL, OpenRLHF, and NeMo RL — directly relevant to Anthropic's RLHF/RLAIF alignment infrastructure.
- **aeval (2025–2026):** Built production model evaluation platform with adversarial safety testing, refusal detection, data contamination detection, and statistical rigor (bootstrap CIs, Welch's t-test, Cohen's d) — FastAPI, TimescaleDB, Redis, Ollama stack with CI/CD regression detection.
- **BRAIN ML Platform (UC Berkeley 2004; rewritten 2026):** PyTorch platform spanning 413 to 8B parameters across five architectures, MLflow + Optuna, FastAPI serving, 823 automated tests, six Docker containers; NeurIPS 2014 published paper.
- **AutoEval — Automated Visual Evaluation for Robot Model Training (2025):** Repurposed screen capture and multimodal AI pipeline to score model outputs (grasp poses, segmentation maps, bounding boxes) against natural-language rubrics; reduced evaluation cycles from 72 hours to approximately 4 minutes using Claude/GPT-4V for spatial reasoning with structured PASS/FAIL reporting.
- **Intuit ICE Platform (2021–2024):** Scaled platform to 675M+ engagements and 50K TPS via rSocket migration supporting ~1.5M concurrent connections at sub-25ms TP99; delivered ICE Self-Service platform reducing developer onboarding from 2–3 weeks to minutes.
- **Java and Python SDK Starter Kits (Intuit):** Extended SDKs with scaffolding templates, Gradle/Maven build configurations, testing frameworks, and CI/CD integration — enabling developers to reach production-ready microservices in minutes.
- **Splunk Search Service (2019–2021):** Owned Go microservices for Search Service and Search Catalog (PostgreSQL metadata); delivered Scheduler Service end-to-end in ~4 months; achieved up to 10x query performance improvements for beta enterprise customers.
---
Anthropic's stated view — that AI research is an empirical science with as much in common with physics and biology as with traditional computer science — matches how I have approached ML systems work: instrument everything, benchmark rigorously, make comparisons reproducible, and build infrastructure that compounds research velocity rather than constraining it. The Fellows program's emphasis on producing a public output aligns with my prior NeurIPS publication and my current practice of building systems with the documentation and rigor required for external review.
I would welcome the opportunity to discuss how my background maps to the specific infrastructure problems the ML Systems & Performance workstream is pursuing.
Sincerely,
**O. Felix Amoruwa**
famoruwa@berkeley.edu | 909-731-9011 | felixamoruwa.info