← cursor / Product Manager, Agent Harness
cover_letter / art_bii9oDRtGJw

role
cursor / Product Manager, Agent Harness
model
anthropic/claude-sonnet-4.6
created
2026-05-20T01:48
↓ Download .docx
Cover letter

Dear Cursor Hiring Team,

Cursor is doing something rare: treating the automation of coding not as a feature to ship but as a foundational research and engineering problem worth solving from first principles. That framing resonates with me directly — I have spent the last two years building agent orchestration frameworks, RL post-training workbenches, and evaluation infrastructure from scratch, not as adjacent work but as the core product. When I read the Agent Harness role, I recognized the exact problem space I have been living in.

## Technical Foundation

My AI/ML work is hands-on and recent. In 2025–2026 I built an RL post-training workbench covering the full RLHF/DPO pipeline: a Reward Lab for designing and A/B testing reward functions (RLVR, learned, hybrid) across GSM8K, MATH, HumanEval, and UltraFeedback; a Playground running real TRL-powered GRPO and DPO training with live SSE metric streaming on Apple Silicon (MPS) and CUDA; and an Arena for head-to-head framework benchmarking across TRL, VeRL, OpenRLHF, and NeMo RL with GPU passthrough in Docker containers. I implemented 12 RL algorithms — PPO, GRPO, DAPO, REINFORCE, REINFORCE++, RLOO, DPO, SimPO, IPO, KTO, ORPO, SPPO — with algorithm-specific metric profiles and standardized throughput, memory, and convergence benchmarking. This is the kind of empirical, measurement-first work the Agent Harness role requires.

On evaluation specifically, I built aeval — a local-first model evaluation platform with five core eval types (factuality, reasoning, instruction-following, safety, code generation), adversarial safety testing with refusal detection, and data contamination detection via SHA-256 hashing. Statistical rigor was non-negotiable: bootstrap confidence intervals, Welch's t-test, Cohen's d effect size, and saturation detection. CI/CD integration with regression detection and automated safety gates rounds out the stack (FastAPI orchestrator, TimescaleDB, Redis job queue, Next.js dashboard, Ollama). Defining what "good" means and building the harness to measure it is exactly what I built aeval to do.

For multi-agent orchestration, I designed and implemented OpenClaw — a multi-agent gateway protocol with subagent delegation, profile management, and session switching — enabling coordinated AI agent workflows across distinct domain agents in StreamIO. I understand the coordination problems that arise when subagents share context, and I have made concrete architectural decisions about how to scope agent authority, handle delegation failures, and route tasks to the right subagent.

My ML foundation goes back further. My NeurIPS 2014 paper on neural networks for protein secondary structure prediction was built on a system I originally hand-coded in C++ with custom backpropagation through time in 2004 — rewritten in 2026 in PyTorch spanning 413 parameters to 8B (a 19-million-fold scale increase). I am comfortable at every layer of the stack, from algorithm implementation to infrastructure to product.

## Why This Role

The through-line in my career is building infrastructure that makes complex systems observable, steerable, and trustworthy for developers — from scaling Intuit's ICE platform to 675M+ engagements and 50K TPS, to designing developer SDKs and self-service portals that reduced onboarding from weeks to minutes. The Agent Harness sits at exactly that intersection: making agents that are technically capable also feel reliable and controllable to the developers using them.

What excites me most about this specific role is the evaluation and trace analysis work. Reading agent traces to identify where agents loop, hallucinate, or take unproductive paths — and turning those patterns into concrete product improvements — is the kind of empirical feedback loop I built aeval and the RL Workbench to support. I also find the multi-agent coordination problem genuinely hard and interesting: when developers spin up fleets of agents in parallel, the context-sharing and conflict-avoidance primitives become load-bearing, and getting them wrong produces exactly the failure modes (loops, contradictory edits, stalled tasks) that erode developer trust.

## Selected Relevant Experience

- **RL Workbench (2026):** Built 3-phase post-training platform benchmarking GRPO/DPO across TRL, VeRL, OpenRLHF, and NeMo RL; implemented 12 RL algorithms with cross-tab workflow lineage tracking and standardized convergence benchmarking — directly applicable to Cursor's real-time RL training on user data.

- **aeval (2025–2026):** Built model evaluation platform with adversarial safety testing, refusal detection, bootstrap confidence intervals, and automated CI/CD safety gates — the same evaluation discipline the Agent Harness role requires for defining and measuring agent quality.

- **OpenClaw multi-agent orchestration:** Designed gateway protocol, subagent delegation, profile management, and session switching for coordinated AI agent workflows — directly relevant to multi-agent coordination and the primitives for agent extensibility.

- **AutoEval — Automated Visual Evaluation for Robot Model Training (2025):** Repurposed screen capture and multimodal AI pipeline to score model outputs against natural-language rubrics, reducing evaluation cycles from 72 hours to ~4 minutes — zero-integration architecture with structured PASS/FAIL reports and confidence scores.

- **Intuit ICE Platform — Developer Frameworks & Platform Infrastructure (2021–2024):** Delivered ICE Self-Service platform (DevPortal, GitOps config, ICE Playground) reducing developer onboarding from 2–3 weeks to minutes; scaled throughput from 6K to 50K TPS via rSocket migration supporting ~1.5M concurrent connections with sub-25ms TP99.

- **Intuit SDK Starter Kits:** Extended Java and Python SDKs with scaffolding templates, build configurations, testing frameworks, and CI/CD integration — empowering developers to go from zero to production-ready microservice in minutes.

- **Splunk Search Orchestration (2019–2021):** Owned Search Service (Go microservices), Search Catalog (PostgreSQL metadata service), and SPL/SPL2; delivered Scheduler Service end-to-end in ~4 months and achieved up to 10x query performance improvements for a Fortune 500 beta customer.

## Closing

Cursor's mission — automating coding — is meaningful precisely because it is hard. The Agent Harness is where that mission either holds together or falls apart for the developer using it. I want to work on that problem: reading traces, defining what failure looks like, building the measurement infrastructure that drives research priorities, and shipping agent behavior that developers can trust. My background spans the full range this role requires — RL implementation, evaluation platform design, multi-agent orchestration, and developer platform product management at scale — and I would welcome the chance to bring it to bear here.

Thank you for your consideration.

**O. Felix Amoruwa**
famoruwa@berkeley.edu | 909-731-9011 | felixamoruwa.info