← cursor / Product Manager, Agent Harness
brief / art_ggbaoWbLjKI
role
model
anthropic/claude-sonnet-4.6
created
2026-05-20T01:40
Company snapshot
Cursor is an AI-first code editor built on a VS Code fork, developed by Anysphere Inc., focused on automating coding for professional developers. The product has gained significant traction in the developer tools space, reportedly reaching $100M+ ARR faster than nearly any software company in history (based on public reporting circa 2024-2025). Recent engineering milestones include shipping Composer 2 (their agentic coding workflow), training a proprietary frontier coding model, and applying real-time RL on user data to improve agent quality. The team is intentionally small, flat, and research-adjacent — engineering and research are tightly coupled, and PMs are expected to operate at a deeply technical level. Cursor's engineering reputation is strong among developers for shipping fast and maintaining high product quality; specific internal team structures and named individuals are not publicly confirmed.
Team stack
Core editor: VS Code fork (TypeScript/Electron) — confirmed by public product. Agent harness: likely TypeScript/Python orchestration layer managing LLM calls, tool use, file system access, and terminal interaction (based on JD description of agent decomposition and tool primitives). Model serving: proprietary frontier model plus third-party LLMs (GPT-4, Claude, likely others) — inferred from public product behavior. Evaluation infrastructure: likely Python-based harness with custom benchmarking pipelines; JD explicitly calls out building eval/benchmarking systems. MCP (Model Context Protocol) integration for extensibility — explicitly named in JD. RL training pipeline: real-time RL on user data mentioned in JD; framework specifics not public but likely PyTorch-based. Observability/tracing: agent trace analysis at scale implied; specific tooling (Jaeger, custom, etc.) uncertain. Multi-agent coordination layer: described in JD as an active area, specifics not public.
Likely questions (10)
| area | question | why |
|---|---|---|
| system_design | Walk us through how you would design an agent harness that can decompose a complex multi-file refactor into subtasks, handle partial failures mid-execution, and recover gracefully without losing context. What are the key primitives? | JD explicitly lists 'owning the agent planning and execution framework: how agents decompose tasks, decide what tools to use, and recover when a step fails' as a core responsibility. |
| domain | You're analyzing agent traces at scale and notice a pattern where agents loop on a specific class of file-system operations. Walk us through your process: how do you identify the failure mode, quantify its frequency, and turn it into a concrete product or research fix? | JD calls out 'analyzing agent traces at scale: identifying where agents get stuck, loop, hallucinate, or take unproductive paths' as a primary workstream. |
| domain | How would you define and build an evaluation framework for agent quality? What metrics would you use to distinguish 'the agent completed the task' from 'the agent completed the task well,' and how do you avoid Goodhart's Law? | JD explicitly asks for 'strong intuition for evaluation and measurement' and lists 'building evaluation and benchmarking systems' as an example project. |
| system_design | Describe how you would architect multi-agent coordination when several subagents are executing in parallel across overlapping files and systems. How do you handle context sharing, conflict resolution, and avoiding redundant or contradictory edits? | JD names 'shaping multi-agent coordination: how subagents share context and avoid conflicts when executing in parallel' as a specific example project. |
| domain | You've worked with GRPO, DPO, PPO, and other RL algorithms in your workbench. How would you think about applying real-time RL on user interaction data to improve agent behavior in a product like Cursor? What reward signal would you use, and what are the risks? | JD states Cursor is 'training agents through real-time RL on user data' and asks for 'experience with reinforcement learning' — directly maps to candidate's RL Workbench evidence. |
| behavioral | Tell us about a time you had to make a hard product tradeoff with incomplete information — specifically in an AI or agent system where empirical results were ambiguous. How did you decide, and what happened? | JD explicitly states 'comfortable in a research-adjacent environment where the roadmap is shaped by empirical results, not just customer requests' and 'making hard tradeoffs with incomplete information.' |
| coding | Given a log of agent tool calls and LLM outputs for a failed coding task, write a script (Python or TypeScript) to parse the trace, identify the first divergence from expected behavior, and output a structured failure report. Walk us through your approach. | JD states 'you'll be reading agent traces' and 'deeply technical — comfortable reading code, analyzing traces, and reasoning about system behavior at a low level.' This is a practical screen for that claim. |
| domain | How would you design the developer-facing UX for observing and steering a running agent — real-time progress, the ability to redirect mid-task, and guardrails — without creating so much friction that developers just turn it off? | JD lists 'designing how developers observe and steer agents: real-time progress, guardrails, the ability to redirect mid-task' as a core example project. |
| culture | Cursor is a small, flat team that ships code fast and values spirited debate. How do you operate in an environment where you're expected to write code, read traces, and challenge research decisions — not just write specs? | JD explicitly says 'this is not a role where you write specs and hand them off' and the company description emphasizes flat org, small team, truth-seeking, and shipping code. |
| domain | Walk us through how you defined the primitives for tool use and external service integration in your OpenClaw multi-agent framework — gateway protocol, subagent delegation, session management. What would you do differently for a production-scale agent harness serving millions of developers? | JD asks for experience defining 'primitives for agent extensibility: how agents use tools, access codebase context, call external services via MCPs and plugins' — directly maps to candidate's OpenClaw evidence. |
Talking points
- Built a production RL post-training workbench (RL Workbench, 2026) implementing 12 algorithms — PPO, GRPO, DAPO, DPO, SimPO, and others — with live SSE metric streaming, cross-framework benchmarking (TRL, VeRL, OpenRLHF, NeMo RL), and GPU Docker passthrough. This is direct hands-on experience with the RL training pipeline Cursor is actively using to improve agent quality, not just familiarity with the concepts.
- Designed and built aeval, a local-first AI model evaluation platform with 5 eval types, adversarial safety testing, refusal detection, bootstrap confidence intervals, Welch's t-test, and Cohen's d effect size — plus CI/CD integration with automated regression detection. This maps directly to the JD's requirement to build evaluation and benchmarking systems that 'drive engineering and research priorities,' with demonstrated statistical rigor.
- Architected OpenClaw, a multi-agent orchestration framework with gateway protocol, subagent delegation, profile management, and session switching — deployed in a production application (StreamIO) serving real users across multiple industry verticals. This is a working implementation of the agent coordination primitives the JD describes, including context sharing and task decomposition across agents.
- At Intuit, scaled the ICE developer platform to 675M+ engagements in FY23 and reduced developer onboarding from 2-3 weeks to minutes — demonstrating the ability to own a complex developer-facing platform end-to-end, instrument it with telemetry (SQL, BigQuery), and drive measurable outcomes. Directly relevant to Cursor's need for a PM who can own the agent harness as a developer platform, not just a feature.
- NeurIPS-published researcher (2014) with a 20-year arc from hand-coded BPTT in C++ to 8B-parameter PyTorch models — signals genuine depth in ML fundamentals, not just product-layer familiarity. Combined with the RL Workbench and aeval projects, positions the candidate as credibly research-adjacent in the way the JD requires.