jobsearch v0.0.1

← cursor / Product Manager, Agent Harness

brief / art_ggbaoWbLjKI

role
cursor / Product Manager, Agent Harness
model
anthropic/claude-sonnet-4.6
created
2026-05-20T01:40

Company snapshot

Cursor is an AI-first code editor built on a VS Code fork, developed by Anysphere Inc., focused on automating coding for professional developers. The product has gained significant traction in the developer tools space, reportedly reaching $100M+ ARR faster than nearly any software company in history (based on public reporting circa 2024-2025). Recent engineering milestones include shipping Composer 2 (their agentic coding workflow), training a proprietary frontier coding model, and applying real-time RL on user data to improve agent quality. The team is intentionally small, flat, and research-adjacent — engineering and research are tightly coupled, and PMs are expected to operate at a deeply technical level. Cursor's engineering reputation is strong among developers for shipping fast and maintaining high product quality; specific internal team structures and named individuals are not publicly confirmed.

Team stack

Core editor: VS Code fork (TypeScript/Electron) — confirmed by public product. Agent harness: likely TypeScript/Python orchestration layer managing LLM calls, tool use, file system access, and terminal interaction (based on JD description of agent decomposition and tool primitives). Model serving: proprietary frontier model plus third-party LLMs (GPT-4, Claude, likely others) — inferred from public product behavior. Evaluation infrastructure: likely Python-based harness with custom benchmarking pipelines; JD explicitly calls out building eval/benchmarking systems. MCP (Model Context Protocol) integration for extensibility — explicitly named in JD. RL training pipeline: real-time RL on user data mentioned in JD; framework specifics not public but likely PyTorch-based. Observability/tracing: agent trace analysis at scale implied; specific tooling (Jaeger, custom, etc.) uncertain. Multi-agent coordination layer: described in JD as an active area, specifics not public.

Likely questions (10)

areaquestionwhy
system_design Walk us through how you would design an agent harness that can decompose a complex multi-file refactor into subtasks, handle partial failures mid-execution, and recover gracefully without losing context. What are the key primitives? JD explicitly lists 'owning the agent planning and execution framework: how agents decompose tasks, decide what tools to use, and recover when a step fails' as a core responsibility.
domain You're analyzing agent traces at scale and notice a pattern where agents loop on a specific class of file-system operations. Walk us through your process: how do you identify the failure mode, quantify its frequency, and turn it into a concrete product or research fix? JD calls out 'analyzing agent traces at scale: identifying where agents get stuck, loop, hallucinate, or take unproductive paths' as a primary workstream.
domain How would you define and build an evaluation framework for agent quality? What metrics would you use to distinguish 'the agent completed the task' from 'the agent completed the task well,' and how do you avoid Goodhart's Law? JD explicitly asks for 'strong intuition for evaluation and measurement' and lists 'building evaluation and benchmarking systems' as an example project.
system_design Describe how you would architect multi-agent coordination when several subagents are executing in parallel across overlapping files and systems. How do you handle context sharing, conflict resolution, and avoiding redundant or contradictory edits? JD names 'shaping multi-agent coordination: how subagents share context and avoid conflicts when executing in parallel' as a specific example project.
domain You've worked with GRPO, DPO, PPO, and other RL algorithms in your workbench. How would you think about applying real-time RL on user interaction data to improve agent behavior in a product like Cursor? What reward signal would you use, and what are the risks? JD states Cursor is 'training agents through real-time RL on user data' and asks for 'experience with reinforcement learning' — directly maps to candidate's RL Workbench evidence.
behavioral Tell us about a time you had to make a hard product tradeoff with incomplete information — specifically in an AI or agent system where empirical results were ambiguous. How did you decide, and what happened? JD explicitly states 'comfortable in a research-adjacent environment where the roadmap is shaped by empirical results, not just customer requests' and 'making hard tradeoffs with incomplete information.'
coding Given a log of agent tool calls and LLM outputs for a failed coding task, write a script (Python or TypeScript) to parse the trace, identify the first divergence from expected behavior, and output a structured failure report. Walk us through your approach. JD states 'you'll be reading agent traces' and 'deeply technical — comfortable reading code, analyzing traces, and reasoning about system behavior at a low level.' This is a practical screen for that claim.
domain How would you design the developer-facing UX for observing and steering a running agent — real-time progress, the ability to redirect mid-task, and guardrails — without creating so much friction that developers just turn it off? JD lists 'designing how developers observe and steer agents: real-time progress, guardrails, the ability to redirect mid-task' as a core example project.
culture Cursor is a small, flat team that ships code fast and values spirited debate. How do you operate in an environment where you're expected to write code, read traces, and challenge research decisions — not just write specs? JD explicitly says 'this is not a role where you write specs and hand them off' and the company description emphasizes flat org, small team, truth-seeking, and shipping code.
domain Walk us through how you defined the primitives for tool use and external service integration in your OpenClaw multi-agent framework — gateway protocol, subagent delegation, session management. What would you do differently for a production-scale agent harness serving millions of developers? JD asks for experience defining 'primitives for agent extensibility: how agents use tools, access codebase context, call external services via MCPs and plugins' — directly maps to candidate's OpenClaw evidence.

Talking points