jobsearch v0.0.1

← openai / Product Manager, API Agents

interviewer_questions / art_AJqWFV3Wvjs

role
openai / Product Manager, API Agents
model
anthropic/claude-sonnet-4.6
created
2026-05-20T19:00

Interviewer

The interviewer profile provided is not a standard LinkedIn bio but rather a structured list of interview questions spanning system design, SDK/coding, domain knowledge, behavioral, and culture/values categories. This appears to be the actual interview question set for the PM, API Agents loop at OpenAI — likely representing a technical PM interviewer or panel with deep platform and agentic systems expertise. The questions signal someone who thinks rigorously about API primitives, developer trust, abstraction tradeoffs, and the philosophical boundary of agent autonomy. The interviewer appears to prioritize both technical depth (state management, rate limiting, SDK design) and product judgment (wrong abstractions, data-driven decisions, backward compatibility). Based on the question set, the interviewer likely has an engineering or technical PM background with direct experience building developer platforms or SDKs.

My profile through their lens

Felix is unusually well-positioned for this role from this interviewer's perspective: he has actually built a multi-agent orchestration framework (OpenClaw), shipped real streaming pipelines with event-driven architectures, and has Staff PM experience on a developer platform at Intuit scale (675M+ engagements). The RL Workbench project — benchmarking GRPO/DPO across TRL, VeRL, OpenRLHF, and NeMo RL — gives him peer-level credibility on post-training topics that OpenAI researchers care about. His NeurIPS publication adds research legitimacy. The interviewer will likely probe whether Felix's hands-on building experience translates into principled API design thinking, or whether it stays at the implementation level. The gap to watch is that Felix's agent work is largely self-directed founder work, not shipping APIs to millions of external developers — the interviewer will test whether he can generalize from builder to platform designer.

Questions they may ask (20)

categoryquestionwhyhow to prepare
resume_deep_dive Walk me through OpenClaw — specifically the gateway protocol and subagent delegation model. What primitives did you expose, and what did you learn about where the abstraction broke down? The interviewer's system design question directly asks about agent execution primitives and state management. Felix built OpenClaw, which is the closest analog on his resume to what OpenAI's Agents API does. The interviewer will use this to assess whether Felix thinks like a platform designer or just an implementer. Prepare a crisp architectural walkthrough of OpenClaw: gateway protocol design, how subagent delegation works, what state is managed where, and — critically — one honest example of where the abstraction was wrong or incomplete. Connect it explicitly to what you'd do differently designing a public API for millions of developers.
resume_deep_dive At Intuit, you scaled ICE to 675M engagements and 50K TPS via rSocket migration. What were the hardest product decisions you made during that scaling — not engineering decisions, but product decisions about what to expose to developers and what to hide? The interviewer cares about developer platform product judgment at scale. Felix's ICE work is his strongest evidence of shipping developer infrastructure at consumer scale, but the interviewer will want to see product thinking, not just execution metrics. Prepare 2-3 specific product decisions from ICE scaling: what you chose to abstract away from developers, what you chose to expose, and what tradeoffs you made between flexibility and opinionation. Tie this to the low-level primitives vs. higher-level abstractions question the interviewer explicitly asks.
resume_deep_dive Your RL Workbench benchmarks 12 algorithms across TRL, VeRL, OpenRLHF, and NeMo RL. From a product perspective, what did you learn about what developers actually need when doing RL post-training — and what's missing from the current tooling ecosystem? OpenAI does post-training and alignment work centrally, and this PM role touches model capability delivery to developers. Felix's RL Workbench is direct evidence of hands-on engagement with the post-training stack. The interviewer will use this to assess domain depth and product intuition about the developer experience gap. Prepare a crisp 3-point answer: (1) what the biggest developer friction points are in RL post-training today, (2) what abstraction or API primitive would most reduce that friction, and (3) how you'd think about the tradeoff between framework-agnostic vs. opinionated tooling. Reference specific pain points you hit building the workbench.
resume_deep_dive You built aeval with bootstrap confidence intervals, Welch's t-test, and Cohen's d for model evaluation. How did you decide what statistical rigor was 'enough' for a developer-facing eval platform — and how would you apply that thinking to OpenAI's API evaluation tooling? The interviewer's behavioral questions probe data-driven decision making. Felix's aeval project is direct evidence of quantitative rigor in AI evaluation. This question tests whether Felix can connect his builder experience to product decisions about what to expose to developers. Prepare a concrete answer about the tradeoff between statistical correctness and developer usability in eval tooling — where did you simplify, where did you not, and why. Then connect to what you'd prioritize if you were designing OpenAI's eval API primitives.
technical_domain Design a durable, resumable multi-step agent execution API. What primitives do you expose, how do you handle partial failures mid-run, and how do you think about state management for developers who need to debug or replay a failed run? This is the interviewer's explicit system design question. Felix has built OpenClaw (multi-agent orchestration), StreamIO's streaming pipeline (WebSocket + HLS + state management), and the RL Workbench (cross-tab workflow lineage tracking). The interviewer will expect Felix to draw on all of this to give a principled API design answer. Prepare a structured answer: (1) core primitives — run, step, checkpoint, tool_call, handoff; (2) state management model — where state lives, how it's serialized, idempotency keys; (3) partial failure handling — retry semantics, compensation, human-in-the-loop checkpoints; (4) developer debugging surface — event log, replay API. Reference your OpenClaw and StreamIO experience as concrete grounding.
technical_domain How would you design a rate-limiting and quota system for an agentic API where a single user request fans out into dozens of model calls, tool calls, and sub-agent invocations — and where cost and latency are non-deterministic? Felix scaled ICE to 50K TPS with sub-25ms TP99 and built multi-agent orchestration with subagent delegation. The interviewer will test whether Felix can translate that operational experience into a principled quota design for non-deterministic agentic workloads. Prepare a layered answer: token-bucket vs. sliding window for different resource types (tokens, tool calls, sub-agent spawns), how you handle fan-out attribution (parent run vs. child agent), how you expose quota state to developers in real-time, and how you handle graceful degradation vs. hard cutoffs. Reference your ICE TPS scaling experience.
technical_domain Sketch the interface design for a streaming agent run events SDK method. What events do you expose, how do you handle backpressure, and what's a good error taxonomy for agentic workflows? Felix built real-time HLS streaming pipelines with WebSocket communication layers, SSE metric streaming in the RL Workbench, and cross-platform SDKs. The interviewer is testing whether Felix can translate that implementation experience into principled SDK design thinking. Prepare a concrete event taxonomy: model_delta, tool_call_start, tool_call_result, handoff, checkpoint, error, run_complete. For backpressure: discuss async iterators, buffering strategies, and what happens when the consumer is slow. For error taxonomy: distinguish transient vs. permanent, tool errors vs. model errors vs. orchestration errors. Reference your SSE streaming work in the RL Workbench.
technical_domain How do you think about the tradeoff between flexible, composable low-level primitives versus opinionated higher-level abstractions in an agents SDK — and how has that tradeoff shifted as the ecosystem has matured? Felix has been on both sides: building low-level infrastructure (ICE, rSocket, OpenClaw gateway) and higher-level abstractions (SDK Starter Kits, DevPortal). The interviewer explicitly asks this question and will probe whether Felix has a principled view, not just anecdotes. Prepare a structured answer: early ecosystem needs primitives (escape hatches, composability), mature ecosystem benefits from opinionated abstractions (reduced cognitive load, best practices baked in). Use your Intuit SDK Starter Kit experience as a concrete example of when you chose opinionation, and your OpenClaw gateway as an example of when you chose composability. Reference how LangChain vs. raw API usage patterns have shifted as evidence.
gap_transition Your most recent Staff PM role ended in September 2024 and you've been running two founder ventures since. These are pre-revenue, self-directed products. How do you think about the transition back to a large-scale platform PM role where you're shipping APIs to millions of external developers rather than building for yourself? The interviewer will notice the gap between Felix's Intuit exit and now. The founder work is impressive technically but the audience is different — Felix has been building for himself and a small user base, not designing APIs for millions of developers with backward compatibility obligations. Acknowledge the difference directly and honestly: founder mode optimizes for speed and learning, platform PM mode optimizes for stability, developer trust, and ecosystem health. Frame your founder work as giving you the builder empathy that makes you a better platform PM — you've felt the pain of bad APIs firsthand. Emphasize your Intuit experience as the platform-at-scale anchor.
gap_transition OpenAI ships capability very fast and sometimes deprecates or pivots APIs that developers have built on. As a founder who has built on top of third-party APIs (Claude MCP SDK, Stripe, ElevenLabs), what's your personal experience with that pain — and how would you encode developer trust into your product decisions as a PM here? The interviewer explicitly asks this culture/values question. Felix has direct experience as an API consumer who has been subject to third-party API changes. This is a rare and genuine anchor for an authentic answer. Prepare a specific story from your founder experience where a third-party API change (or the fear of one) affected your product decisions. Then articulate a principled framework: versioning strategy, deprecation timelines, migration tooling, and how you'd balance research velocity against ecosystem stability. Be honest about the tension — don't pretend it's easy.
gap_transition Your NeurIPS paper is from 2014 — over a decade ago. How have you stayed current with AI research, and how would you engage with OpenAI's research teams as a peer rather than just a consumer of their outputs? The interviewer will notice the gap between the 2014 NeurIPS publication and today's research landscape. Felix's RL Workbench and aeval projects are strong evidence of current engagement, but the interviewer may probe whether Felix can hold his own in research conversations at OpenAI. Prepare a crisp answer that bridges 2014 to 2026: your RL Workbench benchmarking GRPO/DPO/PPO across current frameworks, your aeval statistical rigor work, and your deep learning education platform covering current architectures. Emphasize that you engage with research by building — not just reading papers — and give a specific example of a research insight that changed a product decision you made.
behavioral_situational Tell me about a time you drove alignment across research, engineering, and go-to-market teams on a technically ambiguous product decision. What was the disagreement, how did you structure the decision, and what was the outcome? This is the interviewer's explicit behavioral question. Felix's Intuit experience (CTO-level language assessment, ICE platform decisions, Mailchimp migration) and Splunk experience (RICE framework across 3 microservice backlogs) both have strong candidate stories here. Prepare the Intuit Service Language Assessment story: you analyzed 9 languages, synthesized usage data and developer feedback, and presented strategic recommendations to the CTO. Structure it as: what the disagreement was (likely engineering preference vs. strategic consolidation), how you structured the decision (data + framework), and what the outcome was. Emphasize the cross-functional alignment mechanics, not just the analysis.
behavioral_situational Describe a developer platform feature you shipped that you later realized was the wrong abstraction. What signals told you it was wrong, and what did you do about it? The interviewer explicitly asks this question. Felix's Intuit work (ICE Self-Service, SDK Starter Kits, MSaaS Drift Detection) and Splunk work (Search Service, Scheduler Service) both likely have examples of abstractions that needed revision. Prepare a specific story — ideally from Intuit — where a developer-facing abstraction you shipped turned out to be wrong. Be honest about the failure mode: was it too opinionated, too low-level, wrong mental model? What signals (usage data, developer feedback, support tickets) told you? What did you do — iterate, deprecate, add escape hatches? The interviewer wants intellectual honesty, not a success story with a thin failure veneer.
behavioral_situational Give me an example of a time you used quantitative data to override strong engineering or design intuition on a platform product. What data, what decision, and what happened? Felix explicitly mentions using SQL/BigQuery telemetry to prioritize developer pain points at Intuit, and building statistical rigor into aeval. The interviewer will probe whether Felix can tell a specific, credible story about data overriding intuition — not just describe a process. Prepare the Intuit telemetry story: you used usage data across ~20 mobile apps and 30+ product SKUs to prioritize developer pain points. Find a specific example where the data surprised you or contradicted what engineering believed. Structure it as: what the intuition was, what the data showed, what decision you made, and what happened. Quantify the outcome if possible.
behavioral_situational What's your personal view on where the boundary should be between what an AI agent is allowed to do autonomously versus what requires a human-in-the-loop checkpoint — and how would you encode that view into product decisions? The interviewer explicitly asks this culture/values question. Felix has built autonomous agents (OpenClaw, Fintellect Agents, AutoEval) and has thought about safety in aeval (adversarial safety testing, refusal detection). The interviewer wants a principled view, not a generic safety answer. Prepare a principled framework: irreversibility (can the action be undone?), blast radius (how many users/systems affected?), confidence threshold (how certain is the model?), and domain sensitivity (financial, medical, legal). Reference your Fintellect AI work — financial advice is a domain where you've had to think about this concretely. Then articulate how you'd encode this into API design: confirmation primitives, human-in-the-loop hooks, audit logs.
role_specific_scenario OpenAI is shipping GPT-5.x models at a rapid cadence. As the Agents PM, a major enterprise customer tells you their production agent workflow breaks every time a new model drops because tool-calling behavior changed subtly. How do you prioritize fixing this against new capability work, and what product changes would you make to prevent it? This combines the interviewer's backward compatibility question with Felix's enterprise platform experience at Intuit (MSaaS Drift Detection, configuration drift scanning). Felix has direct experience with the drift detection problem and should have a strong answer. Prepare a two-part answer: (1) prioritization framework — how you weigh developer trust against capability velocity, what the threshold is for blocking a model release; (2) product solution — model versioning with pinning, behavioral regression test suite as a first-class API primitive, structured tool-call schema versioning. Reference your MSaaS Drift Detection work at Intuit as a direct analog.
role_specific_scenario You're defining the roadmap for OpenAI's Agents API for the next 6 months. The research team has three new capabilities ready to ship: (1) persistent agent memory across sessions, (2) multi-agent handoff with shared context, (3) structured tool output validation. How do you prioritize these, and what's your sequencing rationale? This is a direct product prioritization scenario that maps to Felix's RICE framework experience at Splunk and his hands-on experience building all three of these capabilities (OpenClaw has handoff, Fintellect has RAG/memory, aeval has structured output validation). The interviewer will test whether Felix can apply a principled framework, not just pick based on personal preference. Prepare a structured prioritization answer using a framework (RICE or impact/effort/risk). Consider: which capability unblocks the most developer use cases, which has the highest production failure rate today, which is most foundational (others depend on it). Likely sequencing: tool output validation first (unblocks reliability), multi-agent handoff second (enables new use cases), persistent memory third (complex, privacy implications). Justify with developer pain point data.
motivation_fit You've been a founder for the past 18 months building AI products. Why OpenAI specifically, and why this role — what does being inside OpenAI give you that you can't get as a builder on top of the API? The interviewer will probe whether Felix is joining OpenAI because he's excited about the mission and the leverage of the platform role, or because his ventures haven't gained traction. This is a high-stakes motivation question. Prepare an honest, specific answer: what you've learned as a builder on top of OpenAI's APIs that makes you want to shape the platform itself, what specific decisions you'd make differently if you were inside, and why the leverage of defining primitives for millions of developers is more compelling to you now than building one product. Avoid generic 'mission-driven' language — be specific about what you'd change about the Agents API based on your builder experience.
motivation_fit OpenAI operates at a pace and ambiguity level that's different from Intuit or Splunk. You've been a founder, which has its own kind of ambiguity. How do you think about what's different about operating inside a frontier AI lab versus a large enterprise or a startup you control? Felix has operated in three very different environments: large enterprise (Intuit, Splunk, Kaiser), founder mode (StreamIO, Fintellect), and research-adjacent (NeurIPS, Berkeley Lab). The interviewer will probe whether Felix has a realistic model of what OpenAI's operating environment is like. Prepare a nuanced answer that acknowledges the differences: at Intuit you had process and scale but slower velocity; as a founder you had autonomy but no leverage; at OpenAI you'd have research velocity and platform leverage but less control over direction. Be honest about what you find energizing vs. challenging about each. Demonstrate that you've thought about this transition seriously.
unique_to_this_interviewer Based on your experience building OpenClaw and the RL Workbench, what's one thing about the current OpenAI Agents API or SDK that you think is the wrong abstraction — and what would you change? The interviewer's question set includes 'describe a developer platform feature that was the wrong abstraction.' Felix has direct experience as a power user of OpenAI's APIs and has built competing/complementary orchestration frameworks. This question tests whether Felix has genuine, specific product opinions — not just flattery. Prepare a specific, honest critique of the current Agents API or SDK — something you've actually hit as a builder. Examples: event streaming granularity, tool call error handling, lack of durable execution primitives, context window management in multi-agent handoffs. Frame it constructively: here's the problem, here's what I'd change, here's the tradeoff. Demonstrate that you've used the product deeply enough to have a real opinion.

Preparation priorities

  1. 1. SYSTEM DESIGN FLUENCY: Prepare a complete, structured answer to the durable/resumable agent execution API design question — this is the highest-signal technical question and Felix has direct experience with OpenClaw and StreamIO to anchor it. Spend the most prep time here.
  2. 2. WRONG ABSTRACTION STORY: Identify and rehearse a specific, honest story from Intuit or Splunk about a developer platform feature that was the wrong abstraction. The interviewer explicitly asks this and will probe for intellectual honesty — a polished success story will not land.
  3. 3. FOUNDER-TO-PLATFORM-PM TRANSITION: Prepare a crisp, honest narrative about why you're moving from founder mode to platform PM at OpenAI — what you've learned as a builder on top of the API, what you'd change, and why the leverage of the platform role is more compelling than continuing to build on top of it.
  4. 4. AGENT AUTONOMY FRAMEWORK: Develop a principled, specific framework for where the human-in-the-loop boundary should be — not generic safety language, but a concrete decision framework (irreversibility, blast radius, confidence, domain sensitivity) that you'd encode into API product decisions. Reference your Fintellect financial domain experience.
  5. 5. DATA-DRIVEN DECISION STORY: Identify a specific Intuit telemetry story where quantitative data overrode engineering intuition — with specific numbers, a clear decision, and a measurable outcome. The interviewer asks this explicitly and Felix has the raw material (SQL/BigQuery, 20 mobile apps, 30+ SKUs) but needs to sharpen it into a crisp narrative.

⚠ Watch-outs