jobsearch v0.0.1

← thinkingmachines / Research Product Manager

brief / art_geI4JoIzRXc

role
thinkingmachines / Research Product Manager
model
anthropic/claude-sonnet-4.6
created
2026-05-20T03:28

Company snapshot

Thinking Machines Lab is a frontier AI research organization whose stated mission is advancing collaborative general intelligence and democratizing access to AI tools. The team includes builders behind widely-used AI products and open-source projects such as ChatGPT, Character.ai, Mistral open-weights models, PyTorch, OpenAI Gym, Fairseq, and Segment Anything — signaling a pedigree spanning both applied product and foundational research. The company appears to be in an active build-out phase, hiring across research, infrastructure, and applied product roles with a broad compensation band ($175K–$475K) suggesting they are recruiting across seniority levels. Specific recent funding rounds, headcount, or named internal projects are not publicly confirmed and are not stated here to avoid fabrication. Engineering reputation, based on the JD and team pedigree, is likely at the frontier-lab tier — fast-moving, research-first, with high technical bar for all roles.

Team stack

Based on the JD and team pedigree, the stack likely includes: Python as the primary research and infrastructure language (given PyTorch, Fairseq lineage); PyTorch for model development and post-training (likely, given founding team's PyTorch contributions); large-scale GPU cluster orchestration, likely Kubernetes + SLURM or similar HPC schedulers (inferred from 'compute and resource roadmaps' language in JD); internal evaluation and data pipeline tooling (likely custom, given evals emphasis in preferred qualifications); model serving infrastructure at scale (inferred from 'production systems' integration language); possibly Triton or CUDA-level tooling for inference optimization (uncertain, not confirmed in JD). Collaboration tooling is likely Notion/Linear/Slack given startup stage. Data campaign infrastructure stack is unknown.

Likely questions (10)

areaquestionwhy
system_design Walk us through how you would design a compute resource roadmap for a team running simultaneous pre-training, post-training RLHF, and eval campaigns — how do you identify and resolve GPU bottlenecks across competing priorities? JD explicitly calls out 'compute and resource roadmaps, identifying bottlenecks' as a core responsibility; candidate's RL Workbench with GPU Docker passthrough and multi-framework benchmarking is directly relevant.
domain You've benchmarked GRPO, DPO, PPO, and others in your RL Workbench. If a research team came to you debating whether to use GRPO vs. DPO for a new post-training run, what questions would you ask to help them scope the decision, and how would you translate that into a project plan? Preferred qualifications call out post-training as a key domain; the JD asks RPMs to 'translate technical ideas into actionable, well-scoped plans' — this tests both domain depth and PM translation skill.
behavioral Describe a time you had to maintain momentum on a complex, ambiguous technical project where the research direction was still being defined. How did you create clarity without over-constraining the science? JD explicitly calls out 'creating clarity in fast-moving, ambiguous environments' and 'understanding the rhythm of research' as core to the role.
system_design How would you design a milestone and progress-tracking system for a frontier model training run that spans data curation, pre-training, RLHF post-training, and safety evals — where each phase has different cadences and stakeholders? JD calls out 'defining milestones and keeping teams aligned across model development, data campaigns, infrastructure, and product integration' as a primary responsibility.
domain Your aeval platform includes bootstrap confidence intervals, Welch's t-test, and Cohen's d for statistical rigor. How would you advise a research team on when their eval results are statistically meaningful enough to make a go/no-go decision on a model release? Preferred qualifications highlight evals as a key domain; the JD asks RPMs to 'synthesize and communicate progress across diverse technical teams' — this tests domain credibility in evals.
coding Given a dataset of model evaluation runs with fields like model_id, eval_type, score, timestamp, and GPU_hours, write a SQL or Python query to identify which eval types show the highest variance across model versions and flag regressions beyond one standard deviation. Candidate's background includes SQL/BigQuery at Intuit and aeval's statistical tooling; RPM at a research lab is expected to work directly with data to surface insights.
behavioral Tell us about a time you had to align stakeholders across research, infrastructure, legal, and business development on a single technical initiative. What was your communication strategy and what broke down? JD explicitly lists 'research, ML infrastructure, legal, and business development' as cross-functional partners the RPM must coordinate — this tests breadth of stakeholder management.
culture Thinking Machines describes itself as scientists, engineers, and builders. How do you think about your own identity in that triad — and how do you earn credibility with researchers who may be skeptical of PMs? The JD emphasizes 'thriving in deeply technical discussions' and 'understanding the rhythm of research'; the company's founding team is heavily researcher-identity — cultural fit question probing how candidate positions themselves.
domain Your NeurIPS 2014 paper used neural networks for protein secondary structure prediction. How has your mental model of what makes a good research contribution evolved from 2014 to today, especially given the shift to large-scale pre-training and RLHF? Preferred qualifications call out past publications in AI as a differentiator; this question tests intellectual honesty, research taste, and ability to contextualize their own work within frontier AI trends.
behavioral At Intuit you scaled ICE engagements 275% YoY to 675M+ and drove a platform that reduced onboarding from weeks to minutes. How would you apply that platform-thinking discipline to a research infrastructure context where the 'users' are ML researchers rather than software developers? JD asks RPMs to 'support integration of new technologies into production systems'; candidate's Intuit platform experience is their strongest enterprise PM signal and needs to be bridged to a research-lab context.

Talking points