← anthropic / Anthropic Fellows Program, ML Systems & Performance

interviewer_questions / art_dicn4CZ38qM

role

anthropic / Anthropic Fellows Program, ML Systems & Performance

model

anthropic/claude-sonnet-4.6

created

2026-05-21T22:42

Interviewer

The interviewer profile provided is a generic Anthropic interviewer rather than a named individual with a specific LinkedIn profile. Based on the ML Systems & Performance Fellows workstream context, the interviewer is likely a research engineer or ML systems researcher at Anthropic with deep expertise in training infrastructure, accelerator performance, distributed systems, and/or post-training pipelines. Given the workstream's focus on CPU simulators, accelerator backends, synthetic data pipelines, and on-demand infrastructure, the interviewer likely cares most about hands-on engineering depth, systems-level thinking, and the ability to translate research ideas into rigorous implementations. The interview loop shape (AI engineering screen + 4 onsites + research presentation) suggests this interviewer may be conducting a technical deep-dive or research discussion session. Expect probing on ML systems architecture, training/inference performance, and the candidate's ability to contribute to empirical research with a public output.

My profile through their lens

From an Anthropic ML Systems & Performance lens, Felix stands out primarily for his RL post-training workbench — a rare, hands-on artifact that benchmarks GRPO/DPO across TRL, VeRL, OpenRLHF, and NeMo RL with GPU Docker passthrough, which directly maps to the workstream's interest in training infrastructure and framework evaluation. His aeval platform (FastAPI, TimescaleDB, Redis, Ollama) demonstrates systems engineering rigor with statistical evaluation methodology, a signal that he can build research-quality tooling. The NeurIPS 2014 publication and the BRAIN protein structure platform (C++ BPTT in 2004, rewritten to 8B-param PyTorch in 2026) establish a long arc of ML research credibility that Anthropic researchers will respect. However, Felix's background is predominantly product management and applied engineering rather than core ML systems research (e.g., kernel optimization, custom CUDA, accelerator backends), which is the deepest technical layer this workstream targets. The interviewer will probe whether his systems work goes deep enough to contribute to accelerator simulation or HPC-level infrastructure projects.

Questions they may ask (22)

category	question	why	how to prepare
resume_deep_dive	Walk me through the architecture of your RL post-training workbench. Specifically, how did you handle GPU passthrough in Docker for the Arena benchmarking component, and what were the most surprising performance differences you observed across TRL, VeRL, OpenRLHF, and NeMo RL?	The RL workbench is the single most relevant artifact for this workstream. The interviewer will want to verify depth — not just that Felix built it, but that he understands the systems-level tradeoffs (memory bandwidth, throughput, convergence stability) across frameworks. GPU Docker passthrough is a specific engineering challenge that signals real HPC experience.	Prepare a concise architecture walkthrough with concrete benchmark numbers (throughput tokens/sec, memory footprint, convergence curves) for at least 2-3 algorithm/framework pairs. Be ready to explain why specific frameworks outperformed others on specific hardware configurations and what that implies for production training pipelines.
resume_deep_dive	Your aeval platform uses bootstrap confidence intervals, Welch's t-test, and Cohen's d for statistical rigor in model evaluation. How did you decide on this specific statistical stack, and how does your saturation detection mechanism work in practice?	Anthropic's ML Systems workstream cares about evaluation rigor — this is directly relevant to their model evaluation infrastructure. The interviewer will probe whether Felix's statistical choices were principled or cargo-culted, and whether saturation detection is a real algorithmic contribution or a heuristic threshold.	Be able to explain the statistical rationale for each method (why Welch's over Student's t, why Cohen's d over raw p-values) and describe the saturation detection algorithm concretely — what signal triggers it, what the false positive rate is, and how it integrates with the CI/CD regression detection pipeline.
resume_deep_dive	Your BRAIN platform spans from a hand-coded C++ BPTT network in 2004 to an 8B-parameter PyTorch system in 2026 — a 19-million-fold parameter scale increase. What architectural decisions in the 2026 rewrite were most constrained by compute budget, and how did you use Optuna HPO to navigate the search space efficiently?	This question probes whether Felix's ML engineering depth is genuine across the full stack — from low-level gradient computation to modern HPO. The NeurIPS publication gives him credibility, but the interviewer will want to verify that the 2026 rewrite reflects current best practices in ML systems, not just a portfolio project.	Prepare to discuss the HPO search space definition (which hyperparameters, what ranges, what sampler — TPE vs. CMA-ES), the compute budget constraints on Apple Silicon vs. cloud, and the tradeoffs between the 5 architectures (feedforward, GRU, Transformer, ESM-2, multi-task) for the protein structure prediction task specifically.
resume_deep_dive	At Intuit you scaled ICE from 6K to 50K TPS via rSocket migration supporting ~1.5M concurrent connections with sub-25ms TP99. What were the bottlenecks that made rSocket the right choice over gRPC or HTTP/2, and how did you instrument and validate the sub-25ms TP99 target in production?	The ML Systems workstream explicitly calls out comfort with large-scale distributed systems and high-performance computing. This question tests whether Felix's platform scaling experience translates to the kind of systems thinking needed for accelerator workload infrastructure at Anthropic.	Prepare a crisp explanation of rSocket's reactive streams backpressure model vs. gRPC's flow control, the specific bottleneck (likely head-of-line blocking or connection overhead), and the observability stack used to validate TP99 — ideally with a story about a production incident that revealed a latency regression.
technical_domain	The workstream mentions building a CPU simulator for accelerator workloads. If you were designing a cycle-accurate simulator for a TPU-like matrix multiply unit, what abstraction layers would you model, and how would you validate the simulator's fidelity against real hardware traces?	This is the most technically demanding project type listed in the workstream. The interviewer needs to assess whether Felix has the computer architecture depth to contribute to this project, which requires knowledge of systolic arrays, memory hierarchy simulation, and trace-driven validation — areas not explicitly on his resume.	Study the basics of systolic array architecture (Google TPU paper), roofline model analysis, and trace-driven simulation methodology. Even if you haven't built a CPU simulator, be able to reason about the abstraction layers (ISA, microarchitecture, memory hierarchy) and propose a validation approach using hardware performance counters.
technical_domain	You implemented 12 RL algorithms including PPO, GRPO, DAPO, and REINFORCE++ in your workbench. From a systems performance perspective, what are the key differences in memory access patterns and GPU utilization between on-policy algorithms like PPO and offline/hybrid algorithms like DPO, and how did those differences manifest in your benchmarks?	This is the intersection of Felix's strongest artifact (RL workbench) and the workstream's ML systems focus. The interviewer wants to see whether Felix thinks about RL algorithms as systems problems — memory bandwidth, batch construction, gradient accumulation — not just as mathematical formulations.	Prepare a comparison of PPO vs. DPO from a systems perspective: PPO's rollout buffer and value network forward pass vs. DPO's paired preference batch construction, the implications for GPU memory fragmentation, and how GRPO's group-relative reward computation changes the batch structure. Reference actual throughput numbers from your benchmarks.
technical_domain	Your AutoEval system reduces robot model evaluation cycles from 72 hours to ~4 minutes using screen capture + multimodal AI. What are the failure modes of using Claude/GPT-4V as an evaluator for spatial reasoning tasks like grasp pose assessment, and how would you build a calibration dataset to measure evaluator reliability?	Anthropic cares deeply about evaluation methodology — this is core to their safety research. The interviewer will probe whether Felix understands the limitations of LLM-as-judge evaluation, particularly for spatial/geometric tasks where vision models have known weaknesses.	Prepare a concrete discussion of LLM evaluator failure modes (spatial hallucination, reference frame confusion, confidence miscalibration) and a calibration methodology: human-labeled ground truth set, inter-rater reliability metrics (Cohen's kappa), and how you'd detect systematic bias in the evaluator across object categories or lighting conditions.
technical_domain	The workstream mentions adding backends for different accelerators on an open-source project. If you were adding an AMD ROCm backend to a PyTorch-based training framework that currently only supports CUDA, what would be your approach to handling the HIP/CUDA API surface differences, and where would you expect the most significant performance gaps?	This directly maps to a listed project type. The interviewer is testing whether Felix has the low-level systems knowledge to contribute to accelerator backend work, which requires understanding of the CUDA/HIP programming model, memory management, and kernel optimization.	Study the HIP porting guide and the key differences between CUDA and ROCm (warp size differences, memory model, atomics). Be able to discuss the hipify toolchain, the performance gaps in memory bandwidth and kernel launch overhead on MI300X vs. H100, and how you'd use rocprof for performance profiling.
gap_transition	Your most recent full-time role was Staff PM at Intuit, and your current work is founder/CEO of two startups. The Fellows program is explicitly a research role with an expectation of producing a public output like a paper. How do you think about the transition from product and engineering leadership to being an individual contributor researcher, and what does your research process look like day-to-day?	This is the most significant gap to address. Anthropic Fellows are expected to do empirical research and publish. Felix's background is product management and applied engineering — the interviewer will probe whether he can operate in a research mode (hypothesis-driven, rigorous evaluation, written communication of findings) rather than a shipping mode.	Prepare a concrete answer about your research process: how you formulated the research question for your RL workbench, how you designed the evaluation methodology, and what a paper submission from that work would look like. Reference your NeurIPS 2014 experience as evidence of research mode operation, and be honest about what you'd need to develop.
gap_transition	The ML Systems & Performance workstream mentions comfort with high-performance computing 'e.g. in trading.' Your background includes financial AI products but not HPC trading systems. How would you characterize the gap between your distributed systems experience (rSocket, 50K TPS) and the kind of low-latency, bare-metal HPC work common in trading infrastructure, and how would you close it during the fellowship?	The JD's parenthetical 'e.g. in trading' signals that the ideal candidate may have FPGA, kernel bypass networking, or RDMA experience. Felix's systems work is at the application layer (microservices, SDKs) rather than the hardware/OS layer. The interviewer will probe this gap directly.	Be honest about the gap while reframing your strengths: your rSocket work demonstrates understanding of backpressure and connection management at scale, and your RL workbench demonstrates GPU-level performance awareness. Propose a concrete learning plan for the fellowship (e.g., studying RDMA/InfiniBand for distributed training, profiling with Nsight Systems).
gap_transition	Your NeurIPS paper is from 2014 — over a decade ago. The field has changed dramatically. How have you stayed current with ML research, and can you point to a specific paper from the last 12 months that directly influenced a technical decision in one of your recent projects?	Anthropic researchers will scrutinize the recency of Felix's research engagement. The 2014 NeurIPS paper is a strong signal, but the interviewer needs to verify that Felix is actively reading and applying current research, not just citing a decade-old credential.	Prepare 2-3 specific recent papers (2024-2026) that influenced your RL workbench or aeval design — e.g., the GRPO paper (DeepSeek-R1), DAPO, or recent work on reward hacking. Be able to explain what you took from the paper, what you implemented, and what you found when you ran it.
behavioral_situational	Tell me about a time you had to make a significant technical architecture decision with incomplete information and tight time constraints. What was the decision, what data did you use to make it, and what would you do differently in retrospect?	The Fellows program is 4 months — fast-paced with a hard deadline for a public output. The interviewer wants to assess Felix's decision-making under uncertainty, which is critical for research projects where the right approach is often unclear.	Use the rSocket migration at Intuit or the framework selection for your RL workbench as the story. Be specific about what data you had, what you were missing, how you made the call, and — critically — be honest about what you'd do differently. Interviewers at research labs value intellectual honesty over polished narratives.
behavioral_situational	Describe a situation where your initial technical hypothesis was wrong and the data forced you to change direction. How did you handle the pivot, and what did you learn about your own research process?	Empirical research at Anthropic is explicitly framed as a science — hypothesis-driven, falsifiable. The interviewer wants evidence that Felix can handle negative results and iterate, not just ship features. This is a key differentiator between product mode and research mode.	Prepare a specific story from your RL workbench or aeval work where a benchmark result surprised you — e.g., a framework that underperformed expectations, or an evaluation metric that didn't correlate with human judgment. Frame the pivot as a learning, not a failure.
behavioral_situational	The Fellows program involves direct mentorship from Anthropic researchers. Describe a time you worked with a technical mentor or senior researcher who had significantly different views on the right approach. How did you navigate the disagreement, and what was the outcome?	The program structure requires Felix to work closely with mentors (Alwin Peng, Zygi Straznickas) who will have strong opinions about research direction. The interviewer wants to assess whether Felix can be a collaborative, intellectually humble research partner while also advocating for his own technical perspective.	Use a story from Intuit (working with engineering leadership on the rSocket migration or language assessment) or from your academic research. Emphasize how you used data to resolve the disagreement, not authority or persistence.
behavioral_situational	You're running two startups simultaneously while applying for a 4-month full-time fellowship. How will you manage the transition, and what happens to Streamio AI and Fintellect AI during the fellowship period?	This is a practical concern the interviewer will raise. The fellowship requires full-time commitment, and running two startups simultaneously signals potential distraction. The interviewer needs to be confident Felix will be fully present.	Be direct and specific: explain what 'founder mode' looks like for each company right now (are they revenue-generating? do they have co-founders or employees?), and articulate a concrete plan for maintaining them at low-maintenance mode during the fellowship. Avoid vague answers — specificity signals seriousness.
role_specific_scenario	One of the listed project types is 'building on-demand infrastructure for other infrastructure-heavy fellows projects.' If you were designing a self-service compute provisioning system for ML researchers who need GPU clusters for training runs of varying sizes (from single-GPU experiments to 64-GPU distributed runs), what would the architecture look like, and what are the top 3 failure modes you'd design against?	This directly maps to a listed project type and tests Felix's ability to think like an infrastructure engineer serving researchers — which combines his DevPortal/ICE platform experience with ML systems knowledge. The interviewer wants to see whether he can design for researcher ergonomics, not just enterprise developer ergonomics.	Design a system with: job queue (Redis/Ray), cluster autoscaling (Kubernetes + GPU node pools or Slurm), cost controls, and a simple API/CLI. The 3 failure modes to address: job preemption without checkpointing, GPU memory fragmentation across heterogeneous workloads, and cost runaway from zombie jobs. Reference your ICE self-service platform experience as a design analog.
role_specific_scenario	Suppose your fellowship project involves benchmarking a new RL post-training algorithm across multiple open-source frameworks (similar to your existing workbench). How would you design the experiment to produce results rigorous enough for a paper submission, and what would the paper's contribution claim be?	The program's explicit goal is a public output (paper). The interviewer wants to assess whether Felix can frame engineering work as a research contribution — identifying a novel finding, controlling for confounds, and making a falsifiable claim. This is the core skill gap between PM/engineer and researcher.	Prepare a concrete paper framing for your existing RL workbench work: what is the research question (e.g., 'Does framework choice significantly impact convergence stability for GRPO on reasoning tasks?'), what are the controlled variables, what is the novel finding, and what venue would you target (NeurIPS, ICLR, MLSys). Practice articulating this in 2 minutes.
role_specific_scenario	Anthropic recently acquired Stainless, an SDK tooling company. If you were a fellow contributing to the ML Systems workstream and were asked to evaluate whether Anthropic's internal training infrastructure tooling should be open-sourced (similar to how Stainless's SDK tools are developer-facing), what framework would you use to make that recommendation, and what data would you need?	This connects Felix's strongest PM credential (SDK/DevPortal work at Intuit) to a live Anthropic strategic question. The interviewer is testing whether Felix can apply product thinking to infrastructure decisions — a key skill for the ML Systems workstream where engineering and research strategy intersect.	Frame the analysis around: competitive moat (does open-sourcing help or hurt?), community leverage (does external contribution accelerate development?), safety considerations (does exposing training infrastructure create risk?), and developer ecosystem effects. Reference the Stainless acquisition signal and your ICE platform experience with developer adoption metrics.
motivation_fit	Anthropic's mission is AI safety — making AI reliable, interpretable, and steerable. Your background is primarily in developer platforms, applied AI products, and RL post-training. How does your work connect to AI safety specifically, and what safety-relevant research question would you most want to pursue during the fellowship?	This is the core motivation question for any Anthropic role. The ML Systems workstream is capabilities-adjacent (not pure safety research), but Anthropic still expects fellows to be genuinely motivated by safety. The interviewer will probe whether Felix has a coherent safety worldview or is primarily motivated by the research credential.	Prepare a specific safety-relevant research question that connects to your ML systems background — e.g., 'How do different RL post-training algorithms affect the reliability of refusal behavior under distribution shift?' or 'Can we build evaluation infrastructure that detects reward hacking before it manifests in deployment?' Connect it to your existing workbench work.
motivation_fit	The fellowship pays a weekly stipend and is 4 months with no guarantee of a full-time offer. Given your background as a Staff PM at Intuit and a founder of two companies, why is this the right next step for you right now, and what does success look like for you at the end of 4 months?	The interviewer will probe whether Felix is using the fellowship as a stepping stone to a full-time Anthropic role (legitimate) or as a resume credential while continuing to run his startups (a red flag for commitment). The compensation is significantly below what a Staff PM at Intuit would earn, which signals either strong mission alignment or a strategic calculation.	Be honest and specific: articulate why a 4-month research immersion at Anthropic is worth the opportunity cost, what you want to learn that you can't learn as a founder, and what a successful fellowship looks like (paper submitted, specific technical skills developed, potential full-time role). Avoid generic 'I want to work on AI safety' answers.
unique_to_this_interviewer	Given that this is the ML Systems & Performance workstream and the listed mentors include Alwin Peng and Zygi Straznickas, have you reviewed their published work, and is there a specific research direction from their prior work that you'd want to extend or challenge during the fellowship?	The JD explicitly notes that candidates may research mentors' prior work. An interviewer from this workstream will want to see that Felix has done his homework and can engage at a peer level with the research agenda, not just express general enthusiasm for Anthropic.	Research Alwin Peng's and Zygi Straznickas's published work (Google Scholar, Anthropic blog, arXiv). Identify one paper or project from each that connects to your RL workbench or aeval work. Prepare a 2-3 sentence synthesis of what you'd want to explore further and why your existing infrastructure work positions you to contribute.
unique_to_this_interviewer	The workstream mentions 'building complex synthetic data or environment pipelines' as a potential project. Your AutoEval system repurposed a streaming pipeline for robot model evaluation. How would you extend that architecture to generate synthetic training data for fine-tuning a vision-language model, and what are the quality control challenges specific to synthetic data at scale?	Synthetic data pipelines are a live research area at Anthropic (Constitutional AI, synthetic preference data). This question connects Felix's most novel engineering contribution (AutoEval's zero-integration screen capture approach) to a workstream project type, testing whether he can generalize the architecture to a new research problem.	Extend your AutoEval architecture: screen capture → multimodal AI scoring → structured output becomes screen capture → multimodal AI generation → structured synthetic example. Discuss quality control challenges: distribution shift between synthetic and real data, diversity collapse, reward hacking in the generation pipeline, and how you'd use your aeval statistical framework to measure synthetic data quality.

Preparation priorities

1. RESEARCH PRESENTATION PREP: The interview loop includes a research presentation. Prepare a 15-20 minute presentation of your RL workbench as a research contribution — frame it with a research question, methodology, results with statistical rigor, and implications. This is the highest-leverage preparation activity.
2. SYSTEMS DEPTH CREDENTIALING: The workstream's hardest technical bar is accelerator-level systems work (CPU simulators, ROCm backends, HPC). Study the roofline model, systolic array architecture (TPU paper), and CUDA/HIP porting basics. You don't need to be an expert, but you need to reason fluently about these topics.
3. SAFETY MOTIVATION NARRATIVE: Prepare a specific, genuine answer to 'how does your work connect to AI safety?' that goes beyond surface-level enthusiasm. Identify a concrete safety-relevant research question you'd pursue, grounded in your RL post-training and evaluation work.
4. MENTOR RESEARCH: Read published work by Alwin Peng and Zygi Straznickas before any interview. Identify one specific connection between their research and your existing projects. This signals peer-level engagement and is a strong differentiator.
5. COMMITMENT CLARITY: Prepare a direct, specific answer to the dual-founder / fellowship commitment question. Know exactly what 'low-maintenance mode' looks like for Streamio AI and Fintellect AI, and be able to state it confidently in under 60 seconds.

⚠ Watch-outs

WATCH OUT — PRODUCT MANAGER FRAMING: Felix's resume is written in PM voice ('delivered,' 'led,' 'owned'). In a research engineering interview, this framing will raise doubts about whether he was the technical implementer or the product owner. For every technical claim, be prepared to go one level deeper than the resume bullet — specific code, specific algorithms, specific debugging stories. If asked 'did you write this code yourself?' the answer must be unambiguous.
WATCH OUT — DUAL STARTUP DISTRACTION: Running two active companies while applying for a full-time fellowship is a significant red flag for commitment. If the interviewer asks about this (and they will), a vague answer ('I'll manage it') will kill the candidacy. Prepare a specific, credible plan: are the startups revenue-generating? Do they have co-founders? What is the minimum viable maintenance mode? Be concrete.
WATCH OUT — RESEARCH RECENCY GAP: The NeurIPS paper is 2014. Citing it as primary research evidence without demonstrating current research engagement will signal that Felix is trading on a 12-year-old credential. Proactively reference 2024-2026 papers that influenced your recent work, and frame your RL workbench and aeval platform as current research contributions, not just engineering projects.
WATCH OUT — SYSTEMS DEPTH CEILING: Felix's strongest systems work is at the application/microservices layer (rSocket, GraphQL, SDK tooling). The ML Systems workstream's hardest projects (CPU simulators, accelerator backends) require computer architecture depth that is not evident on the resume. Don't overclaim — instead, demonstrate fast-learning credibility by showing you've already started closing the gap (e.g., 'I've been studying the TPU v4 architecture paper and the HIP porting guide in preparation for this application').