jobsearch v0.0.1

← anthropic / Anthropic Fellows Program, ML Systems & Performance

brief / art_8TXD1jmx5jA

role
anthropic / Anthropic Fellows Program, ML Systems & Performance
model
anthropic/claude-sonnet-4.6
created
2026-05-22T17:50

Company snapshot

Anthropic is an AI safety company founded in 2021, headquartered in San Francisco, whose core mission is building reliable, interpretable, and steerable AI systems — most visibly through its Claude model family. The company has grown rapidly, raising multi-billion-dollar rounds from Google and others and releasing Claude 3/3.5/3.7 model families with strong benchmark performance. Anthropic is known for foundational safety research (Constitutional AI, mechanistic interpretability, scaling laws) and positions itself as a public benefit corporation. The Fellows Program is an expansion of their earlier AI Safety Fellows track into capabilities-adjacent workstreams including ML Systems & Performance, RL, and Economics. Specific recent internal project details beyond public research publications are not available to confirm.

Team stack

Based on the JD and public Anthropic research signals: Python (primary language — explicitly required), PyTorch (likely, given LLM training/fine-tuning focus), JAX (likely used internally for model research), CUDA and accelerator tooling (explicitly mentioned — GPU/TPU/custom accelerator backends), distributed systems infrastructure (large-scale training clusters, likely using NCCL, Ray, or custom frameworks), Docker/containerization for reproducible research environments (inferred from project examples like GPU passthrough), open-source model ecosystems (HuggingFace, vLLM, likely), FastAPI or similar for serving/evaluation tooling (inferred from workstream project types), and cloud compute at scale (~$15k/month compute budget per fellow suggests AWS/GCP/Azure or internal clusters). The CPU simulator and accelerator backend projects suggest low-level systems work (C/C++ possible, assembly-level understanding valued). Specific internal tooling names are unknown.

Likely questions (10)

areaquestionwhy
system_design Walk us through how you would design a CPU simulator for a novel ML accelerator workload — what abstractions would you model, and how would you validate correctness against real hardware? The JD explicitly lists 'Building a CPU simulator for accelerator workloads' as a candidate project; this probes low-level systems thinking and hardware-software co-design intuition.
system_design Your RL Workbench benchmarks TRL, VeRL, OpenRLHF, and NeMo RL. If you had to add a new accelerator backend (e.g., AMD ROCm or a custom ASIC) to one of these frameworks, what would your integration strategy be and where would the hardest failure modes be? The JD lists 'Adding backends for different accelerators on an open source project' as a project type; this question directly tests whether the candidate can extend their existing RL Workbench experience to hardware-level integration.
domain You implemented 12 RL algorithms including GRPO, DAPO, and REINFORCE++ in your RL Workbench. What were the most surprising throughput or memory efficiency differences you observed across TRL, VeRL, and OpenRLHF, and what systems-level factors drove those differences? The ML Systems & Performance workstream explicitly values experience training/fine-tuning/evaluating LLMs and debugging training processes; this tests depth of empirical observation from the candidate's own work.
coding Given a distributed training job where GPU utilization is consistently at 40% despite high batch sizes, walk me through your debugging methodology — what tools, metrics, and hypotheses would you pursue first? The JD emphasizes 'analyzing and debugging model training processes' and high-performance computing; this is a core ML systems performance diagnostic question.
domain In your aeval platform you used bootstrap confidence intervals, Welch's t-test, and Cohen's d for statistical rigor in model evaluation. How would you extend that framework to evaluate throughput and convergence benchmarks across RL training runs at scale, and what additional statistical concerns arise? The workstream values engineering rigor alongside research exploration; this bridges the candidate's existing eval work to the performance benchmarking context Anthropic cares about.
system_design Describe how you would build on-demand GPU infrastructure for a cohort of fellows running heterogeneous ML experiments — covering provisioning, cost controls, isolation, and observability. The JD explicitly lists 'Building on demand infrastructure for other infrastructure heavy fellows projects' as a candidate project type.
behavioral Tell me about a time you had to balance research exploration with operational reliability in a system you built. What tradeoffs did you make and what would you do differently? The JD explicitly calls out the need to 'balance research exploration with engineering rigor and operational reliability' as a unique candidate criterion for this workstream.
domain Your AutoEval project reduced robot model evaluation cycles from 72 hours to ~4 minutes using screen capture and multimodal AI. What are the validity threats to that evaluation approach, and how would you design a more rigorous benchmark to complement it? The JD lists 'Building complex synthetic data or environment pipelines' as a project type; this probes the candidate's ability to think critically about eval validity, a core Anthropic research concern.
culture Anthropic describes its research as 'big science' — small number of large-scale efforts rather than many small puzzles. Given that you've been a solo founder running multiple parallel projects, how would you adapt your working style to a highly collaborative, single-focus research environment? The JD explicitly contrasts Anthropic's 'big science' model against fragmented research; the candidate's background is heavily solo/founder-mode, so this is a predictable culture-fit probe.
domain Your original BRAIN system hand-coded BPTT in C++ in 2004. How does your understanding of backpropagation at that level inform how you think about gradient flow and numerical stability in modern large-scale training runs? The ML Systems & Performance workstream values low-level systems understanding; the candidate's unique C++ BPTT history is a strong differentiator that interviewers will likely probe to assess depth versus breadth.

Talking points