Senior AI Compute Infrastructure Engineer
Tech Stack / Keywords
Firma i stanowisko
Kraken is a mission-focused company rooted in crypto values, aiming to accelerate the global adoption of crypto for financial freedom and inclusion. It is a fully remote company with employees in over 70 countries, developing premium crypto products for traders, institutions, and newcomers. The AI Compute and Infrastructure team powers model training, inference, evaluation, and experimentation across the exchange, owning the infrastructure layer for AI workloads with control, speed, reliability, and cost discipline.
Wymagania
- 5+ years of infrastructure engineering experience, including GPU compute, ML infrastructure, distributed systems, high-performance computing, or large-scale production platforms.
- Hands-on experience operating GPU clusters or accelerator-backed infrastructure in production or production-like environments, including scheduling, orchestration, utilization monitoring, and cost optimization.
- Strong systems engineering fundamentals across Linux, networking, storage, containers, Kubernetes, distributed runtimes, and production debugging.
- Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, Ray Serve, or equivalent.
- Proficiency in Python for infrastructure automation, tooling, debugging, integration, and operational workflows.
- Practical understanding of performance tradeoffs across batching, concurrency, memory usage, GPU utilization, model size, latency, throughput, availability, and cost.
- Track record of optimizing compute costs while maintaining clear performance, reliability, and availability expectations.
- Experience building observable systems with metrics, logs, traces, dashboards, alerts, and incident workflows.
- Comfortable working in high-stakes, always-on environments requiring uptime, throughput, correctness, and operational discipline.
- Clear communication skills to translate infrastructure tradeoffs for researchers, product teams, platform engineers, security stakeholders, and engineering leadership.
Nice to haves:
- Experience at a frontier AI lab, hyperscaler, high-frequency trading firm, research platform, or high-scale ML organization.
- Familiarity with custom silicon or specialized accelerators such as TPUs, AWS Trainium, Gaudi, or similar platforms.
- Background in capacity planning, procurement input, reserved capacity strategy, cloud accelerator economics, or GPU fleet cost management.
- Experience with distributed training frameworks such as DeepSpeed, Megatron-LM, FSDP, Ray, or equivalent.
- Experience debugging CUDA, NCCL, kernel, driver, runtime, memory, networking, or low-level performance issues.
- Experience with Rust, C++, Go, CUDA, or other systems languages used for performance-critical infrastructure.
- Experience in crypto, financial services, trading infrastructure, or security-sensitive production infrastructure.
Obowiązki
The opportunity:
- Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation, including drivers, runtimes, kernels, device plugins, node configuration, scheduling primitives, and workload isolation.
- Design infrastructure enabling Kraken teams to run models locally on GPUs to reduce dependency on external providers and contain compute costs.
- Build and improve scheduling, orchestration, placement, quota management, and utilization systems across heterogeneous accelerator environments.
- Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost using frameworks such as vLLM, Triton Inference Server, TensorRT, or equivalent.
- Partner with ML engineers and researchers to remove bottlenecks in training, evaluation, batch inference, online inference, deployment, and production debugging workflows.
- Build observability for GPU utilization, memory pressure, queue depth, saturation, token throughput, request latency, failed workloads, capacity pressure, and spend.
- Drive reliability, incident response, alerting, runbooks, and post-incident improvements for always-on AI compute infrastructure.
- Evaluate and integrate new hardware, cloud instance families, specialized accelerators, runtimes, schedulers, and serving frameworks as the AI infrastructure landscape evolves.
- Build tooling that makes GPU usage visible, accountable, and easier for internal teams to consume without needing to become infrastructure experts.
- Contribute to long-term architecture decisions balancing performance, cost efficiency, scalability, operational simplicity, and production safety.
Inne informacje
Applications are accepted on an ongoing basis unless a specific deadline is stated. Applicants may redact or remove personal information such as age or dates of attendance on resumes. The company considers qualified applicants with criminal histories consistent with the San Francisco Fair Chance Ordinance. Kraken is an equal opportunity employer and does not tolerate discrimination or harassment based on protected characteristics as outlined by law.
Kraken
38 aktywnych ofert