Kraken
Kraken
New

Site Reliability Engineer - AI Agents

Brak informacji o wynagrodzeniu
SeniorFull-time
#366996·Dodano dziś·0
Źródło: Kraken
Aplikuj teraz

Tech Stack / Keywords

AITerraformCloudAWSCI/CDLLMKubernetesSecurity

Firma i stanowisko

Payward is the parent company behind Kraken, NinjaTrader, Breakout, xStocks, Payward Services, and CF Benchmarks. Kraken, founded in 2011, is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions globally. The AI Infrastructure team, part of the Data organization, builds, operates, and scales systems powering AI agents in production, ensuring reliability, observability, and scalability of agentic workflows. This team focuses on platform engineering, building APIs, SDKs, and platform capabilities for AI, Data, and Engineering teams to consume agent infrastructure as a service.


Wymagania

  • 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in production
  • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
  • Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
  • Strong understanding of platform engineering principles including developer experience and API-driven platform design
  • Proficiency with Infrastructure as Code tools, particularly Terraform
  • Experience with containerization and orchestration, especially Kubernetes and Docker
  • Solid understanding of cloud infrastructure, preferably AWS
  • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
  • Experience designing and operating observability, monitoring, and alerting systems
  • Experience implementing incident response procedures and participating in on-call rotations
  • Strong collaboration skills across data, AI, and engineering teams
  • High ownership mindset in a fast-moving, high-stakes production environment

Nice to have:

  • Experience building or operating infrastructure for agent-based or LLM-powered systems
  • Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI)
  • Background in data infrastructure including Airflow, Kafka, Spark, or data lake tooling
  • Experience with CI/CD pipelines and deployment automation for AI/ML workloads
  • Exposure to evaluation frameworks and model performance monitoring at scale
  • Experience working in fast-moving 0→1 environments or platform-building teams
  • Experience building SDKs, developer tooling, or internal platform products focused on usability and adoption
  • Experience with Cloudflare's cloud platform and product ecosystem including networking, security, performance, and Zero Trust solutions

Obowiązki

  • Design, build, and operate the infrastructure layer supporting AI agent workflows in production
  • Ensure reliability, scalability, and observability of agentic systems across internal and external products
  • Design and develop platform services, APIs, SDKs, and self-service capabilities for engineering teams
  • Manage and maintain compute, orchestration, and serving infrastructure for model inference and agent execution
  • Implement monitoring, alerting, and incident response procedures tailored to AI/ML workloads
  • Utilize Infrastructure as Code tools such as Terraform to provision and manage AWS cloud infrastructure
  • Build and maintain CI/CD pipelines for rapid, reliable deployment of AI services and agent workflows
  • Define and implement guardrails, failure handling, and recovery patterns for agentic and LLM-powered systems
  • Collaborate with AI and Data Engineering teams to harden experimental agent prototypes into production systems
  • Manage containerized workloads using Kubernetes for deployment, scaling, and orchestration of AI services
  • Implement access controls and security best practices across AI infrastructure environments
  • Document architecture, runbooks, and best practices for team knowledge sharing

Inne informacje

Applications are accepted on an ongoing basis unless a specific deadline is stated. Applicants may redact or remove personal identifying information from resumes. The employer considers qualified applicants with criminal histories consistent with the San Francisco Fair Chance Ordinance. The company is an equal opportunity employer and does not tolerate discrimination or harassment based on protected characteristics as outlined by law.

Kraken

Kraken

22 aktywne oferty

Zobacz wszystkie oferty
Aplikuj teraz