Principal AI Engineer
Tech Stack / Keywords
Firma i stanowisko
We are partnering with a US-based health-tech company on the takeover of a production AI-powered mobile coaching platform. The platform is built around a Python AI core (atlas-ai) which runs a FastAPI chat surface with intent routing and tool-using agents, a LangGraph-based agent framework with multi-agent orchestration, a Celery + Redis task queue for asynchronous agent flows, MongoDB for fitness-plan storage, Redis for conversation state, and Postgres for LangGraph checkpoints. The product includes several agent personas such as onboarding, chat, plan creation, in-workout smart adjust, plan smart adjust, and habit formation, with direct OpenAI integration for the LLM layer.
Wymagania
- 7+ years Python in production at senior+ level
- Deep LangGraph experience including state graphs, checkpoints, interrupts, multi-agent supervision, subgraphs
- Strong LangChain ecosystem knowledge (chains, tools, memory, output parsers, callbacks)
- Production FastAPI experience including streaming responses, dependency injection, middleware, async patterns
- Celery + Redis broker in production with task ordering, retries, idempotency, priority queues, dead-letter handling
- Concurrency in Python: asyncio (gather, structured concurrency, cancellation), threading boundaries, mixing sync and async code safely
- Multi-datastore operations with MongoDB, Redis, Postgres in a single service and transaction boundaries
- OpenAI API at scale including rate limits, retries with exponential backoff, fallback model routing, streaming, tool/function calling
- Agent design patterns: ReAct, plan-and-execute, supervisor patterns, tool-use loops, multi-turn state, interrupt resumption
- Prompt engineering with evaluation, A/B testing, version control of prompts, regression detection
- Token cost optimisation: prompt caching, model tiering, context window trimming, summary memory
- Production LLM observability: per-route token spend, prompt-level tracing, drift monitoring
- Testing discipline: pytest (including pytest-asyncio), property-based testing, snapshot tests for prompts, eval-based tests for agents
- Pydantic v2 fluency, type-hinted code throughout
Nice to have:
- RAG production experience (vector stores: Pinecone, Qdrant, pgvector)
- Production incident command for LLM-powered systems
- ML engineering background (model serving, feature engineering)
- Anthropic / Claude API experience in addition to OpenAI
- Data pipeline experience (Airflow, Dagster, Prefect)
- Domain knowledge in fitness / health / wearables
Obowiązki
First 90 days:
- Audit atlas-ai: agent flows, LangGraph state machines, Celery topology, datastore usage, OpenAI integration patterns
- Produce a written assessment of operational risk including failure modes, race conditions, retry semantics, idempotency, checkpoint integrity
- Quantify token cost per agent flow and per user session
- Identify highest-risk subsystems and propose stabilization plans
- Build or harden an evaluation harness for agent flows including golden cases, regression suites, hallucination/safety tests
- Lead knowledge-transfer sessions from the client's AI team
Ongoing responsibilities:
- Set the technical direction for the AI core
- Lead design for new agent flows and major changes to existing ones
- Own the production health of the AI surface with platform/SRE support
- Hire and mentor the AI squad (~10 engineers at full scale)
- Represent the AI core in cross-team architecture conversations with the client
Oferta
- 100% remote work
- B2B engagement
- Rate up to PLN 180 per hour
- Start in July
apreel
230 aktywnych ofert