Production Systems Engineer
Tech Stack / Keywords
Firma i stanowisko
Our client is one of the world’s leading financial organizations, processing millions of transactions monthly and ensuring the integrity of global financial operations. We are on a mission to strengthen resilience and enable rapid recovery from large-scale technology failures.
As a Production Systems Engineer – Mass Recovery, you will identify real service dependencies, model complex failure scenarios, and support recovery during major technology incidents. This role is not about theoretical planning – it’s about creating actionable solutions grounded in system behavior under stress.
You’ll be part of a multi-disciplinary Service Management team responsible for resilience, recovery, and major incident response. Together, we ensure the smooth operation of critical services across application, platform, and infrastructure layers in a highly complex technology environment.
Wymagania
- Strong experience in at least one of the following: Production Engineering, Site Reliability Engineering (SRE), Infrastructure/Platform Engineering
- Solid understanding of: Virtualization platforms (ESX), Cloud providers, Storage systems, big data technologies, Networking fundamentals
- Experience with: CMDB platforms (ServiceNow) and their constraints, Observability tools (AppDynamics, Splunk)
- Ability to correlate and analyze data from multiple systems to identify patterns and risks.
- Experience working under pressure during major incident response.
- Excellent communication skills in English, both written and verbal.
Nice to have:
- Previous experience in financial technology disciplines or global banking environments.
- Exposure to Disaster Recovery or Mass Recovery planning/execution.
- Skilled in data extraction and manipulation.
- Experience with large-scale distributed systems.
- Familiarity with Jira and Confluence.
Obowiązki
- Develop and maintain accurate service dependency models across applications, platforms, and infrastructure layers.
- Identify and document shared failure domains (e.g., ESX, Storage, Networks).
- Define and simulate blast radius scenarios for critical systems.
- Correlate service failures to uncover common root causes and dependencies.
- Deliver data-driven insights for incident management and recovery leadership.
- Validate and challenge existing data sources (CMDB, ServiceNow) to ensure accuracy based on real-world system behavior.
- Identify gaps in resilience capabilities, including unrealistic RTOs and missing recovery options.
- Work across diverse tooling (ServiceNow, observability platforms, infrastructure systems) to extract and combine relevant data.
- Collaborate on designing fault-tolerant, recoverable architectures.
Oferta
- Flexible cooperation model – choose the form that suits you best (B2B, employment contract, etc.)
- Hybrid work setup – 8 days a month from the office in Kraków
- Collaborative team culture – work alongside experienced professionals eager to share knowledge
- Continuous development – access to training platforms and growth opportunities
- Comprehensive benefits – including Interpolska Health Care, Multisport card, Warta Insurance, and more
- High quality equipment – laptop and essential software provided
- Sharing the costs of sports activities
- Private medical care
- Sharing the costs of professional training & courses
- Life insurance
Mindbox S.A.
248 aktywnych ofert