(Senior) Network Reliability Engineer (NRE) - AI GPU Clusters
200 - 250 PLN/ godz.B2B
SeniorFull-time·B2B
#365150·Dodano dziś·0
Źródło: nofluffjobs.comTech Stack / Keywords
AISecurityCommunication skillsGoPythonBashLinuxUbuntuNetworkingTCPDNSBGPHPSGPUPrometheusGrafanaAnsibleSaltAWXRelational databaseGitLabVLANPXE
Firma i stanowisko
Margo is a company working on AI GPU cluster infrastructure with a client based in California, USA. The role involves supporting and scaling production environments for AI infrastructure.
Wymagania
- Experience with Go or Python
- Strong scripting skills (Bash, Python)
- Hands-on experience with Linux systems (Ubuntu/Debian)
- Preferred hands-on experience with GPU & HPC infrastructure
- Knowledge of networking (LAN/VLAN, TCP/IP, DNS, BGP, load-balancing, IPv6)
- Experience with boot systems like PXE
- Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic)
- Comfortable with Infrastructure-as-Code tools (Ansible, Salt, AWX)
- Experience managing relational databases (MariaDB)
- Understanding of CI/CD pipelines (GitLab)
- Comfortable with English (written and spoken)
- Proactive and solution-oriented mindset
- Passion for automation and continuous improvement
- Strong collaboration and communication skills
- Ability to work independently and in a team
- Willingness to mentor and share knowledge
Obowiązki
- Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents
- Troubleshoot high-impact production issues in collaboration with other engineering teams
- Participate in an on-call rotation to handle incidents and ensure service continuity
- Implement and maintain observability solutions to monitor AI infrastructure and application health
- Contribute to AI infrastructure lifecycle management across different environments and countries
- Promote and apply best practices in stability, resiliency, scalability, and security
- Maintain clear technical documentation for tools and procedures
- Contribute to system and tool evolution based on production feedback
- Collaborate closely with development teams to ensure infrastructure readiness
- Participate in team rituals and knowledge-sharing initiatives
Oferta
- International environment
- Sport subscription
- Private healthcare
- Flat structure
- International projects
- Free coffee
- Canteen
- Bike parking
- Playroom
- Shower
- Free snacks
- Modern office
- No dress code
Karta sportowa
Opieka zdrowotna
Napoje w biurze
Firmowa stołówka
Parking dla rowerów
Prysznic
Darmowe przekąski
Inne informacje
Client is based in California, USA. Working hours start around 18:00 CEST. Long-term project of minimum one year.
Margo
31 aktywnych ofert