Senior HPC Cluster Administrator - Deep Learning Frameworks Infrastructure
221 250 - 383 500 PLN/ mies.Umowa o pracę (brutto)
292 500 - 507 000 PLN/ mies.Umowa o pracę (brutto)
SeniorFull-time·Umowa o pracę
#325155·Dodano 19 dni temu·34
Źródło: NVIDIATech Stack / Keywords
NetworkingLinuxEmbeddedAnsibleTerraformCI/CDGitLabPrometheus
Firma i stanowisko
NVIDIA's Deep Learning Frameworks (DLFW) Infrastructure team manages large-scale GPU compute clusters supporting deep learning training, inference, and high-performance computing workloads, including DGX/HGX platforms and Grace Blackwell systems.
Wymagania
- BS/MS in CS, EE, CE, or equivalent hands-on experience
- 5+ years of experience deploying and administering large-scale HPC or ML training clusters
- Deep expertise in Linux systems administration at scale
- Strong scripting and automation skills in Python and/or bash
- Hands-on experience with Slurm (scheduling, accounting, cgroup configuration)
- Proficiency with configuration management and IaC (Ansible required; Terraform a plus)
- Experience with container technologies (Docker, Apptainer/Singularity, Kubernetes)
- Solid understanding of high-speed networking (InfiniBand, RoCE, RDMA, EFA)
- Experience with distributed/parallel filesystems and storage architecture
- Ability to own problems end-to-end and communicate clearly with engineering and management stakeholders
Nice to have:
- Experience with NVIDIA GPU infrastructure tools (DCGM, nvidia-smi, MIG, NVSwitch diagnostics)
- Familiarity with cluster management platforms (Colossus, Bright Cluster Manager, xCAT, or similar)
- Experience supporting large-scale distributed deep learning workloads (PyTorch, JAX, Megatron)
- Knowledge of BMC/IPMI/Redfish for out-of-band management and hardware lifecycle
- Background in MLOps tooling or ML platform engineering
Obowiązki
- Own the full lifecycle of GPU compute clusters including procurement, provisioning, configuration management, monitoring, and deprecation across heterogeneous Linux environments (DGX, HGX, embedded systems)
- Design and scale storage solutions (NFS, Lustre, WekaFS, or equivalent) with a roadmap for capacity and performance growth
- Lead automation of infrastructure using IaC tools (Ansible, Terraform) and CI/CD pipelines (GitLab)
- Manage and optimize job scheduling via Slurm, including fair-share policies, reservation management, and MIG/GPU partitioning strategies
- Maintain and improve observability stacks (Prometheus, Grafana, DCGM) and drive proactive resolution of hardware and software incidents
- Collaborate with ML engineers and software teams to tune cluster configuration for large-scale distributed training workloads
- Evaluate and introduce new technologies such as networking fabrics (InfiniBand, NVLink, EFA/RDMA), storage tiers, and container runtimes to improve performance and reliability
- Mentor junior engineers and contribute to team-wide engineering standards
Oferta
- Remote work from Poland
- Competitive base salary range for Poland: 221,250 PLN - 383,500 PLN for Level 3, and 292,500 PLN - 507,000 PLN for Level 4
Inne informacje
Only candidates located in Poland are considered for this position.
NVIDIA
30 aktywnych ofert