Senior Solutions Architect, HPC and AI
~24 375 - 42 250 PLN/ mies.
SeniorFull-time
#296232·Dodano 2 miesiące temu·57
Źródło: NVIDIATech Stack / Keywords
AIPythonProfilingSOLIDCUDACloudLLMPyTorch
Firma i stanowisko
We are seeking a Senior Solutions Architect with strong hands-on experience in deploying, debugging, and optimizing training and inference workloads on large-scale GPU clusters. The role supports customers and partners across Europe in training models on groundbreaking GPU infrastructure, focusing on High Performance Computing and AI workloads.
Wymagania
- BS, MS, PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or a related engineering field—or equivalent practical experience.
- 8+ years of experience in accelerated computing technologies at cluster scale, ideally including work with NVIDIA platforms.
- Strong programming skills in at least one of the following languages: C, C++, or Python.
- Practical experience identifying and resolving bottlenecks in large-scale training workloads or parallel applications.
- Hands-on experience in profiling and debugging large parallel applications.
- Solid understanding of CPU and GPU architectures, CUDA, parallel filesystems, and high-speed interconnects.
- Experience working with large compute clusters with understanding of internal scheduling and resource management mechanisms (e.g., SLURM or Cloud based clusters).
- Proficient knowledge of training pipelines and frameworks, including their internal operations and performance attributes.
Nice to have:
- Experience debugging training pipelines running on thousands of GPUs in production environments.
- Hands-on experience with performance profiling and optimizations using tools like Nsight Systems, Nsight Compute.
- Good understanding of NCCL, MPI, and low-level communication libraries.
- Ability to debug stability issues across the entire stack: parallel application, training frameworks, runtime libraries, schedulers, and hardware.
- Solid understanding of internal workings of LLM frameworks such as PyTorch, Megatron-LM, or NeMo, and their impact on compute layers like CPUs, GPUs, network, and storage.
- Understanding of inference tools such as vLLM, Dynamo, TensorRT-LLM, RedHat Inference Server, or SGLang.
Obowiązki
- Collaborate with NVIDIA’s training framework developers and product teams to stay ahead of the latest features and help partners adopt them effectively.
- Assist with deployment, debugging, and improving the efficiency of AI workloads on extensive NVIDIA platforms.
- Benchmark new framework features, analyze performance, and share actionable insights with customers and internal teams.
- Work directly with external customers to solve cluster performance and stability issues, identify bottlenecks, and implement effective solutions.
- Build expertise and guide customers in scaling workloads efficiently and reliably on the latest generation of NVIDIA GPUs.
- Contribute to Europe’s Sovereign AI initiative by helping customers implement advanced resiliency features within AI training pipelines.
Oferta
- Base salary range for Poland: 292,500 PLN - 507,000 PLN for Level 4, and 375,000 PLN - 650,000 PLN for Level 5.
NVIDIA
30 aktywnych ofert