Back to results

High Performance Compute (HPC) Site Reliability Engineer (Senior Level Role) – Leading AI Company


£120,000 - £140,000

The client:Our client is a start-up who offers an innovative form of GPU computing infrastructure and Cloud-Native Orchestration solutions to Technology and AI firms worldwide. 🧑🏿‍💻👨🏼‍💻👩🏾‍💻

What they need: They are looking for a senior HPC SRE to play a key role guaranteeing the reliability of their HPC environments. You will have an opportunity to design and build HPC infrastructure, as well as setting up clusters from scratch. Key skills include HPC architecture, networking technologies (ie TCP, UDP, IPv4 or MP-BGP), and network protocols (ie RoCE or RDMA).

In this position you will:

  • Get to set up HPC clusters from scratch.
  • You will take ownership of investigations into high-priority incidents, identify solutions, and prepare Root Cause Analysis (RCA).
  • Work in the AI space, where you will get exposure to the latest Machine Learning technology.
  • Please apply to find out more.

Contact:
Harry Kemp

Share Job
Know someone who may be interested?

Upload your CV
One of our consultants will contact you
to find out more about you