Job Title: Senior Linux AI Support Engineer
Job Overview:
We are seeking a highly skilled Senior Linux Support Engineer to provide after-hours support for a high-performance computing (HPC) AI environment hosted on bare-metal, on-premise infrastructure. The ideal candidate will be experienced in Linux system administration, troubleshooting HPC clusters, and optimizing performance in an AI-driven computational setting.
Key Responsibilities:
- Monitor, maintain, and troubleshoot Linux-based HPC infrastructure outside of regular business hours.
- Provide incident response and technical support for HPC cluster failures, performance degradation, and user-reported issues.
- Manage bare-metal servers, ensuring reliability, security, and optimal resource utilization.
- Deploy, configure, and upgrade Linux OS and HPC software stacks as needed.
- Collaborate with AI engineers, researchers, and IT teams to optimize workloads and resource scheduling.
- Maintain automated monitoring and alerting systems to proactively detect failures.
- Perform log analysis, debugging, and root cause analysis for complex system issues.
- Ensure compliance with security policies, access controls, and data integrity standards.
- Document solutions, operational procedures, and troubleshooting guides for continued improvements.
- Contribute to automation efforts using scripts, configuration management, and infrastructure as code (IaC).
Required Skills & Experience:
- 6-8 years of experience in Linux system administration, preferably in an HPC or AI-driven environment leveraging RHEL and Debian based distros.
- Deep understanding of bare-metal infrastructure concepts and management (networking, storage, provisioning).
- Strong knowledge of containerization (Docker, Singularity) and orchestration tools.
- High level of proficiency in scripting and programming (Bash, Python, GoLang) and automation tools (Ansible, Puppet).
- Familiarity with NAS storage systems and protocols NFS, SMB, CIFS.
- Troubleshooting expertise in performance tuning, kernel optimizations, and system-level debugging.
- Strong problem-solving skills with the ability to work independently in high-pressure situations.
- Excellent communication skills for coordinating with remote teams and end-users.
Preferred Skills & Experience:
- Red Hat Certified Engineer (RHCE) or equivalent Linux certification.
- Experience with Nvidia GPUs and tool stacks.
- HPC-related certifications, coursework in AI computing or relative experience.
- Hands-on experience with AI/ML workloads in HPC environments is a plus.
- Experience with Kubernetes or HPC workload schedulers (e.g., Slurm, PBS, Grid Engine).
Recruiting tips
Benefits
At Deloitte, we know that great people make a great organization. We value our people and offer employees a broad range of benefits. Learn more about what working at Deloitte can mean for you.
Our people and culture
Our purpose
Professional development