Job Title: Senior Linux AI Support Engineer
Job Overview:
We are seeking a highly skilled Senior Linux Support Engineer to provide after-hours support for a high-performance computing (HPC) AI environment hosted on bare-metal, on-premise infrastructure. The ideal candidate will be experienced in Linux system administration, troubleshooting HPC clusters, and optimizing performance in an AI-driven computational setting.
Key Responsibilities:
- Monitor, maintain, and troubleshoot Linux-based HPC infrastructure outside of regular business hours.
- Provide incident response and technical support for HPC cluster failures, performance degradation, and user-reported issues.
- Manage bare-metal servers, ensuring reliability, security, and optimal resource utilization.
- Deploy, configure, and upgrade Linux OS and HPC software stacks as needed.
- Collaborate with AI engineers, researchers, and IT teams to optimize workloads and resource scheduling.
- Maintain automated monitoring and alerting systems to proactively detect failures.
- Perform log analysis, debugging, and root cause analysis for complex system issues.
- Ensure compliance with security policies, access controls, and data integrity standards.
- Document solutions, operational procedures, and troubleshooting guides for continued improvements.
- Contribute to automation efforts using scripts, configuration management, and infrastructure as code (IaC).
Required Skills & Experience:
- 6-8 years of experience in Linux system administration, preferably in an HPC or AI-driven environment leveraging RHEL and Debian based distros.
- Deep understanding of bare-metal infrastructure concepts and management (networking, storage, provisioning).
- Strong knowledge of containerization (Docker, Singularity) and orchestration tools.
- High level of proficiency in scripting and programming (Bash, Python, GoLang) and automation tools (Ansible, Puppet).
- Familiarity with NAS storage systems and protocols NFS, SMB, CIFS.
- Troubleshooting expertise in performance tuning, kernel optimizations, and system-level debugging.
- Strong problem-solving skills with the ability to work independently in high-pressure situations.
- Excellent communication skills for coordinating with remote teams and end-users.
Preferred Skills & Experience:
- Red Hat Certified Engineer (RHCE) or equivalent Linux certification.
- Experience with Nvidia GPUs and tool stacks.
- HPC-related certifications, coursework in AI computing or relative experience.
- Hands-on experience with AI/ML workloads in HPC environments is a plus.
- Experience with Kubernetes or HPC workload schedulers (e.g., Slurm, PBS, Grid Engine).
Our purpose
Our people and culture
Professional development
Benefits to help you thrive
At Deloitte, we know that great people make a great organization. Our comprehensive rewards program helps us deliver a distinctly Deloitte experience that helps that empowers our professionals to thrive mentally, physically, and financially—and live their purpose. To support our professionals and their loved ones, we offer a broad range of benefits. Eligibility requirements may be based on role, tenure, type of employment and/ or other criteria. Learn more about what working at Deloitte can mean for you.
Recruiting tips