Accelerated Computing Specialist

Chennai, India
2026-03-03
Role Description

2 yearsKancheepuram, Chennai

We are looking for an Accelerated Computing Specialist to join our Cloud Platform team. In this role, you will work at the intersection of Cloud Infrastructure, Linux Systems, and GPU Computing — learning how to build, manage, and troubleshoot AI/ML clusters used for large-scale model training and inference. This position is ideal for engineering graduates passionate about Linux, cloud computing, GPUs, or AI, and eager to gain hands-on experience with real production workloads in a rapidly growing cloud environment.


**Roles \& Responsibilities :**
-------------------------------


You will start by learning and gradually take ownership of the following:

### **Cloud Platform \& Infrastructure:**

* + Provide L1–L2 operational support for cloud compute, storage, and networking.
* + Monitor VMs, containers, and GPU instances for availability and performance.
* + Troubleshoot issues such as connectivity failures, storage mount issues, or GPU driver errors.
* + Assist in configuring Load Balancers (ALBs) and Ingress Controllers in Kubernetes clusters.
* + Participate in infrastructure automation using Bash, Python, and Terraform.

### **GPU and AI Workload Support:**

* + Learn how to launch and validate GPU clusters used for AI/ML workloads.
* + Understand Slurm job scheduling for distributed training workloads.
* + Help configure clusters for LLM training (e.g., Llama 3 models) using tools such as DGCX Bench DGCX Bench.
* + Monitor and maintain vLLM inference endpoints ensuring high availability and restart readiness.
* + Verify cluster health — all workers accessible, GPUs visible via nvidia-smi, and InfiniBand active (ibstat only shows IB connections).

### **Reliability \& Automation:**

* + Maintain documentation and contribute to Root Cause Analysis (RCA) reports.
* + Support incident response and resolution for GPU-based workloads.
* + Work with the team to automate deployments and checks for:
* + - Training cluster launches (8xH100, 8xH200\)
		- Notebook availability and restarts
		- Inference readiness and autoscaling
* + Stay updated with GPU computing trends (H100, H200, InfiniBand, CUDA, AI inference).

### **Continuous Learning:**

* + Stay updated with GPU computing trends (H100, H200, InfiniBand, CUDA, AI inference).
	+ Explore how LLMs (Llama, Mistral, Falcon) are trained and deployed on GPU clusters.
	+ Learn how cloud orchestration, MLOps, and DevOps come together in production AI environments.

### **Technical Foundation(Preferred Skills):**


We don’t expect you to know everything from Day 1 — but you should have curiosity and strong basics in at least one area below:

* + Core Cloud Skills
* + - Operating Systems: Linux (Ubuntu/Debian/CentOS)
		- Networking: DNS, NAT, VPN, Load Balancer basics
		- Containers: Docker, Kubernetes, Helm
		- Storage: Block and Object (S3 APIs)
		- Monitoring: Prometheus, Grafana, ELK
		- Automation: Bash, Python, Git, Ansible, Terraform
* + GPU / AI Computing Concepts (Good to Know)
* + - NVIDIA GPU tools: CUDA, nvidia-smi, GPU scheduling basics
		- ML frameworks: TensorFlow, PyTorch, ONNX, Hugging Face
		- Cluster Scheduling: Introduction to Slurm
		- LLM Workloads: Basic idea of how inference endpoints serve AI models


**Key Skills \& Qualifications:**
---------------------------------

* + B.E / B.Tech / MCA (Computer Science, IT, ECE, or related fields)
* + Academic or internship projects in Linux, Cloud, ML, or GPU computing
* + Participation in hackathons, open-source projects, or Kaggle/AI community work
* + Familiarity with GitHub or any DevOps pipeline
Ready to apply?

Tags & Skills
DevOps Machine Learning
Join to unlock direct application links.
Accelerated Computing Specialist
E2E Networks Private Limited