Overview: Our client is seeking an experienced AI Cloud Data Centre Operations Manager with expertise in GPU architecture to join their dynamic team. In this role, you will be responsible for overseeing and optimizing the operations of the company's AI-focused cloud data centers, ensuring efficient resource management, uptime, and scalability of GPU-intensive systems. Your knowledge in GPU architecture and data center management will be key to optimizing AI workloads and supporting mission-critical applications.
Key Responsibilities: Data Center Operations Management: Lead and manage day-to-day operations of AI-focused data centers, ensuring optimal performance, availability, and reliability of all hardware and software systems.GPU Resource Optimization: Develop strategies for effective utilization of GPU resources in alignment with business and operational goals. Monitor and optimize GPU workloads to maximize performance and cost-efficiency.Infrastructure Planning and Scaling: Collaborate with infrastructure and development teams to plan, deploy, and scale data center resources to support AI workloads, considering current and future requirements.Performance Monitoring & Troubleshooting: Implement monitoring tools and dashboards for real-time analysis of GPU and overall data center performance. Troubleshoot and resolve issues to maintain high availability and minimal downtime.System Upgrades & Maintenance: Schedule and oversee hardware and software upgrades, including GPU infrastructure, ensuring compatibility and system optimization.Security & Compliance: Ensure data center operations meet industry standards for security and regulatory compliance, implementing best practices for data protection and cybersecurity.Vendor and Stakeholder Management: Engage with vendors for hardware procurement and support, and collaborate with cross-functional teams for project planning and execution.Key Qualifications: Education: Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree preferred.Experience: 5+ years of experience in data center operations, with a focus on AI or cloud environments. Experience with managing GPU-intensive systems is essential.Certifications in Data Center Management or cloud platforms.Technical Expertise:
Strong understanding of GPU architecture and experience optimizing workloads for AI and ML applications.Proficiency in data center infrastructure management (DCIM) tools and monitoring systems.Knowledge of cloud computing platforms (AWS, Google Cloud, Azure) and hybrid cloud environments.Desired Skills:
Strong problem-solving and analytical skills, especially in troubleshooting data center and GPU performance issues.Excellent communication and leadership skills, with the ability to work cross-functionally and drive projects to completion.Familiarity with security standards and regulatory compliance requirements for data center operations. Interested applicants, please send your latest resume to:
Jacqueline Ng [email protected] We regret that only shortlisted candidates will be notified.