We help the world run better
Our company culture is focused on helping our employees enable innovation by building breakthroughs together. How? We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and is aligned to our purpose-driven and future-focused work. We offer a highly collaborative, caring team environment with a strong focus on learning and development, recognition for your individual contributions, and a variety of benefit options for you to choose from.Apply now!
WHAT YOU WILL DO:
----------------
We are seeking a highly skilled and motivated individual to join our team as an Infrastructure, Kubernetes, and ML Solutions Specialist. In this role, you will be responsible for setting up and managing the GPU computing infrastructure connected to an OpenStack-based private cloud, provisioning GPUs resource in Kubernetes (K8S), distributing model training tasks across different K8S GPU nodes, and overseeing the overall infrastructure and operations for our ML solutions.
Responsibilities:
- Design Infrastructure: Collaborate with the AI team to design and implement a robust and scalable GPU computing infrastructure connected to the OpenStack-based private cloud. Evaluate hardware options, configure GPUs, and optimize performance for K8S deployment.
- GPU Provisioning in K8S: Develop and implement GPU provisioning workflows in Kubernetes, ensuring seamless integration with the other platform components like Kubeflow. Automate the allocation and release of GPUs to training tasks.
- Cross-Node GPU Scheduling: Implement GPU-aware scheduling policies in Kubernetes to distribute model training tasks across different GPU nodes. Optimize GPU resource utilization and minimize job queuing times.
- Nvidia Driver Management: Install and manage Nvidia GPU drivers on Kubernetes worker nodes within the OpenStack private cloud. Keep drivers up to date and ensure compatibility with supported GPU models and Kubernetes versions.
- Infrastructure and Operations Management: Oversee the overall infrastructure and operations for our ML solutions. This includes monitoring system performance, ensuring high availability, performing routine maintenance, and troubleshooting issues related to the infrastructure and ML workflows.
- Release ML Solution Deployment: Collaborate closely with the AI team to deploy and manage ML solutions on the platform. This involves deploying trained models, setting up inference pipelines, and ensuring the scalability and reliability of the ML solutions.
- Collaboration: Work closely with the AI team, infrastructure teams, and other stakeholders to understand ML solution requirements, provide technical guidance, and optimize infrastructure and operations based on their needs.
- Documentation and Training: Create and maintain documentation, including setup procedures, best practices, and troubleshooting guides for resource provisioning, Kubernetes management, and ML solution deployment. Train 24x7 support team members on infrastructure management and best practices.
- Stay Updated: Keep up-to-date with the latest advancements in resource provisioning, service-mesh, Kubernetes, and ML technologies. Evaluate emerging technologies and make recommendations for potential enhancements to the infrastructure and operations.
WHAT YOU BRING:
----------------
- Bachelor's degree in computer science, engineering, or a related field. Advanced degrees are a plus.
- Solid understanding of GPU computing concepts and frameworks used in AI/ML, such as CUDA, Kubeflow, Spark, or PyTorch.
- Experience in setting up and managing GPU computing infrastructure connected to an OpenStack-based private cloud.
- Strong knowledge of Kubernetes and container orchestration concepts.
- Proficiency in provisioning GPUs in Kubernetes, including GPU device plugins and resource quotas.
- Familiarity with Nvidia GPU drivers and their installation and management in Kubernetes clusters within an OpenStack environment.
- Experience with GPU-aware scheduling policies and workload distribution in Kubernetes.
- Understanding of ML solution deployment and management, including model deployment and inference pipeline setup.
- Proficiency in scripting languages (e.g., Python, Bash) for automation and infrastructure management.
- Proven experience in infrastructure and operations management for ML solutions.
- Problem-solving skills and the ability to diagnose and resolve complex technical issues.
- Excellent communication and collaboration skills to work effectively with cross-functional teams.
- Strong attention to detail and ability to manage multiple priorities in a fast-paced environment.
- Join our dynamic team and contribute to the cutting-edge
#GCPE
We build breakthroughs together
SAP innovations help more than 400,000 customers worldwide work together more efficiently and use business insight more effectively. Originally known for leadership in enterprise resource planning (ERP) software, SAP has evolved to become a market leader in end-to-end business application software and related services for database, analytics, intelligent technologies, and experience management. As a cloud company with 200 million users and more than 100,000 employees worldwide, we are purpose-driven and future-focused, with a highly collaborative team ethic and commitment to personal development. Whether connecting global industries, people, or platforms, we help ensure every challenge gets the solution it deserves. At SAP, we build breakthroughs, together.
We win with inclusion
SAP’s culture of inclusion, focus on health and well-being, and flexible working models help ensure that everyone – regardless of background – feels included and can run at their best. At SAP, we believe we are made stronger by the unique capabilities and qualities that each person brings to our company, and we invest in our employees to inspire confidence and help everyone realize their full potential. We ultimately believe in unleashing all talent and creating a better and more equitable world.
SAP is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to the values of Equal Employment Opportunity and provide accessibility accommodations to applicants with physical and/or mental disabilities. If you are interested in applying for employment with SAP and are in need of accommodation or special assistance to navigate our website or to complete your application, please send an e-mail with your request to Recruiting Operations Team: Careers@sap.com
For SAP employees: Only permanent roles are eligible for the SAP Employee Referral Program, according to the eligibility rules set in the SAP Referral Policy. Specific conditions may apply for roles in Vocational Training.
EOE AA M/F/Vet/Disability:
Qualified applicants will receive consideration for employment without regard to their age, race, religion, national origin, ethnicity, age, gender (including pregnancy, childbirth, et al), sexual orientation, gender identity or expression, protected veteran status, or disability.
Successful candidates might be required to undergo a background verification with an external vendor.
Requisition ID: 375434 | Work Area: Software-Development Operations | Expected Travel: 0 - 10% | Career Status: Professional | Employment Type: Regular Full Time | Additional Locations: #LI-Hybrid.