API-Platform SRE
Company: Alibaba Cloud
Location: Sunnyvale
Posted on: June 1, 2025
|
|
Job Description:
Alibaba Cloud Open Platform team is responsible for cloud
enterprise-level capabilities such as API Platform, and enterprise
solutions like Landing Zone/Well Architected
Framework.DescriptionMaintaining system reliability and ensuring
core system availability is critical for Open Platform. The goal of
this role is to establish a system reliability framework that
combines technology and management, including but not limited to
the following:1. Develop reliability standards and metrics that
cover aspects such as robust architecture design, engineering
quality, release management, and production environment operations,
ensuring reliability is integrated into the full Alibaba Cloud
development lifecycle.2. Drive major reliability governance
initiatives, such as full-stack disaster recovery, gradual rollout,
incident response and mitigation (1-5-10), loss prevention etc., to
quickly mitigate reliability risks.3. Build reliability platform
that supports change automation, red team/blue team exercises,
incident response collaboration, risk scanning, monitoring etc., to
simplify reliability engineering.4. Handle production environment
incidents, including incident response, incident coordination,
incident detection, incident recovery, and postmortem analysis.5.
Provide technical support to ensure customer business
continuity.Responsibilities--- Daily maintenance of applications,
databases, and middleware, troubleshooting and addressing customer
inquiries;--- Collaborate with cloud product teams to develop
business critical reliability/oncall plans based on customer
requirements for key business periods.--- Participate in technical
design and implementation of business platforms, identify
bottlenecks and propose solutions.--- Build high-quality, reusable
infrastructure, improve product quality and engineering
efficiency.--- Stay updated on cutting-edge technologies, and
leverage them in the team's services and infrastructure.Position
RequirementBasic Qualifications--- Bachelor's Degree in Computer
Science, Information Systems, Computer Engineering or a related
field.--- 5+ years of Systems Engineering, DevOps, Site Reliability
Engineering (SRE) or Enterprise Production experience. Understand
and follow SRE/DevOps best practices.--- 3+ years' experience
operating in a 24/7 production environment. Proficient with SRE
tools, such as at least one scripting language, monitoring tools,
IaC tools, etc. Experienced in troubleshooting large-scale
distributed systems.--- Good team player, able to influence the
team and improve team productivity and team morale.--- Good
communication skills, proficient in Chinese.Preferred
Qualifications--- 3+ years of experience with cloud computing
technologies, in depth understanding and/or hands on experience
with at least one of the major cloud areas:
Compute/Storage/Network/Database/IAM.--- SRE experience in other
major cloud providers (e.g. AWS/GCP/Azure).The pay range for this
position at commencement of employment is expected to be between
$133,200/year and $219,600/year. However, base pay offered may vary
depending on multiple individualized factors, including market
location, job-related knowledge, skills, and experience.If hired,
employee will be in an "at-will position" and the Company reserves
the right to modify base salary (as well as any other discretionary
payment or compensation program) at any time, including for reasons
related to individual performance, Company or individual
department/team performance, and market factors.
#J-18808-Ljbffr
Keywords: Alibaba Cloud, Turlock , API-Platform SRE, Other , Sunnyvale, California
Click
here to apply!
|