Site Reliability Engineer or Platform Engineer

Zuddl is a modular platform for events and webinars that helps event marketers plan and execute events that drive growth. Event teams from global organizations like Microsoft, Google, ServiceNow, Zylo, Postman, TransPerfect and the United Nations trust Zuddl. Our modular approach to event management lets B2B marketers and conferences organizers decide which components they need to build the perfect event and scale their event program. Zuddl is an outcome-oriented platform with a focus on flexibility, and is more partner, less vendor.

FUNDING

Zuddl being a part Y-Combinator 2020 batch has raised $13.35 million in Series A funding led by Alpha Wave Incubation and Qualcomm Ventures with participation from our existing investors GrowX ventures and Waveform Ventures.

About the Role

We are seeking a Senior Platform Reliability Engineer who will be the primary guardian of our platform's stability and performance. This role requires someone who is hand’s on, who continuously think about everything that could go wrong with our platform and proactively works to prevent it. You'll be responsible for ensuring our infrastructure can handle our growing scale while maintaining reliability for critical customer events.

Key Responsibilities

Platform Risk Management

Continuously monitor and identify potential failure points across the entire system
Develop and maintain comprehensive risk assessment frameworks
Proactively highlight critical issues based on risk analysis and impact assessment
Prioritize and escalate issues based on business impact and customer scale

Performance & Load Testing

Design and implement comprehensive load testing strategies
Run regular performance tests as part of the release pipeline
Conduct black-box testing to identify performance regressions before they reach production
Monitor and analyze performance during large-scale customer events (1000+ attendees)
Establish baseline metrics and thresholds for performance monitoring

Infrastructure Monitoring & Alerting

Build and maintain robust monitoring systems for platform health
Create automated alerting for performance degradation and system issues
Develop dashboards and reports for platform reliability metrics
Monitor third-party service dependencies and assess their impact on our platform

Process Development

Establish best practices for infrastructure changes and deployments
Create standardized procedures for risk assessment and mitigation
Develop training materials and guidelines for development teams
Build frameworks for capacity planning and scaling decisions

Disaster Recovery & Business Continuity (Good to have)

Develop and maintain disaster recovery plans for various failure scenarios
Create runbooks for incident response and system recovery procedures
Plan for multi-region deployments and failover strategies
Manage third-party service risks and backup strategies
Ensure business continuity during service outages (e.g., cloud provider issues, third-party service failures)

Release Pipeline Integration (Good to have)

Integrate performance and reliability checks into the deployment process
Create automated gates that prevent problematic releases from going live
Develop checklists and processes for infrastructure changes
Ensure proper testing procedures are followed before production deployments

Required Qualifications

6+ years of experience in DevOps, SRE, or Platform Engineering roles
Extensive experience with large-scale distributed systems and high-traffic applications
Strong expertise in load testing tools and methodologies (JMeter, K6, Locust, Artillery, etc.)
Proficiency in monitoring and observability tools (Prometheus, Grafana, DataDog, New Relic, ELK stack)
Experience with AWS cloud platforms and infrastructure as code
Knowledge of database performance optimization and troubleshooting
Experience with CI/CD pipelines and release management processes
Experience with disaster recovery planning and incident response
Background in capacity planning and performance optimization
Knowledge of third-party service integration and dependency management
Experience with multi-region deployments and distributed architectures

WHY YOU WANT TO WORK HERE

Competitive compensation
Employee Friendly ESOPs
Remote Working
Flexible Leave Program
Home Workstation Setup
A culture built on trust, transparency, and integrity
Ground floor opportunity at a fast-growing series A startup

View all job openings