Zuddl is a modular platform for events and webinars that helps event marketers plan and execute events that drive growth. Event teams from global organizations like Microsoft, Google, ServiceNow, Zylo, Postman, TransPerfect and the United Nations trust Zuddl. Our modular approach to event management lets B2B marketers and conferences organizers decide which components they need to build the perfect event and scale their event program. Zuddl is an outcome-oriented platform with a focus on flexibility, and is more partner, less vendor.

FUNDING

Zuddl being a part Y-Combinator 2020 batch has raised $13.35 million in Series A funding led by Alpha Wave Incubation and Qualcomm Ventures with participation from our existing investors GrowX ventures and Waveform Ventures.

About the Role

We are seeking a Senior Platform Reliability Engineer who will be the primary guardian of our platform's stability and performance. This role requires someone who is hand’s on, who continuously think about everything that could go wrong with our platform and proactively works to prevent it. You'll be responsible for ensuring our infrastructure can handle our growing scale while maintaining reliability for critical customer events.

Key Responsibilities

Platform Risk Management

  • Continuously monitor and identify potential failure points across the entire system
  • Develop and maintain comprehensive risk assessment frameworks
  • Proactively highlight critical issues based on risk analysis and impact assessment
  • Prioritize and escalate issues based on business impact and customer scale

Performance & Load Testing

  • Design and implement comprehensive load testing strategies
  • Run regular performance tests as part of the release pipeline
  • Conduct black-box testing to identify performance regressions before they reach production
  • Monitor and analyze performance during large-scale customer events (1000+ attendees)
  • Establish baseline metrics and thresholds for performance monitoring

Infrastructure Monitoring & Alerting

  • Build and maintain robust monitoring systems for platform health
  • Create automated alerting for performance degradation and system issues
  • Develop dashboards and reports for platform reliability metrics
  • Monitor third-party service dependencies and assess their impact on our platform

Process Development

  • Establish best practices for infrastructure changes and deployments
  • Create standardized procedures for risk assessment and mitigation
  • Develop training materials and guidelines for development teams
  • Build frameworks for capacity planning and scaling decisions

Disaster Recovery & Business Continuity (Good to have)

  • Develop and maintain disaster recovery plans for various failure scenarios
  • Create runbooks for incident response and system recovery procedures
  • Plan for multi-region deployments and failover strategies
  • Manage third-party service risks and backup strategies
  • Ensure business continuity during service outages (e.g., cloud provider issues, third-party service failures)

Release Pipeline Integration (Good to have)

  • Integrate performance and reliability checks into the deployment process
  • Create automated gates that prevent problematic releases from going live
  • Develop checklists and processes for infrastructure changes
  • Ensure proper testing procedures are followed before production deployments


Required Qualifications

  • 6+ years of experience in DevOps, SRE, or Platform Engineering roles
  • Extensive experience with large-scale distributed systems and high-traffic applications
  • Strong expertise in load testing tools and methodologies (JMeter, K6, Locust, Artillery, etc.)
  • Proficiency in monitoring and observability tools (Prometheus, Grafana, DataDog, New Relic, ELK stack)
  • Experience with AWS cloud platforms and infrastructure as code
  • Knowledge of database performance optimization and troubleshooting
  • Experience with CI/CD pipelines and release management processes
  • Experience with disaster recovery planning and incident response
  • Background in capacity planning and performance optimization
  • Knowledge of third-party service integration and dependency management
  • Experience with multi-region deployments and distributed architectures


WHY YOU WANT TO WORK HERE

  1. Competitive compensation 
  2. Employee Friendly ESOPs 
  3. Remote Working
  4. Flexible Leave Program
  5. Home Workstation Setup
  6. A culture built on trust, transparency, and integrity
  7. Ground floor opportunity at a fast-growing series A startup