Site Reliability Engineer (SRE)

Apply Now

Location

Remote US  / Noida, India Hybrid

About Us 

Tobi Cloud is a fast-growing software company that helps transportation fleets run smoothly. Our easy-to-use tools help fleet owners schedule rides, manage dispatch, handle billing, and more.

Our customers rely on us every day. As a result, our support team plays a big role in keeping things running well. We’re looking for an entry-level Application Support Specialist who is curious, helpful, and ready to learn.

You don’t need a long list of qualifications. Instead, if you like solving problems and helping people, you may be a great fit.

About the Role

We are looking for a passionate Site Reliability Engineer (SRE) to join our growing engineering organization and partner closely with our product engineering squads. You will focus on making our systems more reliable, observable, scalable, and resilient while empowering developers to ship with confidence and speed. 

This is a hands-on SRE role where you’ll spend significant time writing code (infrastructure, tooling, observability, automation), improving production reliability, and working shoulder-to-shoulder with product engineers to raise the bar on operational excellence. 

Key Responsibilities 

  • Partner with product engineering squads to design, build, and operate highly reliable services 
  • Own and improve production reliability end-to-end: 
    • Define and measure SLOs/SLIs, error budgets, and reliability goals 
    • Lead incident response, postmortems, and follow-up action items 
    • Participate in on-call rotation and drive rapid, effective resolution of production issues 
  • Build and maintain world-class observability: 
    • Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing 
    • Enable squads to understand system behavior and debug effectively 
  • Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity 
  • Collaborate closely with Staff Engineers / Team Leads to: 
    • Embed reliability best practices into the development lifecycle 
    • Review architectural decisions with a production lens 
    • Mentor engineers on operational excellence, observability, and on-call mindset 
  • Champion modern engineering and DevOps practices: 
    • CI/CD pipelines, progressive delivery (feature flags, canaries, blue-green) 
    • Infrastructure as code (Terraform, Pulumi, CDK) 
    • Effective use of AI-assisted tools to accelerate scripting, debugging, and documentation 
  • Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning 
  • Help evolve our technical strategy for reliability, scalability, and cost-efficiency 

 

Requirements 

  • 5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems 
  • Deep hands-on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS) 
  • Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust) 
  • Strong observability expertise: 
    • Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.) 
    • Distributed tracing (OpenTelemetry, Jaeger, Zipkin) 
    • Structured logging and metrics at scale 
  • Proven track record of incident management, postmortems, and driving reliability improvements 
  • Experience defining and working with SLOs, SLIs, and error budgets 
  • Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes) 
  • Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts 
  • Bias toward automation and reducing manual toil 

 

Nice-to-Haves 

  • Previous on-call leadership or incident commander experience 
  • Background in performance engineering or capacity planning at scale 
  • Familiarity with service meshes, API gateways, or zero-trust networking 
  • Contributions to open-source reliability/observability tools 
  • Experience mentoring or embedding within product squads 

Why You’ll Like Working Here 

You’ll get real hands-on learning from your first day. In addition, you’ll work on a small, friendly team where your ideas matter. Because we value flexibility, you can work remotely in a supportive and low-ego environment.

 

Apply Now