Site Reliability Engineer (SRE)

Location

Remote US / Noida, India Hybrid

About Us

Tobi Cloud is a fast-growing software company that helps transportation fleets run smoothly. Our easy-to-use tools help fleet owners schedule rides, manage dispatch, handle billing, and more.

Our customers rely on us every day. As a result, our support team plays a big role in keeping things running well. We’re looking for an entry-level Application Support Specialist who is curious, helpful, and ready to learn.

You don’t need a long list of qualifications. Instead, if you like solving problems and helping people, you may be a great fit.

About the Role

We are looking for a passionate Site Reliability Engineer (SRE) to join our growing engineering organization and partner closely with our product engineering squads. You will focus on making our systems more reliable, observable, scalable, and resilient while empowering developers to ship with confidence and speed.

This is a hands-on SRE role where you’ll spend significant time writing code (infrastructure, tooling, observability, automation), improving production reliability, and working shoulder-to-shoulder with product engineers to raise the bar on operational excellence.

Key Responsibilities

Partner with product engineering squads to design, build, and operate highly reliable services

Own and improve production reliability end-to-end:
- Define and measure SLOs/SLIs, error budgets, and reliability goals
- Lead incident response, postmortems, and follow-up action items
- Participate in on-call rotation and drive rapid, effective resolution of production issues
Build and maintain world-class observability:
- Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing
- Enable squads to understand system behavior and debug effectively
Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity

Collaborate closely with Staff Engineers / Team Leads to:
- Embed reliability best practices into the development lifecycle
- Review architectural decisions with a production lens
- Mentor engineers on operational excellence, observability, and on-call mindset
Champion modern engineering and DevOps practices:
- CI/CD pipelines, progressive delivery (feature flags, canaries, blue-green)
- Infrastructure as code (Terraform, Pulumi, CDK)
- Effective use of AI-assisted tools to accelerate scripting, debugging, and documentation
Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning

Help evolve our technical strategy for reliability, scalability, and cost-efficiency

Requirements

5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems

Deep hands-on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS)

Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust)

Strong observability expertise:
- Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.)
- Distributed tracing (OpenTelemetry, Jaeger, Zipkin)
- Structured logging and metrics at scale
Proven track record of incident management, postmortems, and driving reliability improvements

Experience defining and working with SLOs, SLIs, and error budgets

Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes)

Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts

Bias toward automation and reducing manual toil

Nice-to-Haves

Previous on-call leadership or incident commander experience

Background in performance engineering or capacity planning at scale

Familiarity with service meshes, API gateways, or zero-trust networking

Contributions to open-source reliability/observability tools

Experience mentoring or embedding within product squads

Why You’ll Like Working Here

You’ll get real hands-on learning from your first day. In addition, you’ll work on a small, friendly team where your ideas matter. Because we value flexibility, you can work remotely in a supportive and low-ego environment.

Apply Now

Site Reliability Engineer (SRE)

Location

About Us

About the Role

Key Responsibilities

Requirements

Nice-to-Haves

Why You’ll Like Working Here

Current Vacancy

Earn More. Save More. Tobi Can Help.

Site Reliability Engineer (SRE)

Location

About Us

About the Role

Key Responsibilities

Requirements

Nice-to-Haves

Why You’ll Like Working Here

Current Vacancy

Related Blogs

Earn More. Save More. Tobi Can Help.