Industry Insights

Thought Leadership

Community and Culture

2025 SRE Report (Part 1): From Uptime to Performance and Beyond

Jan 29, 2025

By Tanny Kang, COO and Lucas Martinez

The 2025 SRE Report by Catchpoint delivers a comprehensive analysis of evolving trends in Site Reliability Engineering (SRE). This year’s findings highlight the transition from uptime to performance as the gold standard, a surprising rise in toil despite advancements in automation, challenges with organizational alignment, and the complexities of fragmented observability systems. The report offers essential takeaways for SRE teams and organizations aiming to enhance incident response, operational efficiency, and system resilience.

Key Findings from the 2025 SRE Report

  1. Slow Is the New Down: Performance Is Now Non-Negotiable

“Uptime is no longer a meaningful measure of success—performance is the gold standard. Slow is the new down.”

The report establishes a fundamental shift in reliability metrics: the distinction between a slow service and downtime no longer matters. While only 21% of respondents had heard the phrase “Slow is the new down,” the majority (53%) agreed with its sentiment. A slow system can lead to outcomes just as severe as downtime, such as customers abandoning their purchases in e-commerce or distributed systems failing to maintain consensus due to timeouts.

This paradigm shift is reshaping SRE best practices:

  • Building Services: Decoupling of synchronous and asynchronous components to ensure responsiveness under stress. Graceful degradation can maintain partial functionality during incidents.

  • Handling Data: Techniques like precomputation, caching, and I/O optimization to enhance data access speeds and overall performance.

  • Operating Services: Replacement of binary pass/fail monitoring with performance-focused observability to proactively address system issues.

The report emphasizes the growing adoption of Service Level Objectives (SLOs) and Experience Level Objectives (XLOs) as essential tools for defining and tracking performance. These objectives, combined with tools like burndown charts, provide actionable insights for managing error budgets and resource allocation.

Takeaway: Performance optimization must be a continuous, collaborative effort, integrating internal, external, and multi-party perspectives.


  1. Toil Is Rising: The Double-Edged AI Sword

“Paradoxically, the free time created by expediting valuable activities may end up being filled with toilsome tasks.”

For the first time in five years, the report reveals an increase in toil—manual, repetitive tasks—with the median level rising from 14% in 2024 to 20% in 2025. Operational tasks account for much of this growth, despite expectations that automation and AI would reduce toil.

Key drivers of increased toil:

  • AI’s Mixed Impact: AI introduces operational challenges like maintaining models, managing GPU clusters, and supervising outputs. These tasks often require significant manual oversight, adding complexity.

  • Short-Term Priorities: Pressure to deliver features and reduce costs deprioritizes long-term investments in operational efficiency, leaving teams overwhelmed by routine tasks.

The report calls for organizations to benchmark toil levels and evaluate the true impact of AI. By making long-term investments in operational improvements, teams can reduce toil and refocus on high-value activities.

Takeaway: The unexpected rise in toil emphasizes the importance of rethinking how teams manage operational tasks. To truly reduce manual workloads, organizations must balance the benefits of AI with the need for thoughtful implementation and sustained investment in scalable practices.


  1. Organizational Misalignment: The Agility vs. Stability Conflict

“It’s a classic case of agility versus stability: businesses want updates, new features, and revenue growth, whereas practitioners prioritize reliability and resilience.”

The report highlights the ongoing tension between delivering features and maintaining reliability. While 58% of respondents believe OKRs are clearly communicated, 41% of frontline teams report always or often feeling pressured to prioritize features over stability.

Factors driving this misalignment:

  • Shifting Priorities: Leadership changes and evolving business goals leave reliability practitioners uncertain about alignment.

  • Communication Gaps: Clear OKRs alone aren’t enough; two-way communication is critical for addressing resource constraints and bridging misaligned goals.

  • Resistance to Change: While organizations recognize the importance of reliability, they often avoid making the adjustments necessary to fully support it.

Takeaway: Transparent communication and adaptability are critical. Organizations must foster collaboration to align priorities and resolve conflicts between agility and stability.


  1. Observability Is Fragmented: Toward a Unified View of System Health

“Rather than fixating on reducing tools, focus on whether their value justifies their cost.”

The report finds that most organizations use 2 to 10 observability tools, highlighting the need for specialized monitoring solutions across diverse technology stacks. However, fragmented systems present challenges:

  • Over-Consolidation: Reducing tools too aggressively can create blind spots and limit critical insights.

  • Tool Overload: Too many tools without proper integration can overwhelm teams with noise and redundant data.

Need for Visibility: Integrated observability tools are essential to provide a unified view of system health. Combining logs, metrics, and traces enables faster root-cause analysis, proactive issue detection, and a holistic understanding of performance. This approach ensures visibility across internal systems and external dependencies, such as APIs or third-party services.

Takeaway: Organizations must balance tool diversity with meaningful integration to achieve actionable insights. A value-driven approach to observability ensures better monitoring, faster resolutions, and alignment with reliability and business goals.


Key Takeaways

  1. Surprising Trends in Toil Levels

The unexpected rise in toil, despite automation advancements, suggests that technology alone cannot solve operational inefficiencies. This trend underscores the importance of rethinking workflows and balancing short-term priorities with investments in long-term improvements. Toil reduction should be approached not as a one-time fix but as an ongoing effort that aligns operational practices with reliability goals.


  1. Observability Requires a Shift in Focus

While organizations often debate the number of tools in their observability stack, the real focus should be on ensuring those tools deliver meaningful value. The challenge isn’t just about reducing or adding tools—it’s about leveraging the right combination of metrics, logs, and traces to deliver actionable insights. A fresh look at how observability tools are integrated and how they align with business objectives can help organizations achieve clarity without adding unnecessary complexity.


  1. AI’s Role Is Evolving

AI presents both challenges and opportunities, but its success lies in how it’s implemented. Instead of viewing AI as a quick fix, organizations should focus on building trust in its outputs through proper oversight and training. When used effectively, AI can go beyond toil reduction to enable proactive incident response, smarter resource allocation, and greater system resilience, ultimately transforming how teams approach reliability and performance.


Conclusion: Adapting to the Future of SRE

The 2025 SRE Report reveals that the landscape of site reliability engineering continues to evolve, with performance, toil reduction, and organizational alignment taking center stage. These findings highlight the need for organizations to adopt balanced, strategic approaches that prioritize long-term resilience over short-term gains.

In Part 2, we’ll dive deeper into the human and organizational aspects of SRE, focusing on skill development, proactive incident preparedness, and addressing internal misalignments. These sections offer actionable guidance for building resilient teams, improving incident response, and aligning reliability efforts across all levels of an organization.

©Vibranium Labs - All rights reserved.

©Vibranium Labs - All rights reserved.

©Vibranium Labs - All rights reserved.