How New Relic Prioritizes Reliability to Eliminate Interruptions to Digital Experiences

How New Relic Prioritizes Reliability to Eliminate Interruptions to Digital Experiences

By Ashley Puls

For software companies, the need for reliability is obvious.

But actually being reliable—that’s difficult.

For New Relic, one of our most important goals is to eliminate interruptions to digital experiences. Airlines, retailers, streaming giants, and other businesses rely on the New Relic observability platform to help them operate services that can scale reliably, so they can in turn deliver exceptional experiences for their users. We’re experts on keeping our own system—and our customers’ systems—up and running.

Our customers watch New Relic dashboards during their deployments and biggest moments, when traffic can surge to as much as 100x the average rate. They use New Relic alerts to notify them when systems might be down or to help pinpoint the source of an issue in their architecture. Some integrate New Relic with their continuous integration and continuous deployment (CI/CD) pipeline so they can deploy quickly and with confidence.

Reliability helps executives achieve their ultimate business goal of protecting and growing revenue. At New Relic, we continuously obsess over improving the reliability bar. For example, over the last year, New Relic has improved its reliability by over 50%. To improve our own uptime, we focused on a few key areas that saw a strong return on investment (ROI). These best practices can help companies keep their websites and systems up and running—24 hours a day, seven days a week, 365 days a year—as they deliver a stable and uninterrupted experience for customers.

Pillars for reliability

With reliability, uptime depends on your weakest link: Any service or deployment can cause an outage on a given day. No company is immune to an outage, but there are several things you can do to minimize the potential for incidents and reduce the severity and duration of any that do occur.

Here are some steps New Relic takes to improve the reliability of its platform.

Metrics to understand trends

If you’re going to improve something, you first have to measure it. At New Relic, we use data to monitor progress and trends—including reliability.

When an incident occurs, we measure several aspects and then compile all the information in a New Relic custom event for the owning team, leader, and organization. This enables us to use New Relic dashboards to track key metrics such as mean time to detection (MTTD) and mean time to resolution (MTTR). We also track time to notify—how long it takes from the onset of an incident to when we post about it on our status page—so we can quickly inform customers about potential impact. Leaders review dashboards daily to weekly and take action accordingly.

Building resilience into our systems

No organization is flawless. Hardware will sometimes fail, and bugs will sometimes get released into production. The key is making sure that when that happens, there is minimal or no customer impact.

At New Relic, major architectural changes go through a design approval process that includes auditing for resilience. Our systems are designed to handle backpressure, leverage partitions to avoid hot spots, degrade gracefully, and scale horizontally. In fact, we run a cell architecture to reduce the blast radius of any potential failure. If there is an issue with one cell instance, traffic can be routed to other cell instances.

We also work to avoid single points of failure. For example, our routing layer as shown below leverages two systems in an active-active architecture. If one system goes down, traffic is automatically routed through the other system.

Weekly chaos engineering to verify resilience

Chaos engineering, a practice popularized by Netflix’s chaos monkey, intentionally breaks things to see how a system responds. In addition to standard unit and integration tests, New Relic runs weekly chaos tests to verify that our systems truly are resilient to failures. Every week, we deliberately break our pre-production system. If we see any flaws, tickets get filed, teams work on the fix, and then we test again in a couple of weeks.

Our chaos experiments include everything from killing a percentage of our Kubernetes pods to failing over a percentage of datastores. After each experiment, we release the executed script so that teams can easily re-run the game day for their own services. This has proven to be a great tool for tuning configurations, verifying auto-scaling algorithms, and ensuring services are resilient to common failures.

Knowledge sharing and avoiding repeat incidents

Just like our customers, we use New Relic to help detect any degradations in our system and speed up mitigation. After restoring service, our teams hold blameless retrospectives to determine why the incident occurred and how we can prevent a similar interruption in the future. Our teams also look for patterns in both our architecture and processes that could reduce the scope or length of an incident and prioritize work, even cross-organizationally, to correct those patterns. Incidents with valuable learnings are also brought to our weekly operations meeting so that other teams can learn without needing to experience the same failure.

The bottom line? Once you have an issue, you should never have that issue again. New Relic focuses on sharing lessons and best practices within teams and across the organization to ensure issues don’t happen twice. As soon as an incident occurs, we discuss the causes and ticket any work that would prevent a similar incident.

We also use bar raisers—engineers and leaders with extensive experience in reliability—to conduct an audit to confirm whether we’ve successfully prevented the issue from recurring. All that information is brought to an operations meeting with leadership and employees across engineering where people present their team’s incidents and key takeaways.

Focus time for teams who need it

If a team is not meeting their internal service availability, it’s important to dig in and understand why.

Sometimes the team might have been pushing to meet a deadline and failed to resolve a major system risk. In other cases, a team may not have the technical knowledge required to solve the problem. At New Relic, we dig in and then ensure the team has focused time to fix the issue holistically. We bring in reliability experts to work with the team on a go-forward plan and will delay new functionality releases to give the team focus time. Since establishing this program, we have seen a substantial decrease in the number of teams who need to use focus time.

Why reliability is crucial for customers

The digital customer experience (DCX) is critical to business success, and it becomes more important every day. Outages not only damage a brand’s reputation, they directly impact revenue—especially if they occur during moments of peak demand.

The 2023 Observability Forecast found that the median annual cost of high-business-impact outages was $7.75 million. Three in five (61%) respondents said critical business app outages cost at least $100,000 per hour of downtime, 32% said at least $500,000, and 21% said at least $1 million. These costs can be catastrophic for digital businesses, especially in industries like retail, transportation, finance, and media and entertainment where uptime is paramount.

Reliability is and will continue to be a top priority for New Relic. We depend on New Relic for our own observability; we understand that consistent uptime is mission critical. By following our lead, customers can deploy with confidence knowing our system—and theirs—will stay up and running.

Krishan Mattoo

Managing Director at Infosys

2w

Okay

Like
Reply
Murale Narayanan

𝗦𝗿. 𝗗𝗶𝗿𝗲𝗰𝘁𝗼𝗿 / India Site Leader @ 𝗗𝗲𝗹𝗹 Technologies | Transforming Cloud Infrastructure with Data & AI | Innovation Advocate | Published Author | Driving Global Technology & SRE Delivery Excellence

2w

This article effectively highlights the importance of reliability for software companies and New Relic's commitment to achieving it. It concisely explains how New Relic's observability platform enables businesses to maintain high service levels even during peak traffic periods. I'd like to add that the mention of a 50% improvement in New Relic's reliability is a compelling statistic that showcases the effectiveness of their practices. It would be interesting to see some details on the key areas they focused on to achieve this improvement. Overall, this summary does a good job of emphasizing the critical role of reliability in today's digital landscape and positions New Relic as a leader in helping businesses achieve their reliability goals.

Like
Reply
Ross C.

Technical Success Manager | Observability | Passionate about helping customers

1mo

Awesome post. 👏 By building that practice of putting reliability front and center, it kicks off a chain reaction of improving so many aspects for whether it be for your end-customers or for your own engineering teams.

Rajeev Ranjan

Talks about #offshore,#Lead Generation , Demand Generation & # Team Management #Offshore Project Staffing & #IT Consulting.

1mo

🌟 Seeking Skilled Offshore Talent from India? 🌟 Need IT/Non-IT experts for your projects? We offer top-notch candidates on a contract basis at competitive rates. Interested? Kindly Message me and let’s discuss for details! #OffshoreTalent #ITExperts #ContractStaffing #CompetitiveRates rajeev@acetechnologies.com Phone: 408-442-3665 Extn: 4253

Like
Reply
Basharat Wani

Software Engineering Leader | Ex-Microsoft | Ex-Amazon | Ex-Rackspace | Builder | Software Development | Culture of Engineering Excellence | Learner|

1mo

Insightful and thank you for sharing.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics