New Relic SLI Query Builder

My role
Lead product designer

My task
Discovery
ideation
UX/UI design

Team
Product manager,
core engineers of 5,
Engineering manager,
content designer

At a glance...

The challenge

For engineers and SREs, defining a Service Level Indicator (SLI) requires advanced knowledge of query syntax, metrics, and attribute schemas.

This technical prerequisite becomes an obstacle that leads to high abandon rates in the current SLI setup process. This improvement aims to solve this problem for our users.

Design strategy

A context-aware query builder that surfaces only valid options, automate setup via templates and warns before users save a broken SLI.

Business impact

Increased service levels adoption:
weekly active account increased by 16% and weekly active users increased by 11% within 8 weeks.

Reduced SLI misconfiguration, contributed to 1,500+ beta users and $1.52M MRR at GA.

The challenge

The complexity of setting up SLIs

Setting up a custom SLI in New Relic means filling NRQL queries in a couple of blank input fields— and to do that, users need to know the metric type, the right function and attribute names etc. Here are some of the details users need to know when they write a SLI query:

Understading the service levels practice and the logic of SLI
Knowing what you could measure for your data (e.g. latency, response time, throughput etc)
Knowing their own elemetry schema and attribute names
Deciding a threshold value that is meaningful for their service

Think of it like building a filter for a spreadsheet you've never seen before. You know what you want to find, but you don't know the column names, the data format, or which operators to apply.

SLI query builder with blank input fields

Low task completion rate
From the event tracking data, we’ve seen users struggled to complete the SLI setup with only 11.1% completion rate.

Another problem we’ve been seeing is users misconfiguring the good and valid events, which leads to a 100% SLI. Users thought their services are healthy while in reality it’s a misleading picture of their service health.

Discovery

Our user group
Service level management was introduced by Google SRE teams, and is now widely adopted across the industry. When designing the SLI query builder, we need to address three key personas:

SRE: Automates infrastructure and resolves issues; establishes performance baselines across teams and services.
Developer: Monitors service uptime, performance and end-user experience.
Business Leader: Minimizes costs from downtime and drives operational efficiency.

Understand the pain points
I partnered with a UX researcher and PM to gather customer feedback through qualitative interviews.

The research aimed to answer three key questions:
- What is blocking adoption?
- What problems do users encounter during SLI setup?
- Who is actually executing the setup flow?

Research findings
The interviews surfaced three key insights:

1. SLI/SLO practice
Some teams are not fully aware of the benefit of SLM and don’t know what would be a good metric to measure for their service, especially when there is no dedicated SRE teams.

2. Ease of use
Some customer are not certain about the attributes within their telemetry schema, hence they need to cross checking a lot when using the query builder. Another common issue is not being familiar with NRQL syntax.

3. Human error
Customers report SLI being inaccurate or not working – which actually came from invalid good or bad event setup.

Known constraints

1. There is always going to be some translation gap between customer’s data and New Relic’s data model. Meaning there will be cases where customers can’t find any metric that is useful for them in SLI.

2. Every SLI must be linked to an entity. For customers who prefer not to tie an SLI to a specific technical component, the work-around is to use a Workload as a logical container.

Design approach

A query is essentially a series of decisions. The design strategy is to surface those decisions one at a time, in the right order, with only valid options at each step.

1. Recognition over recall

Instead of asking users to remember or copy/paste attributes or data points within each clause, we offer dropdowns to lower the technical barrier. This allows users to simply select from all the available data points. It takes away the cognitive load and prevents human errors (e.g. typos)

2. Automation via smart templates

Some SLIs are typically used to measure the performance of certain entity types (APM, Browser, database etc). By suggesting SLIs and threshold based on historic data, we eliminate the setup hurdle, make it accessible for non-technical users. For power users, we offer the option for custom SLI setup.

3. Prevent errors before they happen

We show a warning message and disabled saving when we detect same query for both good and valid events. This sets a guardrail for users and make sure what they configured is actually meaningful, not just technically valid.

4. Simplify comparison by formatting

By changing the query into a line-by-line format, we improve the scannability and make it easy to cross-check the valid versus the good/bad event.

See the improvement in action below

• • •

Business impact

Increase service level adoption -- SLI query builder filled the gap in service level creation flow, which has a $2M CRR impact on the business.
Reduced SLI misconfiguration, contributed to 1,500+ beta users and $1.52M MRR at GA.

Looking back

The biggest UX risk in highly technical products is the silent failure.
Shift the memory burden from the user to the system. By requesting data in incremental steps, we minimize the cognitive load required to recall critical information.
Bridging the gap between specialists and laypeople demands a solution that is intuitive.