P





New Relic SLI Query Builder





My role
Product design lead
My task
UX / UI design
Cross-functional alignment
with engineering on technical constraints
Team
Product manager,
core engineers,
content designer




At a glance...



The challenge



For engineers and SREs, defining a CDF(Cumulative distribution function)-based SLI requires advanced knowledge of query syntax, metrics, and attribute schemas.

This technical prerequisite becomes an obstacle that leads to high abandon rate in the SLI setup process. The SLI query builder aims to solve this problem for our users.


 

Design strategy



A context-aware query builder that surfaces only valid options, auto-fills deterministic parameters, and warns before users save a broken SLI.

The impact



Increase service level adoption -- CDF-based query filled the technical gap in service level creation flow, which has a $2M CRR impact on the business.

Reduced SLI misconfiguration, contributed to 1,500+ beta users and $1.52M MRR at GA.









The problem


Where teams got stuck

Setting up a custom SLI in New Relic means writing NRQL queries from scratch — and to do that, you need to know the metric type, the right function and attribute names etc. Here are some of the details users need to know when they write NRQL:

  • Understanding what a cumulative distribution metric (CDF) is and how it differs from an event-based metric
  • Knowing which function to use — getCdfCount, getField, count, or sum — and when
  • Knowing their own attribute names within their telemetry schema
  • Choosing a threshold value that was statistically meaningful for their service

Think of it like building a filter for a spreadsheet you've never seen before. You know what you want to find, but you don't know the column names, the data format, or which operators applied.

Customer interviews and usage data told a consistent story. Most users who hit the blank query form gave up before finishing. Those who pushed through often got it wrong in a subtle way — configuring their good and bad event definitions identically, which made the SLI show 100% compliance at all times. The SLI looked correct but measured nothing.




Definition



Jobs-to-be-Done

Rather than writing for a generic persona, I focused on specific outcomes real users needed:
  • SREs: When I'm setting SLO for a service, I want to define a SLI without needing to know NRQL syntax, so I can get a reliability baseline in minutes, not hours
  • Developers: When I'm a developer who owns a service but isn't a reliability expert, I want to understand what each function choice means in plain language, so I can make an informed decision without consulting documentation
  • Developers: When I'm configuring good vs. bad event thresholds, I want to be warned if my definitions will produce meaningless data, so I can catch mistakes before they corrupt my SLO compliance metrics




Technical constraints

Engineering had already defined how each function worked under the hood — getCdfCount needed two inputs from the user, while getField needed one (the second was always the same, so we hard-coded it). But the bigger challenge wasn't the technical constraints — it was the range of users we had to serve.

Experienced SREs who were comfortable writing NRQL by hand still needed that freedom. So the goal was to make the experience more approachable without taking away control from users who already knew what they were doing.








Understand the current setup flow

In order to understand why users dropped off during custom query setup, I mapped out every decision a user had to work through to produce a valid SLI query. The number of decisions involved surprised us.





Because the nature of query syntax is contingent, it will be much easier for users if we can guide them with a basic structure. That is why I 


------------
What became clear was that every input had a dependency on the one before it — you need to build the query in a strict order. 

That's when it became obvious: the form couldn't be flat (blank). Each answer unlocked the next question — so the UI had to work the same way. It also revealed exactly where the system could do the work instead of the user — anywhere the answer was deterministic, we could pre-fill it.








Error proofing guardrail

In our usage data, we started noticing a pattern: some SLIs were reporting compliance values above 100% — which is statistically impossible. The system was already surfacing a warning on these accounts (shown below), but users weren't sure what had gone wrong or how to fix it.

When we investigated, the cause was consistent: users had defined their good and bad event queries identically. The SLI was measuring the same thing twice, so it always reported 100% or even higher compliance — regardless of what the service was actually doing. That's what led to the pre-save warning in the builder. The goal was to catch it at the moment of configuration — before any misleading data was ever generated.














Design approach
A query is essentially a series of decisions. The design strategy is to surface those decisions one at a time, in the right order, with only valid options at each step.



1. Show only what's relevant


When a user selected a metric type, the builder immediately filtered the available functions to only those valid for that context. getCdfCount only appeared for distribution metrics. This removed the need to know which functions apply where — the system knew.

















2. One decision at a time


The query builder adjusted based on user’s input in each step  — conditional logic at work. Choose getField and you'd get one field to fill in — the second parameter is always count, so we pre-filled it. If users choose getCdfCount, then two parameters are required: attributeName and thresholdValue. Either way, the form showed exactly what each function needed — nothing more.












3. Build confidence through transparency


As users filled in each field, the query and chart updated side by side in real time. The chart did the reassurance work — a quick visual clue / confirmation that the data looked right — so users didn't need to understand the NRQL to feel confident in what they'd configured.








4. Prevent errors before they happen


The builder detected four specific conditions where a user's good and bad event definitions would produce identical results, making the SLI statistically meaningless. If we detected a conflict, a warning would appear — telling the user exactly what was wrong.

In observability, a broken SLI is worse than no SLI — it gives you false confidence. The warning was there to make sure what users saved was actually meaningful, not just technically valid.







The final design









What I learned



  • Transparency builds confidence. Showing the generated query in real time, even when users didn't need to read it, gave them a sense of control. They could see the system was doing what they intended. That trust is what makes users commit to a configuration they didn't write themselves.

  • Schema-awareness is the real leverage point in query UX. Showing only valid options for a given context reduces cognitive load better than any tooltip.

  • "Easy to use" and "hard to get wrong" are different problems. The builder made configuration easier. The warning system made it safer. Both were necessary — the first gets users to a result, the second ensures the result is trustworthy.
      





•  •  •





Business impact


  • Increase service level adoption -- CDF-based query filled the technical gap in service level creation flow, which has a $2M CRR impact on the business.

  • Reduced SLI misconfiguration, contributed to 1,500+ beta users and $1.52M MRR at GA.











About & contact      ︎      ︎

© 2026  I-Chieh Pan