Serverless Chats

Episode #90: Full-Stack Observability with the New Relic Explorer with Buddy Brewer

Mar 1 '21

About Buddy Brewer

Buddy Brewer is the Field CTO for New Relic in the Americas. In this role, he helps customers get long-term value out of New Relic. Buddy has over 20 years of experience leading engineering and product management teams building tools to help developers and operations professionals deliver better digital experiences. A former entrepreneur in the observability space, Buddy has helped companies across every geography and industry in the world improve their software’s speed, quality, and user experience.
LinkedIn: https://www.linkedin.com/in/bbrewer/
Twitter: @bbrewer
Personal Website: BuddyBrewer.com
New Relic Free Tier: https://newrelic.com/signup
New Relic Explorer: https://newrelic.com/platform/full-stack-observability

Watch this video on YouTube: https://youtu.be/Y4n3fE8g9Ec

This episode is sponsored by New Relic.

Transcript

Jeremy: Hi everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Buddy Brewer. Hey Buddy, thanks for joining me.

Buddy: Hey Jeremy. Thanks for having me.

Jeremy: You are a Field CTO at New Relic so I'd love it if you could tell the listeners a little bit about yourself and what's new with New Relic.

Buddy: Yeah. Been with New Relic for a couple years now and in this Field CTO role I get to spend lots of time with our customers to help them get long-term value out of our observability platform. I'm an engineer by trade. Started my career as a software developer like many of our customers in New Relic. Spent substantially all of my career in product development in various capacities. Engineering, leading engineering teams, product management. And like I said, now I spend most of my time with customers helping them tackle their own observability challenges in their businesses. We're doing a lot right now with New Relic to help people make sense out of the volume of data and to help people pull all of the different types of metrics, events, logs, and traces that go into all this observability into views that they can actually use to help their customers get better experiences in a world where software architectures are just ... They're just becoming more complex by the month.

Jeremy: Right. Well, awesome. First of all, I want to thank New Relic for sponsoring this episode and for the amazing amount of support that they give to us here at Serverless Chats and what we do. So thank you very much for that. Now, you mentioned these tools that you're working on to be able to observe modern applications. And the new tool that was recently launched is the New Relic Explorer. I've looked at this thing. This is absolutely fascinating. It does all kinds of really great things. But I'd love it if you could tell the listeners a little bit more about that product.

Buddy: Yeah. It's part of our full stack observability product in the New Relic One platform. So it's an in-place upgrade that everyone who uses full stack observability today gets. And what it does is it takes all of the information across all of the different dimensions that people are used to seeing in New Relic One, it pulls them together into new views that help people make sense at a macro level of what's going on in the health of their software across all of the dimensions that matter today. So infrastructure, front end, the application logic. All of that stuff in single views. And there's another part of New Relic Explorer that helps people understand in realtime what the key changes are that are happening in a way that requires zero configuration, which is really important to our customers today because the software architectures and the underlying containers and everything that serve those are changing so fast that people just don't have time to manually configure things today like they used to be able to.

Jeremy: Yeah, right. And one of the things too with cloud infrastructures, you've got all this telemetry data coming in from all these different places and most of the time ... I mean, I know at least what I had been doing is using a bunch of different dashboards and basically jumping between different things trying to figure out what's healthy, what's not healthy. And I love these new views that are in the New Relic Explorer because it actually shows you the changing ... If a problem is getting worse and worse and worse, it gives you this growing bubble. So these visualizations are really, really helpful. So I think that's Lookout right? That does that?

Buddy: That's right. Yeah, that's Lookout. The way that I think of Lookout is imagine if you could take something like the Unix diff command and apply it to all of your telemetry data comparing now versus any point in the past. Whereas the Unix diff command is a text console rendering, what Lookout does is it renders all of this in a visual display in a web browser so that you can see ... Like you said, you had these bubbles that really display two dimensions at the same time. The volume of data, whatever it is that you're looking at for a piece of data. A lot of people use this to visualize changes in errors or throughput or latency but it could also be order volume or really any metric that you want. That's the first dimension. And then the second dimension is the magnitude of changes. Right?

Jeremy: Right.

Buddy: What it helps you do is to zero-in, not just on the things that are red ... Because in environments where folks have thousands, or even tens of thousands for some of our enterprise customers, containers running on any given day, the nature of that design and the fault tolerance inherent in that architecture ensures that on any given day there's going to be stuff that's red. Right?

Jeremy: Right.

Buddy: So if a customer calls in about a problem, you log in, you see some things that are red. Well, some of that stuff was red yesterday. What Lookout helps you do is to focus specifically on those things that changed from healthy to not healthy around the same time as a customer-impacting problem. And then you can see all of the different pieces that also correlate to those changes so you could pull it all out and focus just on the things that matter.

Jeremy: Yeah. That's super helpful because, again, it's one of those things where ... I mean, I've worked as an SRE in the past and you get these constant errors sometimes that keep coming up and they're just kind of there. But sometimes it's the severity of the errors. It's how bad was it yesterday versus how bad is it today? Of course, we wouldn't leave a problem that long. But seeing those changes over time and seeing that growing bit of it, I think is just incredibly helpful from that sort of global view standpoint.

And then the other thing that's part of this, which I think is another really cool representation, is the Navigator piece. And this basically uses a red, yellow, and green sort of ... What is it? A hexagonal or an octagon or something like that. But basically shows these little blocks that show you what's healthy and what's not healthy and then you can dive down into each one of those to see more detail.

Buddy: That's right. And what we did was we designed that view to pack an order of magnitude more information density into a screen compared to the view that we had prior to this. Those views continue to be part of the product, but again, this New Relic Explorer is an in-place upgrade that everyone gets that you can use in addition to all the things that you already have with New Relic. But the first piece is that order of magnitude more information density on a screen. The other thing that it does is it summarizes all of this in a way that you can use it as ... You think of it like the new mission control for New Relic. So for your game day dashboard on the major event that you knew was coming and you wanted to be 100% situationally aware about everything going on in your software when that happens. Whether it's a major advertising campaign that you expect to bring a lot of traffic to your site or it's a big calendar event or major event in the news if your media or something like that. That mission control that allows you to see everything in a single view.

And then we have some flexibility where you can customize that view or create multiples of them that align to specific teams. We call those workloads. So you can take the different workloads that are running in your architecture, all if its constituent pieces, from the front end components, the back end components, the infrastructure, all of it, you can visualize in a single view that aligns to specific teams that work on them. So everyone can get their own tailored mission control. And then another piece of all of this ... Because everything that I've talked about so far has been this extremely high altitude look at what's going on in the software, which you need. But as soon as you notice something that requires you to take action, you very quickly need to switch to a lower altitude. So one of the things that we built in to New Relic Navigator is this concept of related entities.

An entity for us is any component in your architecture that makes your software go. Whether it's a Docker container or some other piece of infrastructure or it's a web application that's built in JavaScript or it's back end logic written in Node or Go or PHP or anything. All of those individual discreet components, we call those entities. They all have a health condition, an alert status. They can have events, logs, and traces that are associated with all of them. But one of the things that is a really critical piece of data that we have at New Relic that this release helps to expose is the relationships between all of those. So it's not just a simple linear relationship or even a tree structure. It's a connected graph of all of these different components. So if one thing is red, the thing that you click on might not be the root cause. And the impact of it being red might not be limited to just the pieces that are connecting to that piece. There could be a number of other services that depend on it that are being upstream or downstream impacted.

So every time you click on something we show you all of the upstream and downstream relationships so you can follow it. What you used to have to do is say I'm going to go see what's happening in the application tier, and then you sort out what you can sort out and then you had to pop out of that and then go into a different view and look at all of your front end and just stitch this together in your head. In the worst case, developers responding to problems had to load up five or six different tabs in their web browser and click back and forth between all of these different things. The related entities, New Relic Explorer, the hexagons in Navigator, all of that stuff is designed to help people with that problem so that you can go straight to what the root cause is by just navigating the shortest path through that graph instead of having to pop out and start your search over again.

Jeremy: Right. Yeah. And I wish I only had to open four or five tabs. I mean, you see it's a lot more than that. And you're searching through logs and trying to find that. Now, the other thing that's really cool ... And there are a lot of distributed tracing products out there now and it's very, very cool, where you can go and see how data is moving through different components in your applications, which is really, really helpful. But what's crazy, I think, about this visualization in Navigator and Lookout and everything that New Relic has done, it gives you the ability to connect, like you said, through that graph multiple services that might be sharing things or different data coming from different places. And all of that stuff is instrumented pretty much automatically. I mean, depending on which service you're using. But all of that stuff ... It's not like you have to go in and instrument all of these little tiny bits. This data's just being collected, these traces are being done. And then this really cool service just visualizes all of it for you.

Buddy: That's right. Full stack observability, which is where this new functionality lives. Like I mentioned, this isn't a new product, it's an enhancement to an existing one. Full stack observability exists on top of another part of our platform which we call the Telemetry Data Platform. We've been building that for so many years. We only in last July exposed it as its own product. Priced really simply just on ingest. And we have a free tier by the way that anybody can sign up for and you can ingest 100 gigabytes per month for free with no charge. And one of the great things about the Telemetry Data Platform is it's a high volume, scalable place to put all of your telemetry data, agnostic to whether it's in a metric or an event or a log or a trace. You can just put it all into this single platform. Which is the first step that you have to do if you ever want to have a shot at tearing down all these silos between all the different pieces of information.

So take traces for an example. Having the Telemetry Data platform enables us to do things like show logs in context. Because the logs are in the same data store as all of the trace data. So if you click on a trace, and even if you click on a span inside of the trace, then if you generated any log events that happened just during the context of that span of that trace, we can display it inline. What you used to have to do is you had to go into a different tab in your web browser and start over again and maybe try to use some kind of a trace ID or span ID or something like that that you hope was also indexed in your logging tool and that you could go find it there. Having all of it inside of one data store means that if you're looking at something that's a particular type of data like a trace, you can see other types of related data like logs in the same context and in the same view.

Jeremy: Right. Yeah. And I think an important question would be the simplification of this. Everybody wants things to be simpler and have these really simple views and good ways to represent and visualize their data. But one of the things is that if you were an SRE or you were an ops person in the past, you probably were familiar with all these different tools that you were using and you knew exactly what you needed to see and how different things ran and stuff like that. But it's not just ops people or SREs or people who are just always worried about the infrastructure that are impacted now by a lot of these changes because I think we've made a big shift in the way that we develop applications and who's responsible for the lifecycle of those applications. I'd love to talk about that evolving role. I guess we would call them modern developers maybe? That modern developers building for the cloud and building these complex systems. What kinds of responsibilities do you think have shifted to them?

Buddy: It's interesting how the nature of the role of software development has changed. I started my career 100% front end developer. And specifically building tools for front end developers to reason about the health of the front end experience. And then you had back end developers. And you could meet someone at these networking events and talk about which parts of the stack that you work on. But those lines are fading. There was a report that came out last year. I think it UBS. Compared how many developers identify as different types of developers, front end engineer, back end engineer, et cetera. And the thing that was remarkable about it is specifically people who identified as a full stack engineer, 55% of the respondents identified as full stack engineers. So more than half.

Now, in 2015, five years ago, it was only 29%. So it's the majority and also the fastest growing cohort of engineering role. And that's how you end up with the situation we were talking about before where you've got so many tabs open in your browser is because all of these tools have been built for specific slices of the application architecture. Logging tools, front end analysis tools, back end analysis tools, infrastructure analysis tools. All of that. Full stack observability, the product that we offer with New Relic aims at being a full stack analysis tool. So again, in that single tab you can see the relationships between all of these different tiers. And we did that specifically in response to what we saw as this broader trend, both from the analysts but also talking to our own customers and realizing ... New Relic's been in this business now for 13 years. Started in 2008. So we've seen a lot of this evolution firsthand among our customers and they were asking us for this. They wanted simpler views that connected all of these different pieces together because increasingly what happens, somebody gets a notification that there's a problem that they have to solve and it's not just in a slice of the architecture.

They're on a team that is designed to do everything that it takes to deliver a particular part of the customer experience. So if something goes wrong with that experience, whether it's in the infrastructure tier or the application tier or the front end tier, they're accountable to finding it and fixing it. So we're building tools to help people do that better.

Jeremy: Yeah. And I think that's interesting in terms of that evolution where even when ... Let's say AWS started with EC2s and things like that, the virtual machines, back in 2008, 2007, somewhere around there. You started building applications that way and I think you had very traditional ops people setting up the networking for people and setting up an EC2 and I don't think a lot of people were doing CI/CD, at least not like they are now. So you take that code that a developer would write and someone would set up that instance or that environment for you to dump the code into. And then as we moved towards things like containers, developers are now responsible for packaging their own containers and requiring the resources they need or the packages they need, things like that. And then moving even further down the line to serverless where in most cases you don't even have an ops person involved right?

Buddy: Yeah.

Jeremy: I mean, there's nothing for them to set up sometimes. So that change in how we're building applications ... Do you think that that change of how developers are getting closer and closer to the infrastructure, that that's sort of one of those things that prompts a need for this full stack observability?

Buddy: Oh yeah. Yeah, for sure. And it's affecting everyone. It has crossed the chasm. This is not just cloud native startups that are adopting this. Substantially every large enterprise that I talk to, and I speak to usually multiple per week, are somewhere along this journey of cloud migration. And that includes shifting workloads from data centers or traditional monoliths decomposing into microservices, moving from data centers into cloud. Orchestrating all of this with containers on Kubernetes. And increasingly across the board, cloud native and traditional enterprise, like you said, moving to serverless because there's certain economies and efficiencies that you get out of being able to take advantage of that layer of abstraction that companies of all sizes and across all industries ... Not just gaming and super high tech and media and commerce but also financial services and travel. Just everybody is moving toward this and adopting it and they're looking for tools that can help reason about the connections between all of these different pieces that they're now responsible for.

There's another component to this that we have been and continue to work hard on at New Relic, which was a point that you touched on earlier. This notion of when you have so many components that you have to manage and all of these things are moving are changing so rapidly, you don't have time to undertake these expensive manual tasks to create all of this instrumentation. So we've been for over a year now progressively opening up our platform to accept other types of third-party data, not just our own agent technology that we've building since 2008, but things like Prometheus and Open Telemetry. You can just point exporters at our endpoints and make it easier to get that data on board. Taking all of our instrumentation logic and making it easy to wrap that in automatable frameworks like things like Terraform scripts and stuff so you could actually build observability in as code and deploy it at scale.

We've been talking about New Relic Explorer which is our release that we're talking about today that's all about the visualization. But it's enabled by a tremendous amount of work that we've done and continues to be underway to simplify the instrumentation. Because as the architectures themselves become more complex, it obviously gets harder and harder to keep up. Frankly, a lot of our customers have issues where they can't instrument fast enough to keep up with the change that's happening in their infrastructure. So as a result they have all of these dark areas of their application that are critical to delivering the experiences to their customers, but they don't have observability into what's going on inside of it. So we've been doing a lot of work to give people tools to get leverage on that problem too so that they can add instrumentation to all the pieces that matter.

Jeremy: Right. And I know that New Relic has done a ton of work on instrumenting Lambda functions for example. Like being able to instrument these things where you can't necessarily run the agents. And I know there's been a lot of really cool innovations in the serverless space around some of that stuff. But I'm curious just from a developer perspective, and maybe you have some experience here of seeing some of your customers do this, how much is your average developer who's maybe building a cloud application working on a team, how much is that developer actually going in and using these observability tools to see what's going on? Is that something where they need to be heavily involved in that or are you still seeing a good separation between the ops team in that regard?

Buddy: It's evolving. We're seeing it change in a couple of dimensions. Developers use observability data and monitoring tools far more than they did years ago. Although, there's always been a segment who needed to do that. New Relic, one of the things that we're known for as a company is being the monitoring platform that is the most developer-friendly. Our CEO, Lou, is a programmer at heart who still writes code on the weekends even as the CEO of a public company. It's part of our DNA. And I think we've always had a natural affinity through that to the types of adopters who are developers, who both write the code and they deploy the code. What we're seeing is, and as our company has grown, that cohort of developers who are responsible for both writing the code and deploying and managing it are exploding in size and scale and the number of those people that are out there. So we've just been riding that wave, if you will, of developers who continue to be responsible for how all of that stuff actually materializes. And it's now becoming essentially the standard way of operating, like I mentioned before, not just for cloud native startups but at large enterprises as well.

And that blur between ops and dev is fading to the point that it's really difficult to see as we sit here in 2021. That's on the role side. One of the other things that's changing, I think, that's really interesting about just the way people are using telemetry data is the set of use cases that it's relevant to. Historically all of this data about what's happening in your software, the canonical use case for when you need that is when something's on fire. Right?

Jeremy: Right.

Buddy: The mean time to resolution. How fast can I get a problem solved? How quickly can I take something that's red and turn it green? It's the classic use case for New Relic and for any other tool in this space. But what we're seeing that's evolving is people are using this telemetry data outside of that context more and more frequently as part of their day-to-day software development. For example, how do you choose where to tune and target your reduction of technical debt? If we can present observability to you that helps you understand not just the parts that are the slowest ... Because sometimes things are slow but they're asynchronous, they don't matter, or whatever. But what are the parts that are the slowest that are actually impacting customer experiences in a way that damages your business or damages your brand? So we're seeing people use that telemetry data. Nothing's on fire. But they want to use it in order to better plan and prioritize their development work.

Or another example is we've seen a lot of development in this field of chaos engineering. And testing resiliency not just by looking at the data and evaluating the architecture or doing things like load tests, but actually intentionally breaking things and then seeing how the system reacts in response to that. A tool like New Relic Navigator is really good for being able to see ... Or actually Lookout might even be the best of the features that we're releasing now that help people with this. Where you can spot these changes that maybe you didn't anticipate so you didn't set up threshold alerting or something on. But you can in realtime see how all of the different pieces of your application change when you go in and you test the resilience of your system by breaking it.

So we're seeing continued convergence of the roles, which brings more and more folks into looking at observability data. But we're also a widening of the number of use cases beyond just the traditional fire fighting. It's a cliché, but it's true. As more and more businesses essentially become digital businesses, the data that describes how your digital experiences and working are taking on more and more strategic value to those companies, at least the forward-thinking ones. And so they're looking for ways to leverage that data and new and creative ways beyond traditional fire fighting.

Jeremy: Yeah. I want to talk to you about resiliency and a little bit about chaos engineering, but before we move on from this role, I'm really curious, you've been doing this for quite some time and as you see this evolve, where's the line for developers? How far should we push them down this getting into the ops role? I know we said it's very blurred, but is it at setting up their own automation? Is it at doing networking or actually touching infrastructure? How far do you think a modern developer really needs to go down that path?

Buddy: Well, it's different for different organizations and there's not a single pattern that you can apply. It's a complicated enough problem that everybody kind of needs to tailor their approach to fit the dynamics of the environment that they operate in. Sometimes things like regulatory compliance come into play and all the rest of that. But I think probably at the highest altitude, the broad trend is we're seeing developers become accountable by default to all of it until they reach the point where they can trade off the management to a third party like a cloud provider. So for example, the networking and things like that. If you can trade that off to your cloud provider, but when it comes to defining all of the infrastructure let's do that in code in an immutable way so that I can automate and deploy and do all of that and handle it as a developer. So developer and operations, we've been saying this for over 10 years now, but it's gotten to the point now as I sit here in 2021 where companies of all sizes and across all industries ... Increasingly I go in and I talk to people and you used to kind of ... In the prep, it's like is this going to be a meeting with the development group or is this the operations group?

We just don't talk about it that way anymore. People are accountable to all of it. There might be specialties within the teams where maybe somebody has a time bias. They spend a little bit more of their time in one category versus the other. But the fact is that most people I talk to today, they've got some level of accountability across all of that stuff. Which is one of the reasons why ... There's only so much that a human being can do at any given time.

Jeremy: Right. Exactly.

Buddy: So the way that a lot of people are gaining leverage on that is by trading some of those pieces off to third parties like the cloud providers.

Jeremy: Yeah. No, I think that makes a ton of sense. Another thing I think ... You mentioned something about complexity in there. And one of the things we're seeing quite a bit of now, which is a very popular way to develop software, is to go down the microservices route and get rid of those old monoliths. So as people are building more microservices and you have multiple teams, which means they probably don't always follow the same standards and some might be written in different languages, some might be running in different environments, the complexity of the data that's coming from that and all of that information, trying to organize all of it. This is just one of those things where something like New Relic I think captures ... It kind of captures that perfectly right? Where it's like you've got all this chaos and you try to make sense of it. So just your thoughts on microservices and the role of some of these tools now to make sense of all that data.

Buddy: Yeah. It's been, I think, really great for engineering teams who've moved to these architectures that it allows them to decouple things and move faster. As more and more of businesses move their revenue toward digital they necessarily have to scale up their headcount of people in engineering which means you now have an organizational problem of how do you keep all of these people in this increasingly large organization productive without creating so many dependencies on each other that they all grind to a halt. So microservices are really great for that. Of course, it's also true that the magnitude of data and complexity that engineers are responsible for and accountable to is scaling at a rate that's faster than headcount. So engineers are having to take on more today and they're having to do more with less. But microservices help them at least manage the dependencies versus the old monolith architecture. This is the reason why organizations are moving away from monoliths and toward microservices today.

It does, like you said, create a whole new set of problems for observability platforms like New Relic One to solve for. The relationships, there's orders of magnitude more of these relationships, orders of magnitude more components. All of that data has to be tracked. It all has to be managed and visualized in a way that allows folks to look at all of this at multiple altitudes so that you can see what's happening overall. The mission control kind of thing that we were talking about earlier. But it can't stop there. You have to be able to get very quickly to the individual metrics, events, the logs, the traces, and all of that stuff that are happening right around where a problem comes up. For New Relic, it's required us, in order to continue to serve our customers in the face of all this change to change almost everything about our platform. We went from having probably a dozen different discrete products ... We had a real user monitoring product, a synthetic monitoring product, a mobile app monitoring, APM, infrastructure, logs. We had all of these different discrete products. In order to keep pace with all this and continue to serve our customers, like we talked about earlier with the evolving role of the engineer toward full stack responsibilities, to bring all of that together into a single product.

That change happened because of changes that are happening in the organizations that we serve and in the broader application architectures. Another massive change that we made is we decoupled ... Well, we stopped counting hosts for one thing. We used to price all of this by units that just don't really make sense anymore in the modern era. How do you count up how many hosts that you've got in a world where it's going to be different an hour from now? So we stopped. We switched to ... You do know how many engineers you have. So full stack observability is priced by the seat. And then we decoupled the data. Because, like I said a minute ago, the amount of data that organizations are having to manage is scaling at a rate that's much faster than their headcount. So we took the data and we actually carved that out as a separate thing in our Telemetry Data Platform. Priced it very aggressively. 25 cents a gigabyte and then we give people 100 gigabytes a month for free if they sign up for our free tier. So that you can, in an economically feasible way, track all of this data across all of these different services.

That goes back to the point that I was making a few minutes ago about the big problem that a lot of organizations have is that they just don't have observability across all of their application. There's a couple of reasons for that. One, if the instrumentation is too complex they can instrument fast enough to keep up and so we're working on that. We've done a number of things to make things simpler and we continue to make investments there. The other is sometimes it's just not economically feasible. So we built our whole pricing model and packaging around making it actually feasible for people to be able to generate all this telemetry. Now, of course, once it shows up in the database, it's incumbent on us to help our customers make sense of all of that, hence things like New Relic Explorer which we're talking about today.

Jeremy: Yeah. And that's one of the things I was going to say. Abstraction is very hard. When you're trying to abstract anything it's hard to find the right level. So how do you approach all of this data without oversimplifying it?

Buddy: Yeah. It's hard. We do some of the things that you would expect. We have, and we've always had, curated views that are informed by our 13 years of experience helping thousands of companies manage their own data. We also happen to be a provider of software at fairly large scale. So we have a lot of experience living in the same problem space obviously as our customers do. So we work hard to give people out-of-the-box views that help them understand what's going on in their software. And then of course we've got the ability for you to create custom dashboards like you would expect. We have a query language that allows you to interact directly with the high cardinality events that we store on behalf of our customers. Not everyone can do this, but every organization that we work with usually has a small number or sometimes a lot, but usually at least a few power users who understand the query language and can get in there with a scalpel and pull exactly what they need out.

The thing that we do that I think is unique to New Relic, though, among observability platforms, we also added a programmability layer about a year and a half ago. And what that allows you to do is to move beyond just dragging and dropping widgets from the palate to create custom dashboards and it moves beyond query languages and working with raw data toward the ability to actually write your own code in ReactJS, interact with our data model using GraphQL. So standards that lots of people know. And you can build truly tailored bespoke visualizations. So we've had customers do everything from combine operational data with weather data, geographic data for people who have physical points of presence in stores and things like that. You can build your own. We also have an app catalog and an ecosystem where you can go in and you can install things.

So that's how we manage it at New Relic. We try to bring a point of view. Every company in the world who has this problem, and it's a common problem, it's a balancing act that you will never be done with. You're always working on it. But we try really hard to provide people without a box curated views that allow them to be immediately productive. But at the same time affording the flexibility, not just through custom dashboards and things like that, but actually a platform you can build applications on top of so that you can visualize any sort of way that you want but not make that ecosystem so convoluted to navigate and all of that stuff that it's impossible to find the pieces that solve 80% of the problem. So you can imagine if we took everything that anybody ever did custom and we made all of those first-class objects in the system, it would be such a huge haystack that you wouldn't be able to find the pieces to solve 80% of the problem. So we promote that to people as part of our out-of-box experience. But then we give you the flexibility if you want to, and many of our long-time customers have adopted this, to create truly custom applications to see the data exactly how you need to see it.

Jeremy: Right. Yeah. And I think when you have all this data coming in and you're collecting metrics and logs and traces, that's great to be able to look at all of that stuff independently but you just want to get to that root cause analysis. You want to be able to figure out what that root cause was and be able to jump in. So having those predefined views, I always find those to be helpful because if you just gave me like, "Hey, here's all the data. Just set up the alerts and the graphs and everything that you want," that usually doesn't get you very far until you can spend days and days and days digging into that. So having that top-level stuff and letting you dig in, I think, is a really good way to approach it.

Buddy: Yeah. Like I said, we try to bring that point of view. A little bit of a sidebar from our core discussion but for those in your audience who are interested in sort of historical trivia you may recall that New Relic got its start in 2008 building ... Our founder, Lou, built an APM product on top of Ruby on Rails, which was setting the world on fire back in 2008. Twitter was based on Rails. Some very major apps were based on Rails. One of the things that was a very defining characteristic of Rails was that it had a very strong point of view. You do not put your controllers in that directory, you put your controllers in this directory.

I think some of New Relic's early design intent, given the fact that it was built from within that Rails community, was to start with a point of view so that people could get productive as quickly as possible. It's one of the things people loved about Rails was you got that really fast out-of-box experience. So over time, as our business has grown ... And of course, now New Relic does way more than Ruby on Rails, although you still can monitor your Rails application with New Relic. You have all of this additional flexibility. But where the company got its start and one of the things that we were known for really early on was that developer productivity that came from bringing a point of view of once you get the ... Drop in the instrumentation and you log in and you immediately have insights. So we try to hold onto that even as we give people more tools to create all these custom applications and everything on top.

Jeremy: Yeah. And I think an opinionated approach to certain things with some flexibility, sometimes it can steer you in the wrong direction but you're right, it just gets people productive so much faster.

All right. I want to go back to the resiliency and some of that chaos engineering stuff because it seems like Lookout and Navigator, these are those perfect tools like you said for doing those chaos days or things like that. So what are some of your thoughts on building resiliency into these systems and how can New Relic Lookout and the other services underneath that, how can that help you make sure that you're building resilient systems?

Buddy: Yeah. A lot of what goes into New Relic Navigator and Lookout is having the ability to see in realtime what's changing in your application. So like I said, many of our early adopters for example, when we first started opening this up to a small set of customers before we reached our general availability launch these features, in addition to using the new features for the production events that were coming from real customer traffic and things like that, they were also using it as a way to reason about what was happening in their software architecture when they're intentionally making changes. That was another one of the core use cases that we saw people using this for. And in particular with Lookout, because it's zero config, it's really good at helping people reason about these unknown unknowns in software which is one of the differentiating characteristics that gave rise to this notion of observability in the first place.

A lot of the distinctions that people draw between monitoring and observability is that monitoring was defined by this characteristic of, "I know all of the failure modes, I'm going to instrument them all with threshold-based alerting, and then I want to get a page when something breaks." And in observability, one of the defining characteristics of it was in contrast to the way that people used to do things. More and more of the way that software fails today, oftentimes it fails in a unique fashion because of all of the ... There are more variables in the equation anymore than you can count because of microservices and all that stuff. So you have to have a model that allows you to see what is changing and not just where the changes are but actually direct you toward the ones that are causing a customer impact without relying on you having analyzed all of the possible failure modes in advance.

So since Lookout isn't reliant on prior configuration or thresholds, it's just looking for changes and then correlating all of those changes to each other so you can see where the clusters are. Which is a lot of what the problem space that people are solving in AI ops for example. This is an exploratory realtime versus a lot of the AI ops is about sending you notifications, which we also do. But this is about seeing it when you're actually logged in and exploring what's happening in your software. Makes it highly useful for those situations when you're doing chaos engineering. And it's something that we're seeing. We're far from a state where everybody's doing that today. But again, we're seeing a lot of growth and increasingly companies who are solving for those types of use cases and it was one of the things that we designed New Relic Lookout to help people do.

Jeremy: Yeah. Well, I think if you are at the point where you need to start doing chaos engineering, you probably have a lot of applications and a lot of services talking to one another. And I think just convincing some team, "Hey, by the way, we are going to break something in production to test it," if you're going to do that and you can actually convince some team members to let you do that, you better have a pretty good tool that's going to be able to capture and be able to observe what is actually breaking. And especially even if you have to revert quickly, at least be able to see the history of that and be able to go in and see okay, when this broke this particular service was no longer responding or something like that. So, are there any surprising things you found as people started adopting this stuff?

Buddy: Yeah. One of the things that I thought was most interesting when I was going through our feedback from our early access program ... We designed this for engineers to use. For people who are in the work every single day, to help them do their jobs better. It's common for us to see managers and directors and executives engage in our telemetry data but it's almost always rolled up to a summary that people are using to track things like SLOs and SLAs and maybe correlate that to some sort of a business outcome like conversion rate for a commerce company or ad impressions for media or something like that. Things that are at a higher altitude. One of the surprising things that we saw with the early access program for New Relic Explorer was we saw a use case where a manager who historically was unable because of all of the stuff that we've talked about so far, all of the complexity and everything, it's impossible for them to reason about it and do all of their other responsibilities as a manager.

So their job typically was air traffic control to get managerial leverage on a larger problem. So it's like, "Here's something going on over here. I'm going to send this to the person on my team who's responsible for it, ask them to look into it as part of their job as day-to-day manager." What we found in the early access program ... One of our use cases in specific that comes to mind was someone who hadn't actually rolled up their sleeves and done the root cause analysis in quite a while. Because the complexity required and everything, just didn't have time to do it. And he discovered an issue using New Relic Explorer. But before sending it on to the person on their team responsible for it, they went ahead and clicked in and said, "Let me just see if I can figure out what's going wrong here."

And for the first time in a long time they were actually able to perform root cause analysis on the thing and send it directly to the engineer outside of their team who was responsible for doing the work to actually file the ticket and fix it and all that stuff without having to task it out to an individual on their team to do the investigation. So it probably saved them, what? A day? Two days maybe? At least a day.

Jeremy: That's a lot of time.

Buddy: Yeah. So that was something that we didn't necessarily expect was that it was going to unlock the ability of folks who don't ordinarily do root cause analysis and detailed work to be able to actually navigate to what the root cause was in a way that they'd never been able to do before. It was actually one of the more, I think for me personally, hugely validating points that we had achieved what we had set out to in terms of making an interface that people could use efficiently. When not only the people in your target audience, but also people who weren't necessarily in your target audience were still able to diagnose a problem in realtime because the connections were there in the right place. I mean, the data's always been there. Collecting data's not hard. What's hard is making it all accessible at scale and connecting all of it and delivering insights. Not just piling a bunch of data into a data lake somewhere or something like that.

So when we got that story back that someone was able to, who doesn't ordinarily do this day-to-day, actually get in and diagnose a problem, it was surprising and it was also really validating for us and for the team.

Jeremy: Yeah. And I think that's amazing. No matter what level you are at, whether you're a developer or you're a manager or you're somewhere in between, not only do you reduce mean time to recovery and you can find those problems faster and figure out what the issue is, but that saves a lot of time. I can't tell you how many times I spent days looking through logs and all kinds of things trying to figure out exactly why every 100th time this thing runs something goes wrong. And being able to go and trace that and find that information quickly saves you time, saves you money, saves you mental anguish, I would think, for a lot of these things. So that's pretty cool.

Buddy: Yeah. Sure. We thought so.

Jeremy: Awesome. All right. Well, listen Buddy, I really appreciate you being here and sharing all this stuff about the New Relic Explorer and the New Relic One platform. So if people want to find out more about you, ask you some questions maybe, or they want to find out or sign up for New Relic One and use this new New Relic Explorer, how do they do that?

Buddy: Yeah. Well, for me personally, I'm most active these days on LinkedIn of the social platforms so you can find me there. Just Buddy Brewer. I'll be the one that pops up working at New Relic. And for New Relic, like I mentioned earlier, we have a free tier that is really easy and really the best place for someone to get started who's had no exposure to New Relic. You just go on our website. It's up at the top right. Click on sign up. And what you'll get is 100 gigabytes a month of ingest that you can put into the New Relic platform and one seat license for all of this stuff that we just talked about today. So you can actually ingest 100 gig of your own data every month and just go use New Relic Explorer and all the other parts of full stack observability.

Jeremy: Awesome. And you can find that at newrelic.com. Thanks again, Buddy.

Buddy: Thanks, Jeremy.

Episode source

Serverless Chats Follow

Episode #90: Full-Stack Observability with the New Relic Explorer with Buddy Brewer

Serverless Chats