ALTERNATE UNIVERSE DEV

The Bike Shed

296: Speedy Performance with Nate Berkopec

Nate Berkopec is the author of the Complete Guide to Rails Performance, the creator of the Rails Performance Workshop, and the co-maintainer of Puma. He talks with Steph about being known as "The Rails Speed Guy," and how he ended up with that title, publishing content, working on workshops, and also contributing to open source projects. (You could say he's kind of a busy guy!)

Transcript:

STEPH: All right. I'll kick us off with our fancy intro. Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I’m Steph Viccari. And this week, Chris is taking a break. But while he's away, I'm joined by Nate Berkopec, who is the owner of Speedshop, a Ruby on Rails performance consultancy. And, Nate, in addition to running a consultancy, you're the co-maintainer of Puma. You're also an author as you wrote a book called The Complete Guide to Rails Performance. And you run the workshop called The Rails Performance Workshop. So, Nate, I'm sensing a theme here.

NATE: Yeah, make code go fast.

STEPH: And you've been doing that for quite a while, haven't you?

NATE: Yeah. It's pretty much been since 2015, or so I think. It all started when I actually wrote a blog post about Turbolinks that got a lot of pick up. My hot take at the time was that Turbolinks is actually a good thing. That take has since become uncontroversial, but it was quite controversial in 2015. So I got a lot of pick up on that, and I realized I liked working on performance, and people seem to want to hear about it. So I've been in that groove ever since.

STEPH: When you started down the path of really focusing on performance, were you running your own consultancy at that point, or were you working for someone else?

NATE: I would say it didn't really kick off until I actually published The Complete Guide to Rails Performance. So after that came out, which was, I think, March of 2016…I hope I’m getting that right. It wasn't until after that point when it was like, oh, I'm the Rails performance guy now. And I started getting emails inbound about that. I didn't really have any time when I was actually working on the CGRP to do that sort of thing. I just made that my full-time job to actually write, and market, and publish that. So it wasn't until after that that I was like, oh, I'm a performance consultant now. This is the lane I've driven myself into. I don't think I really had that as a strategy when I was writing the book. I wasn't like, okay, this is what I'm going to do. I'm going to build some reputation around this, and then that'll help me be a better consultant with this. But that's what ended up happening.

STEPH: I see. So it sounds like it really started more as a passion and something that you wanted to share. And it has manifested to this point where you are the speed guy.

NATE: Yeah, I think you could say that. I think when I started writing about it, I just knew...I liked it. I liked the work of performance. In a lot of ways, performance is a much more concrete discipline than a lot of other sub-disciplines of programming where I joke my job is number go down. It's very measurable, and it's very clear when you've made a difference. You can say, “Hey, this number was this, and now it's this. Look what I did.” And I always loved that concreteness of performance work. It makes it actually a lot more like a real kind of engineering discipline where I think of performance engineering as clarifying requirements and the limitations and then building a project that meets the requirements while staying within those limitations and constraints. And that's often not quite as clear for other disciplines like general feature work. It's kind of hard to say sometimes, like, did you actually make the user's life better by implementing such and such? That's more of a guess. That's more of a less clear relationship. And with performance, nobody's going to wake up ten years from today and wish that their app was slower. So we can argue about the relative importance of performance in an application, but we don't really argue about whether or not we made it faster because we can prove that.

STEPH: Yeah. That's one area that working with different teams (as I tend to shift the clients that I'm working with every six months) where we often push hard around feature work to say, “How can we measure this? How can we know that we are delivering something valuable to users?” But as you said, that's really tricky. It's hard to evaluate. And then also, when you add on the fact that if I am leaving that project in six months, then I don't have the same insights to understand how something went for that team. So I can certainly appreciate the satisfaction that comes from knowing that, yes, you are delivering a faster app. And it's very measurable, given the time that you're there, whether it's a short time or if it's a long time that you're with that team.

NATE: Yeah, totally. My consulting engagements are often really short. I don't really do a lot of super long-term stuff, and that's usually fine because I can point to stuff and say, “Yep. This thing was at A, and now it's at B. And that's what you hired me to do, so now it's done.”

STEPH: I am curious; given that you have so many different facets where you are running your consultancy, you are also often publishing a lot of content and working on workshops and then also contributing to open source projects. What does a typical week look like for you?

NATE: Well, right now is actually a decent example. I have client work two or three days a week. And I'm actually working on a new product right now that I'm calling Sidekiq in Practice, which is a course/workshop about scaling Sidekiq from zero to 1000 jobs per second. And I'll spend the other days of the week working on that. My content is...I always struggle with how much time to spend on blogging specifically because it takes so much time for me to come up with a post and publish that. But the newsletter that I write, which I try to write two once a week, I haven't been doing so well with it lately. But I think I got 50 newsletters done in 2020 or something like that.

STEPH: Wow.

NATE: And so I do okay on the per-week basis. And it's all content I've never published anywhere else. So that actually is like 45 minutes of me sitting down on a Monday and being like rant, [chuckles] slam keyboard and rant and then hit send. And my open source work is mostly 15 minutes a day while I'm drinking morning coffee kind of stuff. So I try to spread myself around and do a lot of different stuff. And a lot of that means, I think, pulling back in terms of thinking how much you need to spend on something, especially with newsletters, email newsletters, it was very easy to overthink that and spend a lot of time revising and whatever. But some newsletter is better than no newsletter. And especially when it comes to content and marketing, I've learned that frequency and regularity is more important than each and every post is the greatest thing that's ever come out since sliced bread. So trying to build a discipline and a practice around doing that regularly is more important for me.

STEPH: I like that, some newsletter is better than no newsletter. I was listening to your chat with Brittany Martin on the Ruby on Rails podcast. And you said something very honest that I appreciated where you said, “Writing is really hard, and writing sucks.” And that made me laugh in the moment because even though I do enjoy writing, I still find it very hard to be disciplined, to sit down and make it happen. And then you go into that editor mode where you critique everything, and then you never really get it published because you are constantly fixing it. It sounds like...you've mentioned you set aside about 45 minutes on a Monday, and you crank out some work. How do you work through that inner critic? How do you get past it to the point where then you just publish?

NATE: You have to separate the steps. You have to not do editing and first drafting at the same time. And the reason why I say it sucks and it's hard is because I think a lot of people don't do a lot of regular writing, maybe get intimidated when they try to start. And they're like, “Wow, this is really hard. This is not fun.” And I'm just trying to say that's everybody's experience and if it doesn't get any better, because it doesn't, [chuckles] there's nothing wrong with you, that's just writing, it's hard. For me, especially with the newsletter, I just have to give myself permission not to edit and to just hit send when I'm done. I try to do some spell checking,, and that's it. I just let it go. I'm not going back and reading it through again and making sure that I was very clear and cogent in all my points and that there's a really good flow through that newsletter. I think it comes with a little bit of confidence in your own ideas and your own experience and knowledge, believing that that's worth sharing and that's worth somebody's time, even if it's not a perfect expression of what's in your head. Like, a 75% expression is good enough, especially in a newsletter format where it's like 500 to 700 words. And it's something that comes once a week. And maybe not everyone's amazing, but some of them are, enough of them are that people stay subscribed. So I think a combination of separating editing and first drafting and just having enough confidence and the basis where you have to say, “It doesn't have to be perfect every single time.”

STEPH: Yeah, I think that's something that I learned a while back to apply to my coding process where I had to separate those two steps of where I have to let the creator in me just create and write some code and make it work, and then come back to the editing process, and taking a similar approach with writing. As you may be familiar with thoughtbot, we're big advocates when it comes to sharing content and sharing things that we have learned throughout the week and different projects that we're working on. And often when people join thoughtbot, they're very excited to contribute to the blog. But it is daunting for that first post because you think it has to be this really grand novel. And it has to be something that is really going to appeal to everybody, and it's going to help everyone. And then over time, you learn it's like, oh well, actually it can be this very just small thing that I learned that maybe only helps 20 people, but it still helped those 20 people. And learning to publish more frequently versus going for those grand pieces is more favorable and often more helpful for people.

NATE: Yeah, totally. That's something that is difficult for people at first. But everything in my experience has led me to believe that frequency and regularity is just as, if not more important than the quality of any individual piece of content that I put out. So that's not to say that...I guess it's weird advice to give because people will take it too far the other way and think that means he's saying quality doesn't matter. No, of course, it does, but I think just everyone's internal biases are just way too tuned towards this thing must be perfect. I've also learned we're just really bad judges internally of what is useful and good for people. Stuff that I think is amazing and really interesting sometimes I'll put that out, and nobody cares. [chuckles] And the other stuff I put out that's just like the 45-minute banging out newsletter, people email me back and say, “This is the most helpful thing anyone’s ever read.” So that quality bias also assumes that you know what is good and actually we're not really good at that, knowing every time what our audience needs is actually really difficult.

STEPH: That's totally fair. And I have definitely run into that too, where I have something that I'm very proud of and excited to share, and I realize it relates to a very small group of people. But then there's something small that I do every day, and then I just happen to tweet about it or talk about it, and suddenly that's the thing that everybody's really excited about. So yeah, you never know. So share it all.

NATE: Yeah. And it's important to listen. I pay attention to what people get interested in from what I put out, and I will do more of that in the future.

STEPH: You mentioned earlier that you are working on another workshop focused on Sidekiq. What can you tell me about that?

NATE: So it's meant to be a guide to scaling Sidekiq from zero to 1000 requests per second. And it's meant to be a missing guide to all the things that happen, like the situations that can crop up operationally when you're working on an application that does a lot of work with Sidekiq. Whereas Mike Sidekiq, Wiki, or the docs are great about how do, you do this? What does this setting mean? And the basics of getting it just running, Sidekiq in practice, is meant to be the last half of that. How do you get it to run 1,000 jobs per second in a day-to-day application? So it's the collected wisdom and collected battle scars from five years of getting called in to fix people's Sidekiq installations and very much a product of what are the actual problems that people experience, and how do you fix and deal with those? So stuff about memory and managing Sidekiq memory usage, how to think about queues. Like, what should your queue structure be? How many should you have? Like, how do you organize jobs into queues, and how do you deal with problems like some client is dropping 10,000, 20,000 jobs into a queue. And now the other jobs I put in that queue have 20,000 jobs in front of them. And now this other job I've got will take three hours to get through that queue. How do you deal with problems like that? All the stuff that people have come to me over the years and that I've had to help them fix.

STEPH: That sounds really great. Because yeah, I find that teams who are often in this space with Sidekiq we just let it run until there's a fire. And then suddenly, we start to care as to how it's processing, and we care about our queue structure and how many workers that we have that are pulling from that queue. So that sounds really helpful. When you're building a workshop, do you often go back to any of those customers and pull more ideas from them, or do you find that you just have enough examples from your collective work with clients that that itself creates a course?

NATE: Usually, pretty much every chapter in the workshop I've probably implemented like three-plus times, so I don't really have to go back to any individual customer. I have had some interesting stuff with my current client, Gusto. And Gusto is going through some background job reorganization right now and actually started to implement a lot of the things that I'm advocating in the workshop actually without talking to me. It was a good validation of hey, we all actually think the same here. And a lot of the solutions that they were implementing were things that I was ready to put down into those workshops. So I'd like to see those solutions implemented and succeed. So I think a lot of the stuff in here has been pretty battle-tested.

STEPH: For the Rails Performance Workshop, you started off doing those live and in-person with teams, and then you have since switched to now it is a CLI course, correct?

NATE: That's correct. Yep.

STEPH: I love that very much. When you’ve talked about it, it does feel very appropriate in terms of developers and how we like to consume content and learn. So that is really novel and also, it seems like a really nice win for you. So then other people can take this course, but you are no longer the individual that has to deliver it to their team, that they can independently take the course and go through it on their own. Are you thinking about doing the same thing for the Sidekiq course, or what are your plans for that one?

NATE: Yeah, it's the exact same structure. So it's going to be delivered via the command line. Although I would say Sidekiq in practice has more text components. So it's going to be a combination of a very short manual or book, and some video, and some hands-on exercises. So, an equal blend between all three of those components. And it's a lot of stuff that I've learned over having to teach; I guess intermediate to advanced programming concepts for the last five years now that people learn at different paces. And one of the great things about this kind of format is you can pick it up, drop it off, and move at your own speed. Whereas a lot of times when I would do this in person, I think I would lose people halfway through because they would get stuck on something that I couldn't go back to because we only had four hours of the day. And if you deliver it in a class format, you're one person, and I've got 24 other people in this room. So it's infinitely pausable and replayable, and you can go back, or you can just skip ahead. If you've got a particular problem and you're like, hey, I just want to figure out how to fix such and such; you can do that. You can just come in and do a particular thing and then leave, and that's fine. So it's a good format that way. And I've definitely learned a lot from switching to pre-recorded and pre-prepared stuff rather than trying to do this all live in person.

STEPH: That is one of the lessons that I've learned as well from the couple of workshops that I've led is that doing them in person, there's a lot of energy. And I really enjoy that part where I get to see people respond to the content. And then I get a lot of great feedback from people about what type of questions they have, where they are getting stuck. And that part is so important to me that I always love doing them live first. But then you get to the point, as you'd mentioned, where if you have a room full of 20 people and you have two people that are stuck, how do you help them but then still keep the class going forward? And then, if you are trying to tailor this content for a wide audience…so maybe beginners could take the Rails Performance Workshop, or they could also take the Sidekiq course. But you also want the more senior engineers to get something out of it as well. It's a very challenging task to make that content scale for everyone.

NATE: Yeah. What you said there about getting feedback and learning was definitely something that I got out of doing the Rails Performance Workshop in person like three dozen times, was the ability to look over people's shoulders and see where they got stuck. Because people won't email me and say, “Hey, this thing is really confusing.” Or “It doesn't work the way you said it does for me.” But when I'm in the same room with them, I can look over their shoulder and be like, “Hey, you're stuck here.” People will not ask questions. And you can get past that in an in-person environment. Or there are even certain questions people will ask in person, but they won't take the time to sit down and email me about. So I definitely don't regret doing it in person for so long because I think I learned a lot about how to teach the material and what was important and how people...what were the problems that people would encounter and stuff like that. So that was useful. And definitely, the Rails Performance Workshop would not be in the place that it is today if I hadn't done that.

STEPH: Yeah, helping people feel comfortable asking questions is incredibly hard and something I've gone so far in the past where I've created an anonymous way for people to submit questions. So during class, even if you didn't want to ask a question in front of everybody, you could submit a question to this forum, and I would get notified. I could bring it up, and we could answer it together. And even taking that strategy, I found that people wouldn't ask questions. And I guess it circles back to that inner critic that we have that's also preventing us from sharing knowledge that we have with the world because we're always judging what we're going to share and what we're going to ask in front of our peers who we respect. So I can certainly relate to being able to look over someone's shoulder and say, “Hey, I think you're stuck. We should talk. Let me walk you through this or help you out.”

NATE: There are also weird dynamics around in-person, not necessarily in a small group setting. But I think one thing I really picked up on and learned from RailsConf2021 which was done online, was that in-person question asking requires a certain amount of confidence and bravado that you're not...People are worried about looking stupid, and they won't ask things in a public or semi-public setting that they think might make them look dumb. And so then the people that do end up asking questions are sometimes overconfident. They don't even ask a question. They just want to show off how smart they are about a particular issue. This is more of an issue at conferences. But the quality of questions that I got in the Q&A after RailsConf this year (They did it as Discord chats.) was way better. The quality of questions and discussion after my RailsConf talk was miles better than I've ever had at a conference before. Like, not even close. So I think experimenting with different formats around interaction is really good and interesting. Because it's clear there's no perfect format for everybody and experimenting with these different settings and different methods of delivery has been very useful to me.

STEPH: Yeah, that makes a ton of sense. And I'm really glad then for those opportunities where we're discovering that certain forums will help us get more feedback and questions from people because then we can incorporate that and to future conferences where people can speak up and ask questions, and not necessarily be the one that's very confident and enjoys hearing their own voice. For the Rails Performance Workshop, what are some of the general things that you dive into for that workshop? I'm curious, what is it like to attend that workshop? Although I guess one can't attend it anymore. But what is it like to take that workshop?

NATE: Well, you still can attend it in some sense because I do corporate bookings for it. So if you want to buy 20 seats, then I can come in and basically do a Q&A every week while everybody takes the workshop. Anyway, I still do that. I have one coming up in July, actually. But my overall approach to performance is to always start with monitoring. So the course starts with goals and monitoring and understanding where you want to go and where you are when it comes to performance. So the first module of the Rails Performance Workshop is actually really a group exercise that's about what are our performance requirements and how can we set those? Both high-level and low-levels. So what is our goal for page load time? How are we going to measure that? How are we going to use that to back into lower-level metrics? What is our goal for back-end response times? What is our goal for JavaScript bundle sizes? That all flows from a higher-level metric of how fast you want the page to load or how fast you want a route to change in a React app or something, and it talks about those goals. And then where should you even start with where those numbers should be? And then how are you going to measure it? What are the browser events that matter here? What tools are available to help you to get that data? Because without measurement, you don't really have a performance practice. You just have people guessing at what stuff is faster and what is not. And I teach performance as a scientific process as science and engineering. And so, in the scientific method, we have hypotheses. We test those hypotheses, and then we learn based on those tests of our hypotheses. So that requires us to A, have a hypothesis, so like, I think that doing X makes this faster. And I talk about how you generate hypotheses using profiling, using tools that will show you where all the time goes when you do this particular operation of your software—and then measuring what happens when you do that? And that's benchmarking. So if you think that getting rid of method X or changing method X will speed up the app, benchmarking tells you did you actually speed it up or not? And there are all sorts of little finer points to making sure that that hypothesis and that experiment is tested in a valid way.

I spend a lot of time in the workshop yapping about the differences between development/local environments and production environments and which ones matter. Because what differences matter, it's not often the ones that we think about, but instead it's differences like actually in Rails apps the asset packaging and asset pipeline performs very differently in production than it does in development, works very differently. And it makes it one of the primary reasons development is slower than production, so making sure that we understand how to change those settings to more production-like settings. I talk a lot about data. It’s the other primary difference between development and production is production has a million users, and development has 10. So when you call things like User.all, that behavior is very different in production than it is locally. So having decent production-like data is another big one that I like to harp on in the workshops. So it's a process in the workshop of you just go lesson by lesson. And it's a lot of video followed up by hands-on exercises that half of them are pre-baked problems where I'm like, hey, take a look at this Turbolinks app that I've given you and look at it in DevTools. And here's what you should see. And then the other half is like, go work on your application. And here are some pull requests I think you should probably go try on your app. So it's a combination of hands-on and videos of the actual experience going through it.

STEPH: I love how you start with a smaller application that everyone can look at and then start to learn how performant is this particular application that I'm looking at? Versus trying to assess, let’s say, their own application where there may be a number of other variables that they have to consider. That sounds really nice. You'd mentioned one of the first exercises is talking about setting some of those goals and perhaps some of those benchmarks that you want to meet in terms of how fast should this page load, or how quickly should a response from the API be? Do you have a certain set of numbers for those benchmarks, or is it something that is different for each product?

NATE: Well, to some extent, Google has suddenly given us numbers to work with. So as of this month, I think, June 2021, Google has started to use what they're calling Core Web Vitals in their ranking of search results. They've always tried to say it's not a huge ranking factor, et cetera, et cetera, but it does exist. It is being used. And that data is based on Chrome user telemetry. So every time you go to a website in Chrome, it measures three metrics and sends those back to Google. And those three metrics are Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS). And First Input Delay and Cumulative Layout shift are more important for your single-page apps kind of stuff. It's hard to screw those up with a Golden Path Rails app that just does Turbolinks or Hotwire or whatever. But Largest Contentful Paint is an easy one to screw up. So Google's line in the sand that they've drawn is 2.5 seconds for Largest Contentful Paint. So that's saying that from clicking on your website in a Google search result, it should take 2.5 seconds for the page to paint the largest component of that new page. That's often an image or a video or a large H1 tag or something like that. And that process then will help you to...to get to 2.5 seconds in Largest Contentful Paint; there are things that have to happen along the way. We have to download and execute all JavaScript. We have to download CSS. We have to send and receive back-end responses. In the case of a simple Hotwire app, it's one back-end response. But in the case of a single-page app, you got to download the document and then maybe download several XHR fetches or whatever. So there's a chain of events that has to happen there. And you have to walk that back now from 2.5 seconds in Largest Contentful Paint. So that's the line that I'm seeing getting drawn in the sand right now with Google's Core Web Vitals. So pretty much any meaningful web application performance metric can be walked back from that.

STEPH: Okay. That's super helpful. I wasn't aware of the Core Web Vitals and that particular stat that Google is using to then rank the sites. I was going to ask, this kind of blends in nicely into when do you start caring about performance? So if you have a new application that you are just starting to get to market, based on the fact that Google is going to start ranking you right away, you do have to care some right out of the gate. But I am curious, when do you start caring more about performance, and are there certain tools and benchmarking that you want to have in place from day one versus other things that you'll say, “Well, we can wait until we have X numbers of users or other conditions before we add more profiling?”

NATE: I'd say as an approach, I teach people not to have a performance strategy of monitoring. So if your strategy is to have dashboards and look at them regularly, you're going to lose. Eventually, you're not going to look at that dashboard, or more often, you just don't understand what you're looking at. You just install New Relic or Datadog or whatever, and you don't know how to turn a dashboard into actual action. Also, it seems to just wear teams out, and there's no clear mechanism when you just have a dashboard of turning that into oh, well, this has to now be something that somebody on our team has to go work on. Contrast that with bugs, so teams usually have very defined processes around bugs. So usually, what happens is you'll get an Exception Notification through Sentry or Bugsnag or whatever your preferred Exception Notification service is. That gets read by a developer. And then you turn that into a Jira ticket or a Kanban board or whatever. And then that is where work is done and prioritized. Contrast that with performance; there’s often no clear mechanism for turning metrics into stuff that people actually work on. So understanding at your organization how that's going to work and setting up a process that automatically will turn performance issues into actual work that people get done is important.

The way that I generally teach people to do this is to focus instead of dashboards and monitoring, on alerts, on automated thresholds that get tripped and then sends somebody's an email or put something in the Kanban board or whatever. It just has to be something that automatically gets fired. Different tools have different ways of doing this. Datadog has pretty much built their entire product around monitoring and what they call monitors. That's a perfectly fine way to do it, whatever your chosen performance monitoring tool, which I would say is a required thing. I don't think there's really any good excuse in 2021 for not having a performance monitoring tool. There are a million different ways to slice it. You can do it yourself with OpenTelemetry and then like statsD, I don't know, or pay someone else like everyone else does for Datadog or New Relic or AppSignal or whatever. But you got to have one installed. And then I would say you have to have some sort of automated alerting. Now that alerting means that you've also decided on thresholds. And that's the hard work that doesn't get done when your strategy is just monitoring. So it's very easy to just install a dashboard and say, “Hey, I have this average page time load dashboard. That means I'm paying attention to performance.” But if you don't have a clear answer to what number is good and what number is bad, then that dashboard cannot be turned into real action. So that's why I push monitoring so hard is because it allows people to ignore performance is all that matters, and it forces you to make the decision upfront as to what number matters. So that is what I would say, install some kind of performance monitoring. I don't really care what kind.

Nowadays, I also think there's probably no excuse to not have Real User Monitoring. So there's enough GDPR compliance Real User Monitoring now that I think everyone should be using it. So for industry terms, Real User Monitoring is just performance monitoring in the browser. So it's just users’ browser APIs and sends those back to you or your third-party provider, so having that so you actually are collecting back-end and front-end performance metrics. And then making decisions around what is bad and what is good. Probably everybody should just start with a page load time monitor, Largest Contentful Paint monitor. And if you've got a single-page app, probably hooking up some stuff around route changes or whatever your app...because you don't actually have page loads on every single time you navigate. You have to instrument whatever those interactions are. So having those up and then just drawing some lines that say, “Hey, we want our React route changes to always be one second or less.” So I will set an alert that if the 95th percentile is one second or more, I'm going to get alerted. There's a lot of different ways to do that, and everybody will have different needs there. But having a handful of automated monitors is probably a place to start.

STEPH: I like how you also focus on once you have decided those thresholds and have that monitoring in place, but then how do you make it actionable? Because I have certainly been part of teams where we get those alerts, but we don't necessarily...what you just mentioned, prioritize that work to get done until we have perhaps a user complaint about it. Or we start actually having pages that are timing out and not loading, and then they get bumped up in the priority queue. So I really like that idea that if we agree upon those thresholds and then we get alerted, we treat that alert as if it is a user that is letting us know that a page is too slow and that they are unable to use our application, so then we can prioritize that work.

NATE: And it's not all that dissimilar to bugs, really. And I think most teams have processes around correctness issues. And so, all that my strategy is really advocating for is to make performance fail loudly in the same way that most exceptions do. [chuckles] Once you get to that point, I think a lot of teams have processes around prioritization for bugs versus features and all that. And just getting performance into that conversation at least tends to make that solve itself.

STEPH: I'm curious, as you're joining teams and helping them with their performance issues, are there particular buckets or categories of performance issues that are the most common in terms of, let's say, 50% of issues are SQL-related N+1 issues? What tends to be the breakdown that you see?

NATE: So, when it comes to why something is slow in a Ruby application, I teach a method that I call DRM. And that doesn't have anything to do with actual DRM. It's just memorable because it reminds me of things I don't like. DRM stands for Database Ruby and Memory and in that order. So the most common issue is database, the second most common issue is issues with your Ruby code. The least common issue is memory. Specifically, I'm talking about allocation of objects, creating lots of objects. So probably 80% of your issues are in some way database-related. In Rails, it's 50% of those are probably N+1. And then 30% of database issues are probably what I would call unnecessary SQL. So it's not necessarily N+1, but it's a SQL query for information that you already had, or you could do in a more efficient way. So a common thing for unnecessary SQL would be people will filter an ActiveRecord::Collection like ten different ways when they could have just loaded the whole collection, filtered it with Ruby in the ten different ways afterwards, and that works really well if the collection that you're loading is like 10, 20. Turning that into one database query, plus a bunch of calls to innumerable methods is often way faster than doing that as ten separate database queries. Also, that tends to be a more robust approach. This doesn't happen in most companies, but what could happen is the database is like a shared resource. It's a resource that everybody is affected by. So a performance degradation to the database is the worst possible scenario because everything is affected. But if you screw up what's happening at an individual Rails process, then only that Rails process is affected. The blast radius is tiny. It's just that one request. So doing less stuff in the database while it can actually seem like, oh, that doesn't feel right. I'm supposed to do a lot of stuff in the database. It actually can reduce blast radiuses of performance issues because you're not doing it on this database that everyone has to have access to. There are a lot of areas of gray here. And I talk a lot in all my other material like why -- There's a lot of nuance here.

So database is the main stuff. Issues in how you write your Ruby code is probably the other one. Usually, that's just what I would call code that goes bump in the night. It's code that you don't know is running but actually is. Profilers are what help us figure that out. So oftentimes, I'll have someone open up a profiler on their controller action for the first time. And they're like, wait a minute, I had no idea that such and such was running during this controller action, and actually, we don't need to do that at all. So why is it here? So that's the second most common issue. And then the third issue that really doesn't actually come up all that often is object allocation, numbers of objects that get created. So primarily, this is a problem in index actions or actions transactions that deal with big collections. So in Ruby, we often get overly focused on garbage collection, but garbage collection doesn't take any time if you just don't create objects. And object creation itself takes time. So looking at code through the lens of what object does this code create? And trying to get rid of those object allocations can often be a pretty productive way to make stuff faster.

STEPH: You said a lot of amazing things there. So I'm debating on which one to follow up on. I think the one that stuck out to me the most where I have felt pain around this is you mentioned identifying code that goes bump in the night or code that is running, but it doesn't need to be run. And that is something that I've run into with applications where we have a code path that seems important, but yet I can't prove that it's being executed and exactly why it's there and what flow it's supporting. And I'm curious, do you have any tips or tricks in how you’ve helped teams identify that this code path isn't used and it's something that we can remove and then that itself will help speed up the performance of that particular endpoint?

NATE: Like, there's no performance cost to like 100 models in an application that never actually get used. There's really no performance downside to code in an app that doesn't actually ever get run. But instead, what happens is code gets added into callbacks that usually is probably the biggest offender that’s like, always do this thing after you do X. But then, two years later, you don't always need to do that thing after you do X. So the callbacks always run, but sometimes requirements change, and they don't always need to be run. So usually, it's enough to just pop the profiler now on something. And I have people look at it, and they're like, “I don't know why any of this is happening.” Like, it's usually a pretty big Eureka moment once we look at a flame graph for the first time and people understand how to read those, and they understand what they're looking at. But sometimes there's a bit of a process where especially in a bigger app where it's like, “Such and such is running, and this was an entire other team that's working on this. I have no idea what this even does.” So on bigger apps, there's going to be more learning that has to get done there. You have to learn about other parts of the application that maybe you've never learned about before. But profiling helps us to not only see what code is running but also what that relative importance is. Like, okay, maybe this one callback runs, and you don't know what it does, and it's probably unnecessary. But if it only takes 1% of the total time to run this action, that's probably less important than something that takes 20% of total time. And so profilers help us to not only just see all the code that's being run but also to know where that time goes and what time corresponds to what parts of the code.

STEPH: Yeah, that's often the code that makes me the most nervous is where it's code that I suspect is being run or maybe being run, but I don't understand why it's there and then figuring out if it can be removed and then figuring out ways to perhaps even log when a call is being made to that code to determine if it's truly in use or not or at least supported by a code path that a user is hitting. You have a blog post that I read recently that I really appreciated that talks about essentially gaming benchmarking where you talk about the importance of having context around benchmarks. So if someone says, “I've improved something where it is now 10% faster.” It's like, well, what is that 10% relative to? And if it's a tool that other people are using, what does that mean for them? Or did you improve something that was already very fast, and you made it 10% faster? Was that a really valuable use of your time?

NATE: Yeah. You know, something that I read recently that made me think of that again was this Hacker News post that went viral. That was like, how I optimize an AWS EC2 instance to take 1.5 million requests per second on my JSON API. And out of the box, it was like 500 requests per second, and then he got it to 1.5 million. And the whole article was presented with relative numbers. So it was like, “I made this change, and things got 33% faster. And if you do the whole thing right, 500 to 1.5 million requests per second, it's like my app is three times faster now,” or whatever. And that's true, but it would probably be more accurate to say, “I've taken three-millionth of a second out of every request in my app.” That's two ways of saying the same thing because latency and throughput are just related that way. But it's probably more accurate and more useful to say the absolute number, but it doesn't make for great blog posts, so that doesn't tend to get said. The kinds of improvements that were discussed in this article were really, really low-level stuff. That was like if you turn off...I think it was like turn off iptables or something like that. And it's like, that shaves a microsecond off of every time we make a syscall or something. And that is useful if your performance goal is to serve 1.5 million requests per second Hello World responses off of my EC2 instance, which is what this person admittedly was doing. But there's a tendency to walk that back to if I do all things in this article, my application will be three times faster. And that's just not what the evidence says. It's not what you were told. So there's just a tendency to use relative numbers when absolute numbers would be more useful to giving you the context of like, oh, well, this will improve my app or it won't. We get this a lot in Puma. We get benchmarks that are like, hey, this thing is going to help us to do 50,000 requests per second in Puma instead of 10,000. And another way of saying that is you took a couple of nanoseconds off of the overhead of every single request to Puma. And most Puma applications have a hundred millisecond response time. So it's like, yeah, I guess it's cool that you took a nanosecond off, and I’m sure it's going to help us have cool benchmarks, but none of our users are going to care. No one that's used Puma is going to care that their requests are one nanosecond faster now. So what did we really gain here?

STEPH: Yeah, it makes sense that people would want to share those more...I want to call them sparkly stats and something that catches your attention, but they're not necessarily something that's going to translate to us in the way that we hoped that they will in terms of it's not going to speed up our app 30% or have those same rewards or benefits. Speaking of Puma, how is it being a co-maintainer of Puma? And how do you balance that role with all of your other work?

NATE: Actually, it doesn't take all that much of my time. I try to spend about 15 minutes a day on it. And that's really possible because of the philosophy I have around open-source maintenance. I think that open source projects are fundamentally about collaboration and about sharing our hard-fought extractions and fixes and knowledge together. And it's not about a single super contributor or super maintainer who is just out of the goodness of their heart releasing all of their incredible work and time into the public domain or into a free software license. Puma is a pretty popular piece of Ruby software, so a lot of people use it. And I have things on my back burner of if I ever got 20 hours to work on Puma, here’s stuff I would do. But there are a lot of other people that have more time than me to work on Puma. And they're just as smart, and they have other tools they've got in their locker that I don't have. And I realized that it was more important that I actually find ways to recruit and then unblock those people than it was for me to devote as much time as I could to Puma. And so my work on Puma now is really just more like management than anything else. It's more trying to recruit new contributors and trying to give them what they need to help Puma. And contributing to open source is a really fraught experience for a lot of people, especially their first time. And I think we should also be really conscious of that. Like, 95% of software developers have really never contributed to open source in a meaningful way. And that's a huge talent pool of people that could be helping us that aren't. So I'm less concerned about the problems of the 5% that are currently contributing than I am about why there are 95% of us that don't do anything. So that's what gets me excited to work on Puma now, is trying to change that ratio.

STEPH: I really like that mindset of where you are there to provide guidance but then essentially help unblock others as they're making contributions to the project but then still be there to have the history and full context and also provide a path forward of a good direction for Puma to head. In regards to encouraging more people to contribute to open source projects, I've often heard people say how challenging that is, where they have an open-source project that they would really love people to contribute to but finding people is really hard or just letting people know that they're interested in contributions. Have you had any strategies that have been successful for you in encouraging people to contribute?

NATE: Yeah. So first thing, the easiest thing is we have a contributing.md file. So that's something I think more projects should adopt is have an actual file in your project that says everything about how to contribute. Like, what kinds of contributions do you want? Different projects have different things that they want. Like, Rails doesn't want to refactor PRs. Don't send a refactor PR to Rails because they'll reject it. Puma, I'm happy to accept those. So letting people know like, “Hey, here's how we work here. Here's the community we're creating, and here's how it works. Here's how to get involved.” And I think of it as hanging out the shingle and saying, “Yes, I want your contributions. Here's how to do it.” That alone puts you a step above other projects.

The second thing I would say is you need to have contributor-only communication channels. So we have Matrix chat. So Matrix is like this successor to IRC. So we have a chat channel basically, but it's like contributors only. I don't enforce that, but I just don't want support requests in there. I don't want people coming in there and being like, “My Puma config doesn't work.” And instead, it's just for people that want to contribute to Puma, and that want to help out. If you have a question and come in there, anyone can answer it.

And then finally, another thing that I've had success with is doing one-on-one stuff. So I will actually...I have a Calendly invite that I think is in contributing.md now that you can just book 30 minutes with me anytime about contributing to Puma. And I will get on a Zoom call with you and talk to you about what are your concerns? Where do I think you can help? And I give my time away that way. The way I see it is like if I do that 20 times and I create one super contributor to Puma, that is worth more than me spending 10 hours on Puma because that person can contribute 100, 200, 1,000 hours over their lifetime of contributing to Puma. So that's actually a much more higher leverage contribution, really from my perspective. It's actually helping other people contribute more.

STEPH: Yeah, that's huge to offer people to say, “Hey, you can book time with me, and I will walk you through and let you know where you can start making an impactful contribution right away,” or “Here are some areas that I think you'd be interested, to begin with.” That seems like such a nice onboarding for someone who says, “I'm interested, but I'm nervous,” or “I'm just not sure about where to get started.” Also, I love your complaint department voice for the person who their Puma config doesn't work. That was delightful. [chuckles]

NATE: I think it's a little bit part of my open-source philosophy that, especially at a certain scale like Puma is at that we really kind of over-prioritize users. And I'm not really here to do support; I'm here to make the project better. And users don't actually contribute to open source projects. Users use the thing, and that's great. That's the whole reason we're open-sourcing is so more people use it. But it's important not to prioritize that over people who want to make the project better. And I think a lot of times; people get caught up in this almost clout chasing of getting the most GitHub stars that they think they need and users they think they need. And you don't get paid for having users, and the product doesn't get any better either. So I don't prioritize users. I prioritize the quality of the project and getting contributors. And that will create a better project, which will then create more users. So I think it's easy to get sidetracked by people that ask for your time when they're not giving anything back to the project in return. And especially at Puma’s scale, we have enough people that want my time or the time of other maintainers at Puma so that they can contribute to the project. And putting user support requests ahead of that is not good for the project. It's not the biggest, long-term value increase we could be making, so I don't prioritize them anymore.

STEPH: Yep. That sounds like more the pursuit of sparkly stats and looking for all those GitHub stars or all of those likes. Well, Nate, if you're game, I have two listener questions that I'd like to run by you because I shared with some folks that you are going to be on The Bike Shed today. And they're very excited and have two questions that they'd like me to run by you. How does that sound?

NATE: Yeah, all right.

STEPH: So the first question is, are there any paradigms or trends in Rails that inherently hurt performance?

NATE: Yeah. I get this question a lot, and I will preface it with saying that I'm the performance guy, and I'm not the software design guy. And I get a lot of questions about does such and such software design...how does that impact performance? And usually, there's like a way to do anything in a performant way. And I'm just here to help people to find the performant way and not to prescribe “You must always do X, Y, or Z,” or “ActiveRecord is bad. Never use it.” That's not my job here. And in my experience, there's a fast way to do almost anything.

Now, one thing that I think is dying, I guess, or one approach or one common...I don't know what to call it. One common mistake that is clearly wrong is to not do any form of server-side rendering in a web application. So I am anti-client-side app. But there are ways to do that and to do it quickly. But rendering a basically blank document, which is what most of these applications will do when they're using Rails as a back-end…you'll serve this basically blank document or a document with maybe some Chrome in it. And then, the client-side app has to execute, compile JavaScript, make XHR requests, and then render the page. That is just by definition slower than serving somebody a server-side rendered page. Now, I am 100% agnostic on how you want to generate that server-side rendering. There are some people that are working on better ways to do that with Rails and client-side apps. Or you could just go the Hotwire Turbolinks way. And it's more progressive enhancement where the back-end is always just serving the server-side rendered response. And then you do some JavaScript on top of that.

So I think five years from now, nobody will be doing this approach of serving blank documents and then booting client-side apps into that. Or at least it will be seen as outdated enough that you should never design a project that way anymore. It's one of those few things where it's like, yeah, just by definition, you're adding more steps into a rendering flow. That means, by necessity, it has to be slower. So I think everybody should be thinking about server-side rendering in their project. Again, I’m totally agnostic on how you want to implement that. With React, whatever front-end flavor of the month you want to go with, there's plenty of ways to do that, but I just think you have to be prioritizing that now.

STEPH: All right. Well, I like that five-year projection of where we're headed. I have found that it's often the admin-side where people will still bring in a lot of JavaScript rendering, just to touch on a bit of what you're saying, in terms of let's favor the server-rendered HTML versus over-optimizing a space that one, probably isn't a profitable space in terms that we do want our admins to have a great experience for our product. But if they are not necessarily our users, then it also doesn't need to be anything that is over the top or fancy or probably uses a lot of JavaScript. And instead, we can start simple. And there's a number of times that I've been on projects where we have often walked the admin back to be more server-rendered because we got to a point where someone was very excited to make the admin very splashy and quick but then couldn't keep up with the requests because then they were having to prioritize the user experience first. So it was almost like optimizing the admin, but then it got left out in the cold. So then it's just sort of this poor experience.

NATE: Yes. Shopify famously walked back their admin from I think it was Backbone to Turbolinks. And I think that that has now moved back to React is my understanding. But Shopify is a huge company, so they have plenty of time and resources to be able to do that. But I just remember that happening at the time where I was like, oh wow, they just rolled the whole thing back to Turbolinks again. And now, with the consolidation that's gone on in the React world, it's a little bit easier to pipe a server-side rendering into a React app. Whereas with Backbone, it was like no one knew what you were doing. So there was less knowledge about how to server-side render this stuff. Now it doesn't seem to be so much of a problem. But yeah, I mean, Rails is really good at CRUD apps, and admin is like 99% CRUD. And adhering as closely as possible to the Rails Golden Path there in an admin seems to be the most productive way to work on that kind of feature.

STEPH: All right. Ready for your second question?

NATE: Yes.

STEPH: Okay. This one's a bit more in-depth. They also mentioned a particular project name. So I am going to swap it out with a different name. So on project cinnamon roll, we found a really gnarly time-consuming API endpoint that's getting hammered. And on a first pass, we addressed a couple of N+1 issues and tuned the performance, and felt pretty confident that they had addressed the issue. But it was still fairly slow. So then they took some additional incremental steps. So they swapped out to use OJ for serialization that shaved off an additional 10% but was still slow. They also went the route of going straight to Rails cache with a one-minute expiration. So that way, they could avoid mucking with cache busting because they confirmed with the client that data could be slightly stale. And this was great. It worked out well. So it dropped their average response time down to less than 70 milliseconds. With all that said, that journey took a few hours over a few days, and multiple production deploys. And had they gone straight to the cache, then they would have had a 15-minute fix with a single deploy. So this person's wondering, are there any other examples like that where, rather than taking these incremental seemingly obvious performance whims, there are situations where you want to be much more direct with your path?

NATE: I guess I'd say that profiling can help you to understand and form better hypotheses about what will make things faster and what won't. Because a profiler can't really lie to you about where time goes, either you spent 20% of your time in this method, or you didn't. So I don't spend any time in any of my material talking about what JSON serializer you use. Because really that's actually never...that's really never anybody's bottleneck. It's never a huge proportion of people's total percentage of time. And I know that because I've looked at enough profiles that the issues are usually in other places. So I would say that if your hypotheses that you're generating are not working, it's because you're not generating good enough hypotheses. And profiling is the place to do that. So having profilers running in production is probably the biggest level-upscale-like that most teams could take. So having profilers that you can access as on production servers as a user is probably the biggest level up that you could make to generating the hypotheses because that'll have real production data, real production servers, real production environment. And it's pretty common now that pretty much every team that I work with either has that already, or we work on implementing it. It's something that I've seen in production at GitHub and Shopify. You can do it yourself with rack-mini-profiler. It's all about setting up the authorization, just making sure that only authorized users get to see every single SQL query generated in the flame graph and all that. But other than that, there's no reason you shouldn't do it. So I would say that if you're not generating the right hypotheses or you don't...if the last hypothesis out of 10 is the one that works, you need better hypotheses, and the best way to do that is better profiling.

STEPH: Okay, better profiling. And yeah, it sounds like there's also a bit of experience in there in terms of things that you're used to seeing, that you've noticed that could be outliers in terms of that they're not necessarily the thing that you want to improve. Like you mentioned spending time on how you're serializing your JSON is not somewhere that you would look. But then there are other areas that you've gained experience that you know would be likely more beneficial to then focus on to form that hypothesis.

NATE: Yeah, that's a long way of saying experience pays off. I've had six years of doing this every single day. So I'm going to be pretty good at...that's what I get paid for. [laughs] So if I wasn't very good at that, I probably wouldn't be making any money at it.

STEPH: [laughs] All right. Well, thanks, Nate, so much for coming on the show today and talking so much about performance. On that note, I think it's a good place for us to wrap up. If people are interested in following along with what you're working on and they want to keep up with your latest and greatest workshops that are coming out, where can they find you on the internet?

NATE: speedshop.co is my site. @nateberkopec on Twitter. And speedshop.co has a link to my newsletter, which is where I'm actively thinking every week and publishing stuff too. So if you want to get the drip of news and thoughts, that's probably the best place to go.

STEPH: Perfect. All right. Well, thank you so much.

NATE: No problem.

STEPH: The show notes for this episode can be found at bikeshed.fm.

CHRIS: This show is produced and edited by Mandy Moore.

STEPH: If you enjoyed listening, one really easy way to support the show is to leave us a quick rating or a review in iTunes as it helps other people find the show.

CHRIS: If you have any feedback for this or any of our other episodes, you can reach us @bikeshed on Twitter. And I'm @christoomey.

STEPH: And I’m @SViccari.

CHRIS: Or you can email us at hosts@bikeshed.fm.

STEPH: Thanks so much for listening to The Bike Shed, and we'll see you next week.

Together: Bye.

Announcer: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success.

Support The Bike Shed

Episode source