The Bike Shed

355: Test Performance

Sep 20 '22

Guest Geoff Harcourt, CTO of CommonLit, joins Joël to talk about a thing that comes up with a lot with clients: the performance of their test suite. It's often a concern because with test suites, until it becomes a problem, people tend to not treat it very well, and people ask for help on making their test suites faster. Geoff shares how he handles a scenario like this at CommonLit.

This episode is brought to you by Airbrake. Visit Frictionless error monitoring and performance insight for your app stack.

Transcript:

JOËL: Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I'm Joël Quenneville. And today, I'm joined by Geoff Harcourt, who is the CTO of CommonLit.

GEOFF: Hi, Joël.

JOËL: And together, we're here to share a little bit of what we've learned along the way. Geoff, can you briefly tell us what is CommonLit? What do you do?

GEOFF: CommonLit is a 501(c)(3) non-profit that delivers a literacy curriculum in English and Spanish to millions of students around the world. Most of our tools are free. So we take a lot of pride in delivering great tools to teachers and students who need them the most.

JOËL: And what does your role as CTO look like there?

GEOFF: So we have a small engineering team. There are nine of us, and we run a Rails monolith. I'd say a fair amount of the time; I'm hands down in the code. But I also do the things that an engineering head has to do, so working with vendors, and figuring out infrastructure, and hiring, and things like that.

JOËL: So that's quite a variety of things that you have to do. What is new in your world? What's something that you've encountered recently that's been fun or interesting?

GEOFF: It's the start of the school year in America, so traffic has gone from a very tiny amount over the summer to almost the highest load that we'll encounter all year. So we're at a new hosting provider this fall. So we're watching our infrastructure and keeping an eye on it.

The analogy that we've been using to describe this is like when you set up a bunch of plumbing, it looks like it all works, but until you really pump water through it, you don't see if there are any leaks. So things are in good shape right now, but it's a very exciting time of year for us.

JOËL: Have you ever done some actual plumbing yourself?

GEOFF: I am very, very bad at home repair. But I have fixed a toilet or two. I've installed a water filter but nothing else. What about you?

JOËL: I've done a little bit of it when I was younger with my dad. Like, I actually welded copper pipes and that kind of thing.

GEOFF: Oh, that's amazing. That's cool. Nice.

JOËL: So I've definitely felt that thing where you turn the water source back on, and it's like, huh, let's see, is this joint going to leak, or are we good?

GEOFF: Yeah, they don't have CI for plumbing, right?

JOËL: [laughs] You know, test it in production, right?

GEOFF: Yeah. [laughs] So we're really watching right now traffic starting to rise as students and teachers are coming back. And we're also figuring out all kinds of things that we want to do to do better monitoring of our application, so some of this is watching metrics to see if things happen. But some of this is also doing some simulated user activity after we do deploys. So we're using some automated browsers with Cypress to log into our application and do some user flows, and then report back on the results.

JOËL: So is this kind of like a feature test in CI, except that you're running it in production?

GEOFF: Yeah. Smoke test is the word that we've settled on for it, but we run it against our production server every time we deploy. And it's a small suite. It's nowhere as big as our big Capybara suite that we run in CI, but we're trying to get feedback in less than six minutes. That's sort of the goal.

In addition to running tests, we also take screenshots with a tool called Percy, and that's a visual regression testing tool. So we get to see the screenshots, and if they differ by more than one pixel, we get a ping that lets us know that maybe our CSS has moved around or something like that.

JOËL: Has that caught some visual bugs for you?

GEOFF: Definitely. The state of CSS at CommonLit was very messy when I arrived, and it's gotten better, but it still definitely needs some love. There are some false positives, but it's been really, really nice to be able to see visual changes on our production pages and then be able to approve them or know that there's something we have to go back and fix.

JOËL: I'm curious, for this smoke test suite, how long does it take to run?

GEOFF: We run it in parallel. It runs on Buildkite, which is the same tool that we use to orchestrate our CI, and the longest test takes about five minutes. It signs in as a teacher, creates an account. It creates a class; it invites the student to that class. It then logs out, logs in as that student creates the student account, signs in as the student, joins the class.

It then assigns a lesson to the student then the student goes and takes the lesson. And then, when the student submits the lesson, then the test is over. And that confirms all of the most critical flows that we would want someone to drop what they were doing if it's broken, you know, account creation, class creation, lesson creation, and students taking a lesson.

JOËL: So you're compressing the first few weeks of school into five minutes.

GEOFF: Yes. And I pity the school that has thousands of fake teachers, all named Aaron McCarronson at the school.

JOËL: [laughs]

GEOFF: But we go through and delete that data every once in a while. But we have a marketer who just started at CommonLit maybe a few weeks ago, and she thought that someone was spamming our signup form because she said, "I see hundreds of teachers named Aaron McCarronson in our user list."

JOËL: You had to admit that you were the spammer?

GEOFF: Yes, I did. [laughs] We now have some controls to filter those people out of reports. But it's always funny when you look at the list, and you see all these fake people there.

JOËL: Do you have any rate limiting on your site?

GEOFF: Yeah, we do quite a bit of it, actually. Some of it we do through Cloudflare. We have tools that limit a certain flow, like people trying to credential stuffing our password, our user sign-in forms. But we also do some further stuff to prevent people from hitting key endpoints. We use Rack::Attack, which is a really nice framework. Have you had to do that in client work with clients setting that stuff up?

JOËL: I've used Rack:Attack before.

GEOFF: Yeah, it's got a reasonably nice interface that you can work with. And I always worry about accidentally setting those things up to be too sensitive, and then you get lots of stuff back. One issue that we sometimes find is that lots of kids at the same school are sharing an IP address. So that's not the thing that we want to use for rate limiting. We want to use some other criteria for rate limiting.

JOËL: Right, right. Do you ever find that you rate limit your smoke tests? Or have you had to bypass the rate limiting in the smoke tests?

GEOFF: Our smoke tests bypass our rate limiting and our bot detection. So they've got some fingerprints they use to bypass that.

JOËL: That must have been an interesting day at the office.

GEOFF: Yes. [laughter] With all of these things, I think it's a big challenge to figure out, and it's similar when you're making tests for development, how to make tests that are high signal. So if a test is failing really frequently, even if it's testing something that's worthwhile, if people start ignoring it, then it stops having value as a piece of signal. So we've invested a ton of time in making our test suite as reliable as possible, but you sometimes do have these things that just require a change.

I've become a really big fan of...there's a Ruby driver for Capybara called Cuprite, and it doesn't control chrome with Chrome Driver or with Selenium. It controls it with the Chrome DevTools protocol, so it's like a direct connection into the browser. And we find that it's very, very fast and very, very reliable. So we saw that our Capybara specs got significantly more reliable when we started using this as our driver.

JOËL: Is this because it's not actually moving the mouse around and clicking but instead issuing commands in the background?

GEOFF: Yeah. My understanding of this is a little bit hazy. But I think that Selenium and ChromeDriver are communicating over a network pipe, and sometimes that network pipe is a little bit lossy. And so it results in asynchronous commands where maybe you don't get the feedback back after something happens. And CDP is what Chrome's team and I think what Puppeteer uses to control things directly. So it's great.

And you can even do things with it. Like, you can simulate different time zone for a user almost natively. You can speed up or slow down the traveling of time and the direction of time in the browser and all kinds of things like that. You can flip it into mobile mode so that the device reports that it's a touch browser, even though it's not. We have a set of mobile specs where we flip it with CDP into mobile mode, and that's been really good too.

Do you find when you're doing client work that you have a demand to build mobile-specific specs for system tests?

JOËL: Generally not, no.

GEOFF: You've managed to escape it.

JOËL: For something that's specific to mobile, maybe one or two tests that have a weird interaction that we know is different on mobile. But in general, we're not doing the whole suite under mobile and the whole suite under desktop.

GEOFF: When you hand off a project...it's been a while since you and I have worked together.

JOËL: For those who don't know, Geoff used to be with us at thoughtbot. We were colleagues.

GEOFF: Yeah, for a while. I remember my very first thoughtbot Summer Summit; you gave a really cool lightning talk about Eleanor of Aquitaine.

JOËL: [laughs]

GEOFF: That was great. So when you're handing a project off to a client after your ending, do you find that there's a transition period where you're educating them about the norms of the test suite before you leave it in their hands?

JOËL: It depends a lot on the client. With many clients, we're working alongside an existing dev team. And so it's not so much one big handoff at the end as it is just building that in the day-to-day, making sure that we are integrating with the team from the outset of the engagement.

So one thing that does come up a lot with clients is the performance of their test suite. That's often a concern because the test suite until it becomes a problem, people tend to not treat it very well. And by the time that you're bringing on an external consultant to help, generally, that's one of the areas of the code that's been a little bit neglected. And so people ask for help on making their test suite faster. Is that something that you've had to deal with at CommonLit as well?

GEOFF: Yeah, that's a great question. We have struggled a lot with the speed that our test suite...the time it takes for our test suite to run. We've done a few things to improve it. The first is that we have quite a bit of caching that we do in our CI suite around dependencies. So gems get cached separately from NPM packages and browser assets. So all three of those things are independently cached.

And then, we run our suites in parallel. Our Jest specs get split up into eight containers. Our Ruby non-system tests...I'd like to say unit tests, but we all know that some of those are actually integration tests.

JOËL: [laughs]

GEOFF: But those tests run in 15 containers, and they start the moment gems are built. So they don't wait for NPM packages. They don't wait for assets. They immediately start going. And then our system specs as soon as the assets are built kick off and start running. And we actually run that in 40 parallel containers so we can get everything finished.

So our CI suite can finish...if there are no dependency bumps and no asset bumps, our specs suite you can finish in just under five minutes. But if you add up all of that time, cumulatively, it's something like 75 minutes is the total execution as it goes. Have you tried FactoryDoctor before for speeding up test suites?

JOËL: This is the gem from Evil Martians?

GEOFF: Yeah, it's part of TestProf, which is their really, really unbelievable toolkit for improving specs, and they have a whole bunch of things. But one of them will tell you how many invocations of FactoryBot factories each factory got. So you can see a user factory was fired 13,000 times in the test suite. It can even do some tagging where it can go in and add metadata to your specs to show which ones might be candidates for optimization.

JOËL: I gave a talk at RailsConf this year titled Your Tests Are Making Too Many Database Calls.

GEOFF: Nice.

JOËL: And one of the things I talked about was creating a lot more data via factories than you think that you are. And I should give a shout-out to FactoryProf for finding those.

GEOFF: Yeah, it's kind of a silent killer with the test suite, and you really don't think that you're doing a whole lot with it, and then you see how many associations. How do you fight that tension between creating enough data that things are realistic versus the streamlining of not creating extraneous things or having maybe mystery guests via associations and things like that?

JOËL: I try to have my base factories be as minimal as possible. So if there's a line in there that I can remove, and the factory or the model still saves, then it should be removed. Some associations, you can't do that if there's a foreign key constraint, and so then I'll leave it in. But I am a very hardcore minimalist, at least with the base factory.

GEOFF: I think that makes a lot of sense. We use foreign keys all over the place because we're always worried about somehow inserting student data that we can't recover with a bug. So we'd rather blow up than think we recorded it. And as a result, sometimes setting up specs for things like a student answering a multiple choice question on a quiz ends up being this sort of if you give a mouse a cookie thing where it's you need the answer options. You need the question. You need the quiz. You need the activity. You need the roster, the students to be in the roster. There has to be a teacher for the roster. It just balloons out because everything has a foreign key.

JOËL: The database requires it, but the test doesn't really care. It's just like, give me a student and make it valid.

GEOFF: Yes, yeah. And I find that that challenge is really hard. And sometimes, you don't see how hard it is to enforce things like database integrity until you have a lot of concurrency going on in your application. It was a very rude surprise to me to find out that browser requests if you have multiple servers going on might not necessarily be served in the order that they were made.

JOËL: [laughs] So you're talking about a scenario where you're running multiple instances of your app. You make two requests from, say, two browser tabs, and somehow they get served from two different instances?

GEOFF: Or not even two browser tabs. Imagine you have a situation where you're auto-saving.

JOËL: Oooh, background requests.

GEOFF: Yeah. So one of the coolest features we have at CommonLit is that students can annotate and highlight a text. And then, the teachers can see the annotations and highlights they've made, and it's actually part of their assignment often to highlight key evidence in a passage. And those things all fire in the background asynchronously so that it doesn't block the student from doing more stuff.

But it also means that potentially if they make two changes to a highlight really quickly that they might arrive out of order. So we've had to do some things to make sure that we're receiving in the right order and that we're not blowing away data that was supposed to be there.

Just think about in a Heroku environment, for example, which is where we used to be, you'd have four dynos running. If dyno one takes too long to serve the thing for dyno two, request one may finish after request two. That was a very, very rude surprise to learn that the world was not as clean and neat as I thought.

JOËL: I've had to do something similar where I'm making a bunch of background requests to a server. And even with a single dyno, it is possible for your request to come back out of order just because of how TCP works. So if it's waiting for a packet and you have two of these requests that went out not too long before each other, there's no guarantee that all the packets for request one come back before all the packets from request two.

GEOFF: Yeah, what are the strategies for on the client side for dealing with that kind of out-of-order response?

JOËL: Find some way to effectively version the requests that you make. Timestamp is an easy one. Whenever a request comes in, you take the response from the latest timestamp, and that wins out.

GEOFF: Yeah, we've started doing some unique IDs. And part of the unique ID is the browser's timestamp. We figure that no one would try to hack themselves and intentionally screw up their own data by submitting out of order.

JOËL: Right, right.

GEOFF: It's funny how you have to pick something to trust. [laughs]

JOËL: I'd imagine, in this case, if somebody did mess around with it, they would really only just be screwing up their own UI. It's not like that's going to then potentially crash the server because of something, and then you've got a potential vector for a denial of service.

GEOFF: Yeah, yeah, that's always what we're worried about, and we have to figure out how to trust these sorts of requests as what's a valid thing and what is, as you're saying, is just the user hurting themselves as opposed to hurting someone else's stuff?

MID-ROLL AD:

Debugging errors can be a developer’s worst nightmare...but it doesn’t have to be. Airbrake is an award-winning error monitoring, performance, and deployment tracking tool created by developers for developers that can actually help cut your debugging time in half.

So why do developers love Airbrake? It has all of the information that web developers need to monitor their application - including error management, performance insights, and deploy tracking!

Airbrake’s debugging tool catches all of your project errors, intelligently groups them, and points you to the issue in the code so you can quickly fix the bug before customers are impacted.

In addition to stellar error monitoring, Airbrake’s lightweight APM helps developers to track the performance and availability of their application through metrics like HTTP requests, response times, error occurrences, and user satisfaction.

Finally, Airbrake Deploy Tracking helps developers track trends, fix bad deploys, and improve code quality.

Since 2008, Airbrake has been a staple in the Ruby community and has grown to cover all major programming languages. Airbrake seamlessly integrates with your favorite apps to include modern features like single sign-on and SDK-based installation. From testing to production, Airbrake notifiers have your back.

Your time is valuable, so why waste it combing through logs, waiting for user reports, or retrofitting other tools to monitor your application? You literally have nothing to lose. Head on over to airbrake.io/try/bikeshed to create your FREE developer account today!

GEOFF: You were talking about test suites. What are some things that you have found are consistently problems in real-world apps, but they're really, really hard to test in a test suite?

JOËL: Difficult to test or difficult to optimize for performance?

GEOFF: Maybe difficult to test.

JOËL: Third-party integrations. Anything that's over the network that's going to be difficult. Complex interactions that involve some heavy frontend but then also need a lot of backend processing potentially with asynchronous workers or something like that, there are a lot of techniques that we can use to make all those play together, but that means there's a lot of complexity in that test.

GEOFF: Yeah, definitely. I've taken a deep interest in what I'm sure there's a better technical term for this, but what I call network hostile environments or bandwidth hostile environments. And we see this a lot with kids. Especially during the pandemic, kids would often be trying to do their assignments from home. And maybe there are five kids in the house, and they're all trying to do their homework at the same time. And they're all sharing a home internet connection.

Maybe they're in the basement because they're trying to get some peace and quiet so they can do their assignment or something like that. And maybe they're not strongly connected. And the challenge of dealing with intermittent connectivity is such an interesting problem, very frustrating but very interesting to deal with.

JOËL: Have you explored at all the concept of Formal Methods to model or verify situations like that?

GEOFF: No, but I'm intrigued. Tell me more.

JOËL: I've not tried it myself. But I've read some articles on the topic. Hillel Wayne is a good person to follow for this.

GEOFF: Oh yeah.

JOËL: But it's really fascinating when you'll see, okay, here are some invariants and things. And then here are some things that you set up some basic properties for a system. And then some of these modeling languages will then poke holes and say, hey, it's possible for this 10-step sequence of events to happen that will then crash your server. Because you didn't think that it's possible for five people to be making concurrent requests, and then one of them fails and retries, whatever the steps are. So it's really good at modeling situations that, as developers, we don't always have great intuition, things like parallelism.

GEOFF: Yeah, that sounds so interesting. I'm going to add that to my list of reading for the fall. Once the school year calms down, I feel like I can dig into some technical topics again. I've got this book sitting right next to my desk, Designing Data-Intensive Applications. I saw it referenced somewhere on Twitter, and I did the thing where I got really excited about the book, bought it, and then didn't have time to read it. So it's just sitting there unopened next to my desk, taunting me.

JOËL: What's the 30-second spiel for what is a data-intensive app, and why should we design for it differently?

GEOFF: You know, that's a great question. I'd probably find out if I'd dug further into the book.

JOËL: [laughs]

GEOFF: I have found at CommonLit that we...I had a couple of clients at thoughtbot that dealt with data at the scale that we deal with here. And I'm sure there are bigger teams doing, quote, "bigger data" than we're doing. But it really does seem like one of our key challenges is making sure that we just move data around fast enough that nothing becomes a bottleneck.

We made a really key optimization in our application last year where we changed the way that we autosave students' answers as they go. And it resulted in a massive increase in throughput for us because we went from trying to store updated versions of the students' final answers to just storing essentially a draft and often storing that draft in local storage in the browser and then updating it on the server when we could.

And then, as a result of this, we're making key updates to the table where we store a student's answers much less frequently. And that has a huge impact because, in addition to being one of the biggest tables at CommonLit...it's got almost a billion recorded answers that we've gotten from students over the years. But because we're not writing to it as often, it also means that reads that are made from the table, like when the teacher is getting a report for how the students are doing in a class or when a principal is looking at how a school is doing, now, those queries are seeing less contention from ongoing writes. And so we've seen a nice improvement.

JOËL: One strategy I've seen for that sort of problem, especially when you have a very write-heavy table but that also has a different set of users that needs to read from it, is to set up a read replica. So you have your main that is being written to, and then the read replica is used for reports and people who need to look at the data without being in contention with the table being written.

GEOFF: Yeah, Rails multi-DB support now that it's native to the framework is excellent. It's so nice to be able to just drop that in and fire it up and have it work. We used to use a solution that Instacart had built. It was great for our needs, but it wasn't native to the framework.

So every single time we upgraded Rails, we had to cross our fingers and hope that it didn't, you know, whatever private APIs of ActiveRecord it was using hadn't broken. So now that that stuff, which I think was open sourced from GitHub's multi-database implementation, so now that that's all native in Rails, it's really, really nice to be able to use that.

JOËL: So these kinds of database tricks can help make the application much more performant. You'd mentioned earlier that when you were trying to make your test performant that you had introduced parallelism, and I feel like that's maybe a bit of an intimidating thing for a lot of people. How would you go about converting a test suite that's just vanilla RSpec, single-threaded, and then moving it in a direction of being more parallel?

GEOFF: There's a really, really nice tool called Knapsack, which has a free version. But the pro version, I feel like if you're spending any money at all on CI, it's immediately worth the cost. I think it's something like $75 a month for each suite that you run on it. And Knapsack does this dynamic allocation of tests across containers.

And it interfaces with several of the popular CI providers so that it looks at environment variables and can tell how many containers you're splitting across. It'll do some things, like if some of your containers start early and some of them start late, it will distribute the work so that they all end at the same time, which is really nice.

We've preferred CI providers that charge by the minute. So rather than just paying for a service that we might not be using, we've used services like Semaphore, and right now, we're on Buildkite, which charge by the minute, which means that you can decide to do as much parallelism as you want. You're just paying for the compute time as you run things.

JOËL: So that would mean that two minutes of sequential build time costs just the same as splitting it up in parallel and doing two simultaneous minutes of build time.

GEOFF: Yeah, that is almost true. There's a little bit of setup time when a container spins up. And that's one of the key things that we optimize. I guess if we ran 200 containers if we were like Shopify or something like that, we could technically make our CI suite finish faster, but it might cost us three times as much.

Because if it takes a container 30 seconds to spin up and to get ready, that's 30 seconds of dead time when you're not testing, but you're paying for the compute. So that's one of the key optimizations that we make is figuring out how many containers do we need to finish fast when we're not just blowing time on starting and finishing?

JOËL: Right, because there is a startup cost for each container.

GEOFF: Yeah, and during the work day when our engineers are working along, we spin up 200 EC2 machines or 150 EC2 machines, and they're there in the fleet, and they're ready to go to run CI jobs for us. But if you don't have enough machines, then you have jobs that sit around waiting to start, that sort of thing. So there's definitely a tension between figuring out how much parallelism you're going to do. But I feel like to start; you could always break your test suite into four pieces or two pieces and just see if you get some benefit to running a smaller number of tests in parallel.

JOËL: So, manually splitting up the test suite.

GEOFF: No, no, using something like Knapsack Pro where you're feeding it the suite, and then it's dividing up the tests for you. I think manually splitting up the suite is probably not a good practice overall because I'm guessing you'll probably spend more engineering time on fiddling with which tests go where such that it wouldn't be cost-effective.

JOËL: So I've spent a lot of time recently working to improve a parallel test suite. And one of the big problems that you have is trying to make sure that all of your parallel surfaces are being used efficiently, so you have to split the work evenly. So if you said you have 70 minutes worth of work, if you give 50 minutes to one worker and 20 minutes to the other, that means that your total test suite is still 50 minutes, and that's not good.

So ideally, you split it as evenly as possible. So I think there are three evolutionary steps on the path here. So you start off, and you're going to manually split things out. So you're going to say our biggest chunk of tests by time are the feature specs. We'll make them almost like a separate suite. Then we'll make the models and controllers and views their own thing, and that's roughly half and half, and run those. And maybe you're off by a little bit, but it's still better than putting them all in one.

It becomes difficult, though, to balance all of these because then one might get significantly longer than the other then, you have to manually rebalance it. It works okay if you're only splitting it among two workers. But if you're having to split it among 4, 8, 16, and more, it's not manageable to do this, at least not by hand.

If you want to get fancy, you can try to automate that process and record a timing file of how long every file takes. And then when you kick off the build process, look at that timing file and say, okay, we have 70 minutes, and then we'll just split the file so that we have roughly 70 divided by number of workers' files or minutes of work in each process. And that's what gems like parallel_tests do. And Knapsack's Classic mode works like this as well. That's decently good.

But the problem is you're working off of past information. And so if the test has changed or just if it's highly variable, you might not get a balanced set of workers. And as you mentioned, there's a startup cost, and so not all of your workers boot up at the same time. And so you might still have a very uneven amount of work done by each worker by statically determining the work to be done via a timing file.

So the third evolution here is a dynamic or a self-balancing approach where you just put all of the tests or the files in a queue and then just have every worker pull one or two tests when it's ready to work. So that way, if something takes a lot longer than expected, well, it's just not pulling more from the queue. And everybody else still pulls, and they end up all balancing each other out. And then ideally, every worker finishes work at exactly the same time. And that's how you know you got the most value you could out of your parallel processes.

GEOFF: Yeah, there's something about watching all the jobs finish in almost exactly, you know, within 10 seconds of each other. It just feels very, very satisfying. I think in addition to getting this dynamic splitting where you're getting either per file or per example split across to get things finishing at the same time, we've really valued getting fast feedback.

So I mentioned before that our Jest specs start the moment NPM packages get built. So as soon as there's JavaScripts that can be executed in test, those kick-off. As soon as our gems are ready, the RSpec non-system tests go off, and they start running specs immediately. So we get that really, really fast feedback.

Unfortunately, the browser tests take the longest because they have to wait for the most setup. They have the most dependencies. And then they also run the slowest because they run in the browser and everything. But I think when things are really well-oiled, you watch all of those containers end at roughly the same time, and it feels very satisfying.

JOËL: So, a few weeks ago, on an episode of The Bike Shed, I talked with Eebs Kobeissi about dependency graphs and how I'm super excited about it. And I think I see a dependency graph in what you're describing here in that some things only depend on the gem file, and so they can start working. But other things also depend on the NPM packages. And so your build pipeline is not one linear process or one linear process that forks into other linear processes; it's actually a dependency graph.

GEOFF: That is very true. And the CI tool we used to use called Semaphore actually does a nice job of drawing the dependency graph between all of your steps. Buildkite does not have that, but we do have a bunch of steps that have to wait for other steps to finish. And we do it in our wiki. On our repo, we do have a diagram of how all of this works.

We found that one of the things that was most wasteful for us in CI was rebuilding gems, reinstalling NPM packages (We use Yarn but same thing.), and then rebuilding browser assets. So at the very start of every CI run, we build hashes of a bunch of files in the repository. And then, we use those hashes to name Docker images that contain the outputs of those files so that we are able to skip huge parts of our CI suite if things have already happened.

So I'll give an example if Ruby gems have not changed, which we would know by the Gemfile.lock not having changed, then we know that we can reuse a previously built gems image that has the gems that just gets melted in, same thing with yarn.lock. If yarn.lock hasn't changed, then we don't have to build NPM packages. We know that that already exists somewhere in our Docker registry.

In addition to skipping steps by not redoing work, we also have started to experiment...actually, in response to a comment that Chris Toomey made in a prior Bike Shed episode, we've started to experiment with skipping irrelevant steps. So I'll give an example of this if no Ruby files have changed in our repository, we don't run our RSpec unit tests. We just know that those are valid. There's nothing that needs to be rerun.

Similarly, if no JavaScript has changed, we don't run our Jest tests because we assume that everything is good. We don't lint our views with erb-lint if our view files haven't changed. We don't lint our factories if the model or the database hasn't changed. So we've got all these things to skip key types of processing.

I always try to err on the side of not having a false pass. So I'm sure we could shave this even tighter and do even less work and sometimes finish the build even faster. But I don't want to ever have a thing where the build passes and we get false confidence.

JOËL: Right. Right. So you're using a heuristic that eliminates the really obvious tests that don't need to be run but the ones that maybe are a little bit more borderline, you keep them in. Shaving two seconds is not worth missing a failure.

GEOFF: Yeah. And I've read things about big enterprises doing very sophisticated versions of this where they're guessing at which CI specs might be most relevant and things like that. We're nowhere near that level of sophistication right now.

But I do think that once you get your test suite parallelized and you're not doing wasted work in the form of rebuilding dependencies or rebuilding assets that don't need to be rebuilt, there is some maybe not low, maybe medium hanging fruit that you can use to get some extra oomph out of your test suite.

JOËL: I really like that you brought up this idea of infrastructure and skipping. I think in my own way of thinking about improving test suites, there are three broad categories of approaches you can take. One variable you get to work with is that total number of time single-threaded, so you mentioned 70 minutes. You can make that 70 minutes shorter by avoiding database writes where you don't need them, all the common tricks that we would do to actually change the test themselves. Then we can change...as another variable; we get to work with parallelism, we talked about that.

And then finally, there's all that other stuff that's not actually executing RSpec like you said, loading the gems, installing NPM packages, Docker images. All of those, if we can skip work running migrations, setting up a database, if there are situations where we can improve the speed there, that also improves the total time.

GEOFF: Yeah, there are so many little things that you can pick at to...like, one of the slowest things for us is Elasticsearch. And so we really try to limit the number of specs that use Elasticsearch if we can. You actually have to opt-in to using Elasticsearch on a spec, or else we silently mock and disable all of the things that happen there.

When you're looking at that first variable that you were talking about, just sort of the overall time, beyond using FactoryDoctor and FactoryProf, is there anything else that you've used to just identify the most egregious offenders in a test suite and then figure out if they're worth it?

JOËL: One thing you can do is hook into Active Support notification to try to find database writes. And so you can find, oh, here's where all of the...this test is making way too many database writes for some reason, or it's making a lot, maybe I should take a look at it; it's a hotspot.

GEOFF: Oh, that's really nice. There's one that I've always found is like a big offender, which is people doing negative expectations in system specs.

JOËL: Oh, for their Capybara wait time.

GEOFF: Yeah. So there's a really cool gem, and the name of it is eluding me right now. But there's a gem that raises a special exception if Capybara waits the full time for something to happen. So it lets you know that those things exist. And so we've done a lot of like hunting for...Knapsack will report the slowest examples in your test suite. So we've done some stuff to look for the slowest files and then look to see if there are examples of these negative expectations that are waiting 10 seconds or waiting 8 seconds before they fail.

JOËL: Right. Some files are slow, but they're slow for a reason. Like, a feature spec is going to be much slower than a model test. But the model tests might be very wasteful and because you have so many of them, if you're doing the same pattern in a bunch of them or if it's a factory that's reused across a lot of them, then a small fix there can have some pretty big ripple effects.

GEOFF: Yeah, I think that's true. Have you ever done any evaluation of test suite to see what files or examples you could throw away?

JOËL: Not holistically. I think it's more on an ad hoc basis. You find a place, and you're like, oh, these tests we probably don't need them. We can throw them out. I have found dead tests, tests that are not executed but still committed to the repo.

GEOFF: [laughs]

JOËL: It's just like, hey, I'm going to get a lot of red in my diff today.

GEOFF: That always feels good to have that diff-y check-in, and it's 250 lines or 1,000 lines of red and 1 line of green.

JOËL: So that's been a pretty good overview of a lot of different areas related to performance and infrastructure around tests. Thank you so much, Geoff, for joining us today on The Bike Shed to talk about your experience at CommonLit doing this. Do you have any final words for our listeners?

GEOFF: Yeah. CommonLit is hiring a senior full-stack engineer, so if you'd like to work on Rails and TypeScript in a place with a great test suite and a great team. I've been here for five years, and it's a really, really excellent place to work. And also, it's been really a pleasure to catch up with you again, Joël.

JOËL: And, Geoff, where can people find you online?

GEOFF: I'm Geoff with a G, G-E-O-F-F Harcourt, @geoffharcourt. And that's my name on Twitter, and it's my name on GitHub, so you can find me there.

JOËL: And we'll make sure to include a link to your Twitter profile in the show notes.

The show notes for this episode can be found at bikeshed.fm. This show is produced and edited by Mandy Moore.

If you enjoyed listening, one really easy way to support the show is to leave us a quick rating or even a review in iTunes. It really helps other folks find the show.

If you have any feedback, you can reach us at @_bikeshed or reach me at @joelquen on Twitter or at hosts@bikeshed.fm via email. Thank you so much for listening to The Bike Shed, and we'll see you next week. Byeeeeeee!!!!!!

ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success.

The Bike Shed Follow

355: Test Performance

The Bike Shed