ALTERNATE UNIVERSE DEV

Serverless Chats

Episode #51: Globally Resilient Architectures with Adrian Hornsby

About Adrian Hornsby:

Adrian Hornsby is a Technical Evangelist working with AWS and passionate about everything cloud. Adrian has more than 15 years of experience in the IT industry, having worked as a software and system engineer, backend, web and mobile developer and part of DevOps teams where his focus has been on cloud infrastructure and site reliability, writing application software, deploying servers and managing large scale architectures. Today, Adrian tends to get super excited by AI and IoT, and especially in the convergence of both technologies.

Watch this episode on YouTube: https://youtu.be/6o2owe2VHMo
Transcript:

Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm speaking with Adrian Hornsby. Hey, Adrian, thanks for joining me.

Adrian: Hey, Jeremy, how are you?

Jeremy: So you are a principal developer advocate for architecture at AWS. So why don't you tell the listeners a little bit about your background and what it is you do at AWS?

Adrian: Okay, cool. So first of all, thanks for having me on your show. I'm a huge fan of your show. As for my background, it's a mix of industry and research. Actually, I started my career at the university doing some research, and then moved to Nokia research and eventually some startups, always around distributed systems and real time networks and things like this. And then let's say the particular things is much of the work that I've done was always on AWS since the very beginning. So it kind of felt very natural eventually to join AWS, which was about four years and few months ago. And I joined as a solutions architect, and then quickly moved into an evangelist role. And mostly doing architectures and resiliency and a lot of breaking things kind of chaos engineering type of things.

Jeremy: Awesome. Well, speaking of resilient architectures, that's what I wanted to speak with you about today, because you have on your Medium blog, which is awesome, by the way. I mean, I go there-

Adrian: Thank you.

Jeremy: Every time I go, and I read something there, you think you know it all, and then you read something by Adrian, and you learn something new, which is absolutely amazing. But so I want to talk to you about this, because this is something I think that ties into serverless pretty well, is this idea that I think we take for granted, especially as serverless developers, we take for granted that there is a bunch of things happening for us behind the scenes.

And so we get a lot of this, infrastructure management out of the box, we get, some failover out of the box, we get some of these things. But that really only scratches the surface. And there's so much further we can go to build truly resilient applications. And you have an excellent series on your blog called the resilient architecture collection. And I'd love to go through these because I think that this is the kind of thing where if you start thinking about global distribution, you start thinking about latency. You and I have been having a lot of latency issues trying to record this episode, because you're all the way in Helsinki and I'm over in the United States. These are things to start thinking about.

So I want to jump in first with this idea of embracing failure at scale. And I love this idea because when we build small systems, we think about reliability, right? We try to get as many nines as we possibly can. But when you get to the level of global distribution, distributed systems that are sending messages between components, that are sending messages across the Atlantic Ocean or the Pacific Ocean, this data is going all over the place, this idea of failure, or at least partial failure has become the new normal.

Adrian: Yeah. So yeah, I think it's things have changed a lot in the last few years. I mean, before you were on the monolith application, and you were trying to make sure your monolithic application was always up and running, right? I think there was even some competition into uptimes it was very popular back then to look at uptimes of servers and say, "My server's been up for 16 years, wow, awesome." But now, we've moved away slowly from monoliths to micro service architecture, and especially I think as we move even to the cloud, and we use more third party services, systems become naturally more distributed, and they go over the internet, which is everything but a reliable source of communication.

So, you have network latency, you have network failures. So there's a lot more things that can go wrong. And I think understanding and accepting that anything, at any time can fail is actually a very important thing. Because it means that you accept failure as a first class citizen for your application. And then you need to write code and design applications so that at any moment in time, there can be failures, and that's called partial failure mode, as you said. And it's very different concept than what it used to be back in the day and that means that you need to design your application with different characteristics and different behavior.

Jeremy: Right. And so if you're designing your system with these different characteristics, and you're, you're forward thinking to this idea of resiliency, and again, you have a whole bunch of stuff that you do on chaos engineering as well, which is this idea of injecting failure into the system to see what happens when something breaks. But that is quite an investment, not only an investment in learning, right? You have to learn all these different parts of the cloud, and all these other failover systems and what's available from that standpoint, but also an investment in terms of building your application out that way.

So you mentioned in the article, this idea of the investment of building in this resiliency versus what that lost revenue might be if something fails. So if your billing service goes down, or your payment service goes down, and you can't charge credit cards anymore, if that's just the end of it, right? Like you just say, "Hey, we can't charge billing or we can't charge your credit card, so our site's down." Versus building something that says, "Well, we can't charge your credit card right now, but we can take your credit card number and we can calculate the order total and those sorts of things." So what is that trade off that companies should be looking for, in terms of, as you put it, lost revenue versus the investment in building these resilient architectures?

Adrian: Yeah, it's a very good question. I think first and foremost, it's always start from the business side. It's like understanding what are the requirements in terms of availability because as many nines of availability you want, a matter of fact, the more work you're going to have to put, and the more resources you're going to have to use and that resource is money, right?

And especially I think the work around availability and reliability is not really linear. At the beginning, it's you have a lot of gain with small work, but as more nines you want is actually I would say the investment, versus the investment you have to do to gain more nines becomes a lot bigger as you have more nines, right?

So it's increasingly hard to reach more nines. So, you have to really think what is it that you want to achieve as a business? And I always tell customers to start from a customer point of view as well. Like, what kind of experience do we want? And as you said, maybe the ultimate experience is a fully working site, but what are the possibility for you to maybe degrade an experience when you have an outage? And still being able to deliver service. I always take the example of move a website into a read only mode, whether it's Netflix or Prime Video or even Amazon, when something doesn't work, what kind of features can you still provide to your customer without having them giving a blank screen, and say, "Oh, sorry, our database doesn't work. Therefore, you cannot use anything on our website." I think there's tons of things that you can do in between. And it's all things that you have to take into consideration.

And then of course, it's like, where do you invest it? A lot of people start with the infrastructures, but you have to realize that actually the resiliency is not only infrastructure, it goes from the infrastructure, of course, but it goes to the network, the application, and also people. We've talked about people resiliency for some times and it's also very important.

Jeremy: Yeah, no, and I think that's interesting, too, about this idea of redundancy in there as well. Because obviously, redundancy is still a big part of it. I just, even if we build a system that says, "Hey, if the credit card system goes down, we can still accept credit cards." Really, what you'd like to be able to say is, "Well, the credit card system goes down in this region, and we can maybe failover to this region and still provide that service." And that degradation might be a latency increase, for example, right? And so I really love that idea of duplicating these components.

Obviously, there is a lot that goes into that when you think about duplicating components. You have databases that need to be replicated, you've got all kinds of other things that become-

Adrian: It's more complex.

Jeremy: Yeah, exactly it gets to be more... And it goes back to the investment and time, right? Like, what is the investment, is that something we want to be able to do is provide four nines or six nines of uptime, or whatever it is.

You actually outlined in this post the formula for this, and I don't want to get overly technical around this. But essentially, just to sum it up, if you want to get four nines, you need to have three separate instances or three separate components running, I guess, or regions running in order to get that. And so, is that something though, that some of those multiple nines are built into existing AWS services?

Adrian: So, well, you're touching a very big thing. I think the formula says simply that if you have one component, and this component is, for example, a billable 99% of the time, right? Which is not really good, because it means that you're accepting about three days of downtime per year, which is pretty, a lot.

So let's say you have an instance running somewhere. The simple fact that you actually duplicate that instance, increases its availability to almost to four nines, right? So you go from two nines to four nines. So that gives you 52 minutes of downtime, and then you do this another time, it gives you six nines, which is 31 seconds. So I mean, this is not a new formula. It's been used in electric components, in many industries, in nuclear industries. There is sometimes like six levels of redundancy to make sure that the electricity always powers the plants and all this kind of stuff.

Now, and as you pointed out, there's also a problem with redundancy because it adds more complexity, right? So there's a trade off between how many nines you want, and how much complexity you're willing to accept. So the key there is automation, right? And, of course, if you think about AWS managed services, on AWS already have this idea that they are using three availability zones under the hood, so exactly it's duplications of redundancy, to provide this kind of service so people don't have to use it.

So it's some service provides it, some not. There's regional, there zonal service. I think the most important is to understand it a little bit. So even if you use managed service, I think being curious a little bit, how things are built under the hood gives you a very good idea of your levels of availability or possible availability because of course AWS just provides infrastructure availability, right? It doesn't provide your application.

So even if you use three AZs under the hood, but your application doesn't use it at the full extent of the capability, you won't have four nines of availability, right? So you have to go through the entire stack from, you have to use the infrastructure, you have to use your network, you have to use the application, then of course, people. Because if you deploy an application and that your application is across three AZs, but when you deploy it, you break it, well, it's like, you lose the benefits, right?

So it really is a synchronizations of the entire layer, and what you want to do. That's why it's complicated. And I think that's why it's also important to understand how things work.

Jeremy: Yeah. No, I mean, I think this idea to have repeatability of deployment, this is the infrastructure as code idea. And the other thing you go into, and you have a bunch on your blog about this as well, is this idea of immutable infrastructure, which is just, again, an entire podcast in and of itself, because it's a whole other probably deep thing that we could go down. But I think the basic idea behind that is just this thought that rather than me trying to update in place, which you always have problems, and of course, with EC2 instances it's an even bigger problem. But certainly with serverless applications, if you're just switching out a Lambda function, or something like that, but can you explain just quickly this idea of immutable infrastructure and how it relates to serverless?

Adrian: Right. I mean, yeah, for immutable infrastructures, or immutability is a problem in computer science in general, but not only cloud, but also programming languages. If you look at Python versus SQL, for example and how will you assign variables, and how the state can be shared between variables, it gives headaches to developers every day. So the idea of immutable infrastructure is very similar to that, is you have a variable, you have a state in the cloud, you have an infrastructure running, why do you want to change it? Don't change it, keep it there. If you want to modify it, is deploy something next to it, parallel to it, like a duplication of it with the new version and then slowly move traffic to that new version. That gives you two things. That gives you that hey, you have a working version here that you protect, so if anything goes wrong during your deployments. Sometimes your deployment might work but after an hour of traffic, cache warms up and all of a sudden you have issues.

Well, you can very fast rollback, you just need to move the routing back to the existing infrastructure instead of having to redeploy your old application, having to redo something. When you have an outage, I think the most important is to react without reacting, right? So that the idea that you have something there that is safe to go back to, and that is protected is actually very, very nice. And this is what we call immutable infrastructure. And there's many ways to do that, whether it's a Canary deployment, AB testing, or Blue-Green. I prefer Canary, because it's a progressive rollout. But there's many ways to achieve this kind of things.

Jeremy: Right, yeah. And those Canary deployments are built into API gateway if you're using Lambda functions, for example. But you still have other components that aren't Lambda functions, right? So let's say that you deploy a new version of, I don't know, an SQS queue or maybe a DynamoDB table or something like that. I mean, that's also where things get a little bit hairy, right? Where you start sharing things that are data related.

Adrian: Right, yeah. And I mean, and this is very, very true. I think when you make a deployment it's very important to understand what is your deployment going to affect? If it affects database, definitely, you're going to have to do something else, you cannot exclude the database from your Canary. So you might not do a Canary deployment, you might be doing, say for deployment something with progressive rollouts within your existing infrastructure, because you need to do a schema update. But I think it's not all white or black. I think there's, if most of the time you do deployments, you do not do database schema changes. It's definitely important to try to make those as safe as possible, right? And it's not only serverless, it's also any other infrastructures and on AWS, you can do this from many different ways. Whether it's Route 53 with weighted Round-robin, an ALB supports the weight for target groups, an API gateway support stages with Canary. And actually even a Lambda function supports Canary with aliases weights, right?

So I think the most important is to understand that it's not all black and white. And sometimes you might want to do a deployment that is more problematic, but the important is to limit the number of those, right? So, at least that's why I feel like this. Sometimes you can do a mutation, sometimes not, but I think most of the time, you should be doing it.

Jeremy: Right. Alright, so let's move on to the next one which was avoiding cascading failures, right? And so this is another thing where from a small level, it's really not that big of a deal. It's like, "Oh, the queue backed up, and then so it maybe is more aggressive in trying to call some other third party API. And maybe that gets overwhelmed. And we could put some circuit breakers in there, we could do some of these other things." But there is a lot that can happen at scale, right?

Like if you have maybe 100 queue messages per second, or 100 queue messages per minute versus 10,000 queue messages per second, when those start backing up, and those start retrying, and then some of those get through to the next component, and then that component backs up and starts retrying. I mean, you just have this very, very vicious cycle that can happen. And a lot of that is not built in for you.

Adrian: No, it's a very good point you mentioned. I think the deadly thing, the deadly part of the architecture in distributed systems is that you have many layers. And very often each of those layers have their own timeout and retry policies. And most of the time, let's say the default timeouts are absurd. They are... In Python, the request library, for example, is infinite, the default timeout. So that means if your third party doesn't answer, it will keep the connection open indefinitely, right?

Jeremy: Right.

Adrian: So that means if your client has a different time out, very often they do because client libraries is very often within five to 15 seconds. So that means that your client will have a retry policy and eventually will retry to get the data. And so that means all of a sudden you exhaust the number of connections from the back end side, your connection pool is running out of free connections, and that means that this server is unreachable. And what does the client do? He does a retry to the other servers, and eventually you have this cascading failures because all the clients are going to be retrying to all the servers one after another, and eventually that runs out.

So it's very important in distributed systems to really understand first the timeouts and set them, not, you know, when you do an NPM install or a Pip install, you are installing a lot of libraries from other people that maybe didn't think about your particular use case. And very often, I see teams not looking at those timeouts or they use system default and what is system default? Well, it's...

Jeremy: Don't even know.

Adrian: No one knows. So I mean, I think it's very much related to operational excellence in a way that a, how is my application behaving? What are the defaults? What are the retry policies? Are you going to retry hundred times? It makes no sense, you know?

Jeremy: Right.

Adrian: So you might want to retry once, twice, maybe three times. But that's, don't create more problem if your system is experiencing issue, I think failing fast is very important, especially in distributed systems. And if you really have to retry, don't retry aggressively, maybe retry with an exponential back off, and especially give the system some time to recover.

So, the problem a lot of the time the library is retry maybe even few times per second. It's like, "Oh, you didn't get a request, let's retry, let's retry." It's like what the kids are doing in the car, dad, are we there yet? Are we there yet? Are we there yet? It's very annoying for the drivers, for the backend. So it's the same in distributed system. What you want is either do you do a pub-sub, you say, "Okay, let me know when you get the data," or you asked it like maybe you make a retry, and then the next time you ask after 10 seconds or 20 seconds, and then the longer you wait, the longer the interval between the retries. And that's what's called exponential back off.

Jeremy: Yeah, well, the other thing with exponential back off too, is it can be tough if you have like 1000 requests to try to go through and they all fail, and then they all retry again in one second, and then they all retry in two seconds, and then four seconds, and then 16. So they keep doing the exponential thing. That's why this idea of using something like Jitter, which just randomizes when that request or the retry is going to be is a pretty cool thing as well.

Adrian: Yeah, exactly. And it's important, especially in distributed system, because you don't want to have all your distributed system to retry at the same time exponentially as well, you explained it very well.

Jeremy: So the other thing, you mentioned idempotency earlier, we talked about immutable infrastructure and that sort of stuff. Item potency is another main issue when it comes to retries. And I talk about this all the time, because essentially, if you retry the same operation, just because it looks like it didn't complete, doesn't mean it didn't complete, right?

Adrian: Right. Exactly.

Jeremy: Because there's also a response that can fail, not just the request itself. So, that's one of those things where I think if people are unfamiliar with item potency, just the idea that you can retry the same thing over and over and over again, anytime you start seeing these failures, you need to be able, from a resiliency standpoint, if we go back to resiliency standpoint, we can buffer those and we can retry, and we talked about that, but what are other ways that we can deal with these failures in a way that we can respond back to our customer to let them know what happening?

Adrian: Well, one is degradation, and you can degrade with two things stale data, right? So maybe you can't access the database, but hey, what was the last known version of or version of that data and maybe serve that, for example using cache, right? Cache is a good way to serve requests to customers, even if your database is not working. And that's why also we often use CDNs or actually you have caches every layer from your client, the CDN, the backend, even the database very often have cache.

So, having all those layers of cache can actually add also some complexity and some problems. But the idea that if you can't serve data immediately, maybe there's a version of that, that you can serve. And then maybe dynamic data can be replaced with stale data. A good example, Netflix has a very nice UI, with a lot of different microservices for each of their recommendations. If one of them doesn't work, or if a few of them doesn't work, they fall back into data storing cache, for example, a most popular topics in US today. Well, it's not dynamic. It's something that can be processed once a day, and then you serve this from the cache.

So that means that if your system is experiencing issue, instead of having all your customers query the database for their particular personalized profile, well, you serve stuff from the cache, right? So that gives you a way to free some resources from your backend. So that's one thing to do it and that's used all over the place as well. And nothing, it's the idea of circuit breakers as well, right? You have a dependency that fails, and then, okay, if that dependency doesn't return, what do you serve? At Amazon we love serving, it's quite funny, but we serve the cute dogs of Amazon. I don't know if you've seen this?

Jeremy: Yeah, I have, yeah.

Adrian: When Amazon doesn't work, service doesn't work, we return cute dogs and all that is on cache as well.

Jeremy: Yeah, so I think circuit breakers are one of those things where I don't think enough people use them, right? Because that's one of those things where when we start overwhelming a downstream resource, we have to do something to stop overwhelming it, right? So even with those exponential retries, or the exponential back off, and the retries, and the jitter, and all that stuff. If we keep trying the same thing over, and over, and over, and over and over again, eventually we're just going to build up so much load in our queues that it's going to take forever to work through it.

Adrian: Yeah.

Jeremy: So there is this thing called load shedding, a whole other crazy-

Adrian: Rejection.

Jeremy: And rejection and things like that. Can you explain that a little bit? Because I think that is definitely something that's not built in that you would have to manage yourself.

Adrian: Right. So I mean, the idea is to protect your backend as much as possible. And there's few ways to do that. From, of course, the clients can try to protect the back end by doing retries and back off, but the server, the backing itself at some point, if it really is overwhelmed by requests can do a few things, right? It can simply reject requests, and say, "Okay, no, now I'm at capacity, and your API is not the priority so I'm not going to deal with it," and that's rejection. And you can do load shedding as you say.

You know how much time it takes to process a request, right? So basically, you say, my request to not reach a timeout, I need to process it in at least seven seconds, right? If it starts to take too long, so if this latency for handling the requests start to increase, again, you can simply shut the load. So you remove, you say, "No, I'm not taking any requests now, because my latency for my request is at maximum," right? So that's one possibility, is to do as well.

And then, I mean, another very important thing is rate limiting, right? And that's, again, it sounds very simple. And I know people don't very often implement rate limiting from their own services, but they should. Because sometimes, one day their own services might do something wrong and go into an infinite loop where because you do a deployments, a configuration was not right. And then you have an infinite loop of requesting stuff from a back end that totally destroy your back end. And if you would have had the rate limit in place for your internal services, that would have avoided this. And I like this idea of rate limiting even your own services, because then you can establish contracts between different services and different parts of your system. You say, "Okay, my backend is this. This is the API, and these are the contracts. That when other teams accept to use my service they agree on that contract." And then if they need more requests, then they have to modify that contract.

So that means my backend team knows what is happening with my service. Because I've seen this happen a lot. You have a distributed architectures with different teams handling different services, and your service becomes popular and all of a sudden other teams start to use it, and they don't tell you about it. And then, it's fine, or and then you have a marketing campaign and no one tells you about it. And the marketing campaign all of a sudden is worldwide and everyone downloads or connects to the same endpoint at the same time. That's because there was no contract between the teams. No one agreed, "Okay, my service can only handle a thousand requests per second for you. And if you want more, you need to modify your limits." In fact, this is also why we have a lot of limits on AWS because we have so many distributed services, that teams are forced to negotiate to make sure that we don't kill other people's service, right? So it's, I like this idea of API contracts.

Jeremy: Yeah, no, that rate limiting thing too is, this is something that I don't think people think about. You're right, like if I have a service that is my customer service, and some team is responsible for building that. And also the fallacy that serverless is infinitely scalable too if we think about that, like nothing is infinitely scalable, right? Things can be designed to scale really well, and handle load, and scale up quickly, like that is possible to do still a lot to think about. But the rate limiting point you make is really, really good. Because if I'm a team, to go back to that customer example. I build the customer service, and our marketing team comes along and says, "Oh, well, I need to look up this, you know, every time somebody signs up with some form, I need to check to see if they're already a customer and do something with that."

If that is some massive thing, where all of a sudden, now you're getting 10,000 requests per second. Well guess what? Your Ecommerce system that's also hitting your customer service, now all of a sudden, that can't get the data that it wants, right? And it's this noisy neighbor type effect in a sense, where you're depleting services, or you're depleting resources from your own services. So I love that idea of contracts, rate limiting. I mean, even giving, if an internal team is accessing your own service, handing out API keys with special rate limits and quotas and things like that, I think that makes a ton of sense. So I love that idea.

Adrian: Yeah. And it gives the team building the service a good understanding of what's required in terms of scalability. Because if all of a sudden, if you have only 1000 requests per seconds, it defines the kind of architecture you can do. But if that becomes a lot more, so if all of a sudden you realize, hey, you've given out a lot more API keys, and each of those API keys have a thousand requests per seconds, it can easily go to hundred thousands per seconds maybe on the very large companies, and that's a different architecture. It might actually change the entire architecture because all of a sudden, you have other kind of consideration to take. So it's super important because it gives the ability for the team to understand the service they're building, its scalability patterns, and prepare for it. And it's what we call a cell at Amazon. You've heard the term sales, right?

Jeremy: Yeah.

Adrian: So, we define the size of a sale based on this kind of thing. The scalability patterns, the rate limiting, all these kind of things that are necessary to serve customers well.

Jeremy: Right. And so speaking about system availability, we do need a way to know whether or not our systems are available, and that way is typically using health checks. So you have a whole another article on health checks in this series. The thing that I really liked though, was your description of shallow versus deep health checks. Because I think this is something that not everybody, it's not necessarily intuitive to some people.

Adrian: Yeah. So I'd say the shallow health check is you check for example I'm asking you how are you? And you define the you is only you and nothing else, right? So you tell me I'm fine, but you could also decide, hey, no there's all my family members as well in "you" and tell me, "Oh, no, I'm fine but my wife is tired. My kid is at school." And this is a deep health check because you go much deeper into what is "you."

So it's the same for an instance, the same for a system is when you ask the health of an instance for example, you can say, "Oh, is my instance up and running?" Yeah, that's a shallow. Okay, cool. You have access to local network, you have access to local disk. That's shallow, right? It's like your immediate environment. But if the instance needs to talk to database to cache, can send API queries to third party dependency, well, that's kind of second level, right? So that's kind of also part of its health. Because if it can't reach the database, well it can't maybe do everything.

So the deep health check is around that, it's really understanding the dependencies and the second level, and sometimes even third level dependencies of what you're trying to contact and then report that. Because once you report that, then you can adapt your query. You can say, "Okay, my service doesn't have any database. So I won't do queries that are changing state, for example. I won't try to change my profile picture or change my name." Or, I can't offer that, but maybe I can offer API's that can read only. And so this kind of health check gives a capability for the client to degrade more wisely, right?

But of course, you have to be careful what you tell the client what is available. Because then you have hackers that can also understand how the system is built. So that's why actually, when you build the health check, very often it's built in terms of like, it's unique to different companies, how they work, and what they report to the client and things like this.

Jeremy: Yeah, I just think it's interesting, because I've seen a lot of people build a health check for like an API or an API Gateway, where they have the health check is just a Lambda function that just responds back and says the service is up and running. I think it was that like, I'm not sure what you're checking there other than that-

Adrian: Lambda works.

Jeremy: That Lambda works, right? And that the Lambda service is up and running, which is funny. But that's the kind of thing where if you were building like a serverless health check, like, you think, well, the infrastructure is up and running. But you could do things where if you are connecting to a database, like what's the... maybe you're collecting some metrics, what's the average database load? What's the average response time, though or the latency there? Are you able to connect to a third party service? How many failures have there been to a third party API in the last minute, or the last real rolling five minute window or something like that?

So I think that's important to understand, because then you can build rules around that, in order to decide whether or not a service is healthy enough for you to keep sending traffic to it.

Adrian: Exactly. I mean, and even if it's healthy, so for example, even if you can query the database, but if it answers after seven seconds, is this healthy? So you can have this deep health check that answer, I'm okay, but it takes seven seconds, and so then it forces you to define thresholds as well. And this is what we discussed about later is like, "Okay, how fast do you want your service to answer." And that that defines your, so this is the business, it's a business requirement.

You say, "Okay, my customers needs to be able to access data in four seconds." If it's not, then you shed, you do something else. And that defines a lot of the default, or that you're going to have to put in your systems and then, and it just helps you understand and build the system a little bit more predictably.

Jeremy: Alright. So let me ask you this question. Let's say we build in these really great health checks that and we set some thresholds, we say, if the data doesn't come back within four seconds, or whatever it is, then we want to route that to a different service or to a different region or something like that. What happens if all of your services are coming back with bad health checks?

Adrian: Yeah, this is a good point. And let's say you can have bad health checks like this when sometimes you make configuration mistakes or you do a deployment and something doesn't work. And sometimes, if everything fails at the same time, you have to assume that it's not broken, right? So it's called failing open. So it means okay there is, it might be a health check problem, so let's continue sending traffic to the environment and hopefully things will work. And this is what we have in places on AWS, if you look at all the systems that implement health checks or Route 53, the ELBs, Lambdas, API Gateways, and all this kind of things, if all the health checks fail at the same time, we fail open, so we assume it's more like a health check problem versus an infrastructure problem.

Jeremy: Right, yeah. And then the other thing too, that's kind of cool. And this is just something where I don't think people understand how powerful Route 53 is. Because if you think about your normal load checks, or your health checks, I know I always would think about application load balancers or elastic load balancers. But that is region specific, right? So if, again, I can't health check across 10 different regions or five different regions with an ELB. I need to do that at a higher level, and Route 53 has a ton of capabilities to do this.

Adrian: Right. So now, Route 53 allows you to do health checks on many different levels. And what's the nicest feature of Route 53 is that when it checks the health check, it uses eight regions by default from around the world, right?

So sometimes on the internet, you have regional outages, and sometimes the route on the internet doesn't work, but it doesn't mean other routes don't work from outside. So, Route 53 allows you to go around these kind of regional outages or intermittent regional outages over the internet because it's checks over actually, eight regions, and then three availabilities for each of those regions. This is where the 18% comes from, that's written in the blog. Which is a bit weird, but it gives us an idea that if 18% of the system is at least answering is... if 18% or fewer of the health checks report that is healthy we'll consider unhealthy because it's not enough. So it means something is wrong.

Jeremy: Yeah, so alright. So then you figure out that a particular thing is healthy, you get this consensus, which is crazy, because you're right, this 18% is a weird number. So basically, a lot of them can fail but as long as it's running. But so once it decides that something is healthy and it starts routing traffic there, there's also a bunch of other capabilities too where it's not just route it, based off of it being available. That's one part of it, but then also you can route things based off of geographical distance, you can route things based on latency. Yeah, so how does some of that stuff work?

Adrian: So that's the case where you have two systems, or two environment and then you want to switch between one environment and the other. And I would say, maybe, actually multi region maybe in that case, is the idea that when you are a customer, you want to have data fast. And if you want to have data fast, then you want to have low latency. To have low latency, you have to have your data as close as possible to the end user, right? So in the last, let's say, five, six years, we've seen an explosion of multi region architectures because now we have global customers, right? App stores have exploded, basically we have customers around the world and each of those customers, game is a very good, good example of that they want to have as small ping as possible, right? So the latency should be as small as possible.

So we have to have systems that are deployed very close to the customers, so in multiple region. And then you need to figure out from that user, how do you route the user to a particular backend or particular environment. And then you have different policies, right? So, these are the policy you were mentioning, in Route 53 of whether it's geographic, you have latency, you can have weighted Round-robin. And then, of course, all that supports what's called a failover. So, if any of those fail you can failover to another region. So, Route 53 gives you very complex sets of possibility to make very complex sets of routing, and very flexible as well. But it can be complex. But yeah, so this is the idea behind that.

Jeremy: The thing that's important, though about this multi region or these multi region architecture. And of course, if you're using latency based routing, or you're using geographic based routing, or even just Round-robin routing, you're looking at sort of this active-active type environment, right? So it's, this is not if this one fails, then shift all the traffic to this, this is I have a region in Europe and I have a region in the US, so I want to minimize my latency for customers based on that. So when you're designing those types of multi region active-active systems, especially from a serverless standpoint, you want to be using regional API's instead of edge optimized ones, correct?

Adrian: Correct. Yeah. So I mean, this is especially if you use API gateway, right? So API gateway when it was released, came with an integration with CloudFront, so basically you got a domain name, which was, well, not regional, it was global. So you couldn't basically use Route 53, which is also a DNS provider to actually route traffic to that particular API gateway. So I think it was about a year and a half ago, API gateway released a regional endpoints. So that means now you can have an API gateway without CloudFront integration. And that means now you can use Route 53 to actually route traffic to directly the API gateway in your region. Actually, you can have several API gateway in one region, as long as they have regional endpoints. So you can have multi, I would say multi API gateway routing in one region via Route 53.

So, and this goes into more complex discussion because I see a lot of people design an application per region, right? So you use one API gateway for your serverless application per region. But, if you think about it, the blast radius is high because you have one API gateway for one region. What if that API gateway has an issue? Well, you could have several API gateway, you could have two, three, four, and so that means, I think you need to think about how do you shard basically an application, and the idea of sharding is okay... On Amazon we call that a sale. We say, "Okay, we have a sale," which has an API gateway, maybe Lambda, DynamoDB. And that sale will take, let's say, 100 customers, right?

As long as they're less than 100 customers, we only have that particular sale. But if we have more customers growing, instead of growing the sale and which we know how it behaves, because we've been testing it, we understand the pattern, we can deploy it. It's repeatable, it's very well understood, well, we replicate that sale. So that means at some point, we might have hundreds or thousands of sale in one region. And you can do this as a customer, also, today. You can say, I want to have 10 API gateway, Lambda, and DynamoDB pair per region, why not? Now, it doesn't mean you need to do it, you need to really understand the thing. But that's the idea of a regional endpoint for API gateway is that you are not limited to one for your region, right?

Jeremy: Right. Yeah, no, and I think that's one of those things too where just I've always fallen back on the edge optimized ones, because it was just there. But now with the new HTTP APIs, you don't have those. So I like that idea though, of being able to say, I have more control now of which endpoint my user gets to, I can control that latency a little bit better. So, interesting thing to think about, certainly, another thing to put on your list of things to learn and things to do.

Alright, so we talked a little bit about caching earlier. But you have a whole other article on this about caching for resiliency. So there are obviously a million different things to think about when it comes to caching, how many layers of caching do we need that kind of stuff. But there are some really good reasons for putting, let's say, a CDN in front of your application.

Adrian: Right. So yeah, that's.... I think the CDN is probably the first layer of cache that people tend to use simply because it has massive security implication, right? So CDN, to explain a little bit what a CDN is, is a collection of servers that are globally distributed, they are globally distributed closer to the customer. It's way more, basically a Point of Presence or I think if people I've seen this for CDN, it's called PoP. It's a Point of Presence. It's not matching the AWS regions, there are way more PoPs around the world than there are AWS regions. So that means they are way closer to the customers.

So each of these PoPs is basically an entry point into an application, right? So when you use a CDN, the CDN uses some traffic policies that basically makes the request of the customers come to the closest PoP available to the customer making that request. Okay. So it also allows you to have hundreds of entry point into your application dispersed around the world. So that means, instead of having one entry point, for example, the API gateway, you put a CDN on top of it, you have hundreds of entry point, hiding your API gateway. So that means that doing a DDoS attack on the API gateway becomes a lot harder because all of a sudden, your attacker needs to attack hundreds of Points of Presence on the CDN to be able to do DDoS attacks, right? So people use CDNs, well, to improve latency of static content, but also to make it resilient to DDoS attacks. Because well, it's much harder to attack 160 point versus one, right?

Jeremy: Right and CloudFront and AWS WAF, like have some of these things built into them too to protect against like UDP reflection attacks and SYN flood and some of those other things too. So, a lot of that is good to protect just from I guess a resiliency and uptime sample, and we talked about overwhelming systems, right? So if you have a DDoS, or something that's happening that is overwhelming that system, being able to shed some of that at the CDN layer, because AWS is smart enough to pick that up is just great. It's great and a good layer to have there. But beyond just the I guess the hacker attack or the protection that you get there, depending on what type of data you're serving up.

So let's go back to the Netflix example, that top 10 movies in the US or whatever it is, top 10 shows in the US, that particular thing is pre-generated, and it is served from cache. And that gives us this idea of I guess, page caching even, like even if it's a short amount of time. So what are some of those things that you can do where you can reduce pressure on that downstream system or on your system? Like some of those different techniques for caching.

Adrian: Right. So, to continue on the CDN, right? Like what you have explained, what you said was CDN is actually a cache, right? It's a cache layer to serve content. Primarily people use CDN to serve video, or pictures, or static files, things like this. They often forget that actually CDN, CloudFront for example can also cache dynamic content, and dynamic content, even sometimes cache for a couple of seconds can save your back end. If you... let's take the marketing campaign, for example. Let's say that the marketing campaign is actually a list that is dynamic of something that changed. And if people press refresh every second, because they want to see who is winning the competition or something. If you don't cache, even those dynamic content, the content of the list, even few seconds, that means everyone is going to query the backend, right?

So I think caching dynamic content even for one seconds or two seconds is very, very important, right? They think about the possibilities of doing that. So that's kind of also something that CDN's can do and often people forget about it. So, does it answer your question?

Jeremy: I think it does. And I mean, I guess where I'm trying to go with this, too, is that there's just a million different things that you can do to cache, and there's multiple layers of cache. There's multiple strategies or caching patterns that you outline in here. And I think that, people need to go read the article, I think in order to really understand this stuff. I don't know how much justice we're doing it trying to explain some of it. But I think one of the things we should touch on, just in terms of caching in general, and this is a quote that you have in your article is that for every application out there, that there is an acceptable level of staleness in the data. So just what do you mean by that?

Adrian: So, it's exactly the idea of caching dynamic content. Well, even if you claim your application is very dynamic, and you claim that, no, I need to, I can't cache because for example, it's a top 10 list of real time trends on Twitter. Let's say Twitter trends, right?

Jeremy: Right.

Adrian: People expect that it's real time. So, I would say by default, if you think about real time, people wouldn't think, "Okay, I need to cache that." But, if you have millions of clients around the world requesting that data, absolutely you're going to fake it real time. It's, you might query your downstream server or service that tells you the trend, but maybe if you have thousands of clients connecting at the same time, you don't want each of those clients to query your service, you will just serve it from cache, or make sure that the requests are packed into one single request and then that's the downstream service and then serve back the content.

So it's just this idea of like any application out there, even if you think it should be a, must be real time, it's very important to think about the staleness. And staleness is how real time my data needs to be, even if it's maybe three seconds old, is it really that old or is not usable? Because it's also something you can fall back. So if your database is not accessible, it's like, maybe you can serve back the trend of Twitter, that was maybe an hour ago, and just instead of... and you can say to your customers, you can say, "Oh, we're experiencing issue, this is a trend one hour ago." And that's fine, it's a good UI. It's good use of stale data. And why would customer say, "Oh, you're cheating on us?" No, it's like... I think it's a good example of that.

Jeremy: But that's actually, that brings me to the caching patterns, because that is one of those things where, like you said, there's acceptable levels of staleness in data. So again, if a data is five minutes old, and we don't know how many tweets are about, I don't know, Tiger King, or some popular thing is happening on Twitter or whatever, if we don't know how many, what the most accurate count of tweets is for that, that's probably not going to kill us if that's five minutes old or a minute old or whatever.

But so there's a couple different patterns here for caching. So obviously, we have cache aside, inline caches, things like that. But the ones that I think are more interesting and I'd like to talk about is this idea of soft and hard time Time To Live explain that.

Adrian: So a soft Time To Live is your requirement in terms of staleness, right? So you say, my Twitter trend lists, I want to refresh it every, let's say, every 30 seconds. So you give it a TTL of 30 seconds, a soft TTL of 30 seconds. So if my service requests the cache and the TTL, the soft TTL is expired, and everything is fine you go and query the service, right? But if my service doesn't answer at that moment, so you are, you've passed the soft TTL. Now, your downstream service doesn't give you the data. What do you do? Do you return a 404, or you actually fall back, and you say, alright, my soft TTL is expired, but I'm still within the hard TTL which is it's one hour, right?

And then you say, okay, your service returns the hard TTL and you say, "Oh, sorry, we just have one hour old data, because we're experiencing issue." So again, it's a possible degradation. And actually quite often cache could be used like this. I think it's all about how you create your cache and things like this and how you define your eviction and policies and things like this.

Jeremy: Right. And I think that that is something that a lot of people don't think about. I think the most common way to get rid of your cache is just to set a TTL on a Redis key or something like that. And then that expires, and you're like, "Oh, wait, now I can't fetch new data, what do I do?" So that's really interesting that those two ideas, there are something that people should be thinking about.

Alright. And then the other thing was this idea of requests coalescing, because this is another problem you have is if it takes one second to repopulate the cache or to run that request, if you have 1000 people requesting that before the first one completes, you need to be able to handle that in a certain way.

Adrian: Right, yeah. It's exactly like, you go, you have one requests, and then a thousand requests asking for the same data, but you don't have it in the cache, what do you do?

Jeremy: Right.

Adrian: So one way to do it is you say, "Okay, I'll take one of these requests, they're all the same, and I'll get the result and then use the same result for everyone." So that you park basically all the 999 requests on the side and say, "Wait a second, I went to ask, and I'll give you the data." This is very related to idempotency, and things like this. So that understanding what requests are you doing, what data can be returned from your API, and if it's dynamic or static content. But actually, some framework supports this kind of things out of the box. So it's important at least to figure out if your framework supports that.

Jeremy: Yeah. And I think there are some patterns that you can build fairly simply in order to do that even if you're using a Lambda function, for example. So, but anyways, very, very cool stuff.

Adrian: And if you use DynamoDB actually, with DAX, it supports this kind of request coalescing hard to pronounce for me.

Jeremy: Hard word to say. I think that, the last thing you point out is just be aware of where your caching happens, right? Because there's just so many layers of caching sometimes that it can cause a lot of problems.

Alright, so the last thing I want to talk to you about and then I'll let you go, is this article you wrote about building multi region active-active architecture on AWS that was serverless, right? So if you were doing that, just give me a quick overview how would you build an active-active multi region serverless application on AWS?

Adrian: You're asking me for a pill give me your pill to create an active, active... So first of all, if you define active-active I think before giving you the solution, if you think about active-active a lot of people think about active-active and they think about data replication, right?

Jeremy: Right.

Adrian: So active-active doesn't necessarily involve data replication, right? So you can have active-active but federated as well. So that means you have local database. And if you look at Amazon retail site, you have Amazon UK, Amazon US, Amazon Germany, they're all it's an active-active system. All the regions are active, but they are federated.

So as a business you need to understand, okay, first, what is it that you want? Do you want a multi region with Federation, so local database or do you really want to replicate all your data under the hood, right? So, if you really want to replicate the data under the hood, because this is what I used during in this blog post I was using Dynamo global table which allows you to replicate the data across multiple region. So, that means that if one region is experiencing issue, my data is replicated synchronously to other regions. So then I can failover to another region to get that data.

And again, this is when you say multi region, it doesn't mean multi continent, because in the US you have multiple regions. So when you do multi region, you need to be very, very careful about your compliance in which region of the globe you're operating. If you do this in Europe, there's GDPR so you probably don't want to have Dynamo in Europe and US being replicating data from customers. Because all of a sudden you have different kind of governance on the data and regulation and laws and stuff like this.

So, you can use multi region within one continent, because you want to maybe serve customers faster, because you want to decrease latency. Now, at the end of the day, when you do, we did test on the retail side hundred milliseconds latency, reduce the sales by 1% on the retail side, Amazon retail side so 100 milliseconds latency is very, very little, right?

So when you are between let's say, Germany, Ireland, France, or other regions in EU or even east side or west side of the US, you can easily have a 100 miliseconds latency improvement by choosing the right region closest to the customer. So once you've done that, then you can define multiple regions by which you want to serve your data within the respectable law and governance. And then, of course, you can use Route 53 to direct traffic between different region, or there's also the new global accelerator that doesn't use DNS, but it uses IP Anycast to move traffic. So you avoid all the DNS caching, which is a problem. It's a whole other podcast if you want to talk about DNS caching and the problems of DNS. But yeah, you have multiple solutions to do this.

Now, word of warning, multi region is, it adds complexity as well, right? So it's very, very important that if you decide to go multi region that is a very, very strong business case, or that do you know really well what you're doing. And I always say to customers start with maybe one region, automate it as much as possible so that you can just move your automation to another region, and then avoid data transfer between regions, right? Try to federate the data, and then, you know, I think Federation works great because if I have a service like Amazon.com, I'm using mostly the German Amazon because it's closer to me, and then the delivery is obviously better. I rarely go to US. So, there's no reason for Amazon to transfer my data or my shopping cart between Germany and the US.

So if I go to the US store, I need to recreate a cart and re-authenticate again. So it's still an active-active, it would serve me better if I moved to the US but how many times a year am I in the US to shop very little at the end of the day. It's important to understand that maybe it's acceptable if I'm in the US that actually I'm routed to the German Amazon, and I do my shopping from there, maybe twice a year with increased latency, and that's okay. I don't need to have very complex systems to support the three days in the year while in the US and replicate all my data and make sure everything is there. So it's very important to really have a strong understanding of if you're building multi region, are you doing it for the right thing?

Jeremy: Right. Yeah, no and I think that you made a really good point there where it's like that the Federated aspect of it, especially depending on where it is. Like if you're using three regions in the US, for some reason, then federated might not work, because you could get routed to different regions, you might want to use global tables in that case. But if you were just doing one region in the US, one region in Europe, one region in South America or something like that, then maybe just replicating, say, the login data, right? Like just the authentication data using a global table for that, but then federating the other data...

Adrian: The customer data.

Jeremy: Yeah, right. So some of that other stuff. So I think that's a really interesting approach. People have to go read your articles seriously. And I'm going to put all this into the show notes. So honestly, thank you for sharing not only here and dealing with the technical issues that we had, which might just have to be another blog post at some point, we'll discuss. But seriously, thank you for writing all those articles and sharing all that knowledge. And just giving people the insight into some of this stuff, which I think is not publicly available, it's not readily available, you kind of got to dig through that stuff to find out what is important and what's not important. And you do a great job of summarizing it and going deep on those things.

Adrian: Thank you very much.

Jeremy: So thank you very much for that. So if people want to get in touch with you and find out more about what you're working on, and your blog post, how do they do that?

Adrian: So I'm pretty much everywhere on the internet, Adhorn, A-D-H-O-R-N, whether it's Twitter, Medium, or even Dev.to. So yeah, I'm pretty much there. And you can... anybody can ping me on Twitter. I have open DMs. So if you want to talk about anything, I'm happy to answer. I'm much better at writing than I am at answering live questions. So I hope I did justice to what you expected. But again, thank you very much for having me on your show. I'm a huge fan of what you do, Jeremy. So thank you very much for everything.

Jeremy: Thank you. I am a fan of yours as well. So thanks again and we'll get all that information in the show notes.

Adrian: Thank you very much.

THIS EPISODE IS SPONSORED BY: Amazon Web Services (Innovator Island Workshop)

Episode source