20230507 - Engineering Blogs on Load Shedding
This newsletter collects 6 engineering blogs on load shedding. You also can go to https://www.techblogsearch.dev/ to search engineering blogs from different IT companies.
[DoorDash] [2023/03/14] Failure Mitigation for Microservices: An Intro to Aperture
[LinkedIn] [2022/02/18] Hodor: Detecting and addressing overload in LinkedIn microservices
[Netflix] [2020/11/02] Keeping Netflix Reliable Using Prioritized Load Shedding
[Amazon] [2019/01/01] Using load shedding to avoid overload
[Google][2016/12/19] Using load shedding to survive a success disaster—CRE life lessons
[Netflix][2012/02/29] Fault Tolerance in a High Volume, Distributed System
[DoorDash] [2023/03/14] Failure Mitigation for Microservices: An Intro to Aperture
TL; DR
Problem. Localized mitigation mechanisms like load shedding and circuit breakers are useful in preventing individual services from being overloaded, but they are not very effective in dealing with complex failures that involve interactions between services.
Solution. This blog evaluated an open-source project Aperture, which enables a global failure mitigation plan for DoorDash’s services.
[LinkedIn] [2022/02/18] Hodor: Detecting and addressing overload in LinkedIn microservices
TL; DR
Problem. LinkedIn hasn’t had a standard server side solution for ensuring good quality of service (QoS) for clients when services were becoming overloaded to the point where they were unable to serve traffic with reasonable latency.
Solution. LinkedIn developed Hodor, a tool to detect and address overload in their microservices.
[Netflix] [2020/11/02] Keeping Netflix Reliable Using Prioritized Load Shedding
TL; DR
Problem. Except on/off circuit breakers, Netflix didn’t have a progressive way to shed load.
Solution. This blog proposed a prioritized load shedding to keep Netflix’s systems reliable during heavy traffic by preemptively shedding the lowest priority workloads, and allowing the higher priority workloads to continue running smoothly.
[Amazon] [2019/01/01] Using load shedding to avoid overload
TL; DR
This blog discusses how to use load shedding to avoid system overload in a cloud-based infrastructure. The article covers the general principles of load shedding, such as identifying high-priority requests, detecting overload, and shedding low-priority requests. It also provides examples of load shedding techniques used by AWS services.
[Google][2016/12/19] Using load shedding to survive a success disaster—CRE life lessons
TL; DR
The article describes how Google used load shedding, a technique that prioritizes critical applications, to survive success disasters, a situation where too many users attempt to use an application simultaneously.
[Netflix][2012/02/29] Fault Tolerance in a High Volume, Distributed System
TL; DR
Problem. Netflix experienced different challenges of achieving fault tolerance in a high-volume distributed system.
Solution. Netflix employed a combination of fault tolerance approaches (like network timeout and retries, and circuit breakers) to provide a comprehensive protective barrier between user requests and underlying dependencies.
If you have any questions or comments, please contact plussmart2018@gmail.com.