Architecture Nugget - December 9, 2024

Aurora Serverless resource management, microservices communication protocols, AI integration in architecture, and solutions for Kafka load balancing.

Hey, and welcome to another Architecture Nugget, where I share TLDRs of architectural posts! This time, we have a newcomer—a fantastic YouTube video about microservices communication.

Before diving into the nuggets, I’d really appreciate it if you could share Architecture Nugget with your friends and colleagues if you find it useful. It means a lot to me to see new readers joining through your shares on social media. It keeps me motivated to send out more high-quality content!

Managing resources in auto-scaling databases is pretty tricky, especially when it comes to memory. Aurora Serverless solves it with a multi-level resource management system that lets databases scale up and down based on workload demands.

The solution works at three levels:

At the lowest level, there’s smart coordination between the hypervisor, kernel, and database engine. Traditional databases are like hungry hippos—they eat up all available memory. AWS team had to tweak this behaviour to make the database engine give back memory when it’s not needed, especially during scale-down operations.

The middle level handles resource management on individual hosts. Multiple database instances run on a single physical machine, each in its own virtual machine. A local decision-maker controls scaling speeds and targets to prevent resource exhaustion.

At the cluster level, it’s all about smart workload placement. The system uses three main strategies:

  1. Preventive placement: Mixing workloads with different scaling patterns

  2. Live migration: Moving workloads between hosts when needed

  3. Growth limits: Restricting scaling when resources are tight

What’s really smart is how these levels work independently. This makes the system more resilient—even if the cluster-wide control plane goes down, individual instances can still scale effectively. The trade-off between absolute performance and predictability is intentional, prioritising consistent behaviour over max scalability.

The implementation achieves impressive results—out of over 16 million scale-up events, only about 3,000 needed live migrations. That’s a 99.98% success rate for in-place scaling!

If you’re keen to dig deeper into the nitty-gritty details of database resource management and system design, check out the full VLDB paper on Resource Management in Aurora Serverless.

In lesson 201 of Software Architecture Monday, Mark Richards discuss three important ways microservices can communicate: REST with API Gateway, request-reply messaging, and gRPC. The main problem he talk about is how to make service-to-service communication work good without using too much bandwidth and also avoiding stamp coupling.

He use example of a wishlist service that need customer names from profile service. The video show what are pros and cons of each way. With REST, it is very common and easy, but usually you get more data than needed (stamp coupling). To fix this, you may use field selectors or make special endpoints, but creating new endpoints can break REST rules. Example of such endpoint:

GET /app/1.0/customer?field=name

Request-reply messaging makes private contract between services and does automatic load balancing with request/reply queues. It help to save bandwidth and avoid stamp coupling, but it skip API Gateway benefits like security and metrics.

gRPC is fastest because it use HTTP/2 and Protocol Buffers with strict contracts. But you will need extra load balancing tools (like gRPC-LB). It is mostly for private API and east-west communication.

Each way has its best use. REST is good for north-south communication and when all data is required. Request-reply messaging is better for east-west and when only some data is needed. gRPC is best for high performance cases where services are tightly connected.

The video suggest REST with API Gateway for talking with external users, request-reply for inside service communication needing less data, and gRPC for very fast workflows inside services.

These days, software architects have a tough job of weaving AI into systems in a way that’s more than just a trendy add-on. The trick is to realize that when we talk about “AI,” we’re usually talking about Generative AI, which runs on Large Language Models (LLMs) – a special kind of machine learning model.

Here’s a simple way to see how LLMs work:

When it comes to integrating AI, architects have three big things to think about:

  1. Appropriateness:

  • Good uses: Natural language interfaces, content creation

  • Possible uses: Enhancing complex UI experiences

  • Bad uses: Financial calculations, regulatory compliance

  1. Implementation approach:

  1. Optimisation techniques:

  • RAG (Retrieval-Augmented Generation) pattern for mixing LLMs with knowledge bases

  • Picking models based on what you need, not just size

  • Fine-tuning for specific domains when needed

For practical steps, architects should:

  • Treat AI parts as systems that aren’t always predictable

  • Set up good validation methods

  • Weigh the pros and cons of model size vs. performance

  • Start with API-based solutions to test things out

  • Use RAG patterns when it makes sense

The main thing is to stop seeing AI as some kind of magic and start treating it like any other part of the architecture, with its own strengths and limits that need careful thought in system design.

For more on practical implementation patterns and detailed trade-off analyses, check out “Architectural Intelligence – The Next AI” for in-depth examples and case studies.

Handling uneven load distribution in high-throughput Kafka systems can be a real headache, especially when you’re dealing with millions of transactions every day. This common issue that often leads to consumer lag and processing delays.

The main problem happens when some consumers get overloaded while others are just chilling. For example, in a fraud detection system processing 50,000 transactions per minute, some transactions need extra verification time, causing bottlenecks in certain partitions. Plus, when you’ve got different hardware capabilities across your AWS ECS instances, some processing 20,000 transactions/minute and others only 12,000, it gets even messier.

Here’s how to fix it with four smart approaches:

  • Lag-aware Producers: These clever producers check consumer lag before sending new messages. They’ll send more messages to partitions with lower lag and fewer to those struggling to keep up.

  • Lag-aware Consumers: These consumers keep an eye on their own performance and can trigger rebalancing when they’re falling behind. If a consumer’s getting swamped, it’ll hand off some partitions to less busy consumers.

  • Same-Queue Length Algorithm: This keeps the backlog even across all partitions. The system monitors queue lengths and adjusts message distribution to maintain balance.

  • Weighted Load Balancing: This approach assigns more traffic to faster consumers based on their processing capacity. If Consumer A can handle 20,000 transactions/minute and Consumer B only 10,000, the system distributes load accordingly.

By using these strategies, you can keep processing even across your system, cut down on latency, and make better use of your resources. It’s especially great for systems that need real-time processing, like fraud detection, where every second counts.

Election Day GIF by #GoVote

Which topic should I dive into for a special edition of Architecture Nugget?

Login or Subscribe to participate in polls.

Thank you for reading Architecture Nugget 🙂 

Reply

or to participate.