- Architecture Nugget
- Posts
- Architecture Nugget - December 2, 2024
Architecture Nugget - December 2, 2024
Insights on Slack's Unified Grid, CQRS Patterns, and Architectural Risks in Distributed Systems
Hey there!
Last week, I couldn't publish both Architecture Nugget and Golang Nugget for a few reasons. One of them is that we had a platform migration for both newsletters. :)
I moved them from Jekyll and MailerLite to Beehiiv to have them on separate domains, as I received feedback that having two newsletters on one domain can be a bit confusing.
Another change is that I'll try to send updates more often, but with shorter posts. If you’ve written or found something interesting, please send it to me. I'd be happy to share it with others if it's relevant.
With all that said, let’s jump into some cool architectural stuff!
In cloud setups, you often find small groups of nodes (like 1 to 9) that work together internally but try to avoid coordinating with other groups to keep things running smoothly. But, every now and then, these systems have to sync up across groups or with central controllers, and that’s when things can get a bit dicey.
Here’s a look at what usually goes down in these sometimes-coordinating systems:
The big issues pop up when these systems hit correlated failures. They handle regular node hiccups just fine, but when something major like a power cut or network glitch happens, it causes a sudden spike in coordination traffic. These events don’t happen often, so the system’s response in such cases usually isn’t tested on a large scale.
Three main problems stand out:
Recovery capacity - Systems find it tough to keep both running smoothly and having enough room to recover.
Coordination overload - Controllers suddenly have to deal with way more traffic than usual.
Quality degradation - During failures, outdated info can mess up scheduling decisions.
AWS has come up with some cool ways to deal with these challenges:
Keeping coordination constant no matter the workload.
Designing with clear limits on how much can go wrong.
Building systems that stay stable even if coordination fails.
The tricky part is the system’s bistable nature - it’s stable when things are normal but can also stabilise in an overloaded state. This makes bouncing back a real challenge.
If you’re curious about diving deeper into distributed systems design and control theory, the original article “Some risks of coordinating only sometimes” offers more insights into cluster sizing and theoretical frameworks like CAP and CALM.
Slack hit a big architectural snag as their platform grew. It was initially set up for single-workspace users, but when Enterprise Grid customers started using multiple workspaces heavily, things got messy. The system assumed data was tied to individual workspaces, leading to performance hiccups and user experience issues with cross-workspace features.
The fix? A complete overhaul called Unified Grid, shifting from a workspace-focused to an org-wide architecture.
The implementation has some smart strategies:
Creating a New Boot API: Slack developed a new boot API to provide a unified view of all user data across multiple workspaces. This API aggregates data at the org level, allowing users to access all their channels and workspaces from a single interface, reducing context switching and simplifying data management.
Updating the API Framework: To support the Unified Grid, Slack updated their API framework to handle org-wide contexts. This change enabled APIs to access and manipulate data across multiple workspaces, laying the foundation for a more flexible and scalable architecture.
Implementing three fixing patterns for broken APIs:
Direct Routing: For APIs migrated with Vitess, Slack allowed direct routing without major changes, leveraging new sharding schemes.
Workspace Selection Prompts: For workspace-specific actions, users were prompted to select a workspace, ensuring correct context for API calls.
Multi-Workspace Iteration: Slack iterated over relevant workspaces with a cap of 50 to maintain performance, especially for users with many workspaces.
The team used a “prototyping the path” approach, starting small and gradually expanding.
To manage the transition, they created parallel test suites and developed new client-side data stores at the org level. The migration required updating thousands of APIs and permission checks, but the team broke this down into manageable chunks, first making things work for internal testing, then handling edge cases.
This architectural shift let Slack provide unified views across workspaces while maintaining performance, even for users in multiple workspaces. It’s a good example of when you need to break the “don’t rewrite” rule.
If you’re interested in more details about the migration process and technical challenges, check out the original article about Slack’s Unified Grid architecture transformation. It’s got loads more insights about testing strategies and client infrastructure changes.
Handling complex systems with high performance and scalability needs can be a real puzzle. That’s where CQRS (Command Query Responsibility Segregation) steps in - it’s a smart pattern that splits reading and writing operations into separate models.
Here’s how it works in practice:
The implementation usually involves several key components:
Domain Models (Write Side):
public class Order : AggregateRoot
{
public Guid Id { get; private set; }
public OrderStatus Status { get; private set; }
public void ShipOrder()
{
Status = OrderStatus.Shipped;
AddEvent(new OrderShippedEvent(Id));
}
}
Command Handlers:
public class ShipOrderCommandHandler : ICommandHandler<ShipOrderCommand>
{
private readonly IRepository<Order> _repository;
public async Task Handle(ShipOrderCommand command)
{
var order = await _repository.GetAsync(command.OrderId);
order.ShipOrder();
await _repository.SaveAsync(order);
}
}
Query Handlers:
public class GetOrderSummaryQueryHandler : IQueryHandler<GetOrderSummaryQuery, OrderSummary>
{
private readonly IOrderSummaryRepository _repository;
public async Task<OrderSummary> Handle(GetOrderSummaryQuery query)
{
return await _repository.GetAsync(query.OrderId);
}
}
The pattern works great for complex systems with lots of reads or writes, but it’s not always the best choice. For simple CRUD apps, it might be overkill. You’ll need to think about data consistency too - most CQRS systems use eventual consistency, where the read and write models sync up over time.
Common tools in the CQRS ecosystem include message brokers like Kafka or RabbitMQ, and you might use different databases for your read and write models (like PostgreSQL for writes and MongoDB for reads).
If you’re keen to learn more about advanced topics like event sourcing integration, debugging strategies, and legacy system migration, I’d recommend checking out the full article “CQRS: A Deep Dive into Command Query Responsibility Segregation”. It’s got loads more practical examples and implementation details that couldn’t fit in this summary.
Whatamix: Blendable feed construction (From Whatnot)
Feed functionality is super important for e-commerce apps, but building flexible, maintainable feed systems can be tricky. Whatnot tackled this challenge by creating Whatamix, a platform that uses directed acyclic graphs (DAGs) to construct different types of feeds.
The core problem was managing multiple feed types (For You, category-specific, followed content) while avoiding code duplication and maintaining consistent patterns. Each feed needed different content types, personalisation levels, and business rules.
Whatamix solves this by breaking feed construction into modular DAG nodes that handle specific tasks:
The system has four main node types:
Retrieval nodes - fetch candidates from various sources
Ranking nodes - handle feature hydration and model scoring
Business logic nodes - manage diversity and positioning
Budgeting nodes - handle things like bandits and ads
What’s clever is how Whatamix works in two phases:
DAG Construction - builds the node graph based on user context
DAG Execution - runs non-dependent nodes in parallel automatically
The system includes built-in observability (metrics, logging), A/B testing support through “feed params”, and error handling. Teams can work independently by creating self-contained sub-DAGs, and there’s a process for contributing new reusable nodes.
Since there’s loads more detail about implementation specifics and future plans in the original article “Whatamix: Blendable feed construction”, I’d recommend checking it out if you’re interested in practical ML systems architecture.
Clean Architecture tackles common software development challenges like early tech decisions, rigid systems, and scattered business logic. It’s a design approach that separates system responsibilities into distinct layers, making the codebase more maintainable and adaptable to change.
Here’s how it works: the architecture organises code into layers, with business logic at the core. Let’s break it down with a diagram:
The Domain Layer contains business entities and rules. Here’s a practical example:
public class Webinar {
public Guid Id { get; private set; }
public string Name { get; private set; }
public DateTime ScheduledOn { get; private set; }
public Webinar(string name, DateTime scheduledOn) {
Id = Guid.NewGuid();
Name = name;
ScheduledOn = scheduledOn;
}
}
The Application Layer handles use cases through CQRS pattern, separating read and write operations:
public class CreateWebinarCommandHandler : IRequestHandler<CreateWebinarCommand, Guid> {
private readonly IWebinarRepository _repository;
public async Task<Guid> Handle(CreateWebinarCommand command, CancellationToken cancellationToken) {
var webinar = new Webinar(command.Name, command.ScheduledOn);
await _repository.Add(webinar, cancellationToken);
return webinar.Id;
}
}
The Infrastructure Layer manages external integrations like databases, while the Presentation Layer handles user interaction through APIs. Everything’s tied together using dependency injection, making the system flexible and testable.
While this approach adds some complexity and potential code duplication, it offers significant benefits: easier testing, framework independence, and clear separation of concerns. The architecture particularly shines when requirements change frequently or when you need to swap out technical components without affecting business logic.
If you’d like to dive deeper into implementation details and practical examples, check out the original article “Exploring Clean Architecture: A Practical Guide” which contains comprehensive code samples and detailed explanations of each architectural layer.
Reply