How to Identify Requirements in a Scaling, Multi-Tenant Environment - Scalability

Having a set of requirements to gather across engineering, product, and even finance can help to identify potential areas of collision, low-hanging fruit, or technical challenges.

As organizations grow and the number of engineering teams expands, having someone to gather, harmonize, and make trade-offs across a number of dimensions can help the platform team to meet the needs of the entire business.

How would the requirements be identified?

In this example, I assume that, as a company builds out its on-premise platform, they may continue to use some public-cloud infrastructure, either on an interim basis or as part of a hybrid strategy.

The Audience

These documents will be evolving, and need to serve multiple audiences. This is what makes them both powerful and challenging.

The primary consumers will be the platform engineering team since they will use these to build the platforms.

However, these should also reflect requirements or input from all the stakeholders: product engineering, product managers, customer facing, and executives.

I have found that the actual UX of the document matters to ensure stakeholders read and digest the information to ensure alignment.

Personal Lessons

Many of these may feel like unknowns or wild guesses. It's better to put them down and call out "We don't know exactly but we're guessing we can handle this and will test." Socialize this.

This is how we did it for testing distributed rate limiting and for testing spiking on the billing. I went and looked for historical information on spikes. This was the bare minimum.

Then the converse was to find failure points on dimensions, such as transactions per second, and get feedback or acknowledgment that that is considered okay from different stakeholders, especially executives/CEO.

The hard part is that scalability requirements (and many other infrastructure or platform requirements) are often ignored until something bad happens.

Convert to relatable metrics (number of users, purchases, plans)
Repeat in communication that the scaling number breaks at X number currently -- talking about negative scenarios invites someone to raise a hand a cry foul, justified or not
Show metrics
Make those metrics visible
Ensure alerting for failure that is readable to a broader audience, not just deep in the weeds

Areas of Consideration

Scalability

Define the scalability requirements for both the on-premise and cloud environments. This will include defining the number of users, transactions, and other metrics that the platform needs to be able to handle, and specifying the scaling mechanisms that will be used.

Capacity Planning

Determe the amount of resources needed to meet the anticipated demand. Think long term.

For example, a gaming company may have currently 100 CCU (Concurrent Users), but the scale-out plan for the next 5 years shows a plan to hit 1000 CCU by year 5.

Depending on your environment, there should also be a consideration for what the possibility is 20 years out. This doesn't mean the platform will build towards that year 1. What it does mean is including this in the specification and requirements design to identify potential trade-offs or paint-in-corner scearnios.

Capacity can include requirements along a number of dimensions, including:

Users
Transactions
Data storage
Data processing
Usage patterns (season, growth, spikes)
Locality requirements

Scalability Behavior

The actual architectural decisions should be determined by engineering. However, there are different ways a system may behave at scale. These requirements can be gathered across the team:

Session management
Load distribution
Redundancy
Failover

The details don't matter, but the behavior as expected by end-users should be incorporated.

Some of these patterns can include:

Caching under load -- perhaps the application is okay with stale data
Asynchronous processing -- can the requests be queued or not
Horizontal scaling -- are there issues with provisioning new resources or is the environment already over-provisioned?
Data partitioning -- can data writes be sharded or partitioned?

Examples

For example, when working on the billing platform, we had to ask what would happen if there were a spike in billing requests. We knew we did not want to lose any transactions, so had to evaluate how we did failover. This created questions because we needed to ensure that the records didn't get out of sync. At the same time, we had to consider whether we wanted to queue the transactions in case the backend failed, so that we captured someone's purchase intent and could catch up later when the system came back up.

Performance Metrics

Define the performance metrics that the product needs to meet in order to be considered scalable. Most of these are probably already being tracked by SRE or systems engineers in a manner that they understand, using tools like Grafana or Prometheus.

However, aligning these with the product managers and what the expectations are from users provides a valuable perspective to prevent "it's slow" complaints from users that can baffle internal metrics and makes triage difficult.

Some metrics to consider:

Response times
Throughput
Error rates
Availability

For each of these, include a reference to how a given metric is measured; why that method is used if there can be questions; what a poor metrics can mean, especially for end-users.

These should be referenced under the Scalability Testing section of the specification.

Scalability Testing

Define the testing methodologies that will be used to validate the product's ability to scale.

types of load testing (synthetic, real, geo distribution)
performance metrics (see section above)
test scenarios (common seasonal activity; upcoming plans from marketing; chaos)
acceptance criteria or thresholds
environments
data