In Agile/Safe Development, quality is sometimes sacrificed for scalability. Learn how to correctly scale your system without diminishing quality.
Over the past few years, I have had the opportunity to watch and participate in a variety of projects related to testing numerous systems. Over time these systems have grown to reflect various configurations designed to handle large numbers of users. For the growth in the front end, we allowed for load balancing of machines and dynamic addressing so that multiple machines would handle the increasing number of simultaneous connections. On the back end, we allowed for multiple machines to access our databases and to also replicate those databases to allow for increased transactions. As systems are developed to be allowed to scale, by their nature, they become more complex as more moving parts are put into play. These systems take on a new dynamic that all in one systems (most commonly what is used for day-to-day development work and testing) will never be able to approximate. This opens up the need to ask an important question. What do we prioritize, the scaling or the software quality? More to the point, do we necessarily have to make a choice?
What Challenges Come with Scalability?
Scalability comes down to how many users will be interacting with the system at any given time. How an organization gets to the point where they are able to maximize and leverage those capabilities is going to vary. In many cases, a tradeoff will need to be made. For systems that are going to be very large, and handle many concurrent connections, infrastructure and the ability to multiply a physical footprint is going to be the most important factor. This leads to a few questions:
• What does that look like?
• Does it mean that in-house replications of systems will be in order?
• Does it mean parallelizing systems and software components over multiple machines?
• Does it entail setting up large scale configurations of systems in the cloud to provide the necessary footprint to meet the needs of the growing organization?
It is tempting to think that, once the footprint and systems requirements are met, that most of the problem is solved. From my own experiences, and the development teams I have worked with, this is often misleading. With larger systems, more moving parts are required and those moving parts interact in ways that simpler all-in-one appliance systems do not. The desire to have a simple all-in-one system that runs on a development machine is understandable. Sure, it will be a decent platform to develop and examine the workflow and interactions that are expected for most scenarios, but it is likely to be limited to a handful of users or flows at any given time.
From a development perspective, even with Automation in the form of unit, integration, and end-to-end tests, odds are that most flows in this capacity would be run serially to make sure that they work. For initial development, feature checking, and proof of ideas with stub services, this makes sense. However, my experience shows that these systems and services are much less helpful when we need to crank up the number of concurrent transactions. What works for 5 interactions will not be sufficient to look at 50 simultaneous interactions, and certainly not 5,000 or 500,000 interactions.
Additionally, if an organization embraces Scaled Agile Framework principles (SAFe), we have to consider a number of factors:
• Who owns development? It is unlikely to be a single group but instead spread out through the entire organization.
• Who writes the modules (and who tests those modules) and at what level? Unit testing is going to be helpful for individual components and simple workflow transactions but will certainly not be sufficient to handling large amounts of connections, transactions, data, or any combination of the three. Additionally, who will handle the integration aspect of development between units?
• How will deployments work? What does a CI/CD pipeline look like for these types of large-scale applications? How is automated testing integrated? What level of load testing, performance testing, and security testing is put into place to examine the most risky areas for these setups?
These are not trivial questions, and oftentimes the solutions to adapting to larger projects like this adds to the complexity. Additionally, this is taking a monolithic view of an application, where we are considering just one aspect or section of functionality. From my own experiences, when an application scales, it doesn’t just scale upwards for the number of users or transactions. It also scales outward as it requires more pieces to hold everything together. Load balancers and more advanced network configurations come into play if you are stringing together multiple machines for a unified front end. Additionally, these systems are frequently dealing with not just a single application, but multiple applications needing to work together. Who will take on that integration piece? Will it be a matter of simply configuring one application in a large-scale environment, or will it need to take into account integration with other applications and services. Yes, a microservices architecture can help with this and unify the methods for interaction, to an extent, but they still need to be examined within a larger framework, they need to be proven to interact effectively and they need to all show that the system as a whole is working as intended in this larger environment.
Looking at Development in Larger Environments
One of the most challenging aspects of dealing with an application that needs to deal with a lot of interactions, a number of machines, and many moving parts is how those components interact with one another.
As mentioned previously, a single all-in-one appliance model may work fine for dealing with simple workflows or single user interactions. This will, however, be completely ineffective with a system that needs to be scaled up to handle a large number of transactions or users. The act of scaling systems means that each component added creates an additional layer of challenge. In addition to the architecture elements and how they fit together, we also need to consider which tests need to be configured to ensure that a load balancer is working correctly. Additionally, how do we ensure that timing will allow for database replication or parallel processing? How does a serial process compare to many parallel processes, and also, how do we interact with other groups and ensure their development efforts fit in with ours?
During the past several years, I had the opportunity to participate in these types of data transformation projects and some of the steps we had to do made for interesting and, at times frustrating, adaptations. Running tests serially was going to take too much time and the tools necessary to spin up large and robust environments would require a fairly sophisticated testing approach. Our solution at the time was to allow for parallel test runs, running simultaneously, and to focus on each system in the cluster. To that end, we created a system that had a manager node that would spin up several worker nodes and each of these worker nodes would either handle a subset of tests or allow for a spread of transactions on individual servers (creating parallel transactions to maximize the load balancing of a front end, or to effectively run queries or conduct transactions on designated databases in the back end.
The more systems used in scaling the application surface, the more workers designated to run these transactions. Additionally, tests were created that worked in two dimensions. We were able to set up a large number of worker nodes and spread out our tests, meaning we could run a large number of unique tests simultaneously, thus cutting down the amount of time necessary to perform a build and deployment cycle. We could also replicate tests within these environments and shift the focus of testing to running similar tests in parallel, thus checking to see that we could handle a large number of concurrent transactions.
How Does This Affect Quality?
In my experience, the biggest challenge with balancing scalability and quality is the fact that the matrix of necessary things to examine grows larger and larger with each new component and expansion of the original application. To go back to my all-in-one appliance example (this is, by the way, the terminology we use for our simpler systems), it is possible for a small group to develop components, to automate basic workflows, and to create a series of repeatable and reliable tests to run in parallel with our manager and worker model for fast turnaround of a build to be deployed. In these larger scaled environments, that model breaks down quickly and the ability of a development team to handle all of the possibilities becomes impossible. While there is no truly exhaustive testing that can be performed (with the exception of trivial subsystems or atomic interactions) it is possible to make a risk assessment and focus on the areas that are the most critical. With systems built to larger scale, this risk assessment becomes more important and the need to focus on the critical areas takes precedence.
For these large scale systems, a divide and conquer approach is essential. Additionally, this is a scenario that requires an “all hands on deck” approach to development and testing, specifically when it comes to constructing a pipeline for integration and deployment. In an ideal situation, there would be small integrations performed regularly, so that individual components can be implemented. Additionally, large scale testing efforts would not have to be repeatedly performed.
Conclusion
When the goal is being able to make sure an application can run for a large number of users, or perform a large number of transactions simultaneously, the development and testing needs change and these changes can feel insurmountable. With a focus on full-team engagement, focusing on smaller areas of development and deployment, breaking down stories and deliverables in a way that allows for multiple deliveries, even applications that need to be built to scale can be tamed. Make no mistake, it will take time and it will take many hands working together to get to that point, but we need not sacrifice quality for the ability to scale. We have to change the way we look at quality and we may need to shift the goal posts to handle those changes and we may just need to be patient to allow for each side of the scale to be balanced.