Process Perfection

jroller.com | Apr 7th 2010

Well over a year ago, in a conversation with Alexis Richardson, I came up with a catchy acronym to articulate an idea that I had been kicking around as a simple way to respond to all of the Sturm und Drang in the press and the blogosphere about "lock-in", "data portability" and reliability of cloud computing providers. I said -- "You know what, mate, done properly, it would be like a RAID setup -- it would be an array of cloud providers. Umm, yeah, it would be RAIC! 'Redundant Array of Independent Cloud providers'". Alexis, as I recall, burst out laughing, and said something like "You better trademark that, Mark. That's great."

A few weeks later, I sat down, and wrote a blog post to try to describe the idea in some detail. That post has since become the most popular post on my blog, ever, but that's largely because people hotlink to the image of Wile E. Coyote that I included in it, apparently, and has little to do with the rest of the content. And as it turned out, it's good I didn't try to follow Alexis' advice about that trademark stuff, as an angry commenter let me know (quite correctly) that he was the first to publish the term. :D

Despite all that, the term has gotten some traction. I encounter it now from time to time in other people's writings, and I get a lot of questions about it. By and large, the questions are a consequence of my own laziness bandwidth constraints. That first blog post was never intended to stand alone -- I meant to follow it up with one or more posts, expanding on the idea and explaining what I meant in more detail. Since I never got around to doing that, I can't blame anyone but myself if people are left confused, or have questions.

A few months ago, I was asked by a CSC colleague in Holland if I could contribute a chapter to a book that is being published (in Dutch) there in the coming year on cloud computing. I said, "Sure, I'll write about RAIC!" And so I did. What follows is the English-language input I provided.

Redundant Arrays of Independent Cloud computing providers – RAIC

At one point, many years ago, during the early period of what has since come to be known as the “client/server revolution”, the reliability of hard disk drives in mainframe systems was a powerful sales argument for manufacturers of such systems. Defending their markets from new and aggressive competitors, they made the argument that hard drives used by their competitors were too unreliable (by comparison) and offered unacceptable performance for mission critical work. This argument was helped immensely by the fact that it was, by and large, true.

In 1988, David A. Patterson, Garth A. Gibson and Randy Katz at the University of California, Berkeley, published a paper entitled "A Case for Redundant Arrays of Inexpensive Disks (RAID)" at the SIGMOD (Association of Computing Machinery’s Special Interest Group on the Management Of Data) Conference [1]. This paper laid the foundation for a relatively simple, but extraordinarily effective response to the limitations of disk storage in low cost, client server systems. Simple queuing theory mathematics demonstrate that an array of service providers, working in parallel, provide higher bandwidth than any equivalent single service provider can. But low cost disks being used in client server systems seemed unsuitable for such parallel arrays, because they were of relatively low quality, and correspondingly unreliable. The RAID idea was to combine N disks in a redundant manner. This would compensate for the inherent unreliability of the hardware, and allow systems to exploit parallelism for higher bandwidth. The Berkley paper went on to outline several different implementation strategies, described as “levels”, defining five of them. The genius of the idea was that it took a perceived constraint – low cost, low quality disks – and leveraged them to produce a solution. In other words, RAID leveraged a core attribute of the new model to solve some of its constraints.

RAID was a tremendous success. The commercial implementations in the marketplace have often differed in many ways from the academic ideal embodied by the Berkley paper, and the precise meaning of a particular “RAID level” has often been ambiguous as a result. But as a general concept for system design, RAID has served as one of the core building blocks of commercial IT in the last 20 years. Faced with an inflection point in the history of IT, where the economic advantages of client server systems were exerting enormous pressure on the industry to find a way to exploit it, the idea of RAID emerged as a central enabling technique. In a very real way, RAID helped pave the way for all of the subsequent developments that leveraged this potential, including the Internet, the Web, and what people have now begun calling “cloud computing.” RAID was a conceptual milestone in the design of IT systems.

Arguably, the pressure that is now being exerted on the IT industry by the economic advantages of cloud computing represent the next major inflection point in the history of technology. Like the client server inflection point before it, cloud computing presents us with a “perfect storm” of correlated factors, all of which have now come together to create a model of system design that is disruptive due to the business opportunities that it is enabling.

However, like the client server inflection point before it, there are significant gaps in the conceptual framework of cloud computing design and architectural patterns. These gaps manifest themselves as problems, constraints and challenges, some of which make the use of the new model untenable in certain use cases. Like the mainframe before it, entrenched models of computing have certain attributes – such as reliability – which are expressed and implemented in ways that cannot yet be replicated using the newer model. And like the client server inflection point before it, these problems and constraints are being held up by entrenched interests as justification for rejection of the new model – “this doesn’t work!”

RAIC – Redundant Arrays of Independent Cloud providers – is a conceptual response to some of these constraints. Like RAID before it, RAIC proposes a particular set of design patterns, which can be used to not only mitigate certain constraints, but also allow new potential benefits to accrue, particularly for enterprise customers.

Constraints, problems and challenges facing cloud computing

Cloud computing is a very young conceptual model. Arguably, a consensus on what the term means has still not been reached in the industry, and to the extent that any consensus does exist, however rough, it has only emerged in the last year. It is therefore hardly surprising that the model that it represents has a number of problems, gaps, constraints and challenges that have yet to be resolved.

Prominent amongst these are the following issues:

Reliability: Cloud computing providers have business models that are optimised for their initial, and primary customer base; providers of consumer-facing Web services. As such, they offer levels of reliability that are suitable for the consumer Web. These levels of reliability are inadequate for many (if not all) transactional enterprise workloads. Moreover, due to constraints in their own business models, consumer-oriented cloud computing providers have proven reluctant to change this – they have been slow to offer a different set of terms to enterprise customers, and slow to offer any kinds of guarantees or Service Level Agreements (SLAs), which are standard approaches in traditional outsourcing and hosting relationships in the enterprise market. Above and beyond that, increasing reliance on Internet-based sourcing providers calls into question the reliability of the Internet itself. In a world where an accident in the Mediterranean can take India, parts of Africa and Asia effectively offline for days [2], this question is more than academic.
Lock-in/out: As befits its relative youth, cloud computing is a domain that encompasses a broad and diverse array of solutions, many of them competing with one another as solutions for the same class of problem. These competitive solutions are, by their nature, largely incompatible with one another, and very often proprietary, so that there is little to no transparency for a customer into the implementation itself. Choosing a provider in such a context carries significant risks. Should that solution prove to be the loser in the competitive marketplace, customers that have committed to it will find themselves in a sub-optimal situation. The recent demise of the Platform as a Service provider Coghead [3] provides an object lesson in these sorts of risks.
Data portability: Closely correlated with the problem of lock-in is the problem of data formats. If a provider’s solutions uses proprietary formats, a customer may have a significant data transformation burden to bear, should they decide to extract the data for storage elsewhere. Moreover, in some cases, providers who have optimised for the consumer-facing market are not prepared to even provide direct access to customer data. In some cloud business models, customer data is a valuable good, and earning revenue on it a key part of the provider’s own profit structure. There are unresolved debates about the boundaries of ownership of data to be drawn between a customer and a provider, and not all cloud providers have strategies that are acceptable for enterprise use cases.
Data size and the laws of physics: In an increasing number of cases, and as startling as it may seem, the speed of light is becoming a serious business constraint on the use of cloud computing providers. An example will serve to illustrate the problem: consider an enterprise using a cloud computing provider to host a Business Intelligence (BI) solution. In our (completely contrived and artificial) example, we will assume that the business in question has decided to store both raw and aggregate transactional data in the cloud. This is not as far-fetched as one might assume – in many businesses, large quantities of data are collected on individual transactions solely for the purpose of serving as the basis for later aggregated figures. In such a case, an argument can be made for an “elastic” solution architecture, where the resources required to collect and store the raw data do not run twenty four hours a day, seven days a week, but only as needed. So, to return to our fictional example, we have a business that is collecting vast quantities of data, and storing them with a cloud provider. Let us also assume that the business has been operating this way for some time, and has accumulated many terabytes of data in the cloud. What happens, however, if the sourcing relationship with that provider suddenly goes sour, and the business wants to terminate it? If the provider only offers an Internet-based “pipeline” to the customer’s data, the amount of time it will take to “pull the data out” of the cloud is a function of the amount of data and network bandwidth available. As one industry veteran put it, speaking at the first CloudCamp in London and describing a situation confronted by a start-up company he was involved with, “even if we had run a batch job round the clock for a month, we still would not have been able to extract all of our data.” [4] The laws of physics place an implacable limit on such things.

What is RAIC, and how does it work

RAIC presents us with design patterns that can mitigate all of these constraints. In a nutshell, the idea is simply this: keep multiple, redundant copies of all data with multiple cloud providers. The design patterns embodied by the various RAID levels may provide templates for similar patterns here, but that is the stuff of future work. In this paper, we will limit the discussion to the implications of the simplest possible such model – the equivalent of RAID level 1, mirroring of data.

Conceptually, RAIC involves mirroring a business’s data with multiple cloud providers. Rather than establishing an “eggs in one basket” sourcing model, as shown in Figure 1, RAIC suggests a model where a business has a commercial relationship with multiple cloud providers, and writes all data to each and every one of these providers in parallel. Figure 2 depicts RAIC in action.

Figure 1 – the “all eggs in one basket” model

RAIC, like RAID before it, capitalises on existing technologies (such as, in the figures shown here, Virtual Private Networking (VPN) techniques), but also leverages attributes of the components themselves to enable the model itself. In RAIC’s case, the low cost of cloud computing services, and the lack of capital expenditures needed to enable the model, are what make it a viable solution.

Figure 2 – RAIC

Advantages RAIC provides

Let us now revisit the constraints, problems and challenges listed earlier, and examine how the RAIC concept can mitigate each, in turn.

Reliability: This is the most obvious benefit of the model, and arguably the easiest to understand. Reliability becomes a function of the number of providers. More providers equates to higher reliability. Moreover, distributing the pool of providers across a number of geographies could enable a design that was resistant to transient, localised problems with the Internet. An enterprise using a global RAIC could effectively achieve the same aggregate reliability as the global Internet itself – this is the same argument as that made on behalf of the Content Distribution Network (CDN) concept, with the difference that CDN is a read-only solution, whereas a RAIC is a write-only design pattern. An event that caused the entire, global Internet to fail would be likely to be a cataclysm of such apocalyptic proportions that the failure of business systems might not be the highest priority issue.
Lock-in/out: Essentially, a RAIC system design eliminates the concern of lock-in. If a customer employing a RAIC strategy decides to terminate a commercial relationship with a provider, this presents no problem – the customer is no longer in a relationship of sole dependency with such a provider. Customers will, of course, have to balance risks involved in changes to reliability and availability (at least until one provider can be replaced by another), but this is a straightforward business decision, and one that RAIC enables the business to make, by breaking the sole dependency on a single provider.
Data portability, data size and the laws of physics: In our experience, this is the least intuitive of the advantages of a RAIC-like model, but in our judgement, the most compelling. Put simply, RAIC sidesteps these problems. It doesn’t solve them, per se – it enables a business to simply go around them. Consider the “we’re terminating our relationship” scenario suggested in the lock-in constraint. In a RAIC system, a customer would merely issue a “delete” job on their way out the door. There is no “data portability” problem, because RAIC eliminates the need to ever move the data, in bulk. Similarly, this mitigates the problems posed by large datasets (vs. the laws of physics) in a straightforward way. If a business never needs to move its data, it need not be concerned with the fact that it isn’t feasible to do so. Of course, this assumes a pre-requisite: that a “delete” command to a cloud computing provider really does what the customer wants it to – that “delete” really means “delete”. But, in our experience, customers will find it easier to negotiate the terms of “delete” than an attempt to re-write a provider’s cost model, not to mention the implacable laws of physics.

Like RAID before it, RAIC system designs hint at tremendous opportunity for optimisation, and new capabilities that might emerge as a consequence of the same. Consider the question of how to implement a mechanism to ensure that data is written in parallel to each of the cloud providers involved. We have not detailed any particular implementation strategies, nor do we intend to: these are left as an exercise for the reader. But allow us to explore some of the implications of various strategies for a moment, in order to highlight what we see as fertile ground for optimisation and the emergence of new capabilities.

A naïve implementation of RAIC might simply write all transactions to all providers concurrently – in parallel, but synchronously. This would be simple enough to do, and would work. But this is certainly not the only possible implementation strategy. It is almost as straightforward to imagine more complex implementations, using some form of asynchronous messaging. Imagine a system where transactions were first written to one provider, in a synchronous manner, and then propagated, using asynchronous messaging techniques, to the other providers. This is similar to the design patterns used to implement federated databases. By extension, it is simple to imagine any number of permutations of this sort of design, ranging from an intermediate messaging broker, to peer-to-peer quorum algorithms that distribute the role of the broker as well. Further, these various approaches clearly have complex, differing implications for the role of data in an overall system. Ideas like BASE [5], the CAP theorem [6] and “eventual consistency” [7] will all have a role to play here.

Broader implications for system design

These considerations imply that RAIC is only a starting point, a foundational design pattern that enables other, more complex patterns in turn.

We think it will be useful to explore some of these broader implications, and to place RAIC in a conceptual framework that relates it to other aspects of system design. However, before doing so, let us first make one thing clear with regard to RAIC’s relationship to the concept of cloud computing itself.

RAIC seems easiest to understand as a metaphor for data storage, and this is not a coincidence – ultimately, RAIC is about the storage of data with different providers. However, this sometimes leads people to assume that it is also only applicable at the Infrastructure as a Service layer of the SPI stack. This seems to be a natural consequence of the nature of the SPI stack and the separation of concerns that it seems to imply. For most people, “data storage” equates to things like “hard disks” and “databases”. Those are concepts, moreover, that one finds most prominently at the IaaS layer of the SPI stack. Ergo, RAIC equates to IaaS.

The problem with this is that it unnecessarily restricts the applicability of the pattern. RAIC is perhaps easiest to understand at the IaaS level, but that does not mean that it does not apply to the Platform as a Service (PaaS) or Software as a Service (SaaS) levels as well. Figure 3 demonstrates the pattern at the SaaS level.

Figure 3 – RAIC at the SaaS level of the SPI stack

In this figure, three SaaS providers of enterprise applications are being used in parallel by an organisation. In the simplest possible example, imagine an online spreadsheet that uses the APIs of these providers to store the associated data. Of course, this presents implementation challenges – in particular, with regard to a common user interface to these services, which would otherwise be seen as, and provided by, one of the service providers themselves – but the overall point should be clear. Similar examples can be contrived for e-mail, for example.

What this observation, as well as our earlier remarks about various implementation and data storage strategies, demonstrates is that RAIC is a manifestation of deep design principles, with broad applicability.

Consider the formal, scientific definition of “redundancy”. Redundancy in engineering means the precise duplication of components [8] . Strictly speaking, insofar as the various disks in a RAID system are not precisely identical (identical manufacturer, identical model, identical attributes such as size, etc.), then it is incorrect to speak of these components as being “redundant”. The general usage of the term focuses on isomorphism – components that are not identical, but structurally equivalent. Two SCSI disk drives, made by different manufacturers, seem to be an example. What this common interpretation misapprehends is that it is not the isomorphism of such components that enables their interchangability – it is their isofunctional nature. “Isofunctional” is a term that means, “behave in the same manner” – more strictly, a component is isofunctional to another if, given the same inputs, they produce the same outputs. RAID arrays can (and do) have disks that are very different from one another (different manufacturers, different sizes, etc.), but they all behave the same way, due to their conformance with the SCSI standard interface. They are isofunctional. The more precise term for a “component that is isofunctional without being isomorphic” is, strangely enough, “degeneracy” [9]. Thus, for both RAID and RAIC, the usage of the term “redundant” is, strictly speaking, wrong. “Degenerate” would be the more accurate term. We suspect the terms would enjoy less popularity, however, were they more correct in this sense.

We raise this point out of more than simple academic curiosity. The field where the term “degeneracy” is most commonly used, and where the most care is paid to a precise distinction between these terms, is biology.

Summary

Cloud computing is driving a rapid change in the overall complexity of IT system designs. It is our belief that “traditional” models of IT system design are being pushed to the outer limits of their utility by this change – we believe that these models are already beginning to fail us, and that this will increasingly become the case. When we search for alternative models of system design to guide us, we find biology to be the most promising source. Biomemesis is a term that describes the explicit attempt to design systems that mimic biological models [10]. One of the most profound differences between biological models and “traditional” IT systems is their relationship to duplication. Traditional IT system design approaches strive to eliminate “unnecessary” duplication – ranging from attempts to “normalise” database designs, to attempts to eliminate duplicate code in inheritance-based type systems, to the focus on reuse that characterises SOA. Biological systems, on the other hand, are rife with duplication – duplicate genes, duplicate cells, duplicate processes, duplicate organs, and so on. Often, this duplication is not isomorphic in nature, but isofunctional, as in the brain’s well-documented ability to heal after serious injuries by “rewiring” parts of itself to perform lost functions. It is in this context that biologists find themselves needing to be quite precise about the distinction between “redundancy” and “degeneracy”.

We do not think that the core of the RAIC concept – duplication – being the same as one of the core design principles in biological systems is a coincidence. We believe it is a consequence – a consequence of the pressure being exerted on us as designers to tame the ever-increasing complexity of our systems.

There is no silver bullet [11], and RAIC is not one. But we do believe it has significant value as a foundational design pattern that will enable businesses to both exploit cloud computing in a manner that is reliable enough to meet business goals and exploit cloud computing models, as well as enable new, emergent capabilities, which we can now only dimly imagine.

[1] http://www-2.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf

[2] http://www.computerworld.com/s/article/9060339/Cable_damage_in_Mediterranean_disrupts_Internet_in_Mideast

[3] http://blogs.zdnet.com/SAAS/?p=668

[4] http://www.redmonk.com/jgovernor/2008/07/21/cloudcamp-london-the-inauguration/

[5] http://queue.acm.org/detail.cfm?id=1394128

[6] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495

[7] http://queue.acm.org/detail.cfm?id=1466448

[8] http://en.wikipedia.org/wiki/Redundancy_(engineering)

[9] http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=15929

[10] http://biomimicryinstitute.org/

[11] http://en.wikipedia.org/wiki/No_Silver_Bullet

Original Page: http://www.jroller.com/MasterMark/