brant - above the clouds - bp in creating a sustainable ... · emc proven professional knowledge...

EMC Proven Professional Knowledge Sharing 2010

Above The Clouds - Best practices to Create a Sustainable Computing Infrastructure to Achieve Business Value and Growth Paul Brant

Paul BrantEMC [email protected]

2010 EMC Proven Professional Knowledge Sharing

2

Table of Contents Table of Contents ................................................................................................ 2

Table of Figures .................................................................................................. 8

Table of Equations .............................................................................................. 9

Abstract .............................................................................................................. 11

Introduction ....................................................................................................... 13

Sustainability Metrics and what does it mean to be Sustainable ................................. 18

The Challenges to achieve Sustainability .................................................................... 20

The concept of Green .................................................................................................. 23

IT Sustainability and how to measure it ....................................................................... 24

The Carbon Footprint ........................................................................................ 26

Environment Pillar- Green Computing, Growing Sustainability.................... 31

Standards and Regulations ......................................................................................... 32

Best Practice – In the US, Consider Executive Order 13423 and energy-efficiency

legislation regulations ............................................................................................... 32

Best Practice – Use tools and resources to understand environmental impacts ...... 33

IT facilities and Operations .......................................................................................... 34

Best Practice – Place Data Centers in locations of lower risk of natural disasters ... 35

Best Practice – Evaluate Power GRID and Network sustainability for IT Data Centers

................................................................................................................................. 37

Effectiveness Pillar ........................................................................................... 39

Services and Partnerships ........................................................................................... 39

Tools and Best Practices ............................................................................................. 40

Efficiency Pillar ................................................................................................. 41

Information Management ............................................................................................. 42

Best Practice – Implement integrated Virtualized management into environment. .. 42

Best Practice - Having a robust Information Model .................................................. 44

Best Practices in Root Cause Analysis .................................................................... 46


3

Best practice - Effective root-cause analysis technique must be capable of identifying all

those problems automatically. .................................................................................. 47

Best Practice – Rules-based correlation using CCT ................................................ 48

Best Practice - Reduction of Downstream Suppression ........................................... 49

Self Organizing Systems ............................................................................................. 50

Best Practice - Dynamic Control in a Self Organized System .................................. 53

Best Practice – Utilize STR’s when implementing adaptive controllers .................... 54

Best Practice – Require System Centric Properties ................................................. 55

Application ................................................................................................................ 57

Best Practice – Architect a designed for Run solution ...................................................... 58

Storage ..................................................................................................................... 59

Compression, Archiving and Data Deduplication .............................................................. 60

Autonomic self healing systems ........................................................................................ 66

Storage Media – Flash Disks ............................................................................................ 69

Best Practice – Utilize low power flash technologies ........................................................ 70

Server Virtualization ................................................................................................. 70

Best Practice – Implement DRS ........................................................................................ 71

Network........................................................................................................................ 72

Best Practice - Architect Your Network to Be the Orchestration Engine for Automated

Service Delivery (The 5 S’s) ..................................................................................... 72

Scalable: ........................................................................................................................... 73

Simplified: .......................................................................................................................... 73

Standardized: .................................................................................................................... 73

Shared: .............................................................................................................................. 73

Secure: .............................................................................................................................. 73

Best Practice - Select the Right Cloud Network Platform ........................................ 74

Cloud network infrastructure ............................................................................................. 74

Cloud network operating system (OS) .............................................................................. 74

Cloud network management systems ............................................................................... 74


4

Cloud network security ...................................................................................................... 74

Best Practice – Consider implementing layer 2 Locator/ID Separation ....................... 75

Best Practice – Build a Case to Maximize Cloud Investments .................................... 76

Best Practice - Service Providers - Maximize and sustain Cloud Investments ............ 76

Best Practice - Enterprises - maximize and sustain Cloud Investments ...................... 77

Best Practice – Understand Information Logistics and Energy transposition tradeoffs 77

Infrastructure Architectures .......................................................................................... 80

Data Center Tier Classifications ............................................................................... 81

Cloud Overview ........................................................................................................ 84

Cloud Layers of Abstraction .............................................................................................. 87

Cloud Type Architecture(s) Computing Concerns .................................................... 88

Failure of Monocultures: .................................................................................................... 89

Convenience vs. Control ................................................................................................... 89

General distrust of external service providers ................................................................... 89

Concern to virtualize the majority of servers and desktop workloads ............................... 90

Fully virtualized environments are hard to manage .......................................................... 90

Many environments can't be virtualized onto x86 and hypervisors ................................... 90

Concerns on security ........................................................................................................ 91

Industry Standards ............................................................................................................ 91

Applications support for virtualized environments, or only the one the vendor sells ......... 91

Environmental Impact Concerns ....................................................................................... 92

Threshold Policy Concerns ................................................................................................ 93

Interoperability issues Concerns ....................................................................................... 93

Hidden Cost Concerns ...................................................................................................... 93

Unexpected behavior concerns ......................................................................................... 95

Private Cloud ............................................................................................................ 96

Best Practice – Implement a dynamic computing infrastructure ....................................... 98

Best Practice – Implement an IT Service-Centric Approach ............................................. 99


5

Best Practice – Implement a self-service based usage Model .......................................... 99

Best Practice – Implement a minimally or self-managed platform .................................. 100

Best Practice – Implement a consumption-based billing methodology ........................... 100

Public Cloud ........................................................................................................... 101

Community Cloud ................................................................................................... 101

Best Practice in Community Cloud – Use VM’s .............................................................. 106

Best Practice in Community Cloud – Use Peer to Peer Networking ............................... 106

Best Practice in Community Cloud – Distributed Transactions ....................................... 106

Best Practice in Community Cloud – Distributed Persistence Storage ........................... 107

Challenges in the federation of Public and Private Clouds ..................................... 107

Lack of visibility ............................................................................................................... 108

Multi-tenancy Issues ....................................................................................................... 108

Cloud computing needs to cover its assets ..................................................................... 109

Warehouse Scale Machines - Purposely Built Solution Options ............................ 109

Best Practice – WSC’s must achieve high availability .................................................... 111

Best Practice - WSC’s must achieve cost efficiency ....................................................... 111

WSC (Warehouse Scale Computer) Attributes ............................................................... 112

One Data Center vs. Several Data Centers .................................................................... 112

Best Practice – Use Warehouse Scale Computer Architecture designs in certain

scenarios ......................................................................................................................... 113

Architectural Overview of WSC’s .................................................................................... 113

Best Practice – Connect Storage Directly or via NAS in WSC environments ................. 114

Best Practice – WSC should consider using non-standard Replication Models ............. 114

Networking Fabric ........................................................................................................... 114

Best Practice – For WSC’s Create a Two level Hierarchy of networked switches .......... 115

Handling Failures ............................................................................................................ 115

Best Practice - Use Sharding and other requirements in WSC’s .................................... 116


6

Best Practice – Implement application specific compression .......................................... 118

Utility Computing .................................................................................................... 119

Grid computing ....................................................................................................... 123

Cloud Type Architecture Summary ......................................................................... 123

Infrastructure as a Service and more .............................................................................. 124

Amazon Web services .................................................................................................... 124

Cloud computing ............................................................................................................. 125

Grid Computing ............................................................................................................... 125

Similarities and differences ............................................................................................. 126

Business Practices Pillar ................................................................................ 127

Process Management and Improvement................................................................ 128

Best Practice - Provide incentives that support your primary goals: ............................... 128

Best Practice - Focus on effective resource utilization .................................................... 129

Best Practice - Use virtualization to improve server utilization and increase operational

efficiency ......................................................................................................................... 129

Best Practice - Drive quality up through compliance: ...................................................... 130

Best Practice - Embrace change management ............................................................... 131

Best Practice - Invest in understanding your application workload and behavior: .......... 133

Best Practice - Right-size your server platforms to meet your application requirements 133

Best Practice - Evaluate and test servers for performance, power, and total cost of

ownership ........................................................................................................................ 134

Best Practice - Converge on as small a number of stock-keeping units (SKUs) as you can134

Best Practice - Take advantage of competitive bids from multiple manufacturers to foster

innovation and reduce costs. .......................................................................................... 135

Standards ............................................................................................................... 135

Best Practice - Use standard interfaces to Cloud Architectures ..................................... 137

Security .................................................................................................................. 138

Best Practice – Determine if cloud vendors can deliver on their security claims ............ 139


7

Best Practice - Adopt federated identity policies backed by strong authentication practices139

Best Practice – Preserve segregation of administrator duties ........................................ 140

Best Practice - Set clear security policies ....................................................................... 141

Best Practice - Employ data encryption and tokenization ............................................... 141

Best Practice - Manage policies for provisioning virtual machines.................................. 142

Best Practice – Require transparency into cloud operations to ensure multi-tenancy and

data isolation ................................................................................................................... 142

Governance ............................................................................................................ 143

Best Practices – Do your due diligence of your SLA’s .................................................... 143

Compliance ............................................................................................................ 146

Best Practice - Know Your Legal Obligations ................................................................. 146

Best Practice - Classify / Label your Data & Systems ..................................................... 146

Best Practice - External Risk Assessment ...................................................................... 146

Best Practice - Do Your Diligence / External Reports ..................................................... 147

Best Practice - Understand Where the Data Will Be! ...................................................... 147

Best Practice - Track your applications to achieve compliance. ..................................... 148

Best Practice - With off-site hosting, keep your assets separate. ................................... 148

Best Practice - Protect yourself against power disruptions ............................................. 149

Best Practice - Ensure vendor cooperation in legal matters ........................................... 149

Profitability .............................................................................................................. 150

Business and Profit objectives to achieve Sustainability ................................................. 151

Best Practice - Consumer Awareness and Transparency .............................................. 151

Best Practice – Implement Efficiency Improvement ........................................................ 152

Best Practice - Product Innovation .................................................................................. 152

Best Practice - Carbon Mitigation .................................................................................... 152

Information Technology Sector Initiatives ....................................................................... 152

Best Practice – Virtualization .......................................................................................... 153


8

Best Practice - Recycling e-Waste .................................................................................. 153

Cloud Profitability and Economics ................................................................................... 153

Cloud Computing Economics .......................................................................................... 158

Best Practice – Consider Elasticity as part of the business deciding metrics ................. 158

Economics Pillar ........................................................................................................ 162

Best Practice – Consider Efficiency as only one part of the Economic Sustainable equation

............................................................................................................................... 164

Conclusion ....................................................................................................... 164

Appendix A – Green IT, SaaS, Cloud Computing Solutions ........................ 165

Appendix B – Abbreviations .......................................................................... 168

Appendix B – References ............................................................................... 173

Author’s Biography ......................................................................................... 175

Index ................................................................................................................. 176

Table of Figures

Figure 1 - Sustainability and Technology Interest Trends ........................................................... 13

Figure 2 – Achieving IT Sustainability ......................................................................................... 19

Figure 3 - US Energy Flows (Quadrillon BTUs)[21] .................................................................... 20

Figure 4 – Efficiency - Megawatts to Infowatts to Business Value Solutions .............................. 22

Figure 5 – IT Data Center Sustainability Taxonomy ................................................................... 27

Figure 6 – Top Level Sustainability Ontology (see notes below) ................................................ 30

Figure 7 - U.S. Federal Emergency Management Agency – Disaster MAP[22] ......................... 35

Figure 8 - U.S. Geological Survey Seismological Zones (0=lowest, 4=Highest) ........................ 36

Figure 9 - U.S. NOAA Hurricane Activity in the United States .................................................... 37

Figure 10 - Opportunities for efficiency improvements ............................................................... 40

Figure 11 - EPA Future Energy Use Projections ........................................................................ 42

Figure 12 – Closed Loop System ................................................................................................ 51

Figure 13 – Sustainability Ontology – Self Organizing Systems ................................................. 52

Figure 14 – A Sustainable Information transition lifecycle .......................................................... 59


9

Figure 15 – Self Organized VM application controller ................................................................. 71

Figure 16 - Energy in Electronic Integrated Circuits ................................................................... 78

Figure 17 - Moore's Law - Switching Energy .............................................................................. 79

Figure 18 - Data by physical vs. Internet transfer ....................................................................... 80

Figure 19 – Sustainability Ontology – Infrastructure Architectures ............................................. 84

Figure 20 - Cloud Topology ........................................................................................................ 86

Figure 21 - Cloud Computing Topology ...................................................................................... 87

Figure 22 - Using a Private Cloud to Federate disparate architectures ...................................... 98

Figure 23 - Community Cloud ................................................................................................... 102

Figure 24 - Community Cloud Architecture ............................................................................... 105

Figure 25 – Sustainability Ontology – Business Practices ........................................................ 128

Figure 26 - A continuous process helps maintain the effectiveness of controls as your

environment changes ................................................................................................................ 131

Figure 27 - Consistent and well-documented processes help ensure smooth changes in the

production environment ............................................................................................................ 132

Figure 28 – Provisioning for peak load ..................................................................................... 160

Figure 29 – Under Provisioning Option 1 .................................................................................. 160

Figure 30 – Under Provisioning Option 2 .................................................................................. 161

Table of Equations

Equation 1 – Computing Energy Efficiency ................................................................................. 25

Equation 2 – Computing Energy Efficiency-Detailed .................................................................. 25

Equation 3 – Computing Energy Efficiency-Detailed as a function on PUE ............................... 25

Equation 4 – IT Long Term Sustainability Goal .......................................................................... 28

Equation 5 – What is Efficient IT ................................................................................................. 39

Equation 6 – Linear model of a Control System ......................................................................... 54

Equation 7 – Energy Consumed by a CMOS ASIC .................................................................... 78

Equation 8 – Power Consumed by a CMOS ASIC ..................................................................... 78

Equation 9 – Cloud Computing - Cost Advantage .................................................................... 156

Equation 10 – Cloud Computing - Cost tradeoff for demand that varies over time ................... 156


10

Disclaimer: The views, processes or methodologies published in this article are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes or methodologies.


11

Abstract

The IT industry is embarking on a new paradigm of service delivery. From the start, each

information technology wave, mainframe, minicomputer, PC/microprocessor to networked

distributed computing offered new challenges and benefits. We are embarking on a new wave,

offering new methods and technologies to achieve sustainable growth and increased business

value to companies ranging small businesses to major multinational corporations

There are various new technologies and approaches that businesses can now use to allow a

more efficient and sustainable growth path. For example, Cloud computing, Cloud services,

Private Clouds and Warehouse-Scale data center design methodologies are just a few of the

approaches to sustainable business growth. Other “services on demand” offerings are starting

to make their mark on the corporate IT landscape offering as well.

Every business has its own requirements and challenges when creating a sustainable IT

business model that addresses the need for continued growth and scalability. Can this new

wave, fostered by burgeoning new technologies such as Cloud computing and the never ending

accelerating information growth curve, turn the IT industry into a flat and level playing field?

What metrics should we implement to allow a systematic determination about technology

selection? What are the standards and best practices in the evaluation of each technology? Will

one technology fit all business, environmental and sustainable possibilities?

For example, to sustain business growth, IT consumers require specific standards so that data

and applications are not held captive by non-interoperable Cloud services providers. Otherwise,

we end up with walled gardens as we had with CompuServe, AOL, and Prodigy in the period

before the Internet and worldwide web emerged. Data and application portability standards have

to be firmly in place, with solid Cloud service provider backing.

Data centers are changing at a rapid exponential pace, faster than at any other point in history.

However, with all the changes, data center facilities and the associated information

management technologies, IT professionals face numerous challenges in unifying their peers to

solve problems for their companies. Sometimes you may feel as if you are talking different

languages or living on different planets. What do virtual computers and three-phase power have

in common? Has your IT staff or department ever come to you asking for more power without


12

considering that additional power/cooling is required? Do you have thermal hot spots in places

you never expected or contemplated? Has virtualization changed your network architecture or

your security protocols? What exactly does Cloud computing mean to your data center?

Is Cloud computing or SAAS (Storage or Software as a service) being performed in your data

center already? More importantly, how do you align the different data center disciplines to

understand how new technologies will work together to solve data center sustainability

problems? One possible Best Practice is a standardized data center stack framework that would

address the above issues allowing Best Practices to achieve a sustained business value growth

trajectory.

How do we tier Data Center efficiency and map it back to business value and growth?

In 2008, American data centers consumed more power than American televisions. Collectively,

data centers consumed more power than all the TVs in every home and every sports bar in

America. This really puts a new perspective on it. All of these questions will be addressed and

possible solutions provided.

In summary, this article will go above the Cloud, offering Best Practices that will align with the

most important goal, creating a sustainable computing infrastructure to achieve business value

and growth.


13

Introduction

Deploying IT, SaaS and Cloud Computing solutions to create a sustaining profitability model for

businesses centers on identifying processes and technologies that create value propositions for

all involved. This can be achieved by producing eco-centric, business analytics, metrics, key

performance indicators and sustainability measures with a goal to support developing green and

sustainable business models(See the section titled “Sustainability Metrics and what does it

mean to be Sustainable” on page 18 for the definition of sustainability) .

This can be a daunting task. The good news is there is growing interest in sustainability as well

as with various technologies. Some are on a major upward trend such as Cloud computing as

shown in Figure 1 - Sustainability and Technology Interest Trends, below. This trending

information shows the number of hits on Google’s search engine normalized to the topic of

Sustainability outlined in Blue. It appears that Sustainability and Cloud computing are certainly

on the upward trend. We will find out why.

Figure 1 - Sustainability and Technology Interest Trends1

1 Google Trends


14

According to Gartner[1], the top 10 strategic technologies for 2010 include:

Cloud Computing: Cloud computing is a style of computing that characterizes a model in which

providers deliver a variety of IT-enabled capabilities to consumers. We can exploit Cloud-based

services in a variety of ways to develop an application or a solution. Using Cloud resources

does not eliminate the costs of IT solutions, but does re-arrange some and reduce others. In

addition, enterprises consuming cloud services will increasingly act as cloud providers and

deliver application, information or business process services to customers and business

partners. Some have joked that Cloud computing is analogous to preferring to pay for the power we use, rather than buying a power plant! In addition, Gartner predicts that by 2012, 20 percent of businesses will own no IT assets.

Several interrelated trends are driving the movement toward decreased IT hardware assets,

such as virtualization, cloud-enabled services, and employees running personal desktops and

notebook systems on corporate networks. The need for computing hardware, either in a data

center or at the desktop, will not go away. However, if the ownership of hardware transitions to

third parties, there will be major shifts throughout the IT hardware industry. For example,

enterprise IT budgets either will shrink or be reallocated to more-strategic projects. Enterprise

IT staff will be either reduced or re-skilled to meet new requirements, and/or hardware

distribution will have to change radically to meet the requirements of the new IT hardware

sustainability model.

Advanced Analytics. Optimization and simulation use analytical tools and models to maximize

business process and decision effectiveness by examining alternative outcomes and scenarios,

before, during and after process implementation and execution. This can be viewed as a third

step in supporting operational business decisions. Fixed rules and prepared policies gave way

to more informed decisions powered by the right information delivered at the right time, whether

through customer relationship management (CRM) enterprise resource planning (ERP) or other

applications. The new step provides simulation, prediction, optimization and other analytics, not

simply information, to empower even more decision flexibility at the time and place of every

business process action. The new step looks into the future, predicting what can or will happen.

Client Computing. Virtualization is bringing new ways of packaging client computing

applications and capabilities. As a result, the choice of a particular PC hardware platform, and


15

eventually the OS platform, becomes less critical. Enterprises should proactively build a five to

eight year strategic client computing roadmap that outlines an approach to device standards,

ownership and support, operating system and application selection, deployment and update,

and management and security plans to manage diversity.

IT for Green: IT can enable many green initiatives. The use of IT, particularly among the white-

collar staff, can greatly enhance an enterprise’s green credentials. Common green initiatives

include the use of e-documents, reducing travel via teleconferencing and remote worker support

and tele-working. IT can also provide the analytic tools that others in the enterprise may use to

reduce energy consumption in the transportation of goods or other carbon management

activities.

According to Gartner, by 2014, most IT business cases will include carbon remediation costs.

Today, server vitalization and desktop power management demonstrate substantial savings in

energy costs, and those savings can help justify projects. Including carbon costs into business

cases provides an additional measure of savings, and prepares the organization for increased

scrutiny of its carbon impact.

Economic and political pressure to demonstrate responsibility for carbon dioxide emissions will

force more businesses to quantify carbon costs in business cases. Vendors will have to provide

carbon life cycle statistics for their products or face market share erosion. Incorporating carbon

costs in business cases will only slightly accelerate replacement cycles. A reasonable estimate

for the cost of carbon in typical IT operations is an incremental one or two percentage points of

overall cost. Therefore, carbon accounting will more likely shift market share than market size.

In 2012, 60 percent of a new PC's total life greenhouse gas emissions will have occurred before

the user first turns the machine on. Progress toward reducing the power needed to build a PC

has been slow. Over the course of its entire lifetime, a typical PC consumes 10 times its own

weight in fossil fuels, but around 80 percent of a PCs total energy usage still happens during

production and transportation.

Greater awareness among buyers and those that influence buying, greater pressure from eco-

labels, and increasing cost pressures and social pressure raised the IT industry’s awareness to

the problem of greenhouse gas emissions. Requests for proposal (RFPs) now frequently look


16

for both product and vendor environment-related criteria. Environmental awareness and

legislative requirements will increase recognition of production as well as usage-related carbon

dioxide emissions. Technology providers should expect to provide carbon dioxide emission data

to a growing number of customers.

Reshaping the Data Center: In the past, design principles for data centers were simple: Figure

out what you have, estimate growth for 15 to 20 years, then build to suit. Newly built data

centers often opened with huge areas of white floor space, fully powered and backed by an

uninterruptible power supply (UPS), water-and air-cooled and mostly empty. However, costs are

actually lower if enterprises adopt a pod-based approach to data center construction and

expansion. If you expect to need 9,000 square feet during the life of a data center, then design

the site to support it, but only build what is needed for five to seven years. Cutting operating

expenses, a large portion of overall IT spending for most clients, frees up money to reallocate to

other projects or investments either in IT or in the business itself.

Social Computing: Workers do not want two distinct environments to support their work – one

for their own work products (whether personal or group) and another for accessing “external”

information. Enterprises must focus on use of social software and social media in the enterprise,

and participation and integration with externally facing enterprise-sponsored and public

communities. Do not ignore the role of the social profile to bring communities together.

Security – Activity Monitoring: Traditionally, security has focused on putting up a perimeter

fence to keep others out, but it has evolved to monitoring activities and identifying patterns that

would have been missed previously. Information security professionals face the challenge of

detecting malicious activity in a constant stream of discreet events that are usually associated

with an authorized user and are generated from multiple network, system and application

sources. At the same time, security departments are facing increasing demands for ever-greater

log analysis and reporting to support audit requirements. A variety of complementary (and

sometimes overlapping) monitoring and analysis tools help enterprises better detect and

investigate suspicious activity – often with real-time alerting or transaction intervention. By

understanding the strengths and weaknesses of these tools, enterprises can better understand

how to use them to defend the enterprise and meet audit requirements.


17

Flash Memory: Flash memory is not new, but it is moving up to a new tier in the storage

echelon. Flash memory is a semiconductor memory device, familiar from its use in USB

memory sticks and digital camera cards. It is much faster than rotating disk, but considerably

more expensive although the differential is shrinking. As the price declines, the technology will

enjoy more than a 100 percent compound annual growth rate during the next few years and

become strategic in many IT areas including consumer devices, entertainment equipment and

other embedded IT systems. In addition, it offers a new layer of the storage hierarchy in servers

and client computers that has key advantages including space, heat, performance and

ruggedness.

Virtualization for Availability: Virtualization has been on the list of top strategic technologies in

previous years. It is on the list this year because Gartner emphasizes new elements such as live

migration for availability that have longer-term implications. Live migration is the movement of a

running virtual machine (VM), while its operating system and other software continue to execute

as if they remained on the original physical server. This takes place by replicating the state of

physical memory between the source and destination VMs, then, at some instant in time, one

instruction finishes execution on the source machine and the next instruction begins on the

destination machine.

However, if replication of memory continues indefinitely, but execution of instructions remains

on the source VM, and then the source VM fails the next instruction would now take place on

the destination machine. If the destination VM were to fail, just pick a new destination to start

the indefinite migration, therefore making very high availability possible.

The key value proposition is to displace a variety of separate mechanisms with a single “dial”

that can be set to any level of availability from baseline to fault tolerance, all using a common

mechanism and permitting the settings to be changed rapidly as needed. We could dispense

with expensive high-reliability hardware, with fail-over cluster software and perhaps even fault-

tolerant hardware, but still meet availability needs. This is key to cutting costs, lowering

complexity, and increasing agility as needs shift.

Mobile Applications: By year-end 2010, 1.2 billion people will carry handsets capable of rich,

mobile commerce providing an environment for the convergence of mobility and the web. There

are already many thousands of applications for platforms such as the Apple iPhone, in spite of


18

the limited market and need for unique coding. It may take a newer version designed to flexibly

operate on both full PC and miniature systems, but if the operating system interface and

processor architecture were identical, that enabling factor would create a huge turn upwards in

mobile application availability.

Sustainability Metrics and what does it mean to be Sustainable

What are these metrics and how does one wade through technology, environmental, business

and operational requirements looking for best practices in achieving IT Sustainability? All this

will be touched on as we go through the details.

First, what does “Sustainability” mean? Generally, some define it as:

“Meeting the needs of the present without compromising the ability of future generations to meet

their own needs 2."

Or

“Then I say the earth belongs to each generation during its course, fully and in its own right. The

second generation receives it clear of the debts and encumbrances, the third of the second, and

so on. For if the first could charge it with a debt, then the earth would belong to the dead and not

to the living generation. Then, no generation can contract debts greater than may be paid during

the course of its own existence.”3

This article will take an Information Technology (IT) centric approach to this idea of

sustainability. I believe that for the IT industry, including the decision makers, implementers,

vendors, and technologists in general, a definition of IT sustainability would be:

“A pro-active approach to ensure the long-term viability and integrity of the business by

optimizing IT resource needs, reducing environmental, energy and/or social impacts, and

managing resources while not compromising profitability to the business.”

One corollary to this definition would be that not only would developing a sustainable IT model

not compromise profitability, but, by conforming to best practices, would actually increase it!

2 Washington State Department of Ecology 3 Thomas Jefferson, September 6, 1789


19

The four pillars or focal points to achieve IT sustainability are “The Environment,” “Efficiency,”

“Effectiveness,” and “Business Practices” as shown in Figure 2 – Achieving IT Sustainability, on

page 19. Achieving sustainability requires a focused effort on many fronts. All of these pillars will

be discussed in great detail in the following sections.

The environment is not only an important social issue but is a responsibility for all of us to

manage. From the business perspective, we can address the environmental aspects by working

with the Manufacturing and Supply chain as well as following environmental Standards and

Regulations. As an EMC Employee, EMC’s internal IT, Facilities and operations departments

have made great strides in this particular area. It is important to not minimize what each of us

can do as an individual or as an inclusive industry as a whole.

Considering business practices and requirements is important as well. Understanding each

business’ operational, market and growth models as it relates to Sustainability is a given.

To be effective, it is important to have the appropriate tools and best practices to achieve

sustainable growth. Lastly, and one can argue most importantly as it relates to the IT industry,

that efficiency is paramount to what we can do and how the IT industry as a whole can make a

difference.

Figure 2 – Achieving IT Sustainability


20

The Challenges to achieve Sustainability

It is interesting to note that every IT process has some interaction with energy, given the fact

that all IT technologies are electrical or mechanical in nature. It is always difficult to determine

what metrics we should use to determine how to achieve sustainable growth. For example, one

aspect is the concept of energy efficiency as it relates to business value.

As shown in Figure 4 – Efficiency - Megawatts to Infowatts to Business Value Solutions on page

22, this diagram illustrates an example of the general inefficiencies with power delivery and IT

technology today. The question is what is the efficiency of information management per watt?

The efficiency challenge is multi-dimensional. Following the energy flow from the power plant to

the application, you can follow the energy loss through the full power delivery path. As shown, at

the source, the power plant, upwards of 70% of the energy entering the plant is reduced through

generation and power delivery to the Data Center. Given that most of the power consumed in

the United States comes from fossil fuels, as shown in Figure 3 - US Energy Flows (Quadrillon

BTUs), below, there is a major opportunity to reduce emissions by becoming more efficient.

Figure 3 - US Energy Flows (Quadrillon BTUs)[21]

Of the power entering the data center, 50% is lost. Adding to the loss in the data center includes

the fans and power supply conversions. In terms of the data center facility, there are solutions

such as “Fifth Light”, “CHP (Combined Heat and Power)”, “Flywheel”, “Liquid Cooling” and other

technologies that can make a difference.


21

Within the data center facility, given typical under-utilization of Server, Storage and Network

Bandwidth, as well as the challenges of inefficient and zero-value applications, the Megawatt to

Info watt efficiency can be less than 1%. So, for example, for every 100Watts of power, 0.3

watts of power are actually used. I believe the IT industry can do better.

The good news is that there are solutions. However, there is no silver bullet. There is no one

technology, business or process that will address at any great length the efficiency and

sustainability goal.

Figure 4shows some examples of what can be done at all stages in the Megawatt to Info watt

efficiency cycle. These range from virtualization, Consolidation, Network and Data optimization

as well as other environmental solutions, many of which are point technologies. However, with

these point technologies rolled out in tight orchestration, I am sure IT stakeholders can and will

make a difference.

For example, concerning virtualization, this technology alone cannot relieve the burden of rising

site infrastructure expenses and it can be argued that this technology alone cannot achieve

sustainability. See Equation 4 – IT Long Term Sustainability Goal on page 28. Virtualizing four

or ten servers onto a single physical host(s) will indeed cut power consumption and free up data

center capacity. However, for data centers nearing their limits, virtualization can play a key role

in delaying the time at which an expansion or new facility must be built, but this is not the total

solution.


22

Figure 4 – Efficiency - Megawatts to Infowatts to Business Value Solutions4

The problem is that the cost of electricity and site infrastructure TCO (Total Cost of Ownership)

is greatly out pacing the cost of the server itself. This can happen regardless of whether a single

box is running one application or, virtualized to handle multiple tasks. When electricity and

infrastructure costs greatly exceed server cost, any IT deployment decisions based on server

cost alone will result in a wildly inaccurate perception of the true total cost. Even when

virtualization frees up wasted site capacity for additional servers without spending new money

on site infrastructure, the opportunity cost (i.e. ensuring that scarce resources are used

efficiently) of deploying the capacity is the same. Data center managers can be in a position of

building expensive new capacity sooner than they need to[7].

4 EDS


23

Furthermore, virtualization is a one-time benefit. After consolidating servers so that they are all

running at full capacity, and planning future deployments so that newly purchased servers will

also be fully utilized, data center operators are still faced with the reality that each year’s

generation of servers will most likely draw more power than the previous hardware release.

After virtualization has taken some of the slack out of underutilized IT hardware, the trend in

power growth will resume.

Conversely, it is also possible that virtualization may allow each new server to be so productive

that it’s worthwhile to divert a greater fraction of the IT budget to pay the increased site

infrastructure and electricity cost, but a business can’t make that decision without considering

the true total cost.

The concept of Green

The concept of “Green” also comes to mind. Sustainability and being green go hand in hand.

Going green means change, but not all green solutions are efficient or sustainable. For

example, one might say, plant trees to become green. Well, it would take 6.6 billion trees to

offset the CO2 generated by all of the data centers in the world.5 Planting trees is green, but not

very efficient.

Green business models must reduce carbon dioxide. Developing green business models begins

by determining how a company's products, services or solutions can be produced in contexts

that reduce carbon dioxide (CO2) emissions. Market standards measure reductions by 1 million

metric tons. We will discuss the Greenhouse Gas Equivalencies Calculator in the “Standards

and Regulations” section, starting on page 32, translates difficult to understand statements into

more commonplace terms, such as "is equivalent to avoiding the carbon dioxide emissions of X

number of cars annually." This calculator also offers an excellent example of the analytics,

metrics and intelligence measures that IT, SAAS and Cloud Computing solutions must deliver

across the input and output chains in business models.

5 Robert McFarlane, Principal Data Center and Financial Trading Floor Consultant


24

IT Sustainability and how to measure it

Even though this relatively newly named paradigm, “The Cloud” has a great deal of potential to

contribute to a sustainable IT structure, it is just a part of the whole sustainability picture. In the

following sections, Cloud computing and its variants will be discussed in detail.

As mentioned, emerging green IT, SAAS and Cloud Computing solutions offer great potential to

extend, leverage and strengthen a company's business model by applying measures that can

be reported and managed. Innovative IT solutions are emerging to address sustainability

analytics and carbon reduction metrics. Third parties that work with companies as they develop

green goods and services business models are well positioned to guide IT, SaaS and Cloud

Computing companies toward developing solutions that produce green or eco-centric business

metrics.

A Green IT, SaaS and Cloud Computing list is shown in Appendix A – Green IT, SaaS, Cloud

Computing Solutions, starting on page 165, is being developed as solution providers report.

This list shows the capability of their IT solutions to produce CO2 and sustainability measures,

business analytics, market intelligence, metrics and key performance indicators that can be

applied by companies to develop green goods and services.

This list of metrics will become ever more important because executing profitable green

business plans will depend upon generating more and deeper levels of complex green house

gas and sustainability measurements and metrics.

How does one measure computing Efficiency? After all, if you cannot measure it, you cannot

improve it6.

For example, for a Server, the efficiency at its basic terminology is shown in Equation 1 –

Computing Energy Efficiency, shown below. The efficiency is the effective work done by the

energy used. This is also equal to the efficiency of the actual work done or the rate at which the

work is done by the power used.

6Lord Kelvin


25

Equation 1 – Computing Energy Efficiency

PowerpeedComputingS

EnergyUsedWorkDoneEfficiency ==

Breaking it down further, as shown in Equation 2 – Computing Energy Efficiency-Detailed,

below, the efficiency is also a function of the underlying hardware, its properties and the Data

Center as a whole.

Equation 2 – Computing Energy Efficiency-Detailed

ldingringTheBuiEnergyEntemputersovidedToCoEnergyX

mputersovidedToCoEnergyInChipsEnergyUsedX

InChipsEnergyUsedWorkDoneEfficency Pr

Pr=

This equation shows the dependency of all of the underlying hardware at all levels of the data

center stack, from the Server, Network, and all parts of the infrastructure.

The efficiency is also summed by including the power user efficiency metric that is often used as

shown in Equation 3 – Computing Energy Efficiency-Detailed as a function on PUE, on page 25,

below. The equation shows the dependencies of the business, individual components, and the

Data Center as a whole, to establish what efficiency means. For a more detailed discussion on

PUE, please refer to the EMC Proven Professional Knowledge Sharing article titled “Crossing

the Great Divide in Going Green: Challenges and Best Practices in Next Generation IT

Equipment, EMC Knowledge Sharing, 2008”.

Equation 3 – Computing Energy Efficiency-Detailed as a function on PUE

)/1( PUEEfficiencyDataCentreficiencyComputerEffficiencyComputingEEfficiency ∗∗=

When you think about a data center, what do you picture? Almost any aspect could be

imagined: mechanical & electrical systems, network infrastructure, storage, compute

environments, virtualization, applications, security, cloud, grid, fabric, unified computing, open

source, etc. Then consider how these items incorporate into areas of efficiency, sustainability, or

even a total carbon footprint [11].


26

The Carbon Footprint The view of a data center quickly becomes significantly more complex, leading to challenges

like answering the question of how efficient a data center is to company executives. Where does

someone start to measure for these types of complexities? Are the right technologies in place to

do so? Which metrics should you use for a particular industry and data center design? Data

Center professionals all over the world are asking the same questions and feeling the same

pressures [13].

Data centers are changing at a rapid pace; more than any other point in history. Yet with all the

change, data center facilities, and IT professionals face numerous challenges in unifying their

peers to solve problems for their companies. Has virtualization changed your network

architecture? What about your security protocols? What exactly does Cloud computing mean to

my data center? Is cloud computing being performed in your data center already? More

importantly, how do I align the different data center disciplines to understand how new

technologies work together to solve data center problems?

With ever increasing densities, sleep deprived data center IT professionals still have to keep the

data center operating, while facing additional challenges relating to power efficiencies and

interdepartmental communication.

To compound the problem, ‘Green’ has become the new buzzword in almost every facet of our

lives. Data centers are no exception to green marketing and are sometimes considered easy

targets due to large, concentrated power and water consumption. New green solutions

sometimes are not so green due to limited understanding of data center complexities. They may

disrupt cost saving and efficient technologies already in use.

Corporations are trying to calculate their carbon footprint, put goals in place to reduce it, and

may face pressure to apply a new solution without understanding the entire data center picture

and what options are available. Various government bodies around the world have seen the

increase in data center power consumption and realize it is only trending up. It is only a matter

of time before regulations are put into place that will cause data center operators to comply with

new rules, possibly beyond what a data center was originally designed for. Nevertheless, we all

know that the most visible pressure is that costs are rising, potentially reducing profitability.


27

Figure 5 – IT Data Center Sustainability Taxonomy

The recent economic uncertainty has everyone looking for ways to cut and optimize data

centers even further. Data centers have reached the CFO's radar and are under never ending

scrutiny to cut capital investments and operating expenses. So what are data center owners and

operators supposed to do? Invent their own standards? Metrics? Framework? Which industry

standards and metrics apply to your data center and will they help you show results to your

CFO? There has to be a better way.

With the advent of ‘Cloud computing’ and it multi-faceted variants, understand the data center

interdependencies from top to bottom is a new priority. By doing so, users can analyze potential

outsourcing, as an example, to a cloud technology solution. Outlining Figure 5 – IT Data Center

Sustainability Taxonomy, shown above, is one approach to define the metrics and moving parts

to achieve a framework to understand the challenges and mythologies required to achieve an

efficient approach to IT data center architectures.


28

At the bottom of the stack are the sustainability metrics. Understanding all the metrics that can

be used such as useful work as outlined in Figure 4 to lifecycle manage with a desire to “design

for run” to more eco-centric metrics such as Fuel and site selection is imperative. Design for run

is the concept that you need to consider the full life cycle of IT technology from the energy used

to create the product, the power consumer during the operational period and eventual

disposition of the IT asset.

A carbon score results from these metrics. A carbon score can be localized to more specific

data center relationships as outlined in the data center stack such as the network, server,

storage, etc. As a reference, we can also map the data center in a cloud mapping for more of a

variable metric score.

The output if the data center stack could be a consistent and design terminology definition,

allowing you to map into a more eco centric approach to define a more targeted focal point for

optimization such as the environment, real estate or physical data center as outlined. Out of that

could be an approach to define a score and certification that businesses and governments can

utilize to track and measure sustainable levels.

Therefore, what is the sustainability bottom line? What is the long term metric for Significantly

Improving Sustainability by restoring the economic productivity of IT?

Equation 4 – IT Long Term Sustainability Goal

seanceIncreanalPerformComputatioIncreaseEfficiency Δ≥Δ

The long-term solution is defined in Equation 4 – IT Long Term Sustainability Goal, shown

above. The goal is to have the rate of energy efficiency increase equal, or exceed, the rate of

computational performance increase.

According to the Uptime Institute showing efficiencies of IT equipment relative to computational

performance, server compute performance has been increasing by a factor of three every two

years, so a total factor of 27 (3 x 3 x 3 =27). However, energy efficiency is only doubling in the


29

same period (2 x 2 x 2 =8). 7This means computational performance increased by a factor of 27

between 2000 and 2006. Energy efficiency has gone up as well, but by only a factor of eight

during the same period.

This means that while power consumption per computational unit has dropped dramatically in a

six-year period (by 88 percent), the power consumption has still risen by a factor of more than

3.4 times.

Moore’s Law is a major contributor. The definition of Moore’s law is doubling the number of

transistors on a piece of silicon every 18 months8 resulting in a power density increase within

chips that causes temperatures inside and around those chips to rise dramatically. Virtually

everyone involved in large-scale IT computing is now aware of the resulting temperature and

cooling problems data centers are experiencing, but may not fully understand the risks as they

relate to sustainability.

In addition to a common framework as outlined in Figure 5 on page 27, an Ontology “what the

interdependencies are” as it relates to a sustainable IT framework consistent with what defines

achieving sustainability is useful. It is outlined in Figure 6 – Top Level Sustainability Ontology,

on page 30, below.

As defined in the figure, the four aspects or pillars of achieving Sustainability include “Business

Practices, Environment,” “Effectiveness,” and “Efficiency”. These four pillars of sustainability will

be the common theme of this article and all pillars we will cover each in detail.

7 Uptime Institute - The Invisible Crisis in the Data Center: The Economic Meltdown of Moore’s

Law 8 Gordon More the Intel cofounder originally predicted in 1965 doubling in 24 months. Real

world was faster


30

Figure 6 – Top Level Sustainability Ontology (see notes below)

Notes on figure above;

• Right point arrow above indicates a Ontology sub diagram, drilling down on that topic

• The sub diagram “Figure 19 – Sustainability Ontology – Infrastructure Architectures” can

be found on page 84

• The sub diagram “Figure 13 – Sustainability Ontology – Self Organizing Systems”, can

be found on page 52

• The sub diagram “Figure 25 – Sustainability Ontology – Business Practices”, can be

found on page 128


31

Environment Pillar- Green Computing, Growing Sustainability Green Computing is the efficient use of computing resources; the primary objective is to account

for the triple bottom line (People, Planet, and Profit), an expanded range of values and criteria

for measuring organizational and societal success. Given that computing systems existed

before concern over their environmental impact, it has generally been implemented

retroactively, but some consider it in the development phase. It is universal in nature, because

ever-increasingly sophisticated modern computer systems rely upon people, networks and

hardware. Therefore, the elements of a green solution may comprise items such as end user

satisfaction, management restructuring, regulatory compliance, disposal of electronic waste,

telecommuting, virtualization of server resources, energy use, thin client solutions and return on

investment.

Data centers are one of the greatest environmental concerns of the IT industry. They have

increased in number over time as business demands have increased, with facilities housing an

increasing amount of ever more powerful equipment. As data centers run into limits related to

power, cooling and space, their ever-increasing operation has created a noticeable impact on

power grids. To the extent that data center efficiency has become an important global issue,

leading to the creation of the Green Grid9, an international non-profit organization mandating an

increase in the energy efficiency of data centers. Their approach, virtualization, has improved

efficiency, but is optimizing a flawed model that does not consider the whole system, where

resource provision is disconnected from resource consumption [4].

For example, competing vendors must host significant redundancy in their data centers to

manage usage spikes and maintain the illusion of infinite resources. So, one would argue that

as an alternative, a more systemic approach is required, where resource consumption and

provision are connected, to minimize the environmental impact and allow sustainable growth.

9 Crossing the Great Divide in Going Green:

Challenges and Best Practices in Next Generation IT Equipment, EMC Knowledge Sharing,

2008


32

Standards and Regulations

Information technology has enabled significant improvements in the standards of living of much

of the developed world, and through its contributions to greater transport and energy efficiency,

improved design, reduced materials consumption and other shifts in current practices, may offer

a key to long-term sustainability.

However, the production, purchase, use and disposal of electronic products have also had a

significantly negative environmental impact. As with all products, these impacts occur at multiple

stages of a product’s life: extraction and refining of raw materials, manufacturing to turn raw

materials into finished product, product use, including energy consumption and emissions, and

end-of-life collection, transportation, and recycling/disposal. Since computers and other

electronic products have supply chains and customer bases that span the globe, these

environmental impacts are widely distributed across time and distance.

Best Practice – In the US, Consider Executive Order 13423 and energy-efficiency legislation regulations

Executive Order 13423 (E.O.), "Strengthening Federal Environmental, Energy, and

Transportation Management," which was signed into law January 2007 by former President

Bush, requires that all federal agencies set an energy efficiency and environmental performance

example to achieve a number of sustainability goals with target deadlines. To comply with this

E.O, IT solutions providers will have to change their current product and offering set or create

new products and offerings if they intend to supply government agencies. The suppliers'

commercial customers will benefit as well, fulfilling the EO’s ultimate goal.

While the E.O. does not establish non-compliance penalties, other agencies, a number of states

and at least one city -- New York City -- have enacted legislation to fine companies that violate

these new laws. The impact to IT and data centers is under the Clean Air Act, any source

emitting more than 250 tons of a pollutant would be forced to follow certain regulations and

potentially be exposed to significant financial penalties.

The E.O. is not clear who would be responsible for carbon dioxide generation, the corporate

power consumer or the power generation facility. One thing is certain though, the costs will be

passed on to the business. If electric utilities are charged for carbon dioxide production, they are

either going to pass those charges on to their customers or increase overall electric utility rates.


33

Best Practice – Use tools and resources to understand environmental impacts

A tool developed by the EPA (U.S. Environmental Protection Agency) called the EPEAT

(Electronic Product Environmental Assessment Tool) program was launched in 2006 to help

purchasers identify environmentally preferable electronic products. EPEAT developed their

environmental performance criteria through an open, consensus-based, multi-stakeholder

process, supported by U.S. EPA that included participants from the public and private

purchasing sectors, manufacturers, environmental advocates, recyclers, technology researchers

and other interested parties. Bringing these varied constituencies’ needs and perspectives to

bear on standard development enabled the resulting system not only to address significant

environmental issues, but also to fit within the existing structures and practices of the

marketplace, making it easy to use and therefore widely adopted.

To summarize EPEAT’s goals:

• Provide a credible assessment of electronic products based on agreed-upon criteria

• Evaluate products based on environmental performance throughout the life cycle

• Maintain a robust verification system to maintain the credibility of product declarations

• Help to harmonize numerous international environmental requirements

• Promote continuous improvement in the design of electronic products

• Lead to reduced impact on human and environmental health

For example, EPEAT Cumulative Benefits in the United States reflect that 101 million EPEAT

registered products have been sold in the US since the system’s debut in July 2006, and the

benefits of US EPEAT purchasing have increased over time and will continue to be realized

throughout the life of the products. The data in Table 1 - 2006 to 2008 EPEAT US Sales

Environmental Benefits, on page 34, below, shows the benefits of these sales, year to year and

cumulatively.

It is important to understand the standards and regulations, so that from the business as well as

from the purchasing perspective, we can make the appropriate decisions and continue on the

path to a sustainable future.


34

Table 1 - 2006 to 2008 EPEAT US Sales Environmental Benefits

IT facilities and Operations

A company needs to determine the most effective method to select a data center site that is

most efficient to attain sustainability. There are a number of key factors to consider. The site

selection process is key for most companies, not only because the selected site/provider will be

hosting mission-critical business services, but also because the chosen site will likely house

those critical systems and platforms for the foreseeable future. Since you only perform site

selection activities once or twice, it is important that all relevant factors be evaluated.

Geographical factors are often overlooked in site selection activities, or at best incompletely

examined. Many data centers produce information about hardware reliability or facility security,

but often geography, as a measure of a facility’s ability to serve and sustain its clients’ needs, is

often neglected[8].


35

Best Practice – Place Data Centers in locations of lower risk of natural disasters

The prevalence of natural disasters in U.S. regions is another factor by which companies can

measure data center operations as shown in Figure 7 - U.S. Federal Emergency Management

Agency – Disaster MAP, below. Enterprises that outsource or move data center operations to

other potential sites or locations can mitigate certain risks by choosing locations in areas

deemed low risk by historical and analytical data.

Figure 7 - U.S. Federal Emergency Management Agency – Disaster MAP[22]

We can predict where earthquakes may occur by using seismic zone data and fault line

analysis. Seismic zones are determined by compiling statistics about past earthquakes,

specifically magnitude and frequency. The map below titled Figure 8 - U.S. Geological Survey

Seismological Zones (0=lowest, 4=Highest), on page 36, illustrates U.S. seismic zones as

defined by the United States Geological Survey (USGS). For the purposes of illustrating seismic

activity in the United States, the USGS divides the country into zones, numbered from 0 to 4,

indicating occurrences of observed seismic activity and assumed probabilities for future activity.


36

Figure 8 - U.S. Geological Survey Seismological Zones (0=lowest, 4=Highest)

According to FEMA, flooding is a common event that can occur virtually anywhere in the United

States, including arid and semi-arid regions. The agency has defined flood zones according to

varying levels of risk. Based on current data, Texas maintains the distinction as the country’s

highest-risk flood zone. Note that FEMA is still collecting information and has not yet released

statistical data regarding Hurricane Katrina’s flood impact. Despite recent events, consider the

following flood facts from FEMA collected between 1960 and 1995:

• Texas had the most flood-related deaths during the past 36 years

• Total flood-related deaths in Texas were double that of California

• California ranked second to Texas in number of flood-related deaths

• Texas had more flood-related deaths than any other state 21 out of 36 years

Like flooding, tornadoes can occur in every US state of the country. However, some areas are

more prone to tornadoes than others. It also notes that most of the United States tornadoes

develop over its vast eastern plains.


37

In terms of hurricane activity, most occurrences are by coastal states and those states more

proximally located to the eastern and gulf coasts of the United States. Based on weather

patterns and historical data, according to Figure 9 – U.S. NOAA Hurricane, below, much of the

eastern United States, especially the Southeast and across the Gulf Coast, are significantly

susceptible to yearly hurricane activity.

Figure 9 - U.S. NOAA Hurricane Activity in the United States

Best Practice – Evaluate Power GRID and Network sustainability for IT Data Centers

Other factors should be considered when performing site-selection activities. Not only should

you consider the presence of various geographical factors, but also the availability of key

resources, such as power and network, to name just a few.

Power availability should be a major factor in any site-selection process. Recent headlines (New

Orleans) point to the potentially disastrous effects of deficient power infrastructures. When

evaluating the power infrastructure in a given area, it is important to ascertain several key

factors, including:


38

• Access to more than one grid – Is the provider in question connected to more than one

feed from the energy company in question?

• Power grid maturity – Does the grid(s) in question also feed a large number of residential

developments? Is there major construction occurring within the area served by that grid?

• On-site power infrastructure – Is the data center equipped to support major power

requirements, and sustain itself should the main supply of power fail?

As with power, network and carrier backbone availability, we must also consider the availability

and quality of network backbones. Key factors include:

• Fiber backbone routes and their proximity to the datacenter – Are major carrier routes

proximate to the datacenter?

• Type of fiber in proximity – For that fiber that is proximate to the data center, is it a major

fiber route, or a smaller spur off the main backbone? How much fiber is already in place,

and how much of it is ‘lit’, or ready for service? How much of it is ‘dark’?

• Carrier presence – While the presence of a fiber backbone is important, it is also

important to understand the presence of the carrier/ telecommunication provider(s) in the

area, from a business and support perspective. A carrier may have fiber in the area, but

if they have little or no presence themselves, or rely on third parties for maintenance,

service to the data center may suffer accordingly.

• Carrier type – Simply having a carrier’s backbone nearby does not necessarily indicate

that the carrier themselves are a Tier 1 provider. This is of particular interest when

internet access to/from the data center is being supplied via those carriers. In any event,

it is usually important to understand the internet carriers currently providing service into

the data center, and whether or not they are the telecommunications/fiber provider(s). It

is not enough that one or more carriers are present in a data center. At least one of the

carriers (optimally, more than one) should be a Tier 1 provider, meaning that they peer

directly with other major backbones at private and public peering exchanges. Only a

relatively small number of carriers can claim this level of efficiency, and all smaller

carriers must purchase access from the major carrier, which often introduces some level

of latency.


39

Effectiveness Pillar To have a sustainable model to achieve an efficient business model, we must have an effective

set of tools, best practices, partnerships and a services level component. The following

sections will discuss this in detail.

Services and Partnerships

In order to achieve sustainability, we must have actionable plans to achieve that goal.

Understanding what resources can be brought to bear to achieve new policies and procedures

can achieve that goal. As shown in Equation 5 – What is Efficient IT, outlined below, to achieve

an Efficient IT environment to achieve sustainability is to consider that leveraging external or

internal resources to create a assessment strategy, having the technical knowledge base with

experience and best practices, some of which is outlined in this paper, can achieve that goal.

Equation 5 – What is Efficient IT

As shown in Figure 10 - Opportunities for efficiency improvements, as an example below, with

the ability to utilize the efficient model in Equation 5, one can achieve a sustainable growth path

for a company’s IT infrastructure. This can be done by consolidation and virtualization,

implementing a tiered server and storage infrastructure model and utilizing common tools, key

performance indicators, resource management best practices and process automation.


40

Figure 10 - Opportunities for efficiency improvements

Tools and Best Practices

Executive management , SIO and IT personnel should review their IT management tools sets,

to establish whether investing in automation and better processes can reduce the percentage of

IT staff spent keeping the lights on. Such investments should also increase the utilization of IT

assets.

Dealing with server sprawl, improving network utilization and controlling the growth of enterprise

storage will all help business extend their IT budgets for hardware, maintenance, licensing,

software and staff.

With the latest generation of IT management tools and current best practices, both of which will

be covered in the following sections, this effort should be perfectly compatible to preserving and

improving business agility and a sustainable growth path [3].


41

Efficiency Pillar According to the EPA (U. S. Environmental Protection Agency), the energy used by the nation’s

servers and data centers is about 61 billion kilowatt-hours (kWh) in 2006 (1.5 percent of total

U.S. electricity consumption) for a total electricity cost of about $4.5 billion. This estimated level

of electricity consumption is more than the electricity consumed by the nation’s color televisions

and similar to the amount of electricity consumed by approximately 5.8 million average U.S.

households. Federal servers and data centers alone account for approximately 6 billion kWh (10

percent) of this electricity use, for a total electricity cost of about $450 million annually.

The energy use of the nation’s servers and data centers in 2006 is estimated to have doubled

since 2000. The power and cooling infrastructure that supports IT equipment in data centers

also uses significant energy, accounting for 50 percent of the total consumption of data centers

[5]. Among the different types of data centers, more than one-third (38 percent) of electricity use

is attributable to the nation’s largest (i.e., enterprise-class) and most rapidly growing data

centers.

Under current efficiency trends, national energy consumption by servers and data centers could

nearly double again in another five years (i.e., by 2011) to more than 100 billion kWh,

representing a $7.4 billion annual electricity cost. The peak load on the power grid from these

servers and data centers is currently estimated to be approximately 7 gigawatts (GW),

equivalent to the output of about 15 base load power plants. If current trends continue, this

demand would rise to 12 GW by 2011, which would require an additional 10 power plants.

These forecasts indicate that unless we improve energy efficiency beyond current trends, the

federal government’s electricity cost for servers and data centers could be nearly $740 million

annually by 2011, with a peak load of approximately 1.2 GW. As shown in Figure 11 - EPA

Future Energy Use Projections, shown below, according to the EPA, given historical projections,

annual energy will double every four years.


42

Figure 11 - EPA Future Energy Use Projections10

However, there is good news. By implementing a “State of the Art” or “Best Practices” Scenario

model outlined in the following sections, as shown in the diagram, it is possible to achieve

energy consumption sustainability by reversing the curve. Through best practices such as

server and storage consolidation, Power Management, Virtualization, Data Center facility

infrastructure enhancements such as improved transformers, UPS’s Chillers, Fans, pumps,

Free Air and liquid Cooling all outlined in Figure 4 – Efficiency - Megawatts to Infowatts to

Business Value Solutions, on page 22, all of these aspects will get us to where we want to go.

In order to be efficient in IT and to achieve sustainability, one can summarize it in three aspects.

We can achieve efficiency by Consolidating, Optimizing and Automating. In the following

sections, we will discuss best practices to allow businesses achieve a continued profitability

model while achieving sustainable growth.

Information Management

Best Practice – Implement integrated Virtualized management into environment.

In order to manage an efficient and sustainable infrastructure utilizing the most recent

architectural changes happening in the industry today, outlined in the section titled

“Infrastructure Architectures” starting on page 80, some form of management system needs to

10 Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431, Aug

2007


43

be in place. As discussed in the “Infrastructure Architectures” section, the “Private” and “Hybrid”

type cloud architectures for example, allow application, servers, networks and storage to be

dynamically modified. It is more of a challenge to understand what is where.

The Best Practice is to implement a management solution that allows a business to:

• Monitor virtualized, high-availability clustered and load-balanced configurations and

isolate problems

• View and monitor the critical application processes running on your physical or virtual

machines

• Identify when your critical hardware (Server, PC, ESX, Hyper-V, etc) is operating in a

degraded state so you can proactively use VMotion ( data migration solution) to move

your critical apps to avoid service disruption

• Monitor the status of virtual machines and track their movement (VMotion or Quick

Migrate) in real time

• Isolate problems when for example, using Microsoft Cluster Services and Symantec

VERITAS Clustering

The management system needs to understand, end to end, all of the levels of virtualization and

integrate common information models to suitably scale. EMC’s IONIX family of management

software implements this type of functionality.

EMC’s Ionix Server Manager (EISM) Software understands the virtualization abstraction stack.

EISM implements detailed Discovery of ESX Servers, Virtual Machines, Physical Hosts with

VMs, VirtualCenter Instances and supports dynamic and on-going discovery of

added/deleted/moved VMs. EISM also understand dependencies and Relationships such as VM

topology dynamic (real-time) discovery, associations of VM’s with ESX server’s and physical

hosts.

It is also a best practice for the management application software to support more than one

virtualization platform. EMC’s EISM supports both (ESX) and Microsoft’s (Hyper-V) virtualization

platforms as well as numerous clustering and load balancing solutions.


44

Best Practice - Having a robust Information Model

A Best Practice in the management of Data Center Networks is a methodology that allows a

common management platform to support a common information model that provides key

knowledge for automating management applications.

The ICIM Common Information Model is a potential solution. Supported EMC’s Smarts product

suite of management applications illustrates how ICIM can be used. We use this model to

represent a networked infrastructure supporting complex business networks.

An information model, underlying a management platform, provides knowledge about managed

entities that is important to management applications, such as fault, performance, configuration,

security, and accounting. This information must be shared among applications for an integrated

OSS solution.

As Best Practice, an information model must maintain detailed data about the managed system

at multiple layers, spanning infrastructure, applications, and the business services typical in a

Data Center network. A robust information model enables solutions at every level that includes

element management, network management, service management, and business management.

Having a common information model has many benefits. Benefits include faster application

development, and stored information that is maintained in one place, providing a single coherent

view of the managed system. Applications can access the parts of the model pertinent to its

operation, with consistent views to each application.

In a Data Center management system, agents collect operational data on managed elements

(network, systems, applications, etc.) and provide this data to the management system.

Another Best Practice is an information model should represent the whole range of managed

logical and physical entities, from network elements at any layer through attached servers and

desktops, the applications that run on them, the middleware for application interaction, the

services the applications implement, the business processes the applications support, and the

end-users and customers of business processes.


45

The classes used by the DMTF (Desktop Management Task Force) CIM (Common Information

Model), are an excellent starting point for representing the complete range of entities. An

information model must also be able to describe the behaviors of managed entities. Since

events and problem behaviors play a central role in management processing, such as real-time

fault and performance management, network design and capacity planning, and other functions,

formalizing them within the CIM is a key enabler for management automation.

In addition, Best Practices reflect that data structures or repositories should play an important

role in supporting the semantic model. They must be flexible enough to represent the rich set of

information for each class of managed entities. They must be flexible enough to represent the

often-complex web of relationships between entities (logical and physical) within individual

layers, across layers, and across technology domains that is so typical in a Data Center.

Another important Best Practice is to automate the discovery process about entities and their

relationships within and across technology domains as much as possible. The ability to

automatically populate the information repository is a best practice.

For example, in a Data Center, auto-discovery is particularly effective in environments

supporting the TCP/IP protocol suite, including SNMP and other standard protocols that enable

automatic discovery of a large class of logical and physical entities and relationships in Network

ISO Layers 1-7.

Another Best Practice is to have a modeling language that can describe as many entities of a

managed environment as possible and their relationships within and across technology and

business domains in a consistent fashion. A high-level modeling language can simplify

development of managed entity models as well as reduce error.

The ICIM Common Information Model™ and its ICIM Repository provide excellent examples of

a semantics-rich common information model and an efficient information repository that meet all

requirements presented in earlier sections. ICIM is based on the industry-standard DTMF CIM,

a rich model for management information across networks and distributed systems.

CIM reflects a hierarchical object-oriented paradigm with relationship capabilities, allowing the

complex entities and relationships that exist in the real world to be depicted in the schema. ICIM


46

enhances the rich CIM semantics by adding behavioral modeling to the description of managed

entity classes to automate event correlation and health determination. This behavioral modeling

includes the description of the following information items:

Events or exceptional conditions - These can be asynchronous alarms, expressions over

MIB variables, or any other measurable or observable event.

Authentic problems - These are the service-affecting problems that must be fixed to maximize

availability and performance.

Symptoms of authentic problems - These events can be used to recognize that the problem

occurred.

By adding behavioral modeling, ICIM provides rich semantics that can support more powerful

automation than any other management system.

Best Practices in Root Cause Analysis

Cloud networks are becoming more difficult to manage. The number and heterogeneity of

hardware and software elements in networked systems are increasing exponentially, therefore

increasing the complexity of managing these systems in a similar growth pattern. The

introduction of each new technology adds to the list of potential problems that threaten the

delivery of network-dependent services.

Fixing a problem is often easy once it has been diagnosed. The difficulty lies in locating the root

cause of the myriad events that appear on the management console of a Data Center, cloud or

any infrastructure. It has been shown that 80 to 90 percent of downtime is spent analyzing data

and events in an attempt to identify the problem that needs to be corrected.

For Data Center managers charged with optimizing the availability and performance of large

multi-domain networked systems, it is not sufficient to collect, filter, and present data to

operators. Unscheduled downtime directly affects the bottom line. The need for applications that

apply intelligent analysis to pinpoint root-cause failures and performance problems automatically

is imperative, especially in the consumer driven video and audio vertical. Only when diagnosis is

automated can self-healing networks become a reality.


47

Many problems threaten service delivery. They include hardware failures, software failures,

congestion, loss of redundancy, and incorrect configurations.

Best practice - Effective root-cause analysis technique must be capable of identifying all those problems automatically.

This technique must work accurately for any environment and for any topology, including

interrelated logical and physical topologies, with or without redundancy. The solution must be able to

diagnose problems in any type of object - for example, a cable, a switch card, a server, or a

database application - at any layer, no matter how large or complex the infrastructure.

As important, accurate root-cause analysis is required to determine the appropriate corrective

action as well. If management software cannot automate root-cause analysis, that task falls to

operators. Because of the size, complexity, and heterogeneity of today's networks, and the

volume of data and alarms, manual analysis is extremely slow and prone to error.

A Best Practice is to intelligently analyze, adapt, and automate using a Codebook Correlation

Technology described in the following sections. This Best Practice will translate directly for the

customer into major business benefits, enabling organizations to introduce new services faster,

to exceed service-level goals, and to increase profitability.

Rules-based Correlation Limitations and Challenges

Typically, issues that event managers focus on include gathering and displaying ever more data

to users that is ineffective, as well as too much data, de-duplication and filtering. Because of the

lack of intelligence in legacy event managers, users of these systems often resort to developing

their own custom scripts to capture their specific rules for event processing.

Using customized rules is a development-intensive approach doomed to fail for all but the

simplest scenarios in simple static networks.

In this approach, the developer begins by identifying all of the events. Events can include

alarms, alerts, SNMP traps, threshold violations, and other sources that can occur in the

managed system. The Management Platform and user then attempt to write network-specific

rules to process each of these events as they occur.


48

An organization willing to invest the effort necessary to write rules faces enormous challenges.

Typically, there are hundreds and even thousands of network devices in a Data Center network.

The number of rules required for a typical network, without accounting for delay or loss of

alarms, or for resilience, can easily reach millions. The development effort necessary to write

these rules would require many person-years, even for a small network.

Changes in the network configuration can render some rules obsolete and require writing new

ones. At the point in time when their proper functioning is needed most, i.e., when network

problems are causing loss and delay, rules-based systems are the least reliable given their

constant maintenance and update cycles.

Due to the overall complexity of development, attempts to add intelligent rules to an unintelligent

event manager have not been successful in practice. In fact, rules-based systems have

consistently failed to deliver a return on the investment (ROI) associated with the huge

development effort.

Best Practice – Rules-based correlation using CCT

A Best Practice is utilizing Introducing Codebook Correlation Technology (CCT). CCT is a

mathematically founded next-generation approach to automating the correlation required for

service assurance. CCT is able to automatically analyze any type of problem in any type of

physical or logical object in any complex environment. It is also able to build intelligent analysis

into off-the-shelf solutions as well as automatically adapt the intelligent analysis to the managed

environment, even as it changes. As a result, CCT provides instant results even in the largest of

Data Center Networks.

CCT solutions dynamically adapt to topology changes, since the analysis logic is automatically

generated. This eliminates the high maintenance costs required by rules-based systems that

demand continual reprogramming.

CCT provides an automated, accurate real-time analysis of root-cause problems and their

effects in networked systems. Other advantages include minimal development. CCT supports

off-the-shelf solutions that embed intelligent analysis and automatically adapt to the

environment.


49

Any development that is required consists of developing behavior models. The amount of effort

is dependent not on the size of the managed environment, but on the number of problems that

need to be diagnosed.

Since CCT consists of a simple distance computation between events and problem signatures,

CCT solutions execute quickly. In addition, CCT utilizes minimal computing resources and

network bandwidth, since it monitors only the symptoms that are needed to diagnose problems.

Because CCT looks for the closest match between observed events and problem signatures, it

can reach the correct cause even with incomplete information

Leveraging CCT as a Best Practice to automate service assurance provides substantial

business benefits as well. These include the ability to roll out a new service more quickly,

achieve greater availability and performance of business critical systems.

Since CCT automatically generates its correlation logic for each specific topology, new Data

Center Network services can be managed immediately and new customers can be added to

new or existing services quickly. By eliminating the need for development, ongoing

maintenance, and manual diagnostic techniques, CCT enables IT organizations to be proactive

and to focus their attention on strategic initiatives that increase revenues and market share.

CCT provides a future-proof foundation for managing any type of complex infrastructure. This

gives CCT users the freedom to adopt new technology, with the assurance that it can be

managed effectively, intelligently, and automatically.

Best Practice - Reduction of Downstream Suppression

Reducing Downstream Event Suppression is another Best Practice. Some management

vendors implement a form of root-cause analysis that is actually a form of downstream event

suppression. Downstream suppression is a path-based technique that is used to reduce the

number of alarms to process when analyzing hierarchical Data Center networks. Downstream

suppression works as follows.

A polling device periodically polls Data Center devices to verify that they are reachable. When a

device fails to respond, downstream suppression does the following:


50

• Ignores failures from devices downstream (farther away from the "poller”) from the first

device

• Selects the device closest to the “poller” that fails to respond as the "root cause"

Downstream suppression requires that the network have a simple hierarchical architecture, with

only one possible path connecting the polling device to each managed device. This is typically

unrealistic. Today’s mission-critical Data Center networks leverage redundancy to increase

resilience. Downstream suppression does not work in redundant architectures because the

relationship of one node being downstream from another is undefined; there are multiple paths

between the manager and managed devices.

Utilizing downstream suppression to today's Data Center networks is limited. This technique

applies only to simple hierarchical networks with no redundancy, and addresses only one

problem, Data Center node failure. Because of these limitations, downstream suppression offers

little in the way of automating problem analysis, and certainly cannot claim to offer root-cause

analysis.

Self Organizing Systems

As the size and complexity of computer systems grow, system administration has become the

predominant factor of ownership cost and a main cause for reduced system dependability. All of

these factors impede an IT department from achieving an efficient and sustainable operational

model.

The research community, in conjunction with the various hardware and software vendors, has

recognized the problem and there have been several advances in this area.

All of these approaches propose some form of self-managed, self-tuned system(s) that minimize

manual administrative tasks. As a result, computers, networks and storage systems are

increasingly being designed as closed loop systems, as shown in Figure 12 – Closed Loop

System. As shown, a controller can automatically adjust certain parameters of the system based

on feedback from the system. This system can be either hardware, software or a combination.


51

Controller System

Measurements

Closed Loop Response

Figure 12 – Closed Loop System

Servers, storage systems, Network, and backup hardware are examples of such closed loop

systems aiming at managing the energy consumption and maximizing the utilization of data

centers. In addition, self-organizing systems can also meet performance goals in file servers

through virtualization, using VMware and other virtualization product offerings as well such as

Internet services, databases and Storage tiering. As shown in Figure 13 – Sustainability

Ontology – Self Organizing Systems, on page 52, this methodology or technology can be used

in many additional scenarios, ranging from Storage, Application, Server and Consolidation as

well as the various levels of Virtualization types that can be achieved through Self Organizing

System Theory and applications.


52

Figure 13 – Sustainability Ontology – Self Organizing Systems

It is important that the resulting closed-loop system is stable (does not oscillate) and converges

quickly to the desired end state when applying dynamic control.

In order to achieve a more sustainable solution, a more rigorous approach is needed for

designing dynamically controlled systems. In particular, it is best practice to use the time true

approach of control theory because it results in systems that can be shown to work beyond the

narrow range of a particular experimental evaluation.


53

Computer and infrastructure system designers can take advantage of decades of experience in

the field and can apply well-understood and often automated methodologies for controller

design. Many computer management problems can be formulated, so that standard controllers

or systems are applied to solve them. Therefore, a best practice is that the systems community

should stick with systems design; in this case, systems that are acquiescent to dynamic

feedback-based control. This provides the necessary tunable system parameters and exports

the appropriate feedback metrics, so that a controller (hardware or software) can be applied

without destabilizing the system, while ensuring fast convergence to the desired goals.

Traditionally, control theory and/or feedback systems have been concerned with environments

that are governed by laws of physics (i.e., mechanical devices), and as a result, allowed to

make assertions about the existence or non-existence of certain properties. This is not

necessarily the case with software systems. Checking whether a system is controllable or, even

more, building controllable systems is a challenging task often involving non-intuitive analysis

and system modifications.

As a first step, propose a set of necessary and sufficient properties that any system must abide

by to be controllable by a standard adaptive controller that needs little or no tuning for the

specific system. This is the goal leading to achievement of sustainability. These properties are

derived from the theoretical foundations of a well-known family of adaptive controllers. From a

control or feedback systems perspective, there are two very specific IT diverse management

problems:

1) Enforcing soft performance goals in networked or storage service by dynamically adjusting

the shares of competing workloads

2) Controlling the number of blades, storage subsystems, Network Nodes, etc, assigned to a

workload to meet performance goals within power budgets

Best Practice - Dynamic Control in a Self Organized System

Many computer management problems are defined as online optimization problems. The

objective is to have a number of measurements obtained from the system that can converge to

the desired goals by dynamically setting a number of system parameters (or actuators). The

problem is formalized as an objective function that has to be minimized.


54

Existing research has shown that, in the general case, adaptive controllers are needed to trace

the varying behavior of computer systems and their changing workloads [9].

Best Practice – Utilize STR’s when implementing adaptive controllers

Let us focus on one of the best-known families of adaptive controllers, i.e. Self-Tuning

Regulators (STR), that have been widely used in practice to solve on-line optimization control

problems, Using this technology can help attain sustainability. The term “self-tuning” comes

from the fact that the controller parameters are automatically tuned to obtain the desired

properties of the closed-loop system. The design of closed loop systems involves many tasks

such as modeling, design of control law, implementation, and validation. STR controllers aim to

automate these tasks. Therefore, STR’s can be used out-of-the-box for many practical cases.

Other types of adaptive controllers proposed in many feedback system design methodologies

require more intervention by the designer.

An STR consists of two basic modules or functions: the model estimation module and the on-

line estimates model that describes the measurements from the system as a function of a finite

history of past actuator values and measurements. That model(s) is then used by the control

law module that sets their actuator values. A best practice is using a linear model of the

following form for model estimation in the STR as defined in Equation 6 – Linear model of a

Control System, shown below:

Equation 6 – Linear model of a Control System

( ) ( )01

01)( dituitytY n

i i

n

i i BA −−+−= ∑∑ −

==

Where;

y(t) is a vector of the N measurements sampled at time t and

u(t) is a vector capturing the M actuator settings at time t.

Ai and Bi are the model parameters with dimensions compatible with those of vectors y(t) and

u(t).

n is the model order that captures how much history the model takes into account.

d0 is the delay between an actuation and the time the first effects of that actuation are observed.

The unknown model parameters Ai and Bi are estimated using Recursive Least-Squares (RLS)

estimation. This standard, computationally fast estimation technique that fits Equation 6 to a

number of measurements, so that the sum of squared errors between the measurements and


55

the model curve is minimized. Discreet-time models are assumed. One time unit in this discreet-

time model that corresponds to an invocation of the controller, i.e., sampling of system

measurements, estimation of a model, and setting the actuators. The relation between actuation

and observed system behavior is not always linear. For example, while throughput is a linear

function of the share of resources (e.g., CPU cycles) assigned to a workload, storage processor,

etc, the relation between latency and resource share is nonlinear as Little’s law indicates.

However, even in the case of nonlinear metrics, a linear model is often a good enough local

approximation to be utilized by a controller as the latter usually only makes small changes to

actuator settings. We can estimate the advantage of using linear models in computationally

efficient ways, resulting in tractable control laws.

The control law is essentially a function that, based on the estimated system model defined in

Equation 6 at time t, decides what the actuator values u(t) should be to minimize the objective

function. In other words, the STR derives u(t) from a closed-form expression as a function of

previous actuations, previous measurements and the estimated system measurements y(t).

from a systems perspective, the important point is that these computationally efficient

calculations can be performed on-line. The STR requires little system-specific tuning as it uses

a dynamically estimated model of the system and the control law automatically adapts to system

and workload dynamics. For this process to apply and for the resulting closed-loop system to be

stable and to have predictable convergence time, control theory has come with a list of

necessary and sufficient properties that the target system must abide by.

Best Practice – Require System Centric Properties

In the following paragraphs, guidelines about how one can verify whether a property is satisfied

and what the challenges for enforcing them are presented.

Monotonic. The elements of matrix B0 in Equation 6 on page 54 must have known signs that

remain the same over time. The concept behind this property is that the real (non-estimated)

relation between any actuator and any measurement must be monotonic and of known sign.

This property usually refers to some physical law. Therefore, it is generally easy to check for it

and get the signs of B0. For example, in the long term, a process with high fraction of CPU

cycles gets higher throughput and lower latency than one with a smaller fraction.


56

Accurate models. The estimated model in Equation 6 is a good enough local approximation of

the system’s behavior. As discussed, the model estimation is performed periodically. A

fundamental requirement is that the model around the current operating point of the system

captures the dynamic relation between actuators and measurements sufficiently. In practice,

this means that the estimated model must track only real system dynamics. I used the term

‘noise’ to describe deviations in the system behavior that are not captured by the model. It has

been shown that to ensure stability in linear systems where there is a known upper bound on

the noise amplitude, the model should be updated only when the model error is twice the noise

bound.

There are three main sources for the previously discussed noise:

1) un-modeled system dynamics, due i.e., to contention on the network as an example

2) a fundamentally volatile relation between certain actuators and measurements

3) quantization errors when a linear model is used to approximate locally in an operating range

the behavior of a nonlinear system.

Known system delay. We know the time delay or the delta time reference between system

sample periods of the feedback system 0d relative to actuator intervals.

Known system order. We know the upper bound on the degree of the system. Known System

ensures that the controller knows when to expect the first effects of its actuations, while Known

System Order ensures that the model sufficiently remembers many prior measurements (y) to

capture the dynamics of the system. These properties are needed for the controller to observe

the effects of its actuations and then attempt to correct any error in subsequent actuations. If the

model order were less than the system order, then the controller would not remember having

ever actuated when the measurements finally are affected. The values of d0 and n are derived

experimentally. The designer is faced with a tradeoff: their values must be high enough to

capture the causal relations between actuation and measurements but not too high, so that the

STR remains computationally efficient. Note that 0d = 1 and n = 1 are ideal values.

Minimum phase. Recent actuations have higher impact to the measurements than older

actuations. A minimum phase system is one for which the effects of an actuation can be

corrected or canceled by another, later actuation. It is possible to design Stirs that deal with


57

non-minimum phase systems, but they involve experimentation and non-standard design

processes. In other words, without the minimum phase requirement, we cannot use off-the-shelf

controllers. Typically, physical systems are minimum phase. The causal effects of events in the

system fade as time passes by. Sometimes, this is not the case with computer systems. To

ensure this property, a designer must re-set any internal state that reflects older actuations.

Alternatively, the sample interval can always be increased until the system becomes minimum

phase. However, longer sampling intervals result in slower control response.

Linear independence. The elements of each of the vectors y(t) and u(t) must be linearly

independent. The quality of the estimated model is poor unless this property holds. The

predicted value for y(k) may be way off the actual measurements. The reason is that matrix

inversion in the RLS estimator may result in matrices with very large numbers, which in

combination with the limited resolution of floating point arithmetic of a CPU, result in models that

are not necessarily correct. Sometimes, simple intuition about a system may be sufficient to

ascertain if there are linear dependencies among actuators.

Zero-mean measurements and actuator values. The elements of each of the vectors y(k) and

u(k) should have a mean value close to 0. If the actuators or the measurements have a large

constant component to them, RLS tries to accurately predict this constant component and may

fail to capture the comparably small effect the values of the actuators have. If there is a large

constant component in the measurements and it is known, then you can simply deduct it from

the reported measurements. If unknown, you can easily estimate it using a moving average.

Comparable magnitudes of measurements and actuator values. The values of the elements in

y(k) and u(k) should not differ by more than one order of magnitude. If the measurement values

or the actuator values differ considerably, then RLS predicts more accurately the effects of the

higher value. You can easily solve this problem by scaling the measurements and actuators, so

that their values are comparable. This scaling factor can also be estimated using a moving

average.

Application

To design a sustainable IT infrastructure, it is very important to deal with and understand the

application and the resources needed.


58

Best Practice – Architect a designed for Run solution The Designed for Run11 strategy provides enterprises a clear path for transformation and

modernization while controlling costs and protecting the user experience. It considers the

function and expense of the client's IT system and applications as a whole. Designed for Run

ensures the key factors that affect agility and TCO, rigidity, complexity and resource utilization

are addressed from the start in the planning phase.

The reason why we need to consider this methodology is that most organizations recognize the

need for IT modernization but fear the time, expense and risk involved. However, failure to

modernize will result in high costs, complex systems that are difficult to fix, outages that have a

business impact, and low resource utilization. These factors weigh heavily in a TCO equation

and compromise sustainability.

The Designed for Run approach offers these benefits:

• Reduces risk by bringing deep industry knowledge, expert planning and thorough testing

to every part of the process

• Extracts value within your existing IT, extending the life of legacy elements by integrating

them to support the business through modernization

• Lowers total cost of ownership by building maximum asset utilization into your IT system

and enabling you to anticipate operational IT expenses

The approach to a “design to run” plan follows; it is defined in three phases, Plan, Build, Run

and Automate;

In the “Planning” phase, we focus on reducing risk. The objective is to reduce system

complexity, poor resource utilization, and outages. After assessing your existing IT applications

and infrastructure, we tie your IT strategy to your business strategy, determine the system's

level of maturity, and use existing industry frameworks to develop a blueprint for your system of

the future.

In the “Building” phase we focus on extracting value by designing high quality into the system,

building it for zero outages. Begin by implementing optimized processes for IT applications and

11 HP trademark


59

then implement the elements of a modern architecture onto a modern infrastructure. This

enables us to best use existing systems while gaining the speed, flexibility and innovation

required in the 21st century.

In the “Run” phase, we focus on optimizing for total cost of ownership. After the transformation

is complete, there are no costly surprises for running the finished system. Rigorous, closed-loop

operational processes drive consistency and relentless pursuit of incident prevention. One will

enjoy complete visibility of your enterprise's IT systems, making it possible to avoid costly errors

in budgeting and performance this increasing sustainability.

In the Automate phase, minimize management by automating this process.

Storage

Information management is at the core of achieving efficiency and sustainability. At this point, it

is the belief and assumed to be common knowledge that the digital footprint, the storage that

both businesses and people in general utilize, is exploding at an exponential rate. Something

must be done to address this growth.

Storage has a life cycle, from creation to eventual archival or deletion. In order for any business

to achieve a sustainable growth path, having an information lifecycle management (ILM)

solution is a given. Technologies will be discussed in the following sections having a similar

common approach to data life cycle management as shown in Figure 14 – A Sustainable

Information transition lifecycle. The approach is to consider going from a highly usable

performance entity, “Thick”, where the data is on a disk drive that has high performance

characteristics with standard provisioning. This leads to eventual “Thin” or virtual provisioning.

Thin or virtual provisioning can also include performance as well as capacity tiering going to a

very low granular level. The data can then become “Small” through data deduplication,

compression and other redundancy methods.

Figure 14 – A Sustainable Information transition lifecycle


60

At some point in the data’s life cycle, performance may not be much of an issue and the

physical media can spin down in a smart way, going into a “Green” state to eventually be spun

up with hints on when the application may need it. Eventually, the data may never be touched or

deleted arriving at a “Gone” state. We will discuss all of these life stages.

Compression, Archiving and Data Deduplication

In order to achieve sustainability of data growth and therefore gain efficiency, it is imperative to

implement some form of technology to reduce the amount of data through its life cycle from

creation to its eventual archiving and deletion. One way is to use compression t reduce or

minimize the number of copies of data as well as the data footprint. The standard example is

when a person sends an email to twenty people with a file attachment. Each person will have a

copy of that file. If each recipient saves that file in the mail archive, there will be twenty copies.

This is obviously not a sustainable approach. Data deduplication is a technology that deals with

this un-sustainable data growth pattern. Data compression achieves efficiency and

sustainability.

Best Practice – Implement Deduplication Technology Data deduplication is an application specific form of data compression where redundant data is

eliminated, typically to improve storage utilization. In this process, duplicate data is deleted,

leaving only one copy. However, indexing or the ability to retrieve it from various sources of all

data is retained should it ever be required. Deduplication is able to reduce the required storage

capacity since only the unique data is stored. Backup applications generally benefit the most

from de-duplication due to the nature of repeated full backups of an existing file system or

multiple severs having similar images of an OS[10].

When implementing a data deduplication system, it is important to consider scalability to

achieve true sustainability. Performance should remain acceptable as the storage capacity and

high deduplication granularity. Data deduplication should also be unaffected by data loss due to

errors in the deduplication algorithm.


61

Best Practice – Implement a data-deduplication solution addressing Scaling and hash collisions It is critical that data deduplication solutions detect duplicate data elements, making the

determination that one file, block or byte is identical to another. Data deduplication products

determine this by processing every data element through a mathematical "hashing" algorithm to

create a unique identifier called a hash number. Each number is then compiled into a list,

defined as the hash index.

When the system processes new data elements, the resulting hash numbers are compared

against the hash numbers already in the index. If a new data element produces a hash number

identical to an entry already in the index, the new data is considered a duplicate, and it is not

saved to disk. A small reference "stub" that relates back to the identical data that has been

stored is put in the original place. If the new hash number is not already in the index, the data

element is considered new and stored to disk normally.

It is possible that a data element can produce an identical hash result even though the data is

not identical to the saved version. Such a false positive, also called a hash collision, can lead to

data loss. There are two ways to reduce false positives:

• Use more than one hashing algorithm on each data element. Using in or out-of-band

indexing with SHA-1 and MD5 algorithms is the best practice approach. This

dramatically reduces the potential for false positives.

• Another best practice to reduce collisions is to use a single hashing algorithm but

perform a bit-level comparison of data elements that register as identical.

The challenge with both approaches is that more processing power from the host system be it

the Source Host (Source De-duplication) or the Target (Target De-duplication), reduces index

performance and slows the deduplication process. Target deduplication is the process of data

reduction at the Server vs. at the target (VTL, Disk, etc.). As the deduplication process becomes

more granular and examines smaller chunks of data, the index becomes much larger and the

probability of collisions increases and can exacerbate any performance hit.


62

Best Practice – Include and consider scaling and encryption in the deduplication Process Another issue is the relationship between deduplication, more traditional compression and

encryption in a company's storage infrastructure. Ordinary compression removes redundancy

from files, and encryption "scrambles" data so that it is completely random and unreadable.

Both compression and encryption play an important role in data storage, but eliminating

redundancy in the data can impair the deduplication process. Indexing and deduplication should

be performed first if encryption or traditional compression is required along with deduplication.

Each "chunk" of data (i.e., a file, block or bits) is processed using a hash algorithm, such as

MD5 or SHA-1, generating a unique reference for each piece. The resulting hash reference is

then compared to an index of other existing hash numbers. If that hash number is already in the

index, the data does not need to be stored again. If we have a new entry, the new hash number

is added to the index and the new data is stored.

The more granular a deduplication platform is, the larger an index will become. For example,

file-based deduplication may handle an index of millions, or even tens of millions, of unique

hash numbers. Block-based deduplication will involve many more unique pieces of data, often

numbering into the billions. Such granular deduplication demands more processing power to

accommodate the larger index. This can impair performance as the index scales unless the

hardware is designed to accommodate the index properly.

In rare cases, the hash algorithm may produce the same hash number for two different chunks

of data. When such a hash collision occurs, the system fails to store the new data because it

sees that hash number already. Such a "false positive" can result in data loss.

Best Practice – Utilize multi Hash algorithms, Metadata Hashing, Compression and Data Reduction It is a best practice to implement multiple hash algorithms, examining metadata to identify data

deduplication, compression and data reduction, and thereby prevent hash collisions and other

abnormalities. Data deduplication is typically used in conjunction with other forms of data

reduction, such as compression and delta differencing. In data compression technology, which

has existed for about three decades, algorithms are applied to data in order to simplify large or

repetitious parts of a file.


63

Delta differencing is primarily used in archiving or backup, reduces the total volume of stored

data by saving only the changes to a file since its initial backup. For example, a file set may

contain 400 GB of data, but if only 100 MB of data has changed since the previous backup, then

only that 100 MB is saved. Delta differencing is frequently used in WAN-based backups to make

the most of available bandwidth in order to minimize the backup window. For additional

information on WAN or network sustainability efficiency options, please refer to the section titled

“Network” starting on page 72.

Data deduplication also has ancillary benefits. Less deduplicated or compressed data can be

backed up faster, resulting in smaller backup windows, reduced recovery point objectives

(RPOs) and faster recovery time objectives (RTOs). Disk archive platforms are able to store

considerably more files. If tape is the ultimate backup target, smaller backups also use fewer

tapes, resulting in lower media costs and fewer tape library slots being used.

For a virtual tape library (VTL), the reduction in disk space requirements translates into longer

retention periods for backups within the VTL itself and therefore more sustainable. Data

transfers are accomplished sooner, freeing the network for other tasks, allowing additional data

to be transferred or reducing costs using slower, less-expensive WANs. For additional

information on WAN or network sustainability efficiency options, please refer to the section titled

“Network”, starting on page 72.

In addition to Archiving, Flash Storage, Data Compression, De-duplication and Archiving, we will

discuss other technologies that will advance the ability for a business to achieve sustainability.

Best Practice – Use Self organizing systems theory for Storage Storage virtualization products in the market today are a good first step, but enhancements are

needed. Storage virtualization is creating efficiencies by inserting a layer of abstraction between

data and storage hardware, and that same concept can be taken further to present a layer of

abstraction between data and the method in which data is stored. RAID is actually a well-known

form of data virtualization, because the linear sequence of bytes for data is transformed to stripe

the data across the array, and includes the necessary parity bits. RAID’s data virtualization

technique was designed over 20 years ago to improve data reliability and I/O performance.

Even though it is a reliable and proven technology, we continue to need RAID technology as we

transition more from structured data to large quantities of unstructured data.


64

One of EMC’s most recent technologies called “Virtual LUN” and “Fast Fully Automated Storage

Tiering”, also known as FAST are examples. This technology automatically and dynamically

moves data across storage tiers, so that it is in the right place at the right time simply by pooling

storage resources, defining the policy, and applying it to an application similar to the modeling

parameters Ai and Bi in Equation 6 – Linear model of a Control System on page 54.

FAST enables applications to remain optimized by eliminating trade-offs between capacity and

performance. Automated storage tiering dynamically monitors and automatically relocates data

to increase operational efficiency, lower costs and increase efficiency.

By utilizing a “FAST” technology that implements transparent mobility (i.e. the application does

not know that a transfer is going on) and dynamically moving applications across different

storage types, great strides can be made to sustaining the ability to manage information. As of

this writing, Flash, Fiber Channel, and SATA drive technologies are all currently supported in

EMC’s implementation of FAST.

“Dispersal” is another new technology being considered in storage. It is a natural successor for

RAID for data virtualization because it can be configured with M of N fault tolerance, which can

provide much higher levels of data reliability than RAID. Dispersal essentially packetizes the

data (N packets), and only requires a subset (M packets) to fit perfectly and recreate the data.

There will no longer be a tight coupling between hardware and the storage of the data packets.

This is one major change for data virtualization that will occur as Dispersal replaces RAID and

will eliminate the concept of having copies of data on hardware.

Today’s RAID systems stripe data and parity bits across disks within an array, within an

appliance. When asked, “Where is my data?” the answer is typically “On this piece of

hardware.” This gives people peace of mind in terms of sensing something that is intangible

(since the data is actually virtualized) is actually tangible because it is contained within a

physical device.

The shift for IT storage administrators will be asking, “Where is my data?” since it will be

virtualized across multiple devices in multiple locations to “Is my data protected?” because the


65

root of the first question is the second. Once people get comfortable with actually giving up the

control of knowing exactly where their data resides, they will realize the benefits of data

virtualization.

Increased fault tolerance is the largest benefit to storing data packets across multiple hardware

nodes. RAID is structured to provide disk drive fault tolerance. As a disk drive fails, the other

disks can reconstruct the data.

Dispersal provides not only disk drive level fault tolerance, but also device drive fault tolerance,

and even location fault tolerance. When an entire device fails, the data can be reconstructed

from virtualized data packets on other devices, whether centrally located or across multiple

sites. Self-organizing systems can be used to address reconstructed virtualized technology

issues.

Current products focus on vitalizing storage pools and access from storage hardware. This is a

good step towards avoiding a silo style storage system. However, it appears there is still too

much management burden placed on the storage administrator. The management systems do

have self-discovery, but that simply means listing the hardware nodes that have been added to

the system. The burden is still on the storage administrator to determine where and how to

deploy those nodes.

Another step must occur to simplify the management burden; the system must evolve to self

organize. Self-organizing systems are made up of small units that can determine an inherent

order collectively. Instead of a storage administrator having to determine which pools and tiers

to add storage nodes to, the nodes themselves will evolve to contain metadata and rules, and

inherently place themselves within the storage tiers, as the system requires.

An example of metadata and rules could be related to disk characteristics –SSD (Solid State

Disks) suited for tier 1-performance scenarios. Another example would be related to GPS

location – storage nodes could know which data center they have been installed within, and

determine which storage pools they need to join. This type of self-organized system attributes

are currently in development, for example, EMC’s ATMOS cloud storage offerings.


66

Beyond the storage system, self-organizing systems, in terms of provisioning hardware,

systems will also self organize the tiers. Storage administrators will define the requirements for

tiers (QoS, data reliability, performance), and the storage nodes will self organize underneath

them. That means that when capacity and performance nodes are added to the system, the

system will also determine which tiers need those resources.

Emergent patterns will surface once the storage nodes have metadata and rules, and the

toolset for managing the system will change. Rather than managing the physical hardware, and

individual storage pools and access, management will occur at a system level – what storage is

required in which locations based on how information is dynamically moving across storage

nodes and tiers.

Autonomic self healing systems

The concept of “Self Healing” systems is a subset of self-organizing systems (see section “Self

Organizing Systems”, starting on page 50, for more information) in that it utilizes a similar closed

loop system theory construct. The fundamental approach or rules are:

• The System must know itself

• The System must be able to reconfigure itself within its operational environment

• The System must preemptively optimize itself

• The System must detect and respond to its own faults as they develop

• The System must detect and respond to intrusions and attacks

• The System must know its context of use

• The System must live in an open world assuming the security requirement allow

• The System must actively shrink the gap between user/business goals and IT solutions

to achieve sustainability

Autonomic computing is really about making systems self-managing. If you think about

biological systems like the human body, they are tremendously complex and very robust. The

human body, for example, is constantly making adjustments. Your heart rate is being controlled;

your breathing rate is controlled. All of these things happen beneath the level of conscious

control. Biological systems give a good example for thinking about computer systems. When we

take a look at the attributes of biological systems, we can find attributes that we wish our


67

computer systems had, like self-healing, self-configuring, and self-protecting. We can begin to

build the attributes that we see in biological systems into complex computer systems. In the

end, it translates into real customer benefits because these more complex systems are easier to

administer, control and are more sustainable.

Best Practice – Utilize Autonomic self-healing systems for Storage As it relates to Storage, in addition to FAST, Dispersal and RAID technology to store and

protect, another relatively new technology has been implemented and is commonly referred to

as "autonomic self-healing storage." This technology promises to substantially increase

reliability of disk systems. Autonomic self-healing storage is different from RAID, redundant

array of independent nodes (RAIN), snapshots, continuous data protection (CDP) and mirroring.

RAID, RAIN, etc, are designed to restore data from a failure situation.

These technologies however, are actually self-healing data, not self-healing storage and restore

data when there is a storage failure and mask storage failures from the apps. RAID and RAIN

do not restore the actual storage hardware.

Autonomic self-healing systems transparently restore both the data and storage from a failure. It

has been statistically proven that as HDDs proliferate, so will the number of hard disk drives

failures, which can lead to lost data. Analyzing what happens when a HDD fails illustrates the

issue:

If a hard disk drive fails, the drive must be physically replaced, either manually or from an online

pool of drives. Depending on the RAID set level, the HDD's data is rebuilt on the spare. RAID

1/3/4/5/6/10/60 all rebuild the hard disk drives data, based on parity. RAID 0 cannot rebuild the

HDD's data.

The time it takes to rebuild the HDD's data depends on the hard disk drives capacity, speed and

RAID type. A 1 TB 7,200 rpm SATA HDD with RAID 5 will take approximately 24 hours to 30

hours to rebuild the data, assuming the process is given a high priority.

If the rebuild process is given a low priority and made a background task to be completed in off

hours, the rebuild can take as long as eight days. The RAID group is subject to a higher risk of a

second disk failure or non-recoverable read error during the rebuild, which would lead to lost


68

data. This is because the parity must read every byte on every drive in the RAID group to

rebuild the data. (Exceptions are RAID 6, RAID 60)

SATA drives typically have a rated non-recoverable read error rate of 1 x 1014: roughly 1 out of

100,000,000,000,000 bits will have a non-recoverable read error. This means that a seven-drive

RAID 5 group with 1 TB SATA drives will have approximately a 50% chance of failing during a

rebuild, resulting in the loss of the data in that RAID group.

Enterprise-class drives (Fiber Channel or SAS) are rated at 1 x 1015 for non-recoverable read

errors, which translates into less than a 5% chance of the RAID 5 group having a failure during

a rebuild. RAID 6 eliminates the risk of data loss should a second HDD fail. You pay for that

peace of mind with decreased write performance vs. RAID 5, and an additional parity drive in

the RAID group. Eventually, the hard disk drive is sent back to the factory. Using typical

MTBF’s, there will be approximately 40 HDD "service events" per year.

Best Practice – Consider Autonomic Storage solutions utilizing Standards New Storage systems, including EMC’s VMAX, tackle end-to-end Autonomic self healing and

error detection and correction, including silent data corruption (See section titled “Best Practice

– Implement Undetected data corruption technology into environment”, on starting on page 69).

In addition, sophisticated algorithms that attempt to "heal-in-place" failed HDDs before requiring

a RAID data rebuild are also implemented. A technology that is currently being developed is a

relatively new concept of "fail-in-place" so that in the rare circumstance when a HDD truly fails

(i.e., it is no longer usable), no service event is required to replace the hard disk drive for a

RAID data rebuild. This would add to the sustainability equation.

The T10 DIF is a relatively new standard and only applies to SCSI protocol HDDs (SAS and

Fiber Channel). However, as of this writing, there is no standard spec for end-to-end error

detection and correction for SATA hard disk drives. As a result, EMC and others have devised

proprietary solutions for SATA end-to-end error detection and correction methodologies.

The American National Standards Institute's (ANSI) T10 DIF (Data Integrity Field) specification

calls for data to be written in blocks of 520 bytes instead of the current industry standard 512

bytes. The eight additional bytes or "DIF" provide a super-checksum that is stored on disk with

the data. The DIF is checked on every read and/or write of every sector. This makes it possible


69

to detect and identify data corruption or errors, including misdirected, lost or torn writes. ANSI

T10 DIF provides three types of data protection:

• Logical block guard for comparing the actual data written to disk

• Logical block application tag to ensure writing to the correct logical unit (virtual LUN)

• Logical block reference tag to ensure writing to the correct virtual block

When errors are detected, they can then be fixed by the storage system's standard correction

mechanisms. Self-healing storage solves tangible operational problems in the data center and

allows a more sustainable and efficient environment. This technology reduces service events,

costs, management, the risk of data loss, and application disruptions.

Best Practice – Implement Undetected data corruption technology into environment Another problem with HDDs that is rarely mentioned but is quite prevalent is "silent data

corruption." Silent data corruption(s) are storage errors that go unreported and undetected by

most storage systems, resulting in corrupt data being provided to an application with no

warning, logging, error messages, or notification of any kind.

Most storage systems do not detect these errors, which occur on average with 0.6% of SATA

HDDs and .06% of enterprise HDDs over 17 months 12. Silent data corruption occurs when

RAID does not detect data corruption errors, such as misdirected or lost writes. It can also occur

with a “torn write”, data that is partially written and merges with older data, so the data ends up

part original data and part new data. Because the hard disk drive does not recognize the errors,

the storage system is not aware of it either, so there is no attempt at a fix. See the section titled

“Autonomic self healing systems” starting on page 66, above for additional information.

Storage Media – Flash Disks

A number of techniques can be applied to reduce power consumption in a storage system and

therefore increase efficiency. Disk drive technology vendors are developing low spin disks,

which can be slowed or stopped to reduce power consumption when not in use. Another

12 An Analysis of Data Corruption in the Storage Stack," L.N. Bairavasundaram et al., presented

at FAST '08 in San Jose, Calif.


70

technology is FLASH Disks. Caching techniques that reduce disk accesses and the use of 2.5-

inch rather than 3.5-inch formats can reduce voltage requirements from 12 volts to 6 volts. An

industry-wide move toward higher-capacity Serial ATA (SATA) drives and 2.5-inch disks is

under way, which some claim will lead to better energy performance.

Best Practice – Utilize low power flash technologies A best practice is to utilize FLASH storage as applicable to the Data Center Tiered Storage

requirements. Low power technologies are starting to enter the data center. With the advent of

Solid State Disks (SSDs), this enabling semiconductor technology can and will have a major

impact on power efficiencies.

For example, an SSD system can be based on double data rate (DDR) DRAM technology and

integrated with battery backup. It requires a Fiber Channel (FC) interface consistent with

conventional hard drives. This SSD technology has been available for years and has

established itself in a niche market that serves large processing-intensive government projects

and companies involved in high-volume, high-speed/low-latency transactions such as stock

trading systems.

For additional information on Flash Disk Technologies, please refer to the 2008 EMC Proven

Professional Knowledge Sharing article titled “Crossing the Great Divide in Going Green:

Challenges and Best Practices in Next Generation IT Equipment”.

Server Virtualization

With the advent of virtualization at the Host level, there are new possibilities to balance

application workloads as per the requirements of self-organized systems. ’s infrastructure now

provides two new capabilities:

1. resource pools, to simplify control over the resources of a host

2. clusters, to aggregate and manage the combined resources of multiple hosts as a single

collection

In addition, now has functionality called Distributed Resource Scheduling (DRS) that

dynamically allocates and balances computing capacity across the logical resource pools

defined for Infrastructure.


71

DRS continuously monitors utilization across the resource pools and intelligently allocates

available resources among virtual machines based on resource allocation rules that reflect

business needs and priorities. Virtual machines operating within a resource pool are not tied to

the particular physical server on which they are running at any given point in time. When a

virtual machine experiences increased load, DRS first evaluates its priority against the

established resource allocation rules and then, if justified, allocates additional resources by

redistributing virtual machines among the physical servers. VMotion executes the live migration

of the virtual machine to a different server with complete transparency to end users. The

dynamic resource allocation ensures that capacity is preferentially dedicated to the highest

priority applications, while at the same time maximizing overall resource utilization.

Best Practice – Implement DRS Utilizing DRS and VirtualCenter provides a view and management of all resources in the cluster

emulating a self-organized solution. As shown in Figure 15 – Self Organized VM application

controller, below, a global scheduler within VirtualCenter enables resource allocation and

monitoring for all virtual machines running on ESX Servers that are part of the cluster.

Figure 15 – Self Organized VM application controller

DRS provides automatic initial virtual machine placement on any of the hosts in the cluster, and

also makes automatic resource relocation and optimization decisions as hosts or virtual

machines are added or removed from the cluster. DRS can also be configured for manual

control, in which case it only recommends that you can review and carry out. DRS provides

several additional benefits to IT operations:


72

• Day-to-day IT operations are simplified as staff members are less affected by localized

events and dynamic changes in their environment. Loads on individual virtual machines

invariably change, but automatic resource optimization and relocation of virtual

machines reduces the need for administrators to respond, allowing them to focus on the

broader, higher-level tasks of managing their infrastructure.

• DRS simplifies the job of handling new applications and adding new virtual machines.

Starting up new virtual machines to run new applications becomes more of a task of

high-level resource planning and determining overall resource requirements, than

needing to reconfigure and adjust virtual machines settings on individual ESX Server

machines.

• DRS simplifies the task of extracting or removing hardware when it is no longer needed,

or replacing older host machines with newer and larger capacity hardware. To remove

hosts from a cluster, you can simply place them in maintenance mode, so that all virtual

machines currently running on those hosts are reallocated to other resources of the

cluster. After monitoring the performance of remaining systems to ensure that adequate

resources remain for currently running virtual machines, you can remove the hosts from

the cluster to allocate them to a different cluster, or remove them from the network if the

hardware resources are no longer needed. Adding new resources to the cluster is also

straightforward, as you can simply drag and drop new ESX Server hosts into a cluster

Network

Best Practice - Architect Your Network to Be the Orchestration Engine for Automated Service Delivery (The 5 S’s)

With respect to network technologies, we must rethink how, going forward, we will build out data

communication networks. This is especially true with new data center architectures being

developed to address the needs for efficiency and sustainability. The challenge is to develop a

set of best practices with the requirement to address a cloud-ready network to automate service

delivery. Built correctly, it can become the orchestration engine for your cloud and sustainability

strategy. Cloud services need a network that embraces the five architectural goals. Design a

cloud network with these principles[12]:


73

Scalable: Your cloud network must scale without adding complexity or sacrificing performance. This

means scaling in lockstep with the dynamic consumption of software, storage, and application

resources without “throwing more infrastructure” at the problem.

Simplified: To achieve scale and reduce operational costs, you must simplify your network design. Fewer

moving parts, collapsed network tiers, a single operating system if possible, and fewer

interfaces ensure scalability and pave the way for automation.

Standardized: Cloud computing requires commoditized, standards-based technologies. Likewise, your cloud

network cannot be based on proprietary components that increase the capital and operation

costs associated with delivering cloud services.

Shared: Cloud networks must be built with multi-tenancy in mind. Different customers, departments, and

lines of business will consume various cloud services with their own unique requirements. A

shared network with differentiated service levels is required.

Secure: Cloud networks must embrace security on two levels:

1) controls built into the fabric that prevent wide-scale infrastructure and application

breaches and disruptions

2) overlying identity and data controls to combat regulatory, privacy, and liability concerns.

This security must be coordinated so that the network secures traffic along three key

connections: among virtual machines within the data center, between data centers, and

from clients to data centers.

It is important to note that these five principles are interrelated. Take simplicity, for example. It is

critical to simplify your cloud network infrastructure to ensure scalability as well as reduce the

number of moving parts needed to secure the end-to-end platform. However, to simplify, you

must standardize the components and build on an open network platform as well as use shared

components to get economies of scale.


74

Best Practice - Select the Right Cloud Network Platform

The cloud network is a platform. However, few companies and even fewer vendors ask the

question “A platform for what?” The answer is automation. Successful cloud networking requires

building a network with automation as part of its core focus. Automation makes troubleshooting,

security, provisioning, and other service delivery components less expensive and more reliable.

To bake automation into your cloud services, select a vendor that embraces automation in four

areas:

Cloud network infrastructure Routing, switching, and network appliances are the core components of delivering high-

performance cloud services. A best Practice is to find a vendor whose infrastructure provides

scalability and standardization, the building blocks for automation. In addition, the network

infrastructure must be application-aware to provide the granularity for quality-of-service and

quality-of-experience delivery requirements.

Cloud network operating system (OS) Hardware is the core platform, but the OS is the key to automation. Look for a vendor that

provides an OS that is standardized, shared, and simplified across its entire routing, switching,

and appliance infrastructure. A good platform OS has hooks to automate delivery across as well

as an open development ecosystem for third party, cloud-specific applications to run natively on

the network.

Cloud network management systems The infrastructure and OS are responsible for enabling automation, but the management system

is responsible for orchestrating it. A best practice is to select a vendor with a single

management system for its entire portfolio. Cloud management requires rich policy interfaces

and the ability to define differentiated services in the cloud.

Cloud network security Security appliances are part of your platform infrastructure, but you will also need end-to-end

security automated across the cloud. A best practice is to select a vendor with baked-in security

that protects the network cloud at the macro (broadly across all infrastructure and the OS) and

micro (granularly, to protection individual sessions) levels.


75

Best Practice – Consider implementing layer 2 Locator/ID Separation

With advances in the ability to move applications via virtualization, such as Vmotion, one of the

issues and challenges is the current IETF IP routing and addressing architecture protocols.

Having a single numbering space the "IP address" for both host transport session identification

and network routing creates scaling and interoperability issues. This is particularly true with

new infrastructure architectures outlined in the subsequent section titled “Infrastructure

Architectures”, starting on page 80. We can realize a number of scaling benefits by separating

the current IP address into separate spaces for Endpoint Identifiers (EIDs) and Routing Locators

(RLOCs); among them are:

1. Reduction of the routing table size in the "default-free zone" (DFZ). RLOCs would be

assigned by internet providers at client network attachment points greatly improving

aggregation and reducing the number of globally visible, routable prefixes.

2. Cost-effective multi-homing for sites that connect to different service providers, including

Cloud Servicer Providers, so that providers can control their own policies for packet flow

into the site without using extra routing table resources of core routers.

3. Easing of renumbering burden when clients change providers. Because host EIDs are

numbered from a separate, non-provider assigned and non-topologically-bound space,

they do not need to be renumbered when a client site changes its attachment points to

the internal or external network.

4. Traffic engineering capabilities that can be performed by network elements and do not

depend on injecting additional state into the routing system.

5. Mobility without address changing. Existing mobility mechanisms will be able to work in a

locator/ID separation scenario. It will be possible for a host (or a cluster of physical or

virtual hosts) to move to a different point in the network topology (Internal, external or

hybrid Clouds) either retaining its initial address or acquiring a new address based on

the new network location. A new network location could be a physically different point in

the network topology or the same physical point of the topology with a different provider.

Currently, the IETF(Internet Engineering Task Force) is working on standards13 that will

implement this type of protocol. Cisco is a major driver in this endeavor. By decoupling end

13 IETF Locator/ID Separation Protocol (LISP)- http://tools.ietf.org/html/draft-ietf-lisp-05


76

point locators (addresses) from the routing information, the ability to implement dynamic

network changes will allow cloud providers and consumer to be more efficient and sustainable.

Best Practice – Build a Case to Maximize Cloud Investments

Regardless of whether the plan is to consume or provide cloud services, a best practice is to

select a vendor that provides the right infrastructure, OS, management, and security. Articulate

a compelling business case by considering these business and technical recommendations.

Best Practice - Service Providers - Maximize and sustain Cloud

Investments

The recommendation is that CEOs and business executives at cloud providers should consider

optimizing revenue with a healthy mix of small, medium-size, and large companies. Smaller

firms provider short sales cycles and quick cash, but enterprises provide long-term profitability.

Focus on monetizing your assets by starting with just one or two of the cloud flavors; do not

overstretch and do IaaS, PaaS, and SaaS out of the gate. Allow customers to cut through the

clutter and focus on cost by providing tiered pricing and a self-service portal where users can

immediately pay by plunking down a credit card.

With respect to CTOs and technical executives at cloud, providers should demonstrate that your

cloud network starts on a low-cost, fixed-priced service and quickly scales capacity. Provide a

road map for how you will scale across the IaaS, PaaS, and SaaS flavors with proper network

capacity to consume all services.

Offer cloud network service-level agreements (SLAs) that tackle accessibility, reliability, and

performance. To offer a sustainable offering, consider that cloud services are standardized, but

SLAs are customized. It is important to demonstrate that the offering can tailor SLAs and

provide business-specific granularity.

It is also a best practice to design cloud networks with visibility and quality-of-service reports for

customers to run their own reports and audits, but also dedicate ample resources to

accommodate customers auditing your services to ensure they are SAS Type II- and PCI-

compliant.


77

Best Practice - Enterprises - maximize and sustain Cloud Investments

It is a best practice that CIOs, line-of-business managers, and other enterprise business

executives should consider focusing on cloud services as providing cost savings in the short

term and automation and flexibility as driving competitive advantages in the long term. Identify

business processes that drive revenue or customer interactions, but do not require mission

critical infrastructure. Demonstrate business value by putting them in the cloud first. Restructure

how you position SLAs’ with business peers. Focus on being a service provider, and drive

conversations from technical SLAs to business outcome SLAs.

A best practice for Network Architects and senior infrastructure leaders is to build an

environment with a hybrid internal/public cloud in mind. Use basic building blocks like virtualized

machines and high-performance networks to ensure that you can scale quickly. Provide

granular, real-time visibility across your cloud network. This allows service-level monitoring, cost

tracking, integration with security operations, and detailed audit logs. Build identity management

hooks into the cloud to automate user provisioning; enforce proper access management of

partners, suppliers, and customers; and appease auditors.

Best Practice – Understand Information Logistics and Energy

transposition tradeoffs

You must also consider network efficiencies in terms of creating a sustainable business or

environmental model. As will be discussed in the business practices section titled “


78

Economics,” starting on page 162 , it will be shown that the most efficient network may not be

the most macro sustainable network.

As shown in Figure 16 - Energy in Electronic Integrated Circuits, on page 78, below, each

network switch has a line card, a card that transmits data packets over a digital network. Each

Line card has many CMOS ASICS and each ASIC had millions of CMOS gates.

Figure 16 - Energy in Electronic Integrated Circuits

As shown in Equation 7 – Energy Consumed by a CMOS ASIC and Equation 8 – Power

Consumed by a CMOS ASIC, both shown below, the energy consumed is a function of the

energy of each gate plus the product of the capacitance and the voltage level.

Equation 7 – Energy Consumed by a CMOS ASIC14

[ ] 2

21 VCEEnergy WireGate ∑∑ +=

The power consumed by the ASIC is the product of energy consumed and the data bit rate. As

you can see, as the bit rate increases, the power increases at a linear rate.

Equation 8 – Power Consumed by a CMOS ASIC15

14 IEEE ASIC Design Journal, Nov 2007 15 IEEE ASIC Design Journal, Nov 2007


79

[ ] xBitRateVCEPower WireGate ⎥⎦⎤

⎢⎣⎡ += ∑∑ 2

21

The good news is “Moore’s Law” benefits us in that the switching energy is decreasing over time

as shown in Figure 17 - Moore's Law - Switching Energy, shown on page 79. However, network

use is increasing even faster.

Figure 17 - Moore's Law - Switching Energy16

It is also interesting to note that in some situations, it is more energy efficient to utilize physical

transport than utilizing the IP network. Based on the equations defining power and energy

utilization, in the example of transporting large amounts of data for backup, replication or

general data movement purposes, a physical move is more efficient. Take for example the case

shown in Figure 18 - Data by physical vs. Internet transfer, on page 80. In this case, transferring

9PB of data by physically moving it would be more efficient than transferring the data over the

internet given the equivalent time interval. In addition, the number of Kg of CO2 is substantially

reduced.

16 Intel, 2007


80

Figure 18 - Data by physical vs. Internet transfer17

So, what are the best practices for a vendor or business in terms of an efficient network? The

first is to choose a vendor with lowest power and smallest footprint per unit (lambda, port, and

bit). Leverage long haul technologies and ROADMs (reconfigurable optical add-drop

multiplexer) to reduce intermediate regeneration. Push fiber and (passive) WDM closer to the

end user and eliminate local exchanges. Aggregate multiple service networks onto a single

optical backhaul network. Concentrate higher layer routing into fewer, more efficient data

centers and CO’s (Central Offices). Use service demarcation techniques to allow lower layer

switching, aggregation, and backhaul all the way to the core.

Infrastructure Architectures

In order to achieve efficiency and sustainability in the data center or wherever the IT entity is

located, it is important to understand the various architectures that are available. Depending on

17 Rod Tucker, ARC Special Research Centre for Ultra-Broadband Information Networks

(CUBIN)


81

the use case and business requirements, some architecture types may be a better fit. In some

use cases, implementing all or a subset may make sense. As shown in Figure 19 –

Sustainability Ontology – Infrastructure Architectures, shown below on page 84, this diagram

outlines the possible architectures available today.

The first is the legacy data center, which consists of a physical location with centralized IT

equipment, power, cooling and support. The others are utility computing, warehouse scale

machines and cloud computing. Cloud Computing has a few variants that will be discussed in

subsequent sections. Warehouse Scale Machines are unique in the sense that this architecture

supports specific business models. Businesses that support a few specific applications are one

thing; being able to scale to thousands of servers spanning multiple data centers across the

globe, such as Google, is another.

Datacenters are essentially very large devices that consume electrical power and produce heat.

The datacenter’s cooling system removes that heat, consuming additional energy in the

process. The heat must be removed as well. It is not surprising, then, that the bulk of the

construction costs of a datacenter are proportional to the amount of power delivered and the

amount of heat to be removed. In other words, most of the money is spent either on power

conditioning and distribution or on cooling systems.

Data Center Tier Classifications

The overall design of a datacenter is often classified as belonging to “Tier I–IV.” Tier I

datacenters have a single path for power and cooling distribution, without redundant

components. Tier II adds redundant components to this design (N + 1), improving availability.

Tier III datacenters have multiple power and cooling distribution paths but only one active path.

They also have redundant components and are concurrently maintainable, that is, they provide

redundancy even during maintenance, usually with an N + 2 setup. Tier IV datacenters have

two active power and cooling distribution paths, redundant components in each path, and are

supposed to tolerate any single equipment failure without impacting the load. These tier

classifications are not 100% precise. Most commercial datacenters fall somewhere between

tiers III and IV, choosing a balance between construction costs and reliability. Real-world

datacenter reliability is also strongly influenced by the quality of the organization running the

datacenter, not just by the datacenter’s design. Typical availability estimates used in the


82

industry range from 99.7% availability for tier II datacenters to 99.98% and 99.995% for tiers III

and IV, respectively.

Datacenter sizes vary widely. Two thirds of US servers are housed in datacenters smaller than

5,000 sq ft and with less than 1 MW of critical power. Most large datacenters are built to host

servers from multiple companies (often called co-location datacenters, or “colos”) and can

support a critical load of 10–20 MW. Very few datacenters today exceed 30 MW of critical

capacity.

The data center, as we know it, is changing. Not only is the data center changing physically in

terms of power, cooling and other metrics, how a business uses the information infrastructure is

changing. As shown in Figure 19 – Sustainability Ontology – Infrastructure Architectures, on

page 84, the method or architecture of data centers are changing. What this means is the

options for the business or IT department are exceedingly increasing. Not only is the standard

data center architecture available, but also options such as Utility Computing, Warehouse Scale

Data Centers as well as the various flavors of Cloud offerings are now available. So, how does

one determine what are the best options or the best mix of technologies or architectures to meet

the sustainability and business requirements? Best practices will be discussed.

“Cloud Computing” is rising quickly, with its data centers growing at an unprecedented rate.

However, this is accompanied with concerns about privacy, efficiency at the expense of

resilience, and environmental sustainability, because of the dependence on Cloud vendors such

as Google, Amazon, EMC and Microsoft. There is, however, an alternative model for the Cloud

conceptualization, providing a paradigm for Clouds in the community, utilizing networked

personal computers for liberation from the centralized vendor model.

Community Cloud Computing offers an alternative architecture, created by combining the Cloud

with paradigms from Grid Computing, principles from Digital Ecosystems, and sustainability

from Green Computing, while remaining true to the original vision of the Internet. It is more

technically challenging than Cloud Computing, dealing with distributed computing issues,

including heterogeneous nodes, varying quality of service, and additional security constraints.

However, these challenges are attainable, and with the need to retain control over our digital

lives and the potential environmental consequences, it is a challenge that should be pursued.


83

The recent development of Cloud Computing provides a compelling value proposition for

organizations to outsource their Information and Communications Technology (ICT)

infrastructure. However, there are growing concerns over the control ceded to large Cloud

vendors, especially the lack of information privacy. In addition, the data centers required for

Cloud Computing are growing exponentially, creating an ever-increasing carbon footprint, and

therefore raising environmental concerns. The distributed resource provision from Grid

Computing, distributed control from Digital Ecosystems, and sustainability from Green

Computing, can remedy these concerns. Therefore, Cloud Computing combined with these

approaches would provide a compelling socio-technical conceptualization for sustainable

distributed computing that utilizes the spare resources of networked personal computers to

collectively provide the facilities of a virtual data center and form a Community Cloud. This

essentially reformulates the Internet to reflect its current uses and scale, while maintaining the

original intentions for sustainability in the face of adversity. Include extra capabilities embedded

into the infrastructure to become as fundamental and invisible as moving packets is today.

Cloud Computing is likely to have the same impact on software that foundries have had on the

hardware industry. At one time, leading hardware companies required a captive semiconductor

fabrication facility, and companies had to be large enough to afford to build and operate it

economically. However, processing equipment doubled in price every technology generation. A

semiconductor fabrication line costs over $3B today, so only a handful of major “merchant”

companies with very high chip volumes, such as Intel and Samsung, can still justify owning and

operating their own fabrication lines. This motivated the rise of semiconductor foundries that

build chips for others, such as Taiwan Semiconductor Manufacturing Company (TSMC).

Foundries enable “fab-less” semiconductor chip companies whose value is in innovative chip

design: A company such as nVidia can now be successful in the chip business without the

capital, operational expenses, and risks associated with owning a state-of-the-art fabrication

line. Conversely, companies with fabrication lines can time-multiplex their use among the

products of many fab-less companies, to lower the risk of not having enough successful

products to amortize operational costs. Similarly, the advantages of the economy of scale and

statistical multiplexing may ultimately lead to a handful of Cloud Computing providers who can

amortize the cost of their large datacenters over the products of many “datacenter-less”

companies.


84

Figure 19 – Sustainability Ontology – Infrastructure Architectures

Cloud Overview

Cloud Computing is the use of Internet-based technologies for the provision of services,

originating from the cloud as a metaphor for the Internet, based on depictions in computer

network diagrams to abstract the complex infrastructure it conceals. It can also be seen as a

commercial evolution of the academic-oriented Grid Computing succeeding where Utility

Computing struggled while making greater use of the self-management advances of Autonomic

Computing as discussed in the section titled “Autonomic self healing systems”, on page 66.

Cloud Computing offers the illusion of infinite computing resources available on demand, with

the elimination of upfront commitment from users, and payment for the use of computing

resources on a short-term basis as needed.


85

Furthermore, it does not require the node providing a service to be present once its service is

deployed. It is being promoted as the cutting-edge of scalable web application development, in

which dynamically scalable and often-virtualized resources are provided as a service over the

Internet, with users having no knowledge of, expertise in, or control over the technology

infrastructure of the Cloud supporting them. It currently has significant momentum in two

extremes of the web development industry. The consumer web technology incumbents who

have resource surpluses in their vast data centers, and various consumers and start-ups that do

not have access to such computational resources. Cloud Computing conceptually incorporates

Software-as-a-Service (SaaS), Web 2.0 and other technologies with reliance on the Internet,

providing common business applications online through web browsers to satisfy the computing

needs of users, while the software and data are stored on the servers.

The cloud has three core attributes. First, clouds are built differently than traditional IT. Rather

than dedicating specific infrastructure elements to specific applications, the cloud uses shared

pools that applications can dynamically use as needed.

This pooling, can have the multifaceted benefit of saving capital expenditures since business

units share the resources and provide better application experiences, since there are more

resources based on this shared resource assuming the right QOS management infrastructure

when the application needs it.

Second, clouds are operated differently than traditional IT. Most IT management today is about

managing specific point types, devices, applications, network links, etc. Managing a cloud is all

about managing service delivery. One manages outcomes, rather than individual components.

The cloud brings the concept of "automated" to a new level that is an entirely different

operational model, biased to low-touch and zero-touch IT operational models. Please refer to

the section titled “Self Organizing Systems”, starting on page 50, for additional details.

Finally, clouds are consumed differently than traditional IT. You pay for what you use, when you

use it. It is convenient to consume. Compare that with the traditional model of having to pay for

all the physical infrastructure associated with your application, whether you are using it or not,

“pay for the power I use, rather than buying a power plant ...”


86

Resource Consumption,

Resource Provisioning, Coordinator

Client

Client

Client

Client

Client

Client

Client

Client

Figure 20 - Cloud Topology

Figure 20 - Cloud Topology, shown above, shows the typical configuration of Cloud Computing

at run-time when consumers visit an application served by the central Cloud, which is housed in

one or more data centers. The Cloud resources include consumption and resource provision.

The role of coordinator for resource provisioning is also included and is centrally controlled.

Even if the central node is implemented as a distributed grid, which is typical of a standard data

center, control is still centralized. Providers, who are the controllers, are usually companies with

other web activities that require large computing resources, and in their efforts to scale their

primary businesses, have gained considerable expertise and hardware. For them, Cloud

Computing is a way to resell these as a new product while expanding into a new market.

Consumers include everyday users, Small and Medium sized Enterprises (SMEs), and

ambitious start-ups whose innovation potentially threatens the incumbent providers.


87

IaaS(Infrastructure as a Service)

PaaS(Platform as a

Service)

SaaS(Software as a

Service)

Vendor(s)

Developers

Clients/End Users

Deliver

Deliver

Deliver

Consume

Consume

Provide

Pro

vide

P

rovi

de

Consume

Figure 21 - Cloud Computing Topology

Cloud Layers of Abstraction

While there is a significant buzz around Cloud Computing, there is little clarity over which

offerings qualify as typical use cases or their interrelation with the other solutions. The key to

resolving this confusion is the realization that the various offerings fall into different levels of

abstraction, as shown in Figure 21 - Cloud Computing Topology defined above. They are

focused at different market segments.

Infrastructure-as-a-Service (IaaS): At the most basic level of Cloud Computing offerings, there

are providers such as Amazon and Mosso who provide machine instances to developers. These

instances essentially behave like dedicated servers that are controlled by the developers, who

therefore have full responsibility for their operation. Therefore, once a machine reaches its

performance limits, the developers have to manually instantiate another machine and scale their

application out to it. This service is intended for developers who can write arbitrary software on

top of the infrastructure with only small compromises in their development methodology.

Platform-as-a-Service (PaaS): One level of abstraction above services like Google App Engine

provides a programming environment that abstracts machine instances and other technical


88

details from developers. The programs are executed over data centers, not concerning the

developers with matters of allocation. In exchange for this, the developers have to handle some

constraints that the environment imposes on their application design, for example, the use of

key-value stores instead of relational databases.

Note that “key-value stores” is defined as a distributed storage system for structured data that

focuses on scalability, at the expense of the other benefits of relational databases. Examples

include Google’s "BigTable" and Amazon’s SimpleDB.

Software-as-a-Service (SaaS): At the consumer-facing level are the most popular examples of

Cloud Computing, with well-defined applications offering users online resources and storage.

This differentiates SaaS from traditional websites or web applications, which do not interface

with user information (e.g. documents) or do so in a limited manner. Popular examples include

Microsoft’s (Windows Live) Hotmail, office suites such as Google Docs and Zoho, and online

business software such as Salesforce.com. We can categorize the roles of the various entities

to better understand Cloud Computing.

The vendor as resource provider has already been discussed. The application developers utilize

the resources provided, building services for the end users. This separation of roles helps define

the stakeholders and their differing interests. However, actors can take on multiple roles, with

vendors also developing services for the end users, or developers utilizing the services of others

to build their own services. Yet, within each Cloud, the role of provider, and therefore the

controller, can only be occupied by the vendor providing the Cloud.

It is also important to consider Cloud interfaces. In order to allow a business to achieve

sustainability utilizing an external or internal cloud resource, it is important to consider

developing a standard interface or API that would allow users to interoperate between various

cloud implementations as well as be able to federate each entity. This topic will be covered in

the section titled “Standards”, starting on page 135.

Cloud Type Architecture(s) Computing Concerns

The Cloud Computing model is not without concerns. They include:


89

Failure of Monocultures:

The uptime of Cloud Computing (defined as a measure of the time a computer system has been

running) based solutions is an advantage, when compared to businesses running their own

infrastructure, but we often overlook the co-occurrence of downtime in vendor-driven

monocultures. The use of globally decentralized data centers for vendor Clouds minimizes

failure, aiding its adoption. However, when a cloud fails, there is a cascade effect, crippling all

organizations dependent on that Cloud, and all those dependent upon them.

This was illustrated by the Amazon (S3) Cloud outage, which disabled several other dependent

businesses. Therefore, failures are now system-wide, instead of being partial or localized.

Therefore, the efficiencies gained from centralizing infrastructure for Cloud Computing are

increasingly at the expense of the Internet’s resilience.

Convenience vs. Control

The growing popularity of Cloud Computing comes from its convenience, but also brings vendor

control, an issue of ever-increasing concern. For example, Google Apps for in-house e-mail

typically provides higher uptime, but its failure highlights the issue of lock-in that comes from

depending on vendor Clouds. The even greater concern is the loss of information privacy, with

vendors having full access to the resources stored on their Clouds. Both the British and US

governments are considering a ‘G Cloud’ for government business applications. In particularly

sensitive cases of SMEs and start-ups, the provider-consumer relationship that Cloud

Computing fosters between the owners of resources and their users could potentially be

detrimental, as there is a potential conflict of interest for the providers. They profit by providing

resources to up-and-coming players, but also wish to maintain dominant positions in their

consumer facing industries.

General distrust of external service providers

As soon as you say the word "cloud" people immediately think of an ugly world where big

portions of critical IT are being put in the hands of vendors they do not know and they do not

trust, similar to out sourcing. The difference is that private clouds are about efficiency, control

and choice not to mention sustainability.


90

Choice means the business decides whether everything runs internally, externally, or any mix

you choose. Choice also means that you will have multiple service providers competing for your

business, and it will be easy to switch between them if you need to for some reason. Switching

between providers is an issue and is addressed in the section titled “Standards,” starting on

page 135.

Concern to virtualize the majority of servers and desktop workloads

We can understand that there may be some delay between the actual capabilities of a given

technology, and the general availability of those capabilities. Virtualization is no exception.

Fortunately, a few vendors such as EMC, and Cisco have collaborated (V-Block solutions) to

develop a set of standard designs allowing businesses to have a well-known and proven

solution. Seeing is believing, through the solution just mentioned or from one of the thousands

of enterprise IT environments that are pushing serious workloads, using and getting great

results from desktop virtualization as well.

Fully virtualized environments are hard to manage

This is true if you try to manage them with tools and processes designed for the physical IT

world. Indeed, virtualization efforts often stall because of IT leadership failing to recognize that

the operational model is very different (and vastly improved!) in the virtual world. To get to a

private cloud or indeed any virtualization at scale, the management and operational model will

have to be completely re-engineered. Please refer to the section titled “Information

Management,” starting on page 42, for additional details regarding management.

The upside in virtualizing is enormous and will have the ability to move to a more sustainable

operational and business model by responding far more quickly to changing requirements, as

well as providing far higher service levels.

Many environments can't be virtualized onto x86 and hypervisors

That is often true. All legacy applications are difficult, impractical or not worth the effort to bring

over to an Intel instruction set. The question is should having a 20 years of legacy equipment on


91

data center floor stop a business from moving forward? Which part of your environment is

growing faster? We would estimate that applications that are running on x86 instruction sets are

growing faster. In three years, how much of your world will be legacy, and how much on newer

platforms?

A best practice is to cap the investment in legacy, start building new applications on the new

environment, and selectively migrate from old to new when the opportunity presents itself.

Concerns on security

As discussed in the section titled “Security,” starting on page 138, there are issues, but the best

practices are discussed. It can be argued that fully virtualized environments can be made far

more secure than anything in the physical world can at a lower cost, and with less effort. It is

interesting to point out that trillions of dollars flow around the globe every day in the financial

cloud, a dynamic and federated environment of shared computing resources. So far, I have not

lost a dime.

Industry Standards

Unfortunately, usable industry standards usually develop at an absurdly abysmal pace. Even

when we have them, it is often the case that everyone implements them differently, defeating

the purpose. When it comes to private clouds, there are a few basic and usable standards in

place (i.e. OVF, the open virtualization format), with a few more coming, but it is going to take

time before we as an industry have this sorted out.

A best practice is to keep in mind open standards, but in the short-term advantage specific

technologies that do the job today (e.g.), and keep your options open. Please refer to the

section titled “Standards,” starting on page 135 for additional information.

Applications support for virtualized environments, or only the one the vendor sells

Certain software vendors, such as Oracle, have challenges in making their licensing schemes

work in virtualized environments or may claim to have support concerns. Ironically, these same

software vendors often use virtualization to develop their products. It is unfortunate since this


92

obstacle can be a long term deterrent in staying with that particular application. A best practice

is to vocalize your business infrastructure and sustainability message to your software vendors,

rather than conforming to theirs.

Environmental Impact Concerns

The ever-increasing carbon footprint from the exponential growth of the data centers required

for Cloud Computing within the IT industry is another concern. IT is expected to exceed the

airline industry by 2020 in terms of carbon footprint raising sustainability concerns.

The industry is being motivated to address the problem by legislation, the operational limit of

power grids (being unable to power any more servers in their data centers), and the potential

financial benefits of increased efficiency. The primary solution is the use of virtualization to

maximize resource utilization, but the problem remains. While these issues are common to

Cloud Computing, they are not flaws in the Cloud concept, but the vendor provisioning methods

and the implementation of Clouds. There are attempts to address some of these concerns, such

as a portability layer between vendor Clouds to avoid lock-in. However, this will not alleviate

issues such as inter-Cloud latency.

An open source implementation of the Amazon (EC2) Cloud, called Eucalyptus, allows data

centers to execute code compatible with Amazon’s Cloud. This allows creation of private

internal Clouds, avoiding vendor lock-in and providing information privacy, but only for those

with their own data center and so is not really Cloud Computing (which by definition is to avoid

owning data centers). Therefore, vendor Clouds remain synonymous with Cloud Computing

One solution is a possible alternative model for the Cloud conceptualization, created by

combining the Cloud with paradigms from Grid Computing, principles from Digital Ecosystems,

and sustainability from Green Computing, while remaining true to the original vision of the

Internet. This option will be covered in the section titled “Community Cloud”, starting on page

101. This cloud type is a challenged solution for the enterprise, but it may be the ultimate cloud

architecture in the long term.

One incentive for cloud computing is that it may be more environmentally friendly. First,

reducing the number of hardware components needed to run applications on the company's

internal data center and replacing them with cloud computing systems reduces energy for


93

running and cooling hardware. By consolidating these systems in remote centers, they can be

handled more efficiently as a group.

Second, techniques for cloud computing promote telecommuting techniques, such as remote

printing and file transfers, potentially reducing the need for office space, buying new furniture,

disposing of old furniture, having your office cleaned with chemicals and trash disposed, and so

on. They also reduce the need to drive to work and the resulting carbon dioxide emissions.

Threshold Policy Concerns

Let us suppose you have a program that does credit card validation in the cloud, and you hit the

crunch for the December buying season. Higher demand would be detected and more instances

would be created to fill that demand. As we moved out of the buying crunch, the need

diminishes and the instances of that resource would be de-allocated and put to other use.

A best practice is to test if the program works. Then, develop, or improve and implement, a

threshold policy in a pilot study before moving the program to the production environment.

Check how the policy detects sudden increases in the demand and results in the creation of

additional instances to fill in the demand. Also, check to determine how unused resources are to

be de-allocated and turned over to other work.

Interoperability issues Concerns

If a company outsources or creates applications with one cloud-computing vendor, the company

may find it is difficult to change to another computing vendor that has proprietary APIs and

different formats for importing and exporting data. This creates problems of achieving

interoperability of applications between these two cloud-computing vendors. You may need to

reformat data or change the logic in applications. Although industry cloud-computing standards

do not exist for APIs or data import and export, IBM and Amazon Web Services have worked

together to make interoperability happen.

Hidden Cost Concerns

Cloud computing does not tell you what the hidden costs are. For instance, companies could

incur higher network charges from their service providers for storage and database applications

containing terabytes of data in the cloud. This outweighs costs they could save on new

infrastructure, training new personnel, or licensing new software. In another instance of incurring

network costs, companies who are far from the location of cloud providers could experience

latency, particularly when there is heavy traffic.


94

Best Practice - Assess cloud storage migration costs upfront

Another hidden cost is Migration. Today, many cloud storage providers provide the basic cost

per gigabyte of capacity. For example, at the time of this publication, the basic cost for Amazon

Web Services is approximately $0.15. Pricing for Zetta starts at ~$0.25 and decreases as more

data is stored to the cloud. For archive cloud storage providers that provide additional features

like WORM (Write Once Read Many) and information lifecycle management, the basic cost is in

the realm of ~$1.00 per GB.

Therefore, a potential customer should be able to calculate how much disk storage they need

and then determine a monthly cost for storing data in the cloud. While this sounds simple, few

providers actually mention that basic storage costs are only part of the picture. The issue is

migration into the storage cloud

All providers will charge for data transfers in and out of the cloud based on the volume of data

transferred (typical cost is $0.10 per GB). Some will also charge for metadata functions such as

directory or file attribute listings, and copying or deleting files. While these metadata operation

costs are generally miniscule on a per-operation basis (maximum of $0.01 per 1,000 for

Amazon), they can add up based on the amount of users the customer has accessing cloud

storage data.

Another piece of cloud storage pricing is how a customer actually gets to the data stored in the

cloud. Some cloud storage providers, including Autonomy Zantaz and Iron Mountain Inc.,

support private data lines that connect the customer's infrastructure to the cloud storage

infrastructure. Others, such as Zetta, estimate the Telco circuit and cross-connect fees for

customer access data will add up to as much as 20% of their total cost per month. Whether or

not this will be an issue depends on the type of data storage and the customers’ access

patterns.

Perhaps the least well understood cost of cloud storage is the mass transfer of data in or out of

the cloud. Some providers, like Zetta, do not charge transfer fees for data migration into the

cloud. Others, such as Amazon, include a stated pricing plan for large-scale data transfers using

a portable medium, charging a time-based fee for the data load and a handling fee for the

portable device.

Consumers should make sure that every cloud storage request counts. Therefore, a data

migration plan is crucial and things like virus scanners, indexing services and backup software


95

should be carefully configured so as not to access the cloud storage medium as just another

network drive.

As the cloud continues to evolve, cloud storage providers who can provide the most

sophisticated cost analysis tools will be best suited to help potential customers accurately

determine costs. Yet customers must still look at all potential costs, including transfer, bulk load,

network and on-site appliances as discussed in the section titled “Best Practice – Understand

Information Logistics and Energy transposition tradeoffs”, starting on page 77 .

Unexpected behavior concerns Let us suppose your credit card validation application works well at your company's internal

data center. It is important to test the application in the cloud with a pilot study to check for

unexpected behavior. Examples of tests include how the application validates credit cards, and

how, in the scenario of the December buying crunch, it allocates resources and releases

unused resources, turning them over to other work. If the tests show unexpected results of

credit card validation or releasing unused resources, you will need to fix the problem before

running the application in the cloud.

Security issue concerns

In February 2008, Amazon's S3 and EC2 suffered a three-hour outage. Even though an SLA

provides data recovery and service credits for this type of outage, consumers missed sales

opportunities and executives were cut off from critical business information they needed.

Instead of waiting for an outage to occur, consumers should do security testing on their own,

checking how well a vendor can recover data. The test is very simple; no tools are needed. All

you have to do is to ask for old data you have stored and check how long it takes the vendor to

recover. If it takes too long to recover, ask the vendor why and how much service credit you

would get in different scenarios. Verify if the checksums match the original data.

Test a trusted algorithm to encrypt the data on your local computer, and then try to access data

on a remote server in the cloud using the decryption keys. If you cannot read the data once you

have accessed it, the decryption keys are corrupted, or the vendor is using its own encryption

algorithm. You may need to address the algorithm with the vendor.


96

Another issue is the potential for problems with data in the cloud. You may want to manage

your own private keys to protect the data. Check with the vendor on private key management.

Amazon will give you the certificate if you sign up for it.

Software development in cloud concerns

To develop software using high-end databases, the most likely choice is to use cloud server

pools at the internal data corporate center and extend resources temporarily with Amazon Web

services for testing purposes. This allows project managers to better control costs, manage

security, and allocate resources to the cloud a project is assigned to. Project managers could

also assign individual hardware resources to different cloud types: Web development cloud,

testing cloud, and production cloud. The cost associated with each cloud type may differ. The

cost per hour or usage with the development cloud is most likely lower than the production

cloud, as additional features, such as SLA and security, are allocated to the production cloud.

The managers can limit projects to certain clouds. For instance, services from portions of the

production cloud can be used for the production configuration. Services from the development

cloud can be used for development purposes only. To optimize assets at varying stages of the

project of software development, the managers can get cost-accounting data by tracking usage

by project and user. If the costs are high, managers can use Amazon EC2 to temporarily extend

resources at a very low cost, if security and data recovery issues have been resolved.

Private Cloud

First, what is a private cloud? Since this is a relatively new concept, there are many definitions.

However, the first aspect is there are no presumptions as to where applications physically run

compared to a public cloud. Applications can run in a data center the business owns and/or run

at a service provider's location.

Second, there is no requirement to rewrite the applications to get to a private cloud. Many public

clouds require that applications comply with a pre-defined software stack.

Private clouds are different through the technology of virtualization. For example, utilizing

hypervisors such as, anything that runs on an Intel instruction set can be private cloud


97

structured. As a result, there is no need to rewrite applications just to get to a private cloud

model.

Finally, the private cloud model assumes that control of the private cloud firmly remains in IT's

hands, and not some external service provider. IT controls service delivery, if they choose; and

security and compliance, if they choose (see section titled “Security”, starting on page 138 for

more details). IT controls the mix of internal and external resources, if they choose, or whether

they want an IaaS, PaaS, or SaaS model.

To summarize the definition of a private cloud, it is a fully virtualized computing environment

using a next-generation operational and security model with a flexible consumption model both

internal and external with IT fully in control.

Private clouds are a stepping-stone to external clouds, particularly for financial services. Many

believe that future datacenters will look like internal clouds.

Private clouds can also be designed utilizing various existing technologies federating multiple

aspects of the virtualized data center and Cloud Computing shown in Figure 22 - Using a

Private Cloud to Federate disparate architectures.

This creates what many would consider an internal or private cloud. With the cloud resources of

the external cloud and the virtualization resources of the internal cloud information can, if

properly designed, move securely across the pool of resources, and possibly across legacy

resources in the internal cloud and public cloud resources. This architectural advantage is key in

that they are never separate resources; they are all one pool. The resources are aggregated

and federated together so that applications can act on the combined resources as a single pool

of resources, just like the single pool of resources available to us today when one uses

virtualization to join servers from multiple racks in a data center. This forms the private cloud

that enables us to get the best of both worlds. The word “Private” is used because the use and

operation of the cloud resources are completely controlled and only available to the enterprise.

This cloud resource that looks and behaves just like the resources purchased in the past.


98

This architecture offers the advantage of achieving sustainability and efficiency; you get the best

of both worlds. You can achieve trusted, controlled reliability and security while getting the

flexibility, dynamic, on-demand, and sustainable efficiency of a cloud type architecture.

Figure 22 - Using a Private Cloud to Federate disparate architectures

Let us take it a step further and examine the core principles, or best practices that uniquely

define private cloud computing.

Best Practice – Implement a dynamic computing infrastructure Private cloud computing requires a dynamic computing infrastructure. The foundation for the

dynamic infrastructure is a standardized, scalable, and secure physical infrastructure. There

should be levels of redundancy to ensure high levels of availability, but mostly it must be easy to

extend as usage growth demands it, without requiring architecture rework.

Next, it must be virtualized. Today, virtualized environments leverage server virtualization

(typically from Microsoft, or Xen) as the basis for running services. These services need to be

easily provisioned and de-provisioned via software automation. These service workloads need

to be moved from one physical server to another as capacity demands increase or decrease.

Finally, this infrastructure should be highly utilized, whether provided by an external cloud

provider or an internal IT department. The infrastructure must deliver business value over and

above the investment.


99

A dynamic computing infrastructure is critical to effectively supporting the elastic nature of

service provisioning and de-provisioning as requested by users in the private cloud, while

maintaining high levels of reliability and security. The consolidation provided by virtualization,

coupled with provisioning automation, creates a high level of utilization and reuse, ultimately

yielding a very effective use of capital equipment

Best Practice – Implement an IT Service-Centric Approach Cloud computing is IT (or business) service-centric. This is in stark contrast to more traditional

system or “server”- centric models. In most cases, users of the private cloud generally want to

run some business service or application for a specific, timely purpose. IT administrators do not

want to be bogged down in the system and network administration of the environment. They

would prefer to quickly and easily access a dedicated instance of an application or service. By

abstracting away the server-centric view of the infrastructure, system users can easily access

powerful pre-defined computing environments designed specifically around their service.

An IT Service Centric approach enables user adoption and business agility. The easier and

faster a user can perform an administrative task, the more expedient the business becomes,

reducing costs, driving revenue and approaching an IT sustainable model.

Best Practice – Implement a self-service based usage Model Interacting with the private cloud requires some level of user self-service. Best of breed self-

service provides users the ability to upload, build, deploy, schedule, manage, and report on their

business services on demand within the enterprise. A self-service private cloud offering must

provide easy-to-use, intuitive user interfaces that equip users to productively manage the

service delivery lifecycle.

The benefit of self-service from the users' perspective is a level of empowerment and

independence that yields significant business agility. One benefit often overlooked from the

internal service provider's or IT team's perspective is that the more self-service that can be

delegated to users, the less administrative involvement is necessary. This saves time and

money and allows administrative staff to focus on more strategic, high-valued responsibilities.


100

Best Practice – Implement a minimally or self-managed platform An IT team or service provider must leverage a technology platform that is self-managed in

order to efficiently provide a cloud for their constituents. Best-of-breed clouds enable self-

management via software automation, leveraging the following capabilities as discussed in the

section titled “Information Management,” starting on page 42:

• A provisioning engine for deploying services and tearing them down recovering

resources for high levels of reuse

• Mechanisms for scheduling and reserving resource capacity

• Capabilities for configuring, managing, and reporting to ensure resources can be

allocated and reallocated to multiple groups of users

• Tools for controlling access to resources and policies for how resources can be used or

operations can be performed

All of these capabilities enable business agility while simultaneously enacting critical and

necessary administrative control. This balance of control and delegation maintains security and

uptime, minimizes the level of IT administrative effort, and keeps operating expenses low,

freeing up resources to focus on higher value projects.

Best Practice – Implement a consumption-based billing methodology Finally, private cloud computing is usage driven. Consumers pay for only what resources they

use and therefore are charged or billed on a consumption-based model. Cloud computing

platforms must provide mechanisms to capture usage information that enables chargeback

reporting and/or integration with billing systems such as a charge back system.

The value from a user's perspective is the ability for the business units to pay only for the

resources they use, ultimately keep their costs down. From a provider's perspective, it allows

them to track usage for charge back and billing purposes.

In summary, these five best practices are necessary to produce an enterprise private cloud,

capable of achieving compelling business value including savings on capital equipment and

operating costs, reduced support costs, and significantly increased business agility. This

enables corporations to improve their profit margins and competitiveness in the markets they

serve.


101

Public Cloud

Public cloud solutions are the most well known examples of cloud storage. In a public cloud

implementation, an organization accesses third-party resources (like Amazon S3™,EMC Atmos,

Iron Mountain®, Google™, etc.) on an as-needed basis, without the requirement to invest in

additional internal infrastructure. In this pay-per-use model, public cloud vendors provide

applications, computer platforms and storage to the public, delivering significant economies of

scale. For storage, the difference between the purchase of a dedicated local appliance and the

use of a public cloud is not the functional interface, but merely the fact that the storage is

delivered on demand.

Either the customer or business unit pays for what they actually use or in other cases, what they

have allocated for use. As an extension of the financial benefits, public clouds offer a scalability

that is often beyond what a user would be able to otherwise afford. Publicly accessible clouds

offer storage capacity using multi-tenancy solutions, meaning multiple customers are serviced at

once from the same infrastructure. This results in some common concerns when evaluating

public cloud solutions, including security and privacy, as well as the possibilities of latency and

compliance issues. When considering the use of public cloud options for data storage, pay

attention to the management, both now and in the future, of both the clouds and the data, as

well as the integration of the Cloud service usage with internal IT.

Since there are numerous white papers discussing the public cloud, I will refer these papers for

additional details. I will mention however, that the five (5) best practices outlined in the section

titled “Private Cloud”, starting on page 96, are also applicable to the public cloud.

Community Cloud

Community Clouds are digital ecosystems that distribute adaptive open socio-technical

systems, with properties of self-organization, scalability and sustainability, inspired by natural

eco-systems. This is an interesting approach to the sustainability issue, especially from a

regional perspective.

In a traditional market-based economy, made up of sellers and buyers, the parties exchange

property. In the new network-based economy, made up of servers and clients, the parties share

access to services and experiences. Digital Ecosystems support network-based economies that

rely on next-generation IT to extend the Service-Oriented Architecture (SOA) concept with the


102

automatic combination of available and applicable services in a scalable architecture, to meet

business user requests for applications that facilitate business processes. Digital Ecosystems

research is yet to consider scalable resource provision, and therefore risks being subsumed into

vendor Clouds at the infrastructure level, while striving for decentralization at the service level.

Therefore, the realization of their vision requires a form of Cloud Computing, but with their

principle of community-based infrastructure where individual users share ownership.

One aspect of the Community Cloud as it relates to other Cloud architectures is that Community

Clouds are less dependent on vendors and can, in the long run, achieve a higher level of

environmental sustainability. The Community Cloud approach is to combine distributed resource

provisioning from Grid Computing, distributed control from Digital Ecosystems and sustainability

from Green Computing with the use cases of Cloud Computing, while making greater use of

self-management advances from Autonomic Computing. Replacing vendor Clouds by shaping

the underutilized resources of user machines forms a Community Cloud, with nodes potentially

fulfilling all roles, consumer, producer, and most importantly coordinator, as shown in Figure 23

- Community Cloud, below.

Figure 23 - Community Cloud

The figure shows an environment that includes nodes of varying functionality of user machines

or servers/clients allowing potentially all nodes to fulfill all roles, consumer, producer, and

coordinator. This concept of the Community Cloud draws upon Cloud Computing, Grid


103

Computing, Digital Ecosystems, Green Computing and Autonomic Computing. This is a model

of Cloud Computing that is part of the community, without dependence on Cloud vendors.

There are a number of advantages:

1) Openness: Removing dependence on vendors makes the Community Cloud the open

equivalent to vendor Clouds, and therefore identifies a new dimension in the open versus

proprietary struggle that has emerged in code, standards and data, but has yet to be expressed

in the realm of hosted services.

2) Community: The Community Cloud is as much a social structure as a technology paradigm.

Community ownership of the infrastructure carries with it a degree of economic scalability,

without which there would be diminished competition and potential stifling of innovation as

risked in vendor Clouds.

3) Individual Autonomy: In the Community Cloud, nodes have their own utility functions in

contrast with data centers, in which dedicated machines execute software as instructed.

Therefore, with nodes expected to act in their own self-interest, centralized control would be

impractical, as with consumer electronics like game consoles. Attempts to control user

machines counter to their self-interest results in cracked systems, from black market hardware

modifications and arms races over hacking and securing the software (routinely lost by the

vendors). In the Community Cloud, where no concrete vendors exist, it is even more important

to avoid antagonizing the users, instead embracing their self-interest and harnessing it for the

benefit of the community with measures such as a community currency.

4) Identity: In the Community Cloud, each user would inherently possess a unique identity,

which combined with the structure of the Community Cloud should lead to an inversion of the

currently predominant membership model. Therefore, instead of users registering for each

website (or service) as a new user, they could simply add the website to their identity and grant

access, allowing users to have multiple services connected to their identity, instead of creating

new identities for each service. This relationship is reminiscent of recent application platforms,

such as Facebook’s f8 and Apple’s App Store, but decentralized in nature and so free from

vendor control. In addition, it allows for the reuse of the connections between users, akin to

Google’s Friend Connect, instead of reestablishing them for each new application.


104

5) Graceful Failures: The Community Cloud is not owned or controlled by any one organization,

and therefore not dependent on the lifespan or failure of any one organization. It therefore

should be robust and resilient to failure, and immune to the system-wide cascade failures of

vendor Clouds. Due to the diversity of its supporting nodes, their failure is graceful, non-

destructive, and with minimal downtime, as the unaffected nodes mobilize to compensate for the

failure.

6) Convenience and Control: The Community Cloud, unlike vendor Clouds, has no inherent

conflict between convenience and control. This results from its community ownership that

provides distributed control which is more democratic. However, whether the Community Cloud

can provide a technical quality equivalent or one superior to its centralized counterparts requires

further research.

7) Community Currency: The Community Cloud requires its own currency to support the sharing

of resources, a community currency, which in economics is a medium (currency) not backed by

a central authority (e.g. national government), for exchanging goods and services within a

community. It does not need to be restricted geographically, despite sometimes being called a

local currency. An example is the Fureai kippu system in Japan, which issues credits in

exchange for assistance to senior citizens. Family members living far from their parents can

earn credits by assisting the elderly in their local community, which can then be transferred to

their parents and redeemed by them for local assistance.

8) Quality of Service: Ensuring acceptable quality of service (QoS) in a heterogeneous system

will be a challenge, not least because achieving and maintaining the different aspects of QoS

will require reaching critical mass in the participating nodes and available services. Thankfully,

the community currency could support long-term promises by resource providers and allow the

higher quality providers, through market forces, to command a higher price for their service

provision. Interestingly, the Community Cloud could provide a better QoS than vendor Clouds,

utilizing time-based and geographical variations advantageously in the dynamic scaling of

resource provision.

9) Environmental Sustainability: It is anticipated that the Community Cloud will have a smaller

carbon footprint than vendor Clouds, on the assumption that making use of underutilized user

machines requires less energy than the dedicated data centers require for vendor Clouds. The


105

server farms within data centers are an intensive form of computing resource provision, while

the Community Cloud is more organic, growing and shrinking in a symbiotic relationship to

support the demands of the community, which in turn supports it.

10) Service Composition: The great promise of service oriented computing is that the marginal

cost of creating the nth application will be virtually zero, as all the software required already

exists to satisfy the requirements of other applications. Only their composition and orchestration

are required to produce a new application. Within vendor Clouds it is possible to make services

that expose themselves for composition and compose these services, allowing the hosting of a

complete service-oriented architecture. However, current service composition technologies have

not gained widespread adoption. Digital Ecosystems advocate service cross pollination to avoid

centralized control by large service providers, because easy service composition allows

coalitions of SMEs to compete simply by composing simpler services into more complex

services that only large enterprises would otherwise be able to deliver. So, one could extend

decentralization beyond resource provisioning and up to the service layer, to enable service

composition within the Community Cloud.

Figure 24 - Community Cloud Architecture

As shown in Figure 24 - Community Cloud Architecture, above, is an architecture in which the

most fundamental layer deals with distributing coordination. One layer above, resource

provision and consumption are arranged on top of the coordination framework. Finally, the

service layer is where resources are combined into end-user accessible services, to then

themselves be composed into higher-level services.


106

The concept is the distribution of server functionality between pluralities of nodes provided by

user machines, shaping underutilized resources into a virtual data center. Even though this is a

simple and straightforward idea, it poses challenges on many different levels. The approach can

be divided into three layers, Coordination, Resource, Service and Consumption.

Distributing coordination is taken for granted in homogeneous data centers where good

connectivity, constant presence and centralized infrastructure can be assumed. One layer

above, resource provisioning and consumption are arranged on top of the coordination

framework. This would also be a challenge in a distributed heterogeneous environment. Finally,

the service layer is where resources are combined into end-user accessible service(s). It is also

possible to federate these services into higher-level services.

Best Practice in Community Cloud – Use VM’s To achieve coordination, the nodes need to be deployed as isolated virtual machines, forming a

fully distributed network that can provide support for distributed identity, trust, and transactions.

Using Virtual Machines (VMs), executing arbitrary code in the machine of a resource-providing

user would require a sandbox for the guest code, and a VM to protect the host. The role of the

VM is to make system resources safely available to the Community Cloud. The Cloud

processes could be run safely without danger to the host machine. In addition to VMs, possible

platforms can include Java Virtual Machine’s, lightweight JavaScripts. The age of multi-core

processors, in many cases, has resulted in unused or underutilized cores occurring in modern

personal computers, which lend themselves well to the deployment and background execution

of Community Cloud facing VMs.

Best Practice in Community Cloud – Use Peer to Peer Networking Best Practice is to implement a P2P network. Newer P2P solutions offer sufficient guarantees of

distribution, immunity to super-peer failure, and resistance to enforced control. For example, in

the Distributed Virtual Super-Peer (DVSP) model, a collection of peers logically combines to

form a virtual super-peer that dynamically changes over time to facilitate fluctuating demands.

Best Practice in Community Cloud – Distributed Transactions A key element of distributed coordination is the ability of nodes to jointly participate in

transactions that influence their individual state. Appropriately defined business processes can


107

be executed over a distributed network with a transactional model maintaining the properties on

behalf of the initiator. Newer transaction models maintain these properties while increasing

efficiency and concurrency. Focusing on distributing the coordination of transactions is

fundamental to permitting multi-party service composition without centralized control.

Best Practice in Community Cloud – Distributed Persistence Storage Best Practice is to require storage on its participating nodes, taking advantage of the ever-

increasing surplus on most personal computers. However, the method of information storage in

the Community Cloud is an issue with multiple aspects. First, information can be file-based or

structured. Second, while constant and instant availability can be crucial, there are scenarios in

which recall times can be relaxed. Such varying requirements call for a combination of

approaches, including distributed storage and distributed databases. Information privacy in the

Community Cloud should be provided by the encryption of user information when on remote

nodes, only being unencrypted when accessed by the user. This allows for the secure and

distributed storage of information.

Challenges in the federation of Public and Private Clouds

Cloud computing tops Gartner's “Top 10 Strategic Technologies for 2010.” They define a

strategic technology as “one with the potential for significant impact on the enterprise in the next

three years. The fundamental challenge is that the industry has shoe horned anything that can

be loosely defined as cloud, virtual, IT consolidation, or anything on the network in the same

term being cloud. There is a trend to interchange public, private, hybrid, cloud and other variant

services.”

Gartner predicts that through 2012, “IT organizations will spend more money on private cloud

computing investments than on offerings from public cloud providers.” There are two primary

reasons why the enterprise will not make major strides towards the public cloud in the near

term– lack of visibility and multi-tenancy issues that cloak the real concern about critical data

security. Some consider that “Security” is key and could be a show stopper for public clouds at

least in the short term.

It is interesting to note that recently in the United States, the FBI raided at least two Texas data

centers, serving search-and-seizure warrants for computing equipment, including servers,


108

routers and storage. The FBI was seeking equipment that may have been involved in fraudulent

business practices by a handful of small VoIP vendors18.

It appears that, in the United States, if the FBI finds out that there is a threat and it is coming

from a hosted provider (i.e. cloud provider), and if the servers that are used for the scam/threat

are virtualized (cloud providers), the FBI will confiscate everything, possibly your data! The

reason is that it is much harder to figure out where the server and data is located. For additional

details, please refer to the 2010 Proven Professional article titled “How to Trust the Cloud – “Be

Careful up There” for additional information.

Lack of visibility

The public cloud is opaque and lacks a level of true accountability that will paralyze any

enterprise account from releasing their prized data assets to a set of unknown entities. Look at

the value proposition - no one consuming the service has visibility into the infrastructure. The

providers themselves are not looking at the infrastructure. Are SLAs relevant? In addition, if so,

who can enforce or even monitor them?

The public cloud has received so much buzz in large part because it professes to offer

significant cost savings over buying, deploying and maintaining an in-house IT infrastructure.

While this is massively appealing, it does not answer any of the fundamentals of Quality of

Service, network and data security, to name a few. Imagine the concern of opening up your

internal systems with a direct pipe into the ‘cloud.’

Multi-tenancy Issues

Multi-tenancy is the second reason why businesses of any real size will not make the leap to the

public cloud. Wikipedia defines multi-tenancy as “a principle in software architecture where a

single instance of the software runs on a server, serving multiple client organizations (tenants).”

In other words, many people using the same IT assets and infrastructure.

18 http://www.wired.com/threatlevel/2009/04/data-centers-ra/#ixzz0fv24gOsn


109

So here is the concern, EC2, Google, etc., provide true multi-tenancy but at what cost to

compliance and security? What about such hot topics such as PCI or forensics? How safe are

the tenants on a system? Who is on the same system as you, a hacker or perhaps your nearest

competition? How secure is the isolation between clients? What data have you trusted to this

cloud? If you buy the argument, it will be your patient records, payroll, client list, etc. It will be

essentially your most important data assets. Please refer to the EMC Proven Professional

article titled “How to Trust the Cloud – Be Careful up There” for more information.

Cloud computing needs to cover its assets

Until the public cloud can provide visibility all the way down to the IT infrastructures’ simplest

asset – logs - enterprises simply will not risk it. To be deployed properly, a public cloud needs to

understand logs and log management for security, business intelligence, IT optimization, PCI

forensics, parsing out billing info, and the list goes on.

Until then, in the grand scheme of risk mitigation, enterprises may fear the cloud and segment

public cloud from ITaaS in a private cloud. Most have taken all of the Cloud variants and placed

them into a single bucket. In fact, there is a tremendous value in cloud computing.

Nevertheless, public clouds and enterprise computing are a world apart and should be treated

as such. In addition, there are many risks to consider along the way. Please refer to the EMC

Proven Professional article titled “How to Trust the Cloud – Be Careful up There” for more

information.

Warehouse Scale Machines - Purposely Built Solution Options

Cloud computing, utility computing and other cloud paradigms are most certainly on IT

managers and architects lists. However, there are other architectures that should be considered

to achieve sustainability, especially in specific use cases.

The trend toward server-side computing and the exploding popularity of Internet services has

created a new class of computing systems. This architecture has been defined as warehouse-

scale computers, or WSCs. The name calls attention to the most distinguishing feature of these

machines, the massive scale of their software infrastructure, data repositories, and hardware

platform. This perspective is a departure from a view of the computing problem that implicitly

assumes a model where one program runs in a single machine. In addition, this new class deals


110

with a use case where a limited number of applications need to scale to an enormous scale as

with internet services.

In warehouse-scale computing, the program is an Internet service that may consist of tens or

more individual programs that interact to implement complex end-user services such as email,

search, or maps. These programs might be implemented and maintained by different teams of

engineers, perhaps even across organizational, geographic, and company boundaries as is the

case with mashups. The computing platform required to run such large-scale services bears

little resemblance to a pizza-box server or even the refrigerator-sized high-end multiprocessors

that reigned in the last decade. The hardware for such a platform consists of thousands of

individual computing nodes with their corresponding networking and storage subsystems, power

distribution and conditioning, equipment, and extensive cooling systems. The enclosure for

these systems is in fact a building structure and often indistinguishable from a large warehouse.

Had scale been the only distinguishing feature of these systems, we might simply refer to them

as datacenters. Datacenters are buildings where multiple servers and communication gear are

co-located because of their common environmental requirements and physical security needs,

and for ease of maintenance. In that sense, a WSC could be considered a type of datacenter.

Traditional datacenters, however, typically host a large number of relatively small- or medium-

sized applications, each running on a dedicated hardware infrastructure that is de-coupled and

protected from other systems in the same facility. Those datacenters host hardware and

software for multiple organizational or business units or even different companies. Different

computing systems within such a datacenter often have little in common in terms of hardware,

software, or maintenance infrastructure, and tend not to communicate with each other at all.

WSCs currently power the services offered by companies such as Google, Amazon, Yahoo, and

Microsoft’s online services division. This application requirement differs significantly from

traditional datacenters in that they belong to a single organization, use a relatively

homogeneous hardware and system software platform, and share a common systems

management layer. Often much of the application, middleware, and system software is built in-

house compared to the predominance of third-party software running in conventional

datacenters.


111

Most importantly, WSCs run a smaller number of very large applications (or Internet services),

and the common resource management infrastructure allows significant introduction deployment

flexibility. The requirements of homogeneity, single-organization control, and enhanced focus on

cost efficiency motivate designers to take new approaches in constructing and operating these

systems.

Best Practice – WSC’s must achieve high availability Internet services must achieve high availability, typically aiming for at least 99.99% uptime

(about an hour of downtime per year). Achieving fault-free operation on a large collection of

hardware and system software is difficult and is made more difficult by the large number of

servers involved. Although it might be theoretically possible to prevent hardware failures in a

collection of 10,000 servers, it would surely be extremely expensive. Consequently, WSC

workloads must be designed to gracefully tolerate large numbers of component faults with little

or no impact to service level performance and availability.

Best Practice - WSC’s must achieve cost efficiency Building and operating a large computing platform is expensive, and the quality of a service may

depend on the aggregate processing and storage capacity available, further driving costs up

and requiring a focus on cost efficiency. For example, in information retrieval systems such as

Web search, the growth of computing needs is driven by three main factors.

Increased service popularity translates into higher request loads. The size of the problem keeps

growing. The Web is growing by millions of pages per day, which increases the cost of building

and serving a Web index. Even if the throughput and data repository could be held constant, the

competitive nature of this market continuously drives innovations to improve the quality of

results retrieved and the frequency with which the index is updated.

Although smarter algorithms can achieve some quality improvements, most substantial

improvements demand additional computing resources for every request. For example, in a

search system that also considers synonyms of the search terms in a query, retrieving results is

substantially more expensive, either the search needs to retrieve documents that match a more

complex query that includes the synonyms or the synonyms of a term need to be replicated in

the index meta data structure for each term. The relentless demand for more computing

capabilities makes cost efficiency a primary metric of interest in the design of WSCs. Cost


112

efficiency must be defined broadly to account for all the significant components of cost,

including hosting-facility capital and operational expenses (which include power provisioning

and energy costs), hardware, software, management personnel, and repairs.

WSC (Warehouse Scale Computer) Attributes

Today’s successful Internet services are no longer a miscellaneous collection of machines co-

located in a facility and wired up together. The software running on these systems, such as

Gmail or Web search services, execute at a scale far beyond a single machine or a single rack.

They run on no smaller a unit than clusters of hundreds to thousands of individual servers.

Therefore, the machine, the computer, is this large cluster or aggregation of servers itself and

needs to be considered a single computing unit. The technical challenges of designing WSCs

are no less worthy of the expertise of computer systems architects than any other class of

machines. First, they are a new class of large-scale machines driven by a new and rapidly

evolving set of workloads. Their size alone makes them difficult to experiment with or simulate

efficiently; therefore, system designers must develop new techniques to guide design decisions.

Fault behavior, and power and energy considerations have a more significant impact in the

design of WSCs, perhaps more so than in other smaller scale computing platforms. Finally,

WSCs have an additional layer of complexity beyond systems consisting of individual

servers or small groups of servers; WSCs introduce a significant new challenge to programmer

productivity, a challenge perhaps greater than programming multi-core systems. This additional

complexity arises indirectly from the larger scale of the application domain and manifests itself

as a deeper and less homogeneous storage hierarchy, higher fault rates, and possibly higher

performance variability.

One Data Center vs. Several Data Centers

Multiple datacenters are sometimes used as complete replicas of the same service, with

replication being used primarily for reducing user latency and improving server throughput (a

typical example is a Web search service). In these cases, a given user query tends to be fully

processed within one datacenter, and our machine definition seems appropriate.


113

However, in cases where a user query may involve computation across multiple datacenters,

our single-datacenter focus is a less obvious fit. Typical examples are services that deal with

nonvolatile user data updates requiring multiple copies for disaster tolerance reasons. For such

computations, a set of datacenters might be the more appropriate system. However, think of the

multi-datacenter scenario as more analogous to a network of computers.

In many cases, there is a huge gap in connectivity quality between intra- and inter-datacenter

communications causing developers and production environments to view such systems as

separate computational resources. As the software development environment for this class of

applications evolves, or if the connectivity gap narrows significantly in the future, a need may

arise to adjust the choice of machine boundaries.

Best Practice – Use Warehouse Scale Computer Architecture designs in certain scenarios All but a few large Internet companies might consider WSCs because their sheer size and cost

render them unaffordable. This may not be true. It can be argued that the problems that today’s

large Internet services face will soon be meaningful to a much larger constituency because

many organizations will soon be able to afford similarly sized computers at a much lower cost.

Even today, the attractive economics of low-end server class computing platforms puts clusters

of hundreds of nodes within the reach of a relatively broad range of corporations and research

institutions. When combined with the trends toward large numbers of processor cores on a

single die, a single rack of servers may soon have as many or more hardware threads than

many of today’s datacenters. For example, a rack with 40 servers, each with four 8-core dual-

threaded CPUs, would contain more than two thousand hardware threads. Such systems will

arguably be affordable to a very large number of organizations within just a few years, while

exhibiting some of the scale, architectural organization, and fault behavior of today’s WSCs.

Architectural Overview of WSC’s

The hardware implementation of a WSC will differ significantly from one installation to the next.

Even within a single organization such as Google, systems deployed in different years use

different basic elements, reflecting the hardware improvements provided by the industry.

However, the architectural organization of these systems has been relatively stable over the

years.


114

Best Practice – Connect Storage Directly or via NAS in WSC environments With Google’s implementation, disk drives are connected directly to each individual server and

managed by a globally distributed file system. Alternately, they can be part of Network Attached

Storage (NAS) devices that are directly connected to the cluster-level switching fabric. NAS

tends to be a simpler solution to deploy initially because it pushes the responsibility for data

management and integrity to a NAS appliance vendor. In contrast, using the collection of disks

directly attached to server nodes requires a fault-tolerant file system at the cluster level. This is

difficult to implement, but can reduce hardware costs (the disks leverage the existing server

enclosure), and networking fabric utilization (each server network port is effectively dynamically

shared between the computing tasks and the file system).

Best Practice – WSC should consider using non-standard Replication Models The replication model between these approaches is also fundamentally different. A NAS

solution provides extra reliability through replication or error correction capabilities within each

appliance, whereas systems like GFS implement replication across different machines.

However, GFS-like systems are able to keep data available even after the loss of an entire

server enclosure or rack and may allow higher aggregate read bandwidth because the same

data can be sourced from multiple replicas. Trading off higher write overheads for lower cost,

higher availability, and increased read bandwidth was the right solution for many of Google’s

workloads. An additional advantage of having disks co-located with compute servers is that it

enables distributed system software to exploit data locality.

Some WSCs, including Google’s, deploy desktop-class disk drives instead of enterprise-grade

disks because of the substantial cost differential between the two. Because the data is nearly

always replicated in some distributed fashion (as in GFS), this mitigates the possibly higher fault

rates of desktop disks. Moreover, because field reliability of disk drives tends to deviate

significantly from the manufacturer’s specifications, the reliability edge of enterprise drives is not

clearly established.

Networking Fabric

Choosing a networking fabric for WSCs involves a trade-off between speed, scale, and cost.

Typically, 1-Gbps Ethernet switches with up to 48 ports are essentially a commodity component,


115

costing less than $30/Gbps per server to connect a single rack as of the writing of this article. As

a result, bandwidth within a rack of servers tends to have a homogeneous profile.

However, network switches with high port counts, which are needed to tie together WSC

clusters, have a much different price structure and are more than ten times more expensive (per

1-Gbps port) than commodity switches. In other words, a switch that has 10 times the bi-section

bandwidth costs about 100 times as much.

Best Practice – For WSC’s Create a Two level Hierarchy of networked switches As a result of this cost variation, the networking fabric of WSCs is often organized at the two-

level hierarchy. Commodity switches in each rack provide a fraction of their bi-section bandwidth

for interact communication through a handful of uplinks to the more costly cluster-level switches.

For example, a rack with 40 servers, each with a 1-Gbps port, might have between four and

eight 1-Gbps uplinks to the cluster-level switch, corresponding to an oversubscription factor

between 5 and 10 for communication across racks. In such a network, programmers must be

aware of the relatively scarce cluster-level bandwidth resources and try to exploit rack-level

networking locality, complicating software development and possibly affecting resource

utilization.

Alternatively, one can remove some of the cluster-level networking bottlenecks by spending

more money on the interconnect fabric. For example, Infiniband interconnects typically scale to

a few thousand ports but can cost $500–$2,000 per port. Alternatively, lower-cost fabrics can be

formed from commodity Ethernet switches by building “fat tree” networks. How much to spend

on networking vs. spending the equivalent amount on buying more servers or storage is an

application-specific question that has no single correct answer. One assumption is that intra-

rack connectivity is often cheaper than inter-rack connectivity.

Handling Failures

The sheer scale of WSCs requires that Internet services software tolerate relatively high

component fault rates. Disk drives, for example, can exhibit annualized failure rates higher than

4%. Between 1.2 and 16 average server-level restarts per year is typical. With such high

component failure rates, an application running across thousands of machines may need to

react to failure conditions on an hourly basis.


116

The applications that run on warehouse-scale computers (WSCs) dominate many system

design trade-off decisions. Some of the distinguishing characteristics of software that runs in

large Internet services are the system software and tools needed for a complete computing

platform. Here is some terminology to define the different software layers in a typical WSC

deployment:

Platform-level software:

Platform-level software is the common firmware, kernel, operating system distribution, and

libraries expected to be present in all individual servers to abstract the hardware of a single

machine and provide basic server-level services.

Cluster-level infrastructure is the collection of distributed systems software that manages

resources and provides services at the cluster level; ultimately, we consider these services as

an operating system for a datacenter. Examples are distributed file systems, schedulers, remote

procedure call (RPC) layers, as well as programming models that simplify the usage of

resources at the scale of datacenters, such as MapReduce, Dryad, Hadoop,

Application-level software

Application-level software is the software that implements a specific service. It is often useful to

further divide application-level software into online services and offline computations, because

those tend to have different requirements. Google search, Gmail, and Google Maps are

examples of online services. Offline computations are typically used in large-scale data analysis

or as part of the pipeline that generates the data used in online services; for example, building

an index of the Web or processing satellite images to create map files for the online service.

Best Practice - Use Sharding and other requirements in WSCs Sharding is splitting a data set into smaller fragments (shards) and distributing them across a

large number of machines. Operations on the data set are dispatched to some or all of the

machines hosting shards, and results are coalesced by the client. The sharding policy can vary

depending on space constraints and performance considerations. Sharding also helps


117

availability because recovery of small data fragments can be done more quickly than larger

ones.

In large-scale services, service-level performance often depends on the slowest responder out

of hundreds or thousands of servers. Reducing response-time variance is critical. In a sharded

service, load balancing can be achieved by biasing the sharding policy to equalize the amount

of work per server.

That policy may need to be informed by the expected mix of requests or by the computing

capabilities of different servers. Note that even homogeneous machines can offer different

performance characteristics to a load-balancing client if multiple applications are sharing a

subset of the load-balanced servers.

In a replicated service, a load-balancing agent can dynamically adjust the load by selecting to

which servers to dispatch a new request. It may still be difficult to approach perfect load

balancing because the amount of work required by different types of requests is not always

constant or predictable. Health checking and watchdog timers are required.

In a large-scale system, failures are often manifested as slow or unresponsive behavior from a

given server. In this environment, no operation can rely on a given server to respond to make

forward progress. In addition, it is critical to quickly determine that a server is too slow or

unreachable and steer new requests away from it. Remote procedure calls (RPC’s) must set

well-informed time-out values to abort long-running requests, and infrastructure-level software

may need to continually check connection-level responsiveness of communicating servers and

take appropriate action when needed.

Integrity checks are required. In some cases, besides unresponsiveness, faults are manifested

as data corruption. Although those may be more rare, they do occur and often in ways that

underlying hardware or software checks do not catch (e.g., there are known issues with the

error coverage of some networking CRC checks). Extra software checks can mitigate these

problems by changing the underlying encoding or adding more powerful redundant integrity

checks. See section titled “Best Practice – Implement Undetected data corruption technology

into environment”, starting on page 69 for additional details.


118

Best Practice – Implement application specific compression Often a large portion of the equipment costs in modern datacenters is in the various storage

layers. For services with very high throughput requirements, it is critical in this environment to fit

as much of the working set as possible in DRAM.

This makes compression techniques very important because the extra CPU overhead of

decompressing is still orders of magnitude lower than the penalties involved in going to disks.

Although generic compression algorithms can do quite well on the average, application-level

compression schemes that are aware of the data encoding and distribution of values can

achieve significantly superior compression factors or better decompression speeds. Eventual

consistency, keeping multiple replicas up to date, using the traditional guarantees offered by a

database management system, significantly increases complexity, workloads and software

infrastructure and reduces availability of distributed applications.

Fortunately, large classes of applications have more relaxed requirements and can tolerate

inconsistent views for limited periods, provided that the system eventually returns to a stable

consistent state. Response time of large parallel applications can also be improved by the use

of redundant computation techniques. Several situations may cause a given subtask of a large

parallel job to be much slower than its siblings may, either due to performance interference with

other workloads or software/hardware faults. Redundant computation is not as widely deployed

as other techniques because of the obvious overhead.

However, the completion of a large job is being held up by the execution of a very small

percentage of its subtasks in some situations. One such example is the issue of stragglers, as

described in the paper on MapReduce19. In this case, a single slower worker can determine the

response time of a huge parallel task. MapReduce’s strategy is to identify such situations

toward the end of a job and speculatively start redundant workers only on those slower jobs.

This strategy increases resource usage by a few percentage points while reducing a parallel

computation’s completion time by more than 30%.

19 http://labs.google.com/papers/mapreduce.html


119

Utility Computing

Utility computing can be thought of as a previous incarnation of Cloud Computing with a

business slant.

While utility computing often requires a cloud-like infrastructure, its focus is on the business

model. Simply put, a utility computing service is one in which customers receive computing

resources from a service provider (hardware and/or software) and “pay by the glass,” much as

you do for your water, electric service and other utilities at home.

Amazon Web Services (AWS), despite a recent outage, is the current incumbent for this model

as it provides a variety of services, among them the Elastic Compute Cloud (EC2), in which

customers pay for compute resources by the hour, and Simple Storage Service (S3), for which

customers pay based on storage capacity. Other utility services include Sun’s Network.com,

EMC’s recently launched storage cloud service, and those offered by startups, such as Joyent

and Mosso.

The primary benefit of utility computing is better economics. As mentioned previously, corporate

data centers are notoriously underutilized, with resources such as servers often idle 85 percent

of the time. This is due to over provisioning (buying more hardware than is needed on average

in order to handle peaks).

Any application needs a model of computation, a model of storage and, assuming the

application is even trivially distributed, a model of communication.

The statistical multiplexing necessary to achieve elasticity and the illusion of infinite capacity

requires resources to be virtualized so that the implementation of how they are multiplexed and

shared can be hidden from the programmer.

Different utility computing offerings are distinguished based on the level of abstraction

presented to the programmer and the level of management of the resources. For example,

Amazon EC2 is at one end of the spectrum. An EC2 instance looks much like physical

hardware, and users can control nearly the entire software stack, from the kernel upwards. The

API exposed is “thin”: a few dozen API calls to request and configure the virtualized hardware.

There is no a priori limit on the kinds of applications that can be hosted; the low level of


120

virtualization of raw CPU cycles, block-device storage, IP-level connectivity allow developers to

code whatever they want. On the other hand, this makes it inherently difficult for Amazon to

offer automatic scalability and failover, because the semantics associated with replication and

other state management issues are highly application-dependent.

AWS offers a number of higher-level managed services, including several different managed

storage services for use in conjunction with EC2, such as SimpleDB. However, these offerings

have higher latency and non-standard API’s, and our understanding is that they are not as

widely used as other parts of AWS. At the other extreme of the spectrum are application

domain-specific platforms, such as Google AppEngine and Force.com, the SalesForce business

software development platform. AppEngine is targeted exclusively at traditional web

applications, enforcing an application structure of clean separation between a stateless

computation tier and a state-full storage tier. Furthermore, AppEngine applications are expected

to be request-reply based, and as such, they are severely rationed in how much CPU time they

can use in servicing a particular request. AppEngine’s impressive automatic scaling and high-

availability mechanisms, and the proprietary MegaStore (based on “BigTable”) data storage

available to AppEngine applications, all rely on these constraints. Therefore, AppEngine is not

suitable for general-purpose computing. Similarly, SalesForce.com is designed to support

business applications that run against the salesforce.com database, and nothing else.

Microsoft’s Azure is an intermediate point on this spectrum of flexibility vs. programmer

convenience. Azure applications are written using the .NET libraries, and compiled to the

Common Language Runtime, a language independent managed environment. The system

supports general purpose computing, rather than a single category of application. Users get a

choice of language, but cannot control the underlying operating system or runtime. The libraries

provide a degree of automatic network configuration and failover/scalability, but require the

developer to declaratively specify some application properties in order to do so. Therefore,

Azure is intermediate between complete application frameworks like AppEngine on the one

hand, and hardware virtual machines like EC2 on the other.

As shown in Table 2 - Examples of Cloud Computing vendors and how each provides

virtualized resources (computation, storage), I summarize how these three classes virtualize

computation, storage, and networking. The scattershot offerings of scalable storage suggest

that scalable storage with an API comparable in richness to SQL remains an open challenge.

Amazon has begun offering Oracle databases hosted on AWS, but the economics and licensing


121

model of this product makes it a less natural fit for Cloud Computing. Will one model beat out

the others in the Cloud Computing space?

We can draw an analogy with programming languages and frameworks. Low-level languages

such as C and assembly language allow fine control and close communication with the bare

metal, but if the developer is writing a Web application, the mechanics of managing sockets,

dispatching requests, and so on are cumbersome and tedious to code, even with good libraries.

On the other hand, high-level frameworks such as Ruby on Rails make these mechanics

invisible to the programmer, but are only useful if the application readily fits the request/reply

structure and the abstractions provided by Rails; any deviation requires diving into the

framework at best, and may be awkward to code. No reasonable Ruby developer would argue

against the superiority of C for certain tasks, and vice versa. Different tasks will result in demand

for different classes of utility computing.

Continuing the language analogy, just as high-level languages can be implemented in lower

level ones, highly managed cloud platforms can be hosted on top of less-managed ones. For

example, AppEngine could be hosted on top of Azure or EC2; Azure could be hosted on top of

EC2. Of course, AppEngine and Azure each offer proprietary features (AppEngine’s scaling,

failover and MegaStore data storage) or large, complex API’s (Azure’s .NET libraries) that have

no free implementation, so any attempt to “clone” AppEngine or Azure would require re-

implementing those features or API’s.

Table 2 - Examples of Cloud Computing vendors and how each provides virtualized resources (computation, storage),

Virtualize

computation

Class

Google AppEngine Microsoft Azure Amazon Web

Services

Computation

model (VM)

-Predefined application

structure and framework;

programmer-provided

“handlers” written in Python,

all persistent state stored in

MegaStore (outside Python

code)

-Microsoft Common

Language Runtime

(CLR) VM; common

intermediate form

executed in managed

environment

-Machines are

-x86 Instruction Set

Architecture (ISA) via

Xen VM

-Computation

elasticity allows

scalability, but

developer must build


122

-Automatic scaling up and

down of computation and

storage; network and server

failover; all consistent with 3-

tier Web app structure

provisioned based on

declarative descriptions

(e.g. which “roles” can

be replicated); automatic

load balancing

the machinery, or

third party VAR such

as RightScale for

example

Storage

model

-MegaStore/BigTable -SQL Data Services

(restricted

view of SQL Server)

-Azure storage service

-Range of models

from block store

(EBS) to augmented

key/blob store

(SimpleDB)

-Automatic scaling

varies from no scaling

or sharing (EBS) to

fully automatic

(SimpleDB, S3),

depending on which

model used

-Consistency

guarantees vary

widely depending on

which model used

APIs vary from

standardized(EBS) to

proprietary

Networking

model

-Fixed topology to

accommodate 3-tier Web

app structure

-Scaling up and down is

automatic and programmer

invisible

-Automatic based on

programmer’s

declarative descriptions

of app components

(roles)

-Fixed topology to

accommodate 3-tier

Web app structure

-Scaling up and down

is automatic and

programmer invisible


123

Grid computing

Grid Computing is a form of distributed computing in which a virtual super computer is

composed of networked, loosely coupled computers, acting in concert to perform very large

tasks .A typical configuration is one in which resource provisioning is managed and allocated by

a group of distributed nodes . The Central virtual Super computer is where the resource is

consumed and provisioned. The role of coordinator for resource provision is also centrally

controlled.

It has been applied to computationally intensive scientific, mathematical, and academic

problems through volunteer computing, and used in commercial enterprise for such diverse

applications as drug discovery, economic forecasting, seismic analysis, and back-office

processing to support e-commerce and web services. What distinguishes Grid Computing from

cluster computing is being more loosely coupled, heterogeneous, and geographically dispersed.

In addition, grids are often constructed with general purpose grid software libraries and

middleware, dividing and apportioning pieces of a program to potentially thousands of

computers. However, what distinguishes Cloud Computing from Grid Computing is being web-

centric, despite some of its definitions being conceptually similar (such as computing resources

being consumed as electricity are from power grids).

Cloud Type Architecture Summary

In terms of cloud computing service types and the similarities and differences between cloud,

grid and other cloud types, it may be advantageous to use Amazon Web Services as an

example.

To get cloud computing to work, you need three things: thin clients (or clients with a thick-thin

switch), grid computing, and utility computing. Grid computing links disparate computers to form

one large infrastructure, harnessing unused resources. Utility computing can be considered

paying for what you use on shared servers like those that you pay for a public utility (such as

electricity, gas, and so on).

With grid computing, you can provision computing resources as a utility that can be turned on or

off. Cloud computing goes one step further with on-demand resource provisioning. This

eliminates over-provisioning when used with utility pricing. It also removes the need to over-

provision in order to meet the demands of millions of users.


124

Infrastructure as a Service and more

A consumer can get service from a full computer infrastructure through the Internet. This type of

service is called Infrastructure as a Service (IaaS). Internet-based services such as storage and

databases are part of the IaaS. Other types of services on the Internet are Platform as a Service

(PaaS) and Software as a Service (SaaS). PaaS offers full or partial application development

that users can access, while SaaS provides a complete turnkey application, such as Enterprise

Resource Management through the Internet.

To get an idea of how Infrastructure as a Service (IaaS) is/was used in real life, consider The

New York Times, which processed terabytes of archival data using hundreds of Amazon's EC2

instances within 36 hours. If The New York Times had not used EC2, it would have taken days

or months to process the data.

The IaaS divides into two types of usage: public and private. Amazon EC2 uses public server

pools in the infrastructure cloud. A more private cloud service uses groups of public or private

server pools from an internal corporate data center. One can use both types to develop software

within the environment of the corporate data center, and, with EC2, temporarily extend

resources at low cost, for example, for testing purposes. The mix may provide a faster way of

developing applications and services with shorter development and testing cycles.

Amazon Web services With EC2, customers create their own Amazon Machine Images (AMIs) containing an operating

system, applications, and data, and they control how many instances of each AMI run at any

given time. Customers pay for the instance-hours (and bandwidth) they use, adding computing

resources at peak times and removing them when they are no longer needed. The EC2, Simple

Storage Service (S3), and other Amazon offerings scale up to deliver services over the Internet

in massive capacities to millions of users.

Amazon provides five different types of servers ranging from simple-core x86 servers to eight-

core x86_64 servers. You do not have to know which servers are in use to deliver service

instances. You can place the instances in different geographical locations or availability zones.

Amazon allows elastic IP addresses that can be dynamically allocated to instances.


125

Cloud computing With cloud computing, companies can scale up to massive capacities in an instant without

having to invest in new infrastructure, train new personnel, or license new software. Cloud

computing is of particular benefit to small and medium-sized businesses who wish to completely

outsource their data-center infrastructure, or large companies who wish to get peak load

capacity without incurring the higher cost of building larger data centers internally. In both

instances, service consumers use what they need on the Internet and pay only for what they

use.

The service consumer no longer has to be at a PC, use an application from the PC, or purchase

a specific version that is configured for smart phones, PDAs, and other devices. The consumer

does not own the infrastructure, software, or platform in the cloud. The consumer has lower

upfront costs, capital expenses, and operating expenses. The consumer does not care about

how servers and networks are maintained in the cloud. The consumer can access multiple

servers anywhere on the globe, without knowing which ones and where they are located.

Grid Computing Cloud computing evolved from grid computing and provides on-demand resource provisioning.

Grid computing may or may not be in the cloud depending on what type of users are using it. If

the users are systems administrators and integrators, they care how things are maintained in

the cloud. The providers install and virtualize servers and applications. If the users are

consumers, they do not care how things are run in the system.

Grid computing requires the use of software that can divide and farm out pieces of a program as

one large system image to several thousand computers. One concern about grid is that if one

piece of the software on a node fails, other pieces of the software on other nodes may fail. This

is minimized if that component has a failover component on another node, but problems can still

arise if components rely on other pieces of software to accomplish one or more grid computing

tasks. Large system images and associated hardware to operate and maintain them can

contribute to large capital and operating expenses.


126

Similarities and differences Cloud computing and grid computing are scalable. Scalability is accomplished through load

balancing of application instances running separately on a variety of operating systems and

connected through Web services. CPU and network bandwidth is allocated and de-allocated on

demand. The system's storage capacity goes up and down depending on the number of users,

instances, and the amount of data transferred at a given time.

Both computing types involve multi-tenancy and multitask, meaning that many customers can

perform different tasks, accessing a single or multiple application instance. Sharing resources

among a large pool of users assists in reducing infrastructure costs and peak load capacities.

Cloud and grid computing provide service-level agreements (SLAs) for guaranteed uptime

availability of, say, 99 percent. If the service slides below the level of the guaranteed uptime

service, the consumer will get service credit for receiving data late.

The Amazon S3 provides a Web services interface for storing and retrieving data in the cloud.

Setting a maximum limits the number of objects you can store in S3. You can store an object as

small as 1 byte and as large as 5 GB or even several terabytes. S3 uses the concept of buckets

as containers for each storage location of your objects. The data is stored securely using the

same data storage infrastructure that Amazon uses for its e-commerce Web sites.

While the storage computing in the grid is well suited for data-intensive storage, it is not

economically suited for storing objects as small as 1 byte. In a data grid, the amounts of

distributed data must be large for maximum benefit.

A computational grid focuses on computationally intensive operations. Amazon Web Services in

cloud computing offers two types of instances: standard and high-CPU.


127

Business Practices Pillar It is important to recognize that there are tough challenges that data center managers, industry

operators, and IT businesses face as they all struggle to support their businesses in the face of

budget cuts and uncertainty about the future. It is natural that environmental sustainability is

taking a back seat in many companies at this time. However, the fact is, being “lean and green”

is good for both the business and the environment, and organizations that focus their attentions

accordingly will see clear benefits. Reducing energy use and waste improves a company’s

bottom line, and increasing the use of recycled materials is a proven way to demonstrate good

corporate citizenship to your customers, employees, and the communities in which you do

business.

That said, it is not always easy to know where to begin in moving to greener and more efficient

operations. As shown in Figure 25 – Sustainability Ontology – Business Practices, shown

below, on page 128, many methods and best practices can be implemented. The diagram

outlines the structure of how a company can achieve a high level of efficiency and sustainability

through better process improvement and management, and how conforming to standards and

addressing governance and compliance standards can help achieve this goal. With that in mind,

this section enumerates the many best business practices for environmentally sustainable

business. It is hoped that if companies follow these best practices, it will lead to optimal use of

resources and help teams and management stay aligned with the core strategies and goals of

achieving a sustainable IT.


128

Figure 25 – Sustainability Ontology – Business Practices

Process Management and Improvement

Best Practice - Provide incentives that support your primary goals: Incentives can help you achieve remarkable results in a relatively short period of time if you

apply them properly. Take energy efficiency as an example. A broad range of technology

improvements and best practices are already available that companies can use to improve

efficiency in the data center. However, industry adoption for these advances has been relatively

low. One possible reason is that the wrong incentives are in place. For instance, data center


129

managers are typically compensated based on uptime and not efficiency. Best practice is to

provide specific incentives to reward managers for improving the efficiency of their operations,

using metrics such as Power Usage Effectiveness (PUE), which determines the energy

efficiency of a data center by dividing the amount of power entering a data center by the power

used to run the computer infrastructure within it. Uptime is still an important metric, but the

incentives being appropriately balanced against the need to improve energy efficiency is the

goal[2].

Another outmoded incentive in the industry involves how data center hosting costs are allocated

back to internal organizations. Most often, these costs are allocated based on the proportion of

floor space used. These incentives drive space efficiency and ultra-robust data centers, but they

come at a high cost, and typically are not energy efficient. Space-based allocation does not

reflect the true cost of building and maintaining a data center. A best practice is to achieve

substantial efficiency gains by moving to a model that allocates costs to internal customers,

based on the proportion of energy their services consume. It is anticipated that business units

would began evaluating their server utilization data to make sure they did not already have

unused capacity before ordering more servers.

Best Practice - Focus on effective resource utilization Energy efficiency is an important element in any company’s business practices, but equally

important is the effective use of deployed resources. For example, if only 50 percent of a data

center’s power capacity is used, then highly expensive capacity is stranded in the uninterruptible

power supplies (UPSs), generators, chillers, and so on. In a typical 12 Megawatt data center,

this could equate to $4-8 million annually in unused capital expenditure [3]. In addition, there is

embedded energy in the unused capacity, since it takes energy to manufacture the UPSs,

generators, chillers, and so on. Stranding capacity will also force organizations to build

additional data centers sooner than necessary.

Best Practice - Use virtualization to improve server utilization and increase operational efficiency As noted in the best practice above, underutilized servers are a major problem facing many data

center operators. In today’s budgetary climate, IT departments are being asked to improve

efficiency, not only from a capital perspective, but also with regard to operational overhead. By

migrating applications from physical to virtual machines and consolidating these applications


130

onto shared physical hardware, it is quite common to see several instances in data centers

where server resources are under-utilized. Industry analysts have reported that utilization levels

are often well below 20 percent. Utilizing technologies such as a Hyper-V to increase

virtualization and therefore utilization year after year, in turn helps increase the productivity per

watt of our operations. Utilizing infrastructure architectures such as Amazon, Microsoft’s

Windows Azure cloud operating system, and EMC’s VCE Cloud virtualization are a best

practice.

One immediate benefit of virtual environments is improved operational efficiency. Operations

teams can deploy and manage servers in a fraction of the time it would take to deploy the

equivalent physical hardware or perform a physical configuration change. In a virtual

environment, managing hardware failures without disrupting service is as simple as a click of a

button or automated trigger, which rolls virtual machines from the affected physical host to a

healthy host.

A server running virtualization will often need more memory to support multiple virtual machines,

and there is small software overhead for virtualization. However, the overall value proposition

measured in terms of work done per cost and per watt is much better than the dedicated

underutilized physical server case.

Key benefits of virtualization include:

• Reduction in capital expenditures

• Decrease in real estate, power, and cooling costs

• Faster time to market for new products and services

• Reduction in outage and maintenance windows

Best Practice - Drive quality up through compliance: Many data center processes are influenced by the need to meet regulatory and security

requirements for availability, data integrity, and consistency. See section titled “Compliance” on

page 146 for additional information. Quality and consistency are tightly linked and can be

managed through a common set of processes. Popular approaches to increasing quality are

almost without exception tied to observing standards and reducing variability.


131

A continuous process helps maintain the effectiveness of controls as your environment

changes. Compliance boils down to developing a policy and then operating consistently as

measured against that policy. The extended value that can be offered by standardized,

consistent processes that address compliance will also help you achieve higher quality benefits.

A best practice is to achieve certification to the international information security standard,

ISO/IEC 27001:2005. For instance, through monitoring one’s data center systems for policy

compliance, many companies have exposed processes that were causing problems, and found

opportunities for improvements that benefitted multiple projects. As outlined in Figure 26 - A

continuous process helps maintain the effectiveness of controls as your environment changes.

This is a best practice.

PerformControl

TestControl

DesignControl

Analyze and Correct

Document Issues

Feedback to Design

Figure 26 - A continuous process helps maintain the effectiveness of controls as your environment changes

Best Practice - Embrace change management Poorly planned changes to the production environment can have unexpected and sometimes

disastrous results, which can then spill over into the planet’s environment when the impacts

involve lower energy utilization and other inefficient use of resources. Changes may involve

hardware, software, configuration, or process. Standardized procedures for a request, approval,

coordination, and execution of changes can greatly reduce the number and severity of


132

unplanned outages. Data center organizations should adopt and maintain repeatable, well-

documented processes, where the communication of planned changes enables teams to

identify risks to dependent systems and develop appropriate workarounds in advance.

Figure 27 - Consistent and well-documented processes help ensure smooth changes in the production environment

A best practice is to manage changes to a data center’s hardware and software infrastructure

through a review and planning process that is based on the Information Technology

Infrastructure Library (ITIL) framework. An example of this type of process is shown in Figure 27

- Consistent and well-documented processes help ensure smooth changes in the production

environment. Proposed changes are reviewed prior to approval to ensure that sufficient

diligence has been applied. Additionally, planning for recovery in the case of unexpected results

is crucial. Rollback plans must be scrutinized to ensure that all known contingencies have been

considered. When developing a change management program, it is important to consider the

influences of people, processes, and technology. By employing the correct level of change

management, businesses can increase customer satisfaction and improved service level

performance without placing undue burden on its operations staff.

Other features that your change management process should include:


133

• Documented policies around communication and timeline requirements

• Standard templates for requesting, communicating, and reviewing changes

• Post-implementation review, including cases where things went well

Best Practice - Invest in understanding your application workload and behavior: The applications in your environment and the particulars of the traffic on your network are

unique. The better you understand them, the better positioned you will be to make

improvements. Moving forward in this regard requires hardware engineering and performance

analysis expertise within your organization, so you should consider staffing accordingly.

Credible and competent in-house expertise is needed to properly evaluate new hardware,

optimize your request for proposal (RFP) process for servers, experiment with new

technologies, and provide meaningful feedback to your vendors. Once you start building this

expertise, the first goal is to focus your team on understanding your environment, and then

working with the vendor community. Make your needs known to them as early as possible. It is

an approach that makes sense for any company in the data center industry that is working to

increase efficiency. If you do not start with efficient servers, you are just going to pass

inefficiencies down the line.

Best Practice - Right-size your server platforms to meet your application requirements Another best practice in data centers involves “right-sizing the platform.” This can take two

forms. One is where you work closely with server and other infrastructure manufacturers to

optimize their designs and remove items you don’t use, such as more memory slots and

input/output (I/O) slots than you need, and focus on high efficiency power supplies and

advanced power management features. With the volume of servers that many large

corporations purchase, most manufacturers are open to meeting these requests, as well as

partner with us to drive innovation into the server space to reduce resource consumption even

further. Of course, not all companies purchase servers on a scale where it makes sense for

manufacturers to offer customized stock-keeping units (SKUs). That is where the second kind of

right-sizing comes in. It involves being disciplined about developing the exact specifications that

you need servers to meet for your needs, and then not buying machines that exceed your

specifications. It is often tempting to buy the latest and greatest technology, but you should only

do so after you have evaluated and quantified whether the promised gains provide an

acceptable return on investment (ROI).


134

Consider that you may not need the latest features server vendors are selling. Understand your

workload and then pick the right platform. Conventional wisdom has been to buy something

bigger than your current needs so you can protect your investment. However, with today’s rapid

advances in technology, this can lead to rapid obsolescence. You may find that a better

alternative is to buy for today's needs and then add more capacity as and when you need it.

Also, look for opportunities to use a newer two-socket quad-core platform to replace an older

four-socket dual-core, instead of overreaching with newer, more capable four-socket platforms

with four or six cores per socket. Of course, there is no single answer. Again, analyze your

needs and evaluate your alternatives.

Best Practice - Evaluate and test servers for performance, power, and total cost of ownership A best practice, and what many large corporations are doing in the procurement process, is

usually built around testing. Hardware teams run power and performance tests on all “short list”

candidate servers, and then calculate the total cost of ownership, including power usage

effectiveness (PUE) for energy costs. The key is to bring the testing in-house so you can

evaluate performance and other criteria in your specific environment and on your workload. It is

important to not rely on benchmark data, which may not be applicable to your needs and

environment.

For smaller organizations that do not have resources to do their own evaluation and testing,

SPECpower_ssj2008 (the industry-standard SPEC benchmark that evaluates the power and

performance characteristics of volume server class computers) can be used in the absence of

anything else to estimate workload power. In addition to doing its own tests, Microsoft requests

this data from vendors in all of its RFPs. For more information, visit the Standard Performance

Evaluation Corp. web site at www.spec.org/specpower.

Best Practice - Converge on as small a number of stock-keeping units (SKUs) as you can A best practice is to make leading data center initiatives move to a server “standards” program

where internal customers choose from a consolidated catalogue of servers. Narrowing the

number of SKUs can allow IT departments to make larger volume buys, thereby cutting capital

costs. However, perhaps equally important, it helps reduce operational expenditures and

complexities around installing and supporting a variety of models. This increases operational


135

consistency and results in better pricing, as long-term orders are more attractive to vendors.

Finally, it provides exchangeable or replaceable assets. For example, if the demand for one

online application decreases while another increases, it is easier to reallocate servers as

needed with fewer SKUs.

Best Practice - Take advantage of competitive bids from multiple manufacturers to foster innovation and reduce costs. Competition between manufacturers encourages thorough ongoing analysis of proposals from

multiple companies. That puts most of the weight on price, power, and performance. A best

practice is to develop hardware requirements and then share them with multiple manufacturers.

Then, work actively to develop an optimized solution. Energy efficiency, power consumption,

cost effectiveness and application performance per watt each play key roles in hardware

selection. The competition motivates manufacturers to be price competitive, drive innovation,

and provide the most energy efficient, lowest total cost of ownership (TCO) solutions. In many

cases, online services do not fully use the available performance. Hence, it makes sense to give

more weight to price and power. It is important to remember that power affects not only energy

consumption costs, but also data center capital allocation costs.

Standards

To achieve sustainability, utilizing the various architectures described in the section titled

“Infrastructure Architectures, starting on page 80, a best practice is to create standards allowing

the various technologies to not only interoperate, but allow the business to not get “locked-in” to

a particular vendor or strategy.

Focusing on cloud computing architectures, this technology is an approach in delivering IT

services that promises to be highly agile and lower costs for consumers, especially up-front

costs. This approach impacts not only the way computing is used, but also the technology and

processes that are used to construct and manage IT within enterprises and service providers.

Coupled with the opportunities and promise of cloud computing are elements of risk and

management complexity. Adopters of cloud computing should consider asking questions such

as:

• How do I integrate computer, network, and storage services from one or more cloud

service providers into my business and IT processes?

• How do I manage security and business continuity risk across several cloud providers?


136

• How do I manage the lifecycle of a service in a distributed multiple-provider environment

in order to satisfy service-level agreements (SLAs) with my customers?

• How do I maintain effective governance and audit processes across integrated

datacenters and cloud providers?

• How do I adopt or switch to a new cloud provider?

The definitions of cloud computing, including private and public clouds, Infrastructure as a

Service (IaaS), and Platform as a Service (PaaS) are taken from work by the National Institute

of Standards and Technology (NIST). In part, NIST defines cloud computing as “a model for

enabling convenient, on-demand network access to a shared pool of configurable computing

resources (for example, networks, servers, storage, applications, and services) that can be

rapidly provisioned and released with minimal management effort or service provider

interaction.”

NIST defines four cloud deployment models:

• Public clouds (cloud infrastructure made available to the general public or a large

industry group)

• Private clouds (cloud infrastructure operated solely for an organization)

• Community clouds (cloud infrastructure shared by several organizations)

• Hybrid clouds (cloud infrastructure that combines two or more clouds)

There is an Open Cloud Standards Incubator project going on which includes all of the

deployment models defined above. The focus of the project is management aspects of IaaS,

with some work involving PaaS. These aspects include service-level agreements (SLAs), quality

of service (QoS), workload portability, automated provisioning, and accounting and billing.

The fundamental IaaS capability made available to cloud consumers is a cloud service.

Examples of services are computing systems, storage capacity, and networks that meet

specified security and performance constraints. Examples of consumers of cloud services are

enterprise datacenters, small businesses, and other clouds.

Many existing and emerging standards will be important in cloud computing. Some of these,

such as security-related standards, apply generally to distributed computing environments.


137

Others apply directly to virtualization technologies that are expected to be important building

blocks in cloud implementations.

The dynamic infrastructure enabled by virtualization technologies aligns well with the dynamic

on-demand nature of clouds. Examples of standards include SLA management and compliance,

federated identities and authentication, and cloud interoperability and portability.

Best Practice - Use standard interfaces to Cloud Architectures There are multiple competing proposals for interfaces to clouds and given the embryonic stage

of the industry, it is important for users to insist that cloud providers use standard interfaces to

provide flexibility for future extensions and to avoid becoming locked into a vendor. With the

backing of key players in the industry, this aspect of portability is a primary value that standards-

based cloud infrastructure offers. Three scenarios show how cloud consumers and providers

may interact using interoperable cloud standards. These scenarios are examples only; many

more possibilities exist.

1. Building flexibility to do business with a new provider without excessive effort or cost

2. Ways that multiple cloud providers may work together to meet the needs of a consumer

of cloud services

3. How different consumers with different needs can enter into different contractual

arrangements with a cloud provider for data storage services

As previously discussed, many standards bodies are rallying to generate a common standard

allowing varying cloud offerings to interoperate and federate. The DMTF for example, is working

with affiliated industry organizations such as the Open Grid Forum, Cloud Security Alliance,

TeleManagement Forum (TMF), Storage Networking Industry Association (SNIA), and National

Institute of Standards and Technology (NIST). The DMTF has also established formal

synergistic relationships with other standards bodies. The intent of these alliance partnerships is

to provide mutual benefit to the organizations and the related standards bodies.

Alliances play an important role in helping the cloud community to provide a unified view of

management initiatives. For example, SNIA has produced an interface specification for cloud

storage. The Open Cloud Standards Incubator will not only leverage that work but also

collaborate with SNIA to ensure consistent standards. The Incubator expects to leverage


138

existing DMTF standards including Open Virtualization Format (OVF), Common Information

Model (CIM), CMDB Federation (CMDBf), CIM Simplified Policy Language (CIM-SPL), and the

DMTF's virtualization profiles, as well as standards from affiliated industry groups.

The ultimate goal of the Open Cloud Standards is to enable portability and interoperability

between private clouds within enterprises and hosted or public cloud service providers. A first

step has been initiated in the development of use cases, a service lifecycle, and reference

architecture. This is still a work in progress, but in order for any business to utilize cloud

architectures and leverage these architectures to achieve efficiency and sustainability, an

interface standard is required.

Security

With respect to sustainability and new technologies, such as Cloud Computing, moving to a new

business model such as going to the cloud offers economies of scale and flexibility that are both

good and bad from a security point of view. The massive concentrations of resources and data

present a more attractive target to attackers, but cloud-based defenses can be more robust,

scalable and cost-effective. For a more detailed discussion focusing on the details on Cloud

security, please refer to the Proven Professional article titled “How to Trust the Cloud – “Be

Careful up There””.

The new cloud economic/sustainability model has also driven technical change in terms of:

Scale: Commoditization and the drive towards economic sustainability and efficiency have led

to massive concentrations of the hardware resources required to provide services. This

encourages economies of scale for all the kinds of resources required to provide computing

services.

Architecture: Optimal resource use demands computing resources that are abstracted from

underlying hardware. Unrelated customers who share hardware and software resources rely on

logical isolation mechanisms to protect their data. Computing, content storage and processing

are massively distributed. Global markets for commodities demand edge distribution networks

where content is delivered and received as close to customers as possible. This tendency

towards global distribution and redundancy means resources are usually managed in bulk, both

physically and logically.


139

Given the reduced cost and flexibility it brings, a migration to cloud computing is compelling for

many SMEs. However, there can be concerns for SMEs migrating (also see “Best Practice -

Assess cloud storage migration costs upfront on page 94 for additional information) to the cloud

including the confidentiality of their information and liability for incidents involving the

infrastructure

Following are some best practices for managing trust in public and private clouds:

Best Practice – Determine if cloud vendors can deliver on their security claims

Because information security is only as strong as its weakest link, it is essential for

organizations to evaluate the quality of their cloud vendors. Having a high-profile “brand name”

vendor and an explicit SLA is not enough. Organizations must aggressively verify whether cloud

vendors can deliver upon and validate their security claims. Enterprises must make a firm

commitment that they will protect the information assets outside their corporate IT environment

to at least the same high standard of security that would apply if those same information assets

were preserved in-house. In fact, because these assets are stored outside the organization, it

could be argued that the standard for protection should be even higher. Security practitioners

must be particularly diligent in assessing the security profiles of those cloud vendors entrusted

with highly sensitive data or mission-critical functions.

Best Practice - Adopt federated identity policies backed by strong authentication practices A federated identity allows a user to access various web sites, enterprise applications and cloud

services using a single sign-on. Federated identities are made possible when organizations

agree to honor each other’s trust relationships, not only in terms of access but also in terms of

entitlements. Establishing “ties of federation” agreements between parties to share a set of

policies governing user identities, authentication and authorization, provides users with a more

convenient and secure way of accessing, using and moving between services, whether those

services reside in the enterprise or in a cloud. Federated identity policies go hand-in-hand with

strong authentication policies. Whereas federation policies bridge the trust gap between

members of the federation, strong authentication policies bridge the security gap, creating the

secure access infrastructure to bring all members of the community together.


140

The federation of identity and authentication policies will eventually become standard practice in

the cloud, not just because users will demand it but as a matter of convenience. For

organizations, federation also delivers cost benefits and improved security. Companies can

centralize the access and authentication systems maintained by separate business units. They

can reduce potential points of threat, such as unsafe password management practices, as users

will no longer have to enter credentials and passwords in multiple places. For federated identity

policies to become more widely used, the information technology and security industry will have

to knock down barriers to implementing such policies. So far, it appears the barriers are not

economic or technological, but trust-related. Federated identity models, like the strong

authentication services that enforce them, are only as strong as their weakest link. Each

member of the federation must be trusted to comply with the group’s security policies.

Expanding the circle of trust means expanding the threat surface where problems could arise

and increasing the potential for single points of failure in the community of trust. The best way of

ensuring that trust and security are preserved within communities of federation is to require all

community members to enforce a uniform, acceptable level of strong authentication. Some IT

industry initiatives are attempting to establish security standards that facilitate federated

identities and authentication. For instance, the OASIS Security Services Technical Committee

has developed the Security Assertion Markup Language (SAML), an XML-based standard for

exchanging authentication and authorization data between security domains, to facilitate web

browser single sign-on. SAML appears to be evolving into the definitive standard for enterprises

deploying web single sign-on solutions.

Best Practice – Preserve segregation of administrator duties While data isolation and preventing data leakage are essential, enterprise systems

administrators still need appropriate levels of access to manage and configure their company’s

applications within the shared infrastructure. Furthermore, in addition to systems administrators

and network administrators, private clouds introduce a new function into the circle of trust: the

cloud administrator. Cloud administrators, the IT professionals working for the cloud provider,

need sufficient access to an enterprise’s virtual facilities to optimize cloud performance while

being prevented from tapping into the proprietary information they are hosting on behalf of their

tenants. Enterprises running private clouds on hosted servers should consider requiring that


141

their data center operator disable all local administration of hypervisors, using a central

management application instead to better monitor and reduce risks of unauthorized

administrator access.

As an added security measure, enterprises should preserve a separation of administrator duties

in the cloud. The temptation may be to consolidate duties, as many functions can be centrally

administered from the cloud using virtualization management software. However, as with

physical IT environments, in which servers, networks and security functions are split among

several administrators or departments, segregating those functions within the cloud can provide

added security by decentralizing control. Furthermore, organizations can use centralized

virtualization management capabilities to limit administrative access, define roles and

appropriately assign privileges to individual administrators. By segregating administrator duties

and employing a centralized virtualization management console, organizations can safeguard

their private clouds from unauthorized administrator access.

Best Practice - Set clear security policies Set clear policies to define trust and be equipped to enforce them. In a private cloud, trust

relationships are defined and controlled by the organization using the cloud. While every party in

the trust relationship will naturally protect information covered by government privacy and

compliance regulations, employee tax ID numbers, proprietary financial data, etc., organizations

will also need to set policies for how other types of proprietary data are shared in the cloud. For

instance, a corporation may classify information such as purchase orders or customer

transaction histories as highly sensitive, even as trade secrets, and may establish risk-based

policies for how cloud providers and business partners store, handle and access that data

outside the enterprise. For “Trust” relationships to work, there must be clear, agreed-upon

policies for what information is privileged, how that data is managed and how cloud providers

will report and validate their performance in enforcing the standards set by the organization.

These agreed-upon standards must be enforced by binding service level agreements (SLAs)

that clearly stipulate the consequences of security breaches and service agreement violations.

Best Practice - Employ data encryption and tokenization The cloud provider in online backups sometimes stores enterprise data used in cloud

applications. Encrypting data is often the simplest way to protect proprietary information against

unauthorized access, particularly by administrators and other parties within the cloud.


142

Organizations should encrypt data residing with or accessible to cloud providers. As in

traditional enterprise IT environments, organizations should encrypt data in applications at the

point of ingest. Additionally, they should ensure cloud vendors support data encryption controls

that secure every layer of the IT stack. Segregate sensitive data from the users or identities they

are associated with as an additional precaution to secure data residing in clouds. For instance,

companies storing credit card data often keep credit card numbers in separate databases from

where cardholders’ personal data is stored, reducing the likelihood that security breaches will

result in fraudulent purchases. Companies also can protect sensitive cardholder information in

the cloud through a form of data masking called tokenization. This method of securing data

replaces the original number with a token value that has no explicit relationship to the original

value. The original card number is kept in a separate, secure database called a vault.

Best Practice - Manage policies for provisioning virtual machines Best Practice is to secure their virtual infrastructure; companies using private clouds must be

able to oversee how virtual machines are provisioned and managed within their clouds. In

particular, managing virtual machine identities is crucial, as they are used for basic

administrative functions, such as identifying the systems and people with which virtual machines

are physically associated, and moving software to new host servers. Organizations establishing

a security position based on virtual machine identities should know how those identities are

created, validated and verified, and what safety measures their cloud vendors have taken to

safeguard those identities. Additionally, information security leaders should set their identity

access and management policies to grant all users, whether human or machine, the lowest level

of access needed for each to perform their authorized functions within the cloud.

Best Practice – Require transparency into cloud operations to ensure multi-tenancy and data isolation In the virtualized environment of the cloud, many different companies, or “tenants,” may share

the same physical computing, storage and network infrastructure. Cloud providers need to

ensure isolation of access that software, data and services can be safely partitioned within the

cloud and that tenants sharing physical facilities cannot tap into their neighbors’ proprietary

information and applications.

The best way to ensure secure data isolation and multi-tenancy is to partition access to

appropriate cloud resources for all tenants. Cloud vendors should furnish log files and reports of


143

user activities. Some cloud vendors are able to provide an even higher degree of visibility

through applications that allow enterprise IT administrators to monitor the data traversing their

virtual networks and to view events within the cloud in near real time. Specific performance

metrics should be written into managed service agreements and enforced with financial

consequences if those agreed-upon performance conditions are not upheld.

Organizations and businesses with private clouds should work with cloud vendors to ensure

transferability of security controls. In other words, if data or virtual resources are moved to

another server or to a backup data center, the security policies established for the original

server or primary data center should automatically be implemented in the new locations.

Governance

The ability to govern and measure enterprise risk within a company-owned data center is

difficult and it seems surprisingly still in the early stages of maturation in most organizations.

Cloud computing brings new unknowns to governance and enterprise risk.

Online agreements and contracts of this type for the most part are still untested in a court of law

and consumers have yet to experience an extended outage of services that they may someday

determine to need on a 24/7 basis. Questions still remain about the ability of user organizations

to assess the risk of the provider through onsite assessments.

The storage and use of information considered sensitive by nature may be allowed, but it could

be unclear as to who is responsible in the event of a breach. If both the code authored by the

user and the service delivered by the provider are flawed, who is responsible? Current statutes

cover the majority of the United States but how are the laws of foreign countries, especially the

European Union, to be interpreted in the event of disputes? Many questions remain with respect

to Cloud Governance and Enterprise Risk.

Best Practices – Do your due diligence of your SLAs Cloud consumers considering using Cloud Services should perform in depth due diligence prior

to the execution of any service “Terms of Service,” “Service Level Agreements” (SLAs), or use.

This due diligence should assess the arrangement of risks known at present and abilities of

partners to work within and contribute to the customer’s enterprise risk management program

for the length of the engagement. Some recommendations include:


144

1. Consider creating a Private (Virtual) Cloud or a Hybrid Cloud that provides the

appropriate level of controls while maintaining risk at an acceptable level.

2. Review what type of provider you prefer, such as software, infrastructure or platform.

Gain clarity on how pricing is performed with respect to bandwidth and CPU utilization in

a shared environment. Compare usage as measured by the cloud service provider with

your own log data, to ensure accuracy.

3. Request clear documentation on how the facility and services are assessed for risk

and audited for control weaknesses, the frequency of assessments and how control

weaknesses are mitigated in a timely manner. Ask the service provider if they make the

results of risk assessments available to their customers.

4. Require the definition of what the provider considers critical success factors, key

performance indicators, and how they measure them relative to IT Service Management

(Service Support and Service Delivery).

5. Require a listing of all provider third party vendors, their third party vendors, their roles

and responsibilities to the provider, and their interfaces to your services.

6. Request divulgence of incident response, recovery, and resiliency procedures for any

and all sites and associated services.

7. Request a review of all documented policies, procedures and processes associated

with the site and associated services assessing the level of risk associated with the

service.

8. Require the provider to deliver a comprehensive list of the regulations and statutes

that govern the site and associated services, and how compliance with these items is

executed.

9. Perform full contract or terms of use due diligence to determine roles, responsibilities,

and accountability. Ensure legal counsel review, including an assessment of the

enforceability of local contract provisions and laws in foreign or out-of-state jurisdictions.

10. Determine whether due diligence requirements encompass all material aspects of

the cloud provider relationship, such as the provider’s financial condition, reputation

(e.g., reference checks), controls, key personnel, disaster recovery plans and tests,

insurance, communications capabilities and use of subcontractors.

Request a scope of services including:

• Performance standards


145

• Rapid provisioning – de-provisioning

• Methods of multi-tenancy and resource sharing

• Pricing

• Controls

• Financial and control reporting

• Right to audit

• Ownership of data and programs

• Procedures to address a Legal Hold

• Confidentiality and security

• Regulatory compliance

• Indemnification

• Limitation of liability

• Dispute resolution

• Contract duration

• Restrictions on, or prior approval for, subcontractors

• Termination and assignment, including timely return, of data in a machine-

readable format

• Insurance coverage

• Prevailing jurisdiction (where applicable)

• Choice of Law (foreign outsourcing arrangements)

• Regulatory access to data and information necessary for supervision

• Business Continuity Planning.

Consumers, Businesses, Cloud Service Providers, and Information Security and Assurance

professionals must collaborate to focus on the potential issues and solutions listed above, and

to discover the holes. The Cloud Security Alliance (CSA), one of the standards bodies outlined

in the section titled “Standards”, starting on page 135, calls for collaboration in setting standard

terms and requirements that drive governance and enterprise risk issues to a mature and

acceptable state allowing for negotiation. The CSA is working to address these issues so

businesses can take full advantage of the nimbleness, expansive service options, flexible

pricing and cost savings of Cloud Services to achieve a sustainable IT solution.


146

Compliance

With cloud computing resources as a viable and cost effective means to outsource entire

systems and increase sustainability, maintaining compliance with your security policy and the

various regulatory and legislative requirements to which your company has adhered can

become even more difficult to demonstrate. The cost of auditing compliance is likely to increase

without proper planning. With that in mind, it is imperative to consider all of your requirements

and options prior to progressing with cloud computing plans [6].

Best Practice - Know Your Legal Obligations Best Practice is your organization must fully understand all of the necessary legal requirements.

The regulatory landscape is typically dictated by the industry in which you reside. Depending on

where your organization operates, you are likely subject to a lengthy collection of legislation that

governs how you treat specific types of data, and it is your obligation to understand it and

remain compliant. Without understanding your obligations, an organization cannot formulate its

data processing requirements. It is a best practice to engage internal auditors, external auditors,

and legal counsel to ensure that nothing is left out.

Best Practice - Classify / Label your Data & Systems Your company must classify date to adequately protect it. Considering the regulatory and

legislative requirements discussed earlier, your organization needs to classify its data to isolate

what data requires the most stringent protection from the public, or otherwise less sensitive

data. The data and systems must also be clearly labeled and the processes surrounding the

handling of the data formalized. At this point, your organization can consider cloud-computing

resources for data and systems not classified at a certain level, which would be subject to

burdensome regulatory requirements.

Best Practice - External Risk Assessment A third party risk assessment of the systems and data being considered for cloud resources

should be conducted to ensure all risks are identified and accounted for. This includes a Privacy

Impact Assessment (PIA) as well as other typical Threat Risk Assessments (TRA). Impacts to

other internal systems, such as Data Leakage Protection (DLP) systems should also be

considered. Be prepared to discover extensive risks with costly remediation strategies in order

to consider cloud computing for regulated data.


147

Best Practice - Do Your Diligence / External Reports At a minimum, you need to understand the security of the organization hosting your cloud

computing resources and what they are prepared to offer. If you have very stringent security

requirements, you may want to mandate that your cloud provider be certified to ISO/IEC 27001:

2005 annually. It is also likely that an organization will need to improve your processes and

operational security maturity to manage your cloud provider to that level of security. It is

important to utilize the risk assessment and data classification exercises previously mentioned

to provide the amount of security required to ensure the appropriate confidentiality, integrity and

availability of your data and systems without over spending. Assuming ISO/IEC 27001:2005

certification is too costly or not available within the class of service you seek, the assurance

statement most likely to be available is the Statement on Auditing Standards (SAS )70 Type II.

Work these requirements into the contract requirements and ensure that you see a previous

certificate of compliance prior to formalizing an agreement.

Similarly, the business should demand the results of external security scans and penetration

tests on a regular basis due to the unique attack surfaces associated with cloud computing. The

value of certifications such as ISO/IEC 27001:2005 or audit statements like SAS 70 are the

source of significant debate among security professionals. Skeptics will point out that through

the scoping process; an organization can exclude critical systems and process from scrutiny

and present an unrealistic picture of organizational security. This is a legitimate issue, and our

recommendation is that domain experts develop standards relating to scoping these and other

certifications, so that over time, the customer will expect broad scoping. Customers must

demand an ISO certification based upon a comprehensive security program. In the end, this will

benefit the cloud provider as well, as a certifiably robust security program will pay for itself in

reduced requests for audit.

Best Practice - Understand Where the Data Will Be! If your company is considering using cloud-computing resources for regulated data, it is

imperative to understand where the data will be processed and stored under any and all

situations. Of course, this task is far from simple for all parties, including cloud-computing

providers. However, with respect to legislative compliance surrounding where data can and

cannot be transmitted or stored, the cloud computing provider will need to be able to

demonstrate assurance that the data will be where they say it is and only there. This applies to

third parties and other outsourcers used by the cloud computing provider. If the provider has


148

reciprocal arrangements or other types of potential outsourcing of the resources, strict attention

to how this data is managed, handled, and located must extend to that third party arrangement.

If the potential provider you have engaged cannot do this, investigate others. As this

requirement becomes more prevalent, it is likely the option will likely become available.

Remember, if that assurance cannot be provided, some of your data and processing cannot use

public cloud computing resources as defined in Domain 1 without exception. Private clouds may

be the appropriate option in this case.

Best Practice - Track your applications to achieve compliance. To manage an application effectively, you have to know where it is. Establish a "chain of

custody" that enables you to see where applications are running and manage them against any

legal concerns. The chain of custody includes identifying the machine the application is

installed on, what data is associated with that application, who is in control of the machine, and

what controls are in place.

With server virtualization, applications move among different machines, and without careful

control over the chain of custody, you can expose an application or the data to circumstances

where a high-security application may be shifted into a low-security environment. Before you

change anything in the environment, consider whether the change will create unauthorized

access to the application or related data.

Best Practice - With off-site hosting, keep your assets separate. If a third party controls or hosts one of your servers, keeping your operating assets separate

from those of the host's other customers is critical to avoid potential liability for security

exposures, including improper access. For hosted applications, you also need to ensure that

settings for one application cannot drift or migrate into the control of another, so no other host

customers can access your data.

To do this, you need to evaluate how the host distributes and controls applications and data

stored in its server array. Depending on the configurations of the hosts and client machines,

settings and programmatic adjustments can trickle down and install in an unexpected manner.


149

This is why you need to make sure that appropriate security controls are in place. You do not

want unexpected updates or configuration controls to gain control over your data or application

versions. Make sure your contract with the hosting company details the technical specifications

that protect your data and users, and that the hosting company provides testing and monitoring

reports that show compliance with your controls.

Best Practice - Protect yourself against power disruptions Any CIO overseeing a data center knows that power outages are a common occurrence. The

reason is simple -- the power to run and cool a data center is more and more vulnerable. A 2006

AFCOM survey reported that 82.5% of data center outages in a five-year period were power-

related.

If your data center has experienced power-related business interruptions, consider drafting

contract terms for your own customers that protect you from liability if the power supply to your

facilities is disrupted or lost. You may want more than general "acts of God" clauses in your

customer-facing agreements.

If you are considering a shift to a hosted extension of your data center, you need to understand

your hosted site's power supply and capabilities. Make sure your contract precisely defines

those capabilities and allocates the risks for any service disruptions that occur. Account for this

in your own customer contracts as well. Draft them carefully to ensure that power disruptions to

your suppliers do not expose you to liability that you would avoid if your data center were in-

house.

Best Practice - Ensure vendor cooperation in legal matters What happens when virtualization and compliance collide and the matter ends up in court?

When a legal collision between virtualization and e-discovery occurs, such as if a third-party

host was unable to produce documents a business needs for a legal action, a service provider

can be a significant risk variable.

To avoid this scenario, it is best practice that you obtain the provider’s commitment to

cooperate in legal matters. This must be done contractually with the third-party data custodian.


150

In conclusion, virtualizing any aspect of your data center changes the game for compliance and

e-discovery. It is best practice to make sure you know exactly where your applications are

running, that your server controls are intact, and that your service provider contract provisions

are "virtualization-friendly." There are benefits of a virtual data center or cloud. It is wise to

address the issues above and not worry about whether your compliance controls are falling out

of the cloud.

Profitability

The business of sustainability in Information Technology is the catalyst for sustainable and

profitable growth. To put it another way, profitability and sustainability go hand in hand.

There is a new definition of profitability that has been the mantra of Sustainability for some time.

It has called the “Triple Bottom Line.” The three (3) bottom lines include social, environmental,

and economic extents, and you should align these extents to profitability. A more detailed

definition of the extents is shown in Table 3 - Extent of Sustainability to achieve profitability,

shown below. In principle, the basic idea is as simple as it is compelling. Resources may only

be used at a rate at which they can be replenished naturally. It is obvious that the way in which

the industrialized world operates today is not sustainable and that change is imperative.

Table 3 - Extent of Sustainability to achieve profitability

Social Environmental Economic Labor, health, and safety:

Address occupational health,

safety, working conditions, and

so on

Human rights and diversity:

Ensure compliance with human

rights and organizational

diversity

Product safety: Ensure

consumer safety • Retention and

qualification: Attract, foster, and

retain top talent by fostering

“green” profile

Energy optimization: Manage

energy costs via planning, risk

management, and process

improvements

Water optimization: Ensure

sustainable and cost-effective

water supply •

Raw materials optimization:

Control raw material–related

costs and manage price volatility

Air and climate change: Reduce

or account for greenhouse gas

emissions

Sewage: Manage sewage

emissions and impact on water

Sustainability performance

management: Provide key

performance indicators to

manage sustainability efforts

Sustainable business

opportunity: Enable new goods

and services for customers

Emission trading: Ensure

financial optimization (cap and

trade)

Reporting: Comply with external

demands for adequate reporting

and disclosures


151

supply

Land pollution: Avoid or reduce

land pollution

Waste: Manage waste in a

sustainable way • Sustainable

product life cycle: Sustainably

develop new products and

manage life cycle

Sustainability is very relevant not only at times of growth, but specifically during times of

economic challenge. The main drivers of sustainability do not change for the following reasons:

Regulation will continue to increase. That is specifically true in the case of carbon emissions,

but will likely include many other environmental and social aspects in the future. Energy prices

will continue to fluctuate and, with economic recovery, rise sharply and increase cost pressure.

Consumer awareness will continue to intensify and force transparency and optimization across

entire business networks and supply chains.

Business and Profit objectives to achieve Sustainability Looking at the big picture, the new sustainability model is all about being environmentally

friendly and making money. This sounds great to CEOs thinking about sustainability, but what

are companies actually doing to achieve this? The backbone of most programs is based on the

best Business Practices consisting of Awareness and Transparency, Efficiency Improvements,

Innovation and Mitigation.

Best Practice - Consumer Awareness and Transparency Consumer Awareness and Transparency communicates the value of your sustainability initiative

and is key to building brand equity. Transparency offers accountability to the program and

avoids “green-washing.” In addition to getting out the word, awareness programs are often

promoted as educational, providing a series of sustainability best practices to improve industry

at large. Whether you view this through a lens of being altruistic or self-serving, the net result

promotes and advances sustainable practices.


152

Best Practice – Implement Efficiency Improvement Efficiency improvement, how to do more with less, is a central theme in most sustainability

programs. Efficiencies improve products or processes, typically without making major changes

to the underlying product or technology. Modifying engine design to be 20% percent more

efficient is an example of product efficiency, whereas redesigning packaging to reduce waste, or

transporting components and finished products more efficiently are examples of process

efficiency. The effects of efficiencies are additive, each contributing to the sustainability goals of

the company, driving the bottom line and creating potential for increased brand value.

Best Practice - Product Innovation Product Innovation is often more challenging than efficiency enhancements because it results in

fundamental changes to products and processes. Innovation tends to have a higher barrier to

entry than efficiency programs, requiring ideas that challenge the status quo and require

significant R&D and marketing investments. The risks of failure for both product development

and ultimately customer acceptance are higher for innovations, but so too are the potential

rewards. Developments of thin module photovoltaic (TMPV) solar cells and algae-based bio-

diesel, both with potential to significantly change the economics of renewable energy, are

examples of innovation.

Best Practice - Carbon Mitigation Carbon Mitigation offsets green house gas (GHG) emissions through projects that remove

carbon from the atmosphere. The Kyoto Protocol’s cap and trade mechanism created the

framework for trading carbon allowances as a way for companies to meet mandatory GHG

emissions targets. It also paved the way for a voluntary carbon offset market targeted to

companies without mandatory requirements or those seeking to be carbon-neutral. Carbon-

neutral status is also becoming popular for individuals with a number of sites and affinity credit

cards catering to this desire.

Information Technology Sector Initiatives IT has become pervasive across all sectors and although invisible in many ways, it forms a

service backbone for almost all products. People rarely think about it, but computers and

communications are invoked for every cell phone call, every online purchase, every item

shipped by a courier, every Google search and every invoice processed. In short, everything in

the modern economy has an associated IT carbon footprint.


153

Network and Server Infrastructure Until recently, computing and communications were all about capacity and speed, with little

thought to energy requirements. Following Moore’s Law, computing power and speed grow

consistently to the point where, in some markets, it can cost more to power a server than to

purchase it. In response, innovation has turned to designing low power chips that deliver high

performance without the energy penalty. Network and server infrastructure manufacturers are

focusing on reducing energy, space and cooling requirements with a new breed of high-density,

high-capacity platforms using state of the art energy-efficient chipsets and components, not to

mention the Cloud and the new paradigms that arise.

The proposition is fundamentally ROI based and is especially attractive to businesses that have

hit power, size, or cooling barriers in their existing installations. As described in the previous

sections, the technologies outlined below will go a long way to achieve sustainability

Best Practice – Virtualization When analyzing efficiency improvements, one option is to eliminate a facility altogether.

Virtualization favors consolidating many distributed datacenters into a specially designed,

centralized “Cloud” facility. An example of this is Google’s advanced data center facility,

affectionately known as a Googleplex, which is alleged as being the most efficient and

economical datacenter. While some argue the Googleplex is search specific, the concept of

achieving Google economies of scale for applications across the board holds merit.

Best Practice - Recycling e-Waste Equipment vendors are increasing e-waste collection and recycling in efforts to reduce heavy

metal and toxins levels in local landfills. Companies such as Dell and HP have long-standing e-

cycling programs as part of their cradle-to-grave sustainability programs.

Cloud Profitability and Economics

Cloud Computing, the long-held dream of computing as a utility, has the potential to transform a

large part of the IT industry, making software even more attractive as a service and shaping the

way IT hardware is designed and purchased. Developers with innovative ideas for new Internet

services no longer require large capital outlays in hardware to deploy their service or the human


154

expense to operate it. They need not be concerned about over provisioning for a service whose

popularity does not meet their predictions, therefore wasting costly resources, or under

provisioning for one that becomes wildly popular, therefore missing potential customers and

revenue. Moreover, companies with large batch-oriented tasks can get results as quickly as

their programs can scale, since using 1000 servers for one-hour costs no more than using one

server does for 1000 hours. This elasticity of resources, without paying a premium for large

scale, is unprecedented in the history of IT.

Cloud Computing refers to both the applications delivered as services over the Internet and the

hardware and systems software in the datacenters that provide those services. The services

themselves have long been referred to as Software as a Service (SaaS). The datacenter

hardware and software is what we call a Cloud. When a Cloud is made available in a pay-as-

you-go manner to the public, we call it a Public Cloud; the service being sold is Utility

Computing. We use the term Private Cloud to refer to internal datacenters of a business or other

organization, not made available to the public. Therefore, Cloud Computing is the sum of SaaS

and Utility Computing, but does not include Private Clouds. People can be users or providers of

SaaS, or users or providers of Utility Computing. We focus on SaaS Providers (Cloud Users)

and Cloud Providers, who have received less attention than SaaS Users. From a hardware

point of view, there are three aspects to Cloud Computing.

1. The illusion of infinite computing resources available on demand, thereby eliminating the

need for Cloud Computing users to plan for provisioning

2. The elimination of an up-front commitment by Cloud users, thereby allowing companies

to start small and increase hardware resources only when there needs increase

3. The ability to pay for use of computing resources on a short-term basis as needed (e.g.,

processors by the hour and storage by the day), and release them as needed, thereby

rewarding conservation by letting machines and storage go when they are no longer

useful

You can argue that the construction and operation of extremely large-scale, commodity-

computer datacenters at low-cost locations was the key enabler of Cloud Computing. They

uncovered the factors of 5 to 7 percent decrease in cost of electricity, network bandwidth,

operations, software, and hardware available at these very large economies of scale. These

factors, combined with statistical multiplexing to increase utilization compared to a private cloud,


155

meant that cloud computing could offer services below the costs of a medium-sized datacenter

and yet still make a profit.

Any application needs a model of computation, a model of storage, and a model of

communication. The statistical multiplexing necessary to achieve elasticity and the illusion of

infinite capacity requires each of these resources to be virtualized to hide the implementation of

how they are multiplexed and shared.

One view is that different utility-computing offerings will be distinguished based on the level of

abstraction presented to the programmer and the level of management of the resources.

Amazon EC2 is at one end of the spectrum. An EC2 instance looks much like physical

hardware, and users can control nearly the entire software stack, from the kernel upwards. This

low level makes it inherently difficult for Amazon to offer automatic scalability and failover,

because the semantics associated with replication and other state management issues are

highly application-dependent.

At the other extreme of the spectrum are application domain specific platforms, such as Google

AppEngine. AppEngine is targeted exclusively to traditional web applications, enforcing an

application structure of clean separation between a stateless computation tier and a state-full

storage tier. AppEngine’s impressive automatic scaling and high-availability mechanisms and

the proprietary MegaStore data storage available to AppEngine applications, all rely on these

constraints. Applications for Microsoft’s Azure are written using the .NET libraries, and compiled

to the Common Language Runtime, a language-independent managed environment. Therefore,

Azure is intermediate between application frameworks like AppEngine and hardware virtual

machines like EC2.

From a business and profitability perspective, when is Utility Computing preferable to running a

Private Cloud?

Case 1: Demand for a service varies with time

Provisioning a data center for the peak load it must sustain a few days per month leads to

underutilization at other times. Instead, Cloud Computing lets an organization pay by the hour

for computing resources, potentially leading to cost savings even if the hourly rate to rent a

machine from a cloud provider is higher than the rate to own one.


156

Case 2: Demand is unknown in advance

For example, a web startup will need to support a spike in demand when it becomes popular,

followed potentially by a reduction once some visitors turn away.

Case 3: Batch Processing

Organizations that perform batch analytics can use the ”cost associatively” of cloud computing

to finish computations faster: using 1000 EC2 machines for 1 hour costs the same as using 1

machine for 1000 hours.

For the first case of a web business with varying demand over time and revenue proportional to

user hours, the tradeoff is shown in Equation 9 – Cloud Computing - Cost Advantage, below.

Cloud Computing is more profitable when the following is true:

Equation 9 – Cloud Computing - Cost Advantage

Profit From Using Cloud Computing ≥ Profit From Using a Fixed Capacity Data Center

Equation 10 – Cloud Computing - Cost tradeoff for demand that varies over time

⎟⎠⎞

⎜⎝⎛ −∗≥−∗

nUtilizatioCostvenueNetUserHoursCostvenueNetUserHours DataCenter

DataCenterCloudCloud Re)Re(

In Equation 10 – Cloud Computing - Cost tradeoff for demand that varies over time, above, the

left-hand side multiplies the net revenue per user-hour by the number of user-hours, giving the

expected profit from using Cloud Computing. The right-hand side performs the same calculation

for a fixed-capacity datacenter by factoring in the average utilization, including nonpeak

workloads. Whichever side is greater represents the opportunity for higher profit.

As shown in Table 4 - Best Practices in Cloud Architectures, shown below, defines best

practices to achieve growth of Cloud Computing Architectures. The first three best practices

concern adoption and the next five deal with growth. The last two are marketing related. All best

practices should aim at horizontal scalability of virtual machines over the efficiency of a single

VM.


157

Table 4 - Best Practices in Cloud Architectures

Best Practice Solution

Increase Availability of Service Use Multiple Cloud Providers; Use Elasticity to

Prevent DDOS (Distributed Denial of Service

attacks)

Implement some form of Data Lock-In Standardize APIs; Compatible SW to enable

Surge Computing

Data Confidentiality and Audit ability Deploy Encryption, VLANs, Firewalls;

Geographical Data Storage

Reduce Data Transfer Bottlenecks FedExing Disks; Data Backup/Archival; Higher

BW Switches

Minimize Performance Unpredictability Improved VM Support; Flash Memory; Gang

Schedule VMs

Implement a Scalable Storage Solution Data De-duplication, Tiered Storage and other

Scalable Store Solutions

Reduce or minimize Bugs in Large-Scale

Distributed Systems

Implement a full functioned fault isolation and

root cause analysis function as well as

Distributed VMs

Implement Architecture to Scale quickly Implement Auto-Scalar that relies on Machine

Learning Snapshots to encourage Cloud

Computing Conservationism

Reputation Fate Sharing Offer reputation-guarding services like those

for email

Implement best of breed tiered Software

Licensing

Pay-for-use licenses; Bulk use sales, etc

In addition:

1. Applications need to both scale down rapidly as well as scale up, which is a new requirement.

Such software also needs a pay-for-use licensing model to match needs of Cloud Computing.

2. Infrastructure Software needs to be aware that it is no longer running on bare metal, but on

VMs. It needs to have billing built in from the beginning.


158

3. Hardware Systems should be designed at the scale of a container (at least a dozen racks),

which will be is the minimum purchase size. Cost of operation will match performance. Cost of

purchase in important, rewarding energy proportionality, such as by putting idle portions of the

memory, disk, and network into low power mode. Processors should work well with VMs, flash

memory should be added to the memory hierarchy, LAN switches, and WAN routers must

improve in bandwidth and cost.

Cloud Computing Economics

When deciding whether hosting a service in the cloud makes sense over the long term, you can

argue that the fine-grained economic models enabled by Cloud Computing make tradeoff

decisions more flowing, and in particular the elasticity offered by clouds serves to transfer risk.

Although hardware resource costs continue to decline, they do so at variable rates. For

example, computing and storage costs are falling faster than WAN costs. Cloud Computing can

track these changes and potentially pass them through to the customer more effectively than

building your own datacenter, resulting in a closer match of expenditure to actual resource

usage.

In making the decision about whether to move an existing service to the cloud, you must

examine the expected average and peak resource utilization, especially if the application may

have highly variable spikes in resource demand; the practical limits on real-world utilization of

purchased equipment; and operational costs that vary depending on the type of cloud

environment being considered. See the section titled “


159

Economics Pillar,” starting on page 162 for additional details on economic issues. After all,

profitability and economics go hand in hand.

Best Practice – Consider Elasticity as part of the business deciding metrics Although the economic appeal of Cloud Computing and its variants is often described as

“converting capital expenses to operating expenses” (CapEx to OpEx),the phrase “pay as you

go” may more directly capture the economic benefit to the buyer or consumer. Hours purchased

via Cloud Computing can be distributed non-uniformly in time (e.g., use 100 server-hours today

and no server-hours tomorrow, and still pay only for what you use). In the networking

community, this way of selling bandwidth is already known as usage-based pricing. In addition,

the absence of up-front capital expense allows capital to be redirected to core business

investment. Even though, as an example, Amazon’s pay-as-you-go pricing could be more

expensive than buying and depreciating a comparable server over the same period, you can

argue that the cost is outweighed by the extremely important Cloud Computing economic

benefits of elasticity and transference of risk. This is especially true if the risks of over

provisioning (underutilization) and under provisioning (saturation) are paramount.

Starting with elasticity, the key observation is that Cloud Computing’s ability to add or remove

resources at a fine granularity (for example, one server at a time with EC2) and with a lead-time

of minutes rather than weeks allows us to match resources to workload much more closely.

Real world estimates of server utilization in datacenters range from 5% to 20%. This may sound

shockingly low, but it is consistent with the observation that for many services, the peak

workload exceeds the average by factors of 2 to 10. Few users deliberately provision for less

than the expected peak, and therefore they must provision for the peak and allow the resources

to remain idle at nonpeak times. The more pronounced the variation, the more the waste.

A simple example demonstrates how elasticity reduces this waste and can more than

compensate for the potentially higher cost per server-hour of paying-as-you-go vs. buying.

For example, regarding elasticity, assume a service has a predictable daily demand where the

peak requires 500 servers at noon but the trough requires only 100 servers at midnight, as

shown in Figure 28 – Provisioning for peak load, on page 160.


160

Figure 28 – Provisioning for peak load

As long as the average utilization over a whole day is 300 servers, the actual utilization over the

whole day (shaded area under the curve) is 300 x 24 = 7200 server-hours; but since we must

provision to the peak of 500 servers, we pay for 500 x 24 = 12000 server-hours, a factor of 1.7

more than what is needed. Therefore, as long as the pay-as-you-go cost per server-hour over 3

years is less than 1.7 times the cost of buying the server, you can save money using utility

computing.

In fact, the Figure 28 – Provisioning for peak load diagram above example underestimates the

benefits of elasticity. In addition to simple diurnal or 24 hour pattern, most nontrivial services

also experience seasonal or other periodic demand variations (e.g., e-commerce peaks in

December and photo sharing sites peak after holidays) as well as some unexpected demand

bursts due to external events (e.g., news events). Since it can take weeks to acquire and rack

new equipment, the only way to handle such spikes is to provision for them in advance. We

already saw that even if service operators predict the spike sizes correctly, capacity is wasted,

and if they overestimate, the spike they provision for is even worse. They may also

underestimate the spike as shown in Figure 29 – Under Provisioning Option 1 on page 160

however, accidentally turning away excess users. While the monetary effects of over

provisioning are easily measured, those of under provisioning are more difficult to measure yet

equally serious given performance and scalability concerns.


161

Figure 29 – Under Provisioning Option 1

Not only do rejected users generate zero revenue; they may never come back due to poor

service. Figure 30 – Under Provisioning Option 2 on page 161 aims to capture this behavior.

Users will abandon an under provisioned service until the peak user load equals the

datacenter’s usable capacity, at which point users again receive acceptable service, but with

fewer potential users.


162

Figure 30 – Under Provisioning Option 2

Regarding Figure 28, even if peak load can be correctly anticipated without elasticity, we waste

resources (shaded area) during nonpeak times. In the case of Figure 29 – Under Provisioning

Option 1, potential revenue from users not served (shaded area) is sacrificed. Lastly, in Figure

30 – Under Provisioning Option 2, some users desert the site permanently after experiencing

poor service. This attrition and possible negative press result in a permanent loss of a portion of

the revenue stream.


163

Economics Pillar

It seems to be a foregone conclusion that by increasing energy efficiency, the world as a whole

reduces carbon emissions and overall, aids in the goal of sustainability.

Consider the economics of passenger airline flights. A number of years ago, there was thought

of building wide body aircraft so that by building planes that could handle more people, less

planes would be used and therefore increase efficiency (i.e., reduce carbon gas emissions).

Interestingly, the opposite happened. At the micro level, yes, the passenger per flight efficiency

increased, but since the cost of airfare was reduced, more people started to fly and therefore

greenhouse emissions increased.

It has become an article of faith among environmentalists, seeking to reduce greenhouse gas

emissions, that improving the efficiency of energy use will lead to a reduction in energy

consumption. This proposition has even been adopted by many countries who are promoting

energy efficiency as the most cost effective solution to global warming.

However, in the United States, there has been a backlash against energy efficiency as an

instrument of energy policy. This has been stimulated partly by disillusionment with the failures

of energy conservation programs undertaken by utilities, and partly by the growing influence of

the 'contrarians', those hostile to government mandated environmental programs.

The debate as to whether energy efficiency is effective (i.e., reduces energy consumption) has

spread from the pages of obscure energy economic journals in the early 1990s to the pages of

the leading US science journal, Science, and newspaper, the New York Times in the mid 1990s.

It has recently produced such polemics as the US book by “Inhaber” entitled “Why Energy

Conservation Fails.” Inhaber argues, with the aid of an extensive bibliography, that energy

efficiency programs are a waste of time and effort.

This debate has also promoted discussion among the climate change community and US

energy analysts over the extent of the “rebound” or “take-back” effect. That is how much of the

energy saving produced by an efficiency investment is taken back by consumers in the form of

higher consumption, both on the micro and macro levels.


164

The Khazzoom-Brookes postulate20, first put forward by the US economist Harry Saunders in

1992, says that energy efficiency improvements that, on the broadest considerations, are

economically justified at the micro level, lead to higher levels of energy consumption at the

macro level, than in the absence of such improvements.

It argues against the views of conservationists who promote energy efficiency as a means of

reducing energy consumption. We can identify every little benefit from each individual act of

energy efficiency and then aggregate them all to produce a macroeconomic total. In effect, it

adopts a macroeconomic (top down) approach rather than the microeconomic (bottom up)

approach used by conservationists.

It warns that although it is possible to reduce energy consumption through improved energy

efficiency, it would be at the expense of loss of economic output. It can be argued that

overzealous pursuit of energy efficiency per se would damage the economy through

misallocation of resources. In other words, reduced energy consumption is possible but at an

economic cost.

The effect of higher energy prices, either through taxes or producer-induced shortages, initially

reduces demand, but in the longer term, encourages greater energy efficiency. This efficiency

response amounts to a partial accommodation of the price rise and therefore the reduction in

demand is blunted. The result is a new balance between supply and demand at a higher level of

supply and consumption than if there had been no efficiency response.

For example, under the economic conditions of falling fuel prices and a free market approach

that have prevailed in the United Kingdom most of this century, energy consumption has

increased at the same time as energy efficiency has improved. During periods of high energy

prices, such as 1973-4 and 1979-80, energy consumption fell. Whether this is due to the

adverse consequences of higher fuel prices on economic, or activity or energy efficiency

improvements, is a matter of fierce dispute.

20 http://www.zerocarbonnetwork.cc/News/Latest/The-Khazzoom-Brookes-Postulate-Does-

Energy-Efficiency-Really-Save-Energy.html


165

The lower level of energy consumption at times of high energy prices may be at the expense of

reduced economic output. This, in turn, is due to the adverse effect on economic productivity as

a whole, and of the high price of an important resource.

Best Practice – Consider Efficiency as only one part of the Economic Sustainable equation

Energy is only one factor of production. Therefore, there are no economic grounds for favoring

energy productivity over labor or capital productivity. Governments may have non-economic

reasons, such as combating global warming, for singling out energy productivity

However, climate policies that rely only on energy efficiency technologies may need

reinforcement by market instruments such as fuel taxes and other incentive mechanisms.

Without such mechanisms, a significant portion of the technological achievable carbon and

energy savings could be lost to the rebound.

Conclusion Data centers are changing at a rapid, exponential pace. Cloud computing and all of its variants

have been discussed. How we align the different data center disciplines to understand how new

technologies will work together to solve data center sustainability problems remains a key

discussion area. We reviewed Best Practices to achieve business value and sustainability. In

summary, this article went above the Cloud, offering Best Practices that align with the most

important goal of creating a sustainable computing infrastructure to achieve business value and

growth.


166

Appendix A – Green IT, SaaS, Cloud Computing Solutions

SaaS or Cloud Computing Company Name

Webpage Address Value Proposition Offered to Green Goods and Services Companies

APX, Inc. - http://www.apx.com/

Analytics, Technology, Information and

Services for the Energy and Environmental

Markets. See Environmental Registry and

Banking >

http://www.apx.com/environmental/environme

ntal-registries.asp and Market Operations >

http://www.apx.com/marketoperations/ for

examples.

Carbon Fund

Offsets

http://www.carbonfund.o

rg/calculators

Carbon footprint calculator and preset

option enables customers to easily and

affordably offset your carbon footprint by

pressing the “Offset Your Footprint Now!”

button after adding a list of items to a

shopping list. Offers detailed information to

learn more about carbon offsets.

Cloud Computing

Expo

http://cloudcomputingex

po.com/

Information About Cloud Computing

CO2 Stats http://www.co2stats.com

/

CO2Stats makes your site carbon neutral

and shows visitors you are environmentally

friendly.


167

Erico http://www.ecorio.org/

Mobile phone capability to track carbon

footprint.

GaBi Software

http://www.pe-

international.com/english

/gabi/

Software tools and databases for product

and process sustainability analyses

Greenhouse Gas

Equivalencies

Calculator

http://www.epa.gov/clea

nenergy/energy-

resources/calculator.html

Green company business models must

offer a value proposition that reduces

carbon dioxide. Designing the green

business model begins by identifying how

its products, services or solutions can

reduce carbon dioxide (CO2) emissions.

Market standards trend toward measuring

reductions by 1 million metric tons. It can

be difficult to visualize what a "metric ton of

carbon dioxide" really is. This calculator

translates difficult to understand

statements into more commonplace terms,

such as "is equivalent to avoiding the

carbon dioxide emissions of X number of

cars annually." It also offers an excellent

example of the kind green analytics,

metrics and intelligence measures that

SaaS / Cloud Computing solutions must

address.

idela sports

entertainment

http://idealsportsent@gm

ail.com

promoting a healthy active life style, and

aiding in the preservation of the

environment through the demonstration of

bike riding to reduce C02 emissions and

exercise.


168

PE

INTERNATIONAL

Experts in

Sustainability

http://www.pe-


/

PE INTERNATIONAL provides

conscientious companies with cutting-edge

tools, in-depth knowledge and an

unparalleled spectrum of experience in

making both corporate operations and

products more sustainable. Applied

methods include implementing

management systems, developing

sustainability indicators, life cycle

assessment (LCA), carbon footprint,

design for environment (DfE) and

environmental product declarations (EPD),

technology benchmarking, or eco-

efficiency analysis and emissions

management. PE INTERNATIONAL offers

two leading software solutions, with the

GaBi software for product sustainability

and the SoFi software for corporate

sustainability.

Planet Metrics http://www.planetmetrics

.com/

Rapid Carbon Modeling (RCM) approach

enables organizations to efficiently assess

their exposure to commodity, climate, and

reputational risks and the implications of

these forces on the corporation, its

suppliers and customers.

Point Carbon http://www.pointcarbon.

com/trading/

Point Carbon Trading Analytics provides

the market with independent analysis of

the power, gas and carbon markets. We

offer 24/7 accessible web tools, aimed at

continuously providing our clients with the


169

latest market-moving information and

forecasts.

SoFi

http://www.pe-


/sofi/

SoFi is a leading software system for

environmental and sustainability /

corporate social responsibility

management, it is currently used in 66

countries. The fast information flow and the

consistent database in SoFi will help you to

improve your environmental and

sustainability performance. The main

product lines are: SoFi EH&S for

Environmental Management and

Occupational Safety SoFi CSM for

Sustainable Corporate Management SoFi

EM for Emissions Management and

Benchmarking

Trucost PLC http://www.trucost.com/

Trucost is an environmental research

organization working with companies,

investors and government agencies to

understand the impacts companies have

on the environment. Trucost is an

independent organization founded in 2000.

Appendix B – Abbreviations

Acronym Description Comment


170

DCE

Data Center Efficiency =

IT equipment / Total

facility power

Shows a ratio of how well a data

center is consuming power

DCPE

Data Center Performance

Efficiency = Effective IT

workload / total facility

power

Shows how effectively a data center

is consuming power to produce a

given level of service or work such

as energy per transaction or energy

per business function performed

PUE

Power usage

effectiveness = Total

facility power / IT

equipment power

Inverse of DCE

Kilowatts (kw) Watts / 1,000 One thousand watts

Annual kWh kWh x 24 x 365 kWh used in on year

Megawatts (mw) kW / 1,000 One thousand kW

BTU/hour watts x 3.413

Heat generated in an hour from

using energy in British Thermal

Units. 12,000 BTU/hour can equate

to 1 Ton of cooling.

kWh 1,000 watt hours The number of watts used in one

hour

Watts

Amps x Volts (e.g. 12

amps * 12 volts = 144

watts)

Unit of electrical energy power

Watts BTU/hour x 0.293 Convert BTU/hr to watts

Volts

Watts / Amps (e.g. 144

watts / 12 amps = 12

volts)

The amount of force on electrons


171

Amps

Watts / Volts (e.g. 144

watts / 12 volts = 12

amps)

The flow rate of electricity

Volt-Amperes (VA) Volts x Amps Sometimes power expressed in

Volt-Amperes

kVA Volts x Amp / 1000 Number of kilovolt-amperes

kW kVA x power-factor Power factor is the efficiency of a

piece of equipments’ use of power

kVA kW / power-factor Kilovolt-Amperes

U 1U = 1.75”

EIA metric describing height of

equipment in racks


172

Activity / Watt

Amount of work

accomplished per unit of

energy consumed. This

could be IOPS,

Transactions or

Bandwidth per watt.

Indicator of how much work and

how efficient energy is being used

to accomplish useful work. This

metric applies to active workloads

or actively used and frequently

accessed storage and data.

Examples would be IOPS per watt,

Bandwidth per watt, Transactions

per watt, Users or streams per watt.

Activity per watt should also be

used in conjunction with another

metric such as how much capacity

is supported per watt and total watts

consumed for a representative

picture.

IOPS / Watt

Number of I/O operations

(or transactions) / energy

(watts)

Indicator of how effectively energy

is being used to perform a given

amount of work. The work could be

I/Os, transactions, throughput or

other indicator of application

activity. For example SPC-1 / Watt,

SPEC / Watt, TPC / Watt,

transaction / watt, IOP / Watt.


173

Bandwidth / Watt

GBPS or TBPS or PBPS

/ Watt Amount of data

transferred or moved per

second and energy used.

Often confused with

Capacity per watt

This indicates how much data is

moved or accessed per second or

time interval per unit of energy

consumed. This is often confused

with capacity per watt given that

both bandwidth and capacity

reference GByte, TByte, PByte.

Capacity / Watt GB or TB or PB (storage

capacity space / watt

Indicator of how much capacity

(space) or bandwidth supported in a

given configuration or footprint per

watt of energy. For inactive data or

off-line and archive data, capacity

per watt can be an effective

measurement gauge. However, for

active workloads and applications

activity per watt also needs to be

looked at to get a representative

indicator of how energy is being

used

Mhz / Watt Processor performance /

energy (watts)

Indicator of how effectively energy

is being used by a CPU or

processor.

Carbon Credit Carbon offset credit

Offset credits that can be bought

and sold to offset your CO2

emissions

CO2 Emission

Average 1.341 lbs per

kWh of electricity

generated

The amount of average carbon

dioxide (CO2) emissions from

generating an average kWh of

electricity


174

Appendix B – References

[1] Gartner's Top Predictions for IT Organizations and Users, 2010 and Beyond: A New Balance Brian Gammage, Daryl C. Plummer, Ed Thompson, Leslie Fiering, Hung LeHong, Frances Karamouzis, Claudio Da Rold, Kimberly Collins, William Clark, Nick Jones, Charles Smulders, Meike Escherich, Martin Reynolds, Monica Basso, Publication Date: 29 December 2009

[2] Cloud Computing Value Chains:Understanding Businesses and Value Creation in the Cloud, Ashraf Bany Mohammed, Jorn Altmann and Junseok Hwang, Dec 2009

[3] Cloud Data Management Interface Specification, Version 0.80, Jan 2009

[4] HP and the cloud for industry analysts, Rebecca Lawson, Director of worldwide cloud marketing initiatives, Fall 2009

[5] Belady, C., Electronics Cooling, Volume 13, No. 1, February 20007

[6] White Paper - Creating HIPAA-Compliant Medical Data Applications with Amazon Web Services, April 2009

[7] Guidelines for energy efficient data centers, February 16,2007

[8] Evaluating Data Center High-Availability Service Delivery, A FORTRUST White Paper, June 2008

[9] Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval Thomas Hofmann International Computer Science Institute, Berkeley, CA & EECS Department, CS Division, UC Berkeley, [email protected]

[10] A closer look at Data de-duplication and VTL, Jan Poos, Sun Microsystems

[11] The Green Data Center: Understanding Energy Regulations, Power Consumption and More, Using chargeback to reduce data center power consumption: Five steps, Search tech Data, IBM, Nov 2009

[12] Supporting Sustainable Cloud Services Investing In The Network To Deliver Scalable, Reliable, And Secure Cloud Computing, A commissioned study conducted by Forrester Consulting on behalf of Juniper Networks, October 2009

[13] PROXY PROPOSALS FOR MEASURING DATA CENTER PRODUCTIVITY CONTRIBUTORS: JON HAAS, INTEL MARK MONROE, SUN MICROSYSTEMS JOHN PFLUEGER, DELL JACK POUCHET, EMERSON PETER SNELLING, SUN MICROSYSTEMS ANDY RAWSON, AMD FREEMAN RAWSON, IBM WHITE PAPER #17, ©2009 The Green Grid.

[14] S.V. Garimella, Joshi, Y.K. , Bar-Cohen, A., Mahajan, R., Toh, K.C., Carey, V.P.,Baelmans, M., Lohan, J., Sammakia, B. and Andros, F., “Thermal Challenges in Next Generation Electronic Systems – Summary of Panel Presentations and Discussions,”IEEE Trans. Components and Packaging Technologies”, 2002

[15] Shah, A.J., Carey, V.P., Bash, C.E and Patel, C.D., “Energy Analysis of Data Center Thermal Management Systems, Proceedings of the 2003 IMECE, paper IMECE2003-42527, 2003.


175

[16] Shah, A.J., Carey, V.P., Bash, C.E. and Patel, C.D, “An Energy-Based Control Strategy for Computer Room Air-Conditioning Units in Data Centers,” paper IMECE2004-61384, Proceedings of the 2004 IMECE, Anaheim, CA, 2004.

[17] Data Center Power Efficiency, Technical committee white paper, The green Grid, February 20, 2007

[18] Power Efficiency and Storage Arrays, Technology concepts and business considerations, EMC, July, 2007

[19] IDC, Industry trends and market analysis, October 30, 2007

[20] http://www.ashrae.org/

[21] http://www.fema.gov/hazard/map/index.shtm#disaster

[22] US Department of Energy, Energy Information Industry, Annual Energy Review 2006, June 27, 2007


176

Author’s Biography

Paul Brant is a Senior Technology Consultant at EMC in the Global Technology Solutions

Group located in New York City. He has over 25 years experience in semiconductor VLSI

design, board level hardware and software design and IT solutions in various roles including

engineering, marketing and technical sales. He also holds a number of patents in the data

communication and semiconductor fields. Paul has a Bachelors and Masters Degree in

Electrical Engineering from New York University (NYU) located in down town Manhattan as well

as a Masters in Business Administration (MBA), from Dowling College located in Suffolk County,

Long Island, NY. In his spare time, he enjoys his family of five, bicycling and other various

endurance sports.


177

Index Amazon ....82, 87, 88, 89, 92, 93, 94, 95, 96,

101, 110, 119, 120, 121, 123, 124, 126,

130, 155, 158, 173

ANSI ......................................................... 68

AppEngine ............................. 120, 121, 155

ASIC ................................................... 77, 78

Autonomic Computing .............. 84, 102, 103

Business Practices ........... 19, 127, 128, 151

CCT .................................................... 48, 49

Chillers ..................................................... 42

CHP ......................................................... 20

CIM .................................................. 45, 138

Cloud ....1, 11, 12, 13, 14, 23, 24, 46, 72, 73,

74, 76, 77, 81, 82, 83, 84, 85, 86, 87, 88,

89, 92, 93, 96, 99, 100, 101, 102, 103,

104, 105, 106, 107, 109, 119, 120, 121,

123, 125, 126, 130, 136, 137, 138, 140,

142, 143, 144, 145, 153, 154, 155, 156,

157, 158, 159, 164, 165, 166

Cloud computing .............. 13, 14, 24, 26, 27

Cloud Computing .24, 81, 82, 83, 85, 86, 88,

89, 92, 97, 102, 121, 154, 156, 158, 173

Consolidation ..................................... 21, 51

CRM ......................................................... 14

data center .....11, 12, 16, 20, 21, 23, 25, 26,

27, 28, 31, 34, 35, 38, 65, 69, 73, 80, 81,

82, 83, 86, 91, 92, 95, 96, 106, 124, 127,

128, 129, 130, 131, 132, 133, 134, 135,

141, 143, 149, 150, 153, 155, 164, 169

Digital Ecosystems ...... 82, 83, 92, 101, 102,

103, 105

DMTF ........................................ 45, 137, 138

Downstream Event Suppression .............. 49

DRAM ............................................... 70, 118

Effectiveness ........................ 19, 29, 39, 129

EISM ......................................................... 43

EMC ................................ 1, 10, 19, 174, 175

Environment ................................. 19, 29, 31

EPA .............................................. 33, 41, 42

EPEAT ................................................ 33, 34

ERP .......................................................... 14

ESX .............................................. 43, 71, 72

Executive Order 13423 ............................. 32

FAST ............................................ 64, 67, 69

FBI .................................................. 107, 108

Federated ....................................... 139, 140

FEMA ........................................................ 36

Fifth Light .................................................. 20

Flash ........................................... 17, 64, 157

Flash Memory ................................. 17, 157

Flywheel ................................................... 20

Google13, 81, 82, 87, 88, 89, 101, 103, 109,

110, 113, 114, 116, 120, 121, 152, 153,

155

Governance ............................................ 143

Green .. 15, 23, 24, 31, 82, 83, 92, 102, 103,

165, 166

Grid Computing ..... 82, 83, 84, 92, 102, 103,

123, 125

HDD .................................................... 67, 68

Hyper-V ............................................ 43, 130

IaaS .............................. 76, 87, 97, 124, 136


178

IBM ................................................... 93, 173

ICIM ............................................. 44, 45, 46

IONOX ..................................................... 43

ISO ........................................... 45, 131, 147

IT 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,

23, 24, 26, 27, 28, 29, 31, 32, 34, 37, 39,

40, 41, 42, 49, 50, 53, 57, 58, 59, 64, 71,

72, 80, 81, 82, 85, 89, 90, 92, 97, 98, 99,

100, 101, 107, 108, 109, 127, 129, 134,

135, 139, 140, 141, 142, 143, 144,

145,鶬152, 153, 165, 169, 175

Joyent .................................................... 119

Liquid Cooling .......................................... 20

Little’s law ................................................ 55

Moore’s Law ............................... 29, 78, 153

Mosso .............................................. 87, 119

Optimization ............................................. 14

PaaS ............................ 76, 87, 97, 124, 136

People, Planet, Profit ............................... 31

RAID .......................... 63, 64, 65, 67, 68, 69

RAID 6 ..................................................... 68

RFP ........................................................ 133

Risk ................................................ 143, 146

RLS .................................................... 54, 57

ROI ........................................... 48, 133, 153

Ruby on Rails ......................................... 121

SaaS ..13, 24, 76, 85, 88, 97, 124, 154, 165,

166

SAS ............................................ 68, 76, 147

SATA ................................ 64, 67, 68, 69, 70

Security 16, 74, 95, 97, 137, 138, 139, 140,

145

Self Healing .............................................. 66

Self Organizing Systems .................... 50, 52

Self-organizing .......................................... 65

SNMP ................................................. 45, 47

Social Computing ................................... 16

STR .............................................. 54, 55, 56

Sustainability . 13, 18, 19, 20, 21, 23, 24, 27,

28, 29, 30, 31, 81, 82, 84, 104, 150, 151,

167

T10 DIF ..................................................... 68

TCO ............................................ 22, 58, 135

Telco ......................................................... 94

Texas ........................................................ 36

triple bottom line ....................................... 31

United States ............ 20, 33, 35, 36, 37, 143

VERITAS .................................................. 43

Virtualization .... 14, 17, 42, 51, 70, 90, 138,

153

VM ................ 17, 43, 71, 106, 121, 156, 157

VTL ............................................. 61, 63, 173

Warehouse-Scale ..................................... 11

WDM ......................................................... 80

WORM ...................................................... 94

WSCs ..... 109, 110, 111, 112, 114, 115, 116

Zantaz ....................................................... 94

Zetta ......................................................... 94

brant - above the clouds - bp in creating a sustainable ... · emc proven professional knowledge...

Documents