scienti c computing in the cloud with google app engineradu/sperk.pdf · google app engine is a...

Scientific Computing in the Cloud with

Google App Engine

master thesis in computer science

by

Michael Sperk

submitted to the Faculty of Mathematics, Computer

Science and Physics of the University of Innsbruck

in partial fulfillment of the requirements

for the degree of Master of Science

supervisor: Prof. Dr. Radu Prodan, Institute of ComputerScience

Innsbruck, 17 January 2011

Certificate of authorship/originality

I certify that the work in this thesis has not previously been submitted for a

degree nor has it been submitted as part of requirements for a degree except as

fully acknowledged within the text.

I also certify that the thesis has been written by me. Any help that I have

received in my research work and the preparation of the thesis itself has been

acknowledged. In addition, I certify that all information sources and literature

used are indicated in the thesis.

Michael Sperk, Innsbruck on the 17 January 2011

i

Abstract

Cloud Computing as a computing paradigm recently emerged to a topic of high

research interest. It has become attractive alternative to traditional computing

environments, especially for smaller research groups that can not afford expen-

sive infrastructure. Most of the research regarding scientific computing in the

cloud however focused on IaaS cloud providers. Google App Engine is a PaaS

cloud framework dedicated to the development of scalable web applications.

The focus of this thesis is to investigate App Engine’s capabilities in terms of

scientific computing. Moreover algorithm properties that are well suited for ex-

ecution on Google App Engine as well as potential problems and bottlenecks

are identified.

Acknowledgements

....

Contents

0.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

0.1.1 The Runtime Environment . . . . . . . . . . . . . . . . . xiii

0.1.2 The Datastore . . . . . . . . . . . . . . . . . . . . . . . . xiv

0.1.3 Scalable Services . . . . . . . . . . . . . . . . . . . . . . . xviii

0.1.4 The App Engine Development Server . . . . . . . . . . . xix

0.1.5 Quotas and Limits . . . . . . . . . . . . . . . . . . . . . . xx

0.2 HTTP Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi

0.2.1 The HTTP Protocol . . . . . . . . . . . . . . . . . . . . . xxvi

0.2.2 Apache HTTP Components . . . . . . . . . . . . . . . . . xxvii

0.2.3 Entity Compression . . . . . . . . . . . . . . . . . . . . . xxix

0.3 Slave Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi

0.3.1 App Engine Slaves . . . . . . . . . . . . . . . . . . . . . . xxxii

0.3.2 Local Slaves . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii

0.3.3 Comparison of Slave Types . . . . . . . . . . . . . . . . . xxxiii

0.4 The Master Application . . . . . . . . . . . . . . . . . . . . . . . xxxv

0.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . xxxv

0.4.2 Generating Jobs . . . . . . . . . . . . . . . . . . . . . . . xxxvi

0.4.3 Job Mapping . . . . . . . . . . . . . . . . . . . . . . . . . xxxviii

0.4.4 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . xxxix

0.5 The Slave Application . . . . . . . . . . . . . . . . . . . . . . . . xlii

0.5.1 WorkJobs . . . . . . . . . . . . . . . . . . . . . . . . . . . xliii

0.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . xliv

0.5.3 Message Headers . . . . . . . . . . . . . . . . . . . . . . . xlv

0.6 Shared Data Management . . . . . . . . . . . . . . . . . . . . . . xlvii

0.6.1 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . xlviii

0.6.2 Data Transfer Strategy . . . . . . . . . . . . . . . . . . . xlviii

0.6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . l

0.7 Monte Carlo Routines . . . . . . . . . . . . . . . . . . . . . . . . liii

0.7.1 Pi Approximation . . . . . . . . . . . . . . . . . . . . . . liii

0.7.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . lix

0.8 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . lxiv

0.8.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . lxv

vii

0.8.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . lxvi

0.8.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . lxvii

0.9 Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxi

0.9.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxii

0.9.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . lxxiii

0.9.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . lxxiv

0.10 Rank Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxix

0.10.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxx

0.10.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . lxxx

0.10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . lxxxi

0.10.4 Hardware and Experimental Setup . . . . . . . . . . . . . lxxxvii

0.11 Analyzing Google App Engine Performance . . . . . . . . . . . . lxxxix

0.11.1 Latency Analysis . . . . . . . . . . . . . . . . . . . . . . . lxxxix

0.11.2 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . xc

0.11.3 Java Performance Analysis . . . . . . . . . . . . . . . . . xcii

0.11.4 JIT Compilation . . . . . . . . . . . . . . . . . . . . . . . xcv

0.11.5 Cache Hierarchy . . . . . . . . . . . . . . . . . . . . . . . xcvii

0.12 Speedup Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . xcviii

0.12.1 Pi Approximation . . . . . . . . . . . . . . . . . . . . . . c

0.12.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . ci

0.12.3 Rank Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . cii

0.12.4 Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . . ciii

0.13 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . cv

0.14 Resource Consumption and Cost Estimation . . . . . . . . . . . . cviii

List of Figures cxv

List of Tables cxvii

Bibliography cxix

viii

Introduction

In the last few years a new paradigm for handling computing resources called

Cloud Computing emerged. The basic idea is that resources, software and data

are provided as on-demand services over the Internet [20]. The actual technology

used in the cloud is abstracted from the user, so all the administrative tasks

are shifted to the service provider. Moreover the provider deals with problems

such as load balancing and scalability. Typically resource virtualization is used

to deal with these problems. Cloud Computing provides a flexible and cost

efficient alternative to local management of compute resources. Payment is

typically done on a per use basis, so the user only pays for the resources that

were actually consumed.

Cloud services can be classified to three categories by the level of abstraction of

the service [20]:

1. Infrastructure as a Service (IaaS): IaaS provides only basic storage,

network and computing resources . The user does not manage or control

the underlying cloud infrastructure, but can deploy and execute arbitrary

software including operating system and applications.

2. Platform as a Service (PaaS): PaaS provides a platform for executing

consumer created applications, developed using programming languages

and tools provided by the producer. The user does not manage the un-

derlying cloud infrastructure, storage or the operating system, but has

control over the deployed applications.

3. Software as a Service (SaaS): SaaS provides the use of applications

developed by the producer running on the cloud infrastructure through

a thin client, typically a web browser. The user has no control over the

infrastructure, the operating system or the software capabilities.

Cloud computing has recently become an appealing alternative for research

groups, to buying and maintaining expensive computing clusters. Most work on

scientific computing in the cloud focused on IaaS clouds such as Amazon EC2 [1],

because arbitrary software can be executed which makes the process of porting

ix

existing scientific programs a lot easier. Moreover there are no restrictions in

terms of operating system or programming language.

Google App Engine is a PaaS Cloud Service especially dedicated to scalable

web applications [5]. It mainly targets smaller companies that cannot afford

the infrastructure to handle a large number of requests or sudden traffic peaks.

App Engine provides a framework for developing servlet based web application

using Python or Java as programming language. Applications are deployed to

Google’s server infrastructure.

Each application can consume resources such as CPU time, number of requests

or used bandwidth. Google grants a certain amount of free daily resources to

each applications. If billing is enabled the user pays for resource usage that

surpasses the free limits, otherwise the web application becomes unavailable if

critical resources are depleted. This makes the service especially interesting

for scientific computing, since an automated program could just use up the

given daily free resources and pause computation until resources are replenished.

Moreover each member of the research group can provide his own account with

separate resources.

The problem is though that the framework is very restrictive in terms of pro-

gramming language, libraries and many other aspects. This makes an use for

scientific computing more difficult than on common IaaS cloud platforms.

The focus of this thesis is to explore the capabilities of the Google App En-

gine framework in terms of scientific computing. The goal is to build a simple

framework for utilizing App Engine servers for parallel scientific computations.

Subsequently a few exemplary algorithms should be implemented and analyzed

in order to identify algorithm properties that might be well suited for execution

in a PaaS cloud environment. Moreover potential problems and bottlenecks that

arise should be analyzed as well.

The thesis is structured in four main parts first introducing the App Engine

framework and the parts of the API that will be used. Followed by a description

of the Distribution Framework that was developed in the course of the thesis

and a basic introduction to the algorithms that were implemented to test the

system. Finally the the experimental results obtained by testing the Distribution

Framework under practical circumstances are presented.

x

Google App Engine

Google App Engine is a Cloud service for hosting Web applications on Google’s

large scale server infrastructure. However, it provides a whole framework for

building scalable Web applications rather than plain access to hardware. As

more people access the application, App Engine automatically allocates and

manages additional resources.

The user never has to set up or maintain a server environment. In addition,

common problems such as load balancing, caching and traffic peaks are handled

by App Engine automatically.

The framework provides a certain amount of free resources, enough for smaller

applications. Google estimates with the free resources an application can handle

about 5 million page views per month. If an application needs resources that

exceed the monthly quota limits, these are billed on a per-use basis. For example,

if an application is very computation heavy only the additional CPU hours are

billed.

In this chapter the aspects of the App Engine framework relevant to this thesis

will be described. The description in this chapter are mostly based on [22] and

the official online documentation [2].

0.1 Architecture

Figure 0.1 shows the request handling architecture of Google App Engine. Each

request is at first inspected by the App Engine frontend. In fact there are multi-

ple frontend machines and a load balancer that manages the proper distribution

of requests to the actual machines. The frontend determines to which applica-

tion the request is addressed by inspecting the domain name of the request.

In the next step the frontend reads the configuration of the corresponding appli-

cation. The configuration of an application determines how the frontend handles

a request, depending on the URL. The URL path can be mapped either to a

xi

Figure 0.1: Request handling architecture of Google App Engine takenfrom [22]

static file or to a request handler. Static files are typically images, Java scripts or

files. A request handler dynamically generates a response for the request, based

on application code. If no matching mapping is found in the configuration a

HTTP 404 "Not Found" error message is responded by the frontend.

Requests to static files are forwarded to the static file servers. The static file

servers are optimized for fast delivery of resources that do not change often.

Whether a file is static and should be stored on the static file servers is decided

at application deployment.

If the request is linked to a request handler it is forwarded to the application

servers. One specific application server is chosen and a instance application is

started. If there is already a instance of the application running it can be reused,

so typically servers already running an instance are preferred. The appropriate

request handler of the application is then invoked.

The strategies for load balancing and distributing requests to application servers

are still being optimized. However the main goal is fast responding request han-

dlers, in order to guarantee a high throughput of requests. How many instances

of an application are started at a time and how requests are distributed depends

on the applications traffic and resource usage patterns. Typically there are just

enough instances started at a time to handle the current traffic.

The application code itself runs in a runtime environment, an abstraction above

the operating system. This mechanism allows servers to manage resources such

as CPU cycles and memory for multiple applications running on the same server.

xii

Besides applications are prevented from interfering with one another.

The application server waits until the request handler terminates and returns the

response to the frontend, thus completing the request. Request handlers have to

terminate before returning data, therefore streaming of data is not possible. The

frontend then constructs the final response to the client. If the client indicates

that it supports compression by adding the ”Accept-Encoding” request header,

data is automatically compressed using zip file format.

App Engine consists of three main parts: the runtime environment, the Data-

store and the scalable services. The runtime environment executes the code of

the application. The Datastore provides a possibility for developers to persist

data beyond requests. Finally App Engine provides a couple of scalable services

typically useful to web applications. In the following each of the parts will by

described shortly.

0.1.1 The Runtime Environment

As already mentioned, the application code runs in a runtime environment,

which is abstracted from the underlying operating system. This isolated envi-

ronment running the applications is called the sandbox. App Engine applications

can be programmed either in Python or in Java. As a consequence each pro-

gramming language has its own runtime environment.

The Python runtime provides an optimized interpreter (by the time of this writ-

ing Python 2.5.2 was the latest supported version). Besides the standard library,

a wide variety of useful libraries and frameworks for Python web application de-

velopment, such as Django, can be used.

The Java runtime follows the Java Servlet standards, providing the correspond-

ing APIs to the application. Common web application technologies such as

JavaServer Pages (JSP) are supported as well. The App Engine SDK supports

developing applications using Java in the version 5 or 6.

Though applications are typically developed in Java, in principle any language

supporting compilers producing Java bytecode, such as JavaScript, Ruby or

Scala, can be used. This section will focus on the Java runtime.

The sandbox imposes several restrictions to applications:

1. Developers have limited access to the filesystem. Files deployed along

with the application can be read, however there is no write access to the

xiii

filesystem whatsoever.

2. Applications have no direct access to the network, though HTTP requests

can be performed through a service API.

3. In general no access to the underlying hardware or the operating system

is granted.

4. App Engine does not support domains without ”www” such as http:

//example.com, because of canonical name records being used for load

balancing.

5. Usage of threads is not permitted.

6. Java applications can only use a limited set of classes from the standard

Java Runtime Environment, documented in [8].

Sandboxing on the one hand prevents applications from performing malicious

operations that could harm the stability of the server infrastructure or interfere

with other applications running on the same physical machine. On the other

hand, it enables App Engine to perform automatic load balancing, because it

does not matter on what underlying hardware or operating system the appli-

cation is executed. There is no guarantee that two requests will be executed

on the same machine even if the requests arrive one after another and from the

same client. Multiple instances of the same or even of different applications can

run on the same machine without affecting one another.

The sandbox also limits resources such as CPU or memory use and can throttle

applications that use a particular high amount of resources in order to protect

applications executed on the same machine. A single request has a maximum

of 30 seconds to terminate and respond to the client, although App Engine

is optimized for much shorter requests and may slowdown an application that

consumes too many CPU cycles.

Since scientific applications are CPU intensive, these limitations imposed by the

runtime environment are problematic for such an application.

0.1.2 The Datastore

Web applications need a way to persist data between the stateless requests

to the application. The traditional approach is a relational database residing

on a single database server. The central database is accessed by a single or

potentially by multiple web servers retrieving the necessary data. The advantage

xiv

http://example.com

http://example.com

of such a system is that every web server always has the most recent version

of the data. However, once the limit for handling multiple parallel database

requests is reached, it gets difficult to scale the system up to more requests.

Alongside relational database systems there are various other approaches like

XML databases or object databases.

The Datastore is App Engine’s own distributed data storage service. The main

idea is to provide a high level API for use and hide the details of how storage

is actually done from the developer. This spares the application developer the

task of keeping data up to date while still maintaining scalability.

The database paradigm of the Datastore most closely resembles an object

database. Data objects are called entities and have a set of properties. Prop-

erty values can be chosen from a set of supported data types. Entities are of a

named kind in order to provide a mechanism for categorizing data.

This concept might seem similar to a relational database. Entities resemble rows

and properties resemble the columns in a table. However, there are some key

differences to a relational database. First of all, the Datastore is schemaless,

which means that entities of the same kind are not required to have the same

properties. Furthermore, two entities are allowed to have a property with the

same name but different value types. Another important difference is that a

single property can have multiple values.

Entities are identified by a key, which can either be generated automatically by

the App Engine or manually by the programmer. Unlike the primary key in

a relational database, an entity key is not a field, but a separate aspect of the

entity. App Engine uses the key in combination with the kind of an entity to

determine where to store the entity in the distributed server infrastructure. As

a consequence the key as well as the kind of an entity cannot be changed once

it is created.

Indexes used in the Datastore are defined in a configuration file. While testing

the application locally on the development server index, suggestions are auto-

matically added to the configuration. The framework recognizes typical queries

performed by the application and generates according indexes. The index def-

initions can be manually fine tuned by modifying the configuration file before

uploading the application.

App Engine’s query concept provides most common query types. A query con-

sists of the entity kind alongside with a set of conditions and a sorting order.

Executing a query results in all the entities of the given kind meeting all of

xv

the given conditions being returned sorted by the given order. Besides letting

the query return the entities there is also the option to let it return only the

key values of the entities. This helps to minimize the data transfer from the

Datastore to the application, if only some of the queried entities are actually

used.

The data of web application is typically accessed by multiple users simultane-

ously, thus making a transaction concept important. App Engine guarantees

atomicity : every update of an entity involving multiple properties either suc-

ceeds entirely or fails entirely, leaving the object in its original state. Other users

will only see the complete update or the original entity and never something in

between.

App Engine uses a optimistic concurrency control mechanism. It is assumed that

transactions can take place without conflict. Once a conflict occurs (multiple

users try to update the same entity at the same time), the entity is rolled back to

its original state and all users trying to perform an update receive an concurrency

failure exception. Such a concept is most efficient for a system where conflict

occurs rather sparse, which is usually the case for a web application. Reads

will always succeed and the user will just see the most recent version of the

entity. There is also a possibility to read multiple entities in a group in order to

guarantee consistency of the data.

There is also the possibility to define transactions manually, by bundling mul-

tiple database operations into a transaction. For example, an application can

read an entity, update a property accordingly, write the entity back and commit

the transaction. Again, if the transaction fails all of the database operations

have to be repeated.

The Datastore provides two standard Java interfaces for data access: Java Data

Objects (JDO) and Java Persistence API (JPA). The implementation of the two

interfaces uses the DataNucleus Access Platform, which is an open source im-

plementation of the specified APIs. Alongside the high level APIs, App Engine

also provides a low level API, which can be used to program further database

interfaces. The low level API can also be used directly from the application,

which in some cases might be more efficient than the high level APIs.

The Java Data Objects API

In the following, the JDO API will be shortly described, alongside with an

example illustrating the use of the API. JDO uses annotations to describe how

xvi

entities are stored and reconstructed. In the following a JDO data class called

DataStoreEntity is defined in order to demonstrate the use of annotations:

Listing 1: example of a Datastore entity

1

2 @PersistenceCapable

3 public class DataStoreEntity {

4

5 @PrimaryKey

6 @Persistent

7 String key;

8

9 @Persistent

10 private Blob data;

11

12 public DataStoreEntity(byte [] data , String key){

13 this.data = new Blob(data);

14 this.key = key;

15 }

16 public byte[] getData () {

17 return data.getBytes ();

18 }

19 public void setData(byte[] data) {

20 this.data = new Blob(data);

21 }

22 public String getKey () {

23 return key;

24 }

25 public void setKey(String key) {

26 this.key = key;

27 }

28 }

The class is marked with the annotation @PersistenceCapable indicating that

it is a storable data class. The class defines two fields annotated with @Per-

sistent telling the datastore that they should be stored as properties of the

entity. The field key is additionally annotated with @PrimaryKey, making it

the database key of the entity. Besides the standard Java data types, there are

several additional classes for various purposes provided.

In order to perform database operations, a PersistenceManager is needed

which is retrieved through a PersistenceManagerFactory (PMF). The PMF

takes some time to initialize, though only one instance is needed for the appli-

xvii

cation. Typically the PMF is stored in a static variable making it available to

the application through a singleton wrapper:

Listing 2: PersistenceManagerFactory

1

2 public final class PMF {

3 private static final PersistenceManagerFactory

4 pmfInstance = JDOHelper.

5 getPersistenceManagerFactory("transactions -optional");

6

7 private PMF() {}

8 public static PersistenceManagerFactory get() {

9 return pmfInstance;

10 }

11 }

Having defined a JDO data class and the singleton wrapper for the PersistenceManagerFactory,

instances of the entity can be stored into the Datastore and retrieved using the

query API:

Listing 3: using the query API

1 PersistenceManager pm = PMF.get().getPersistenceManager ();

2

3 pm.makePersistent(new DataStoreEntity(

4 data , req.getHeader("id")));

5

6 Query query = pm.newQuery(DataStoreEntity.class);

7 List <DataStoreEntity > objs =

8 (List <DataStoreEntity >) query.execute ();

Every database operation is performed through a PersistenceManager in-

stance. The makePersistent() method simply stores persistence capable

classes in the Datastore. Datastore entities are retrieved using queries. Queries

are also generated by the PersistenceManager, the newQuery() method returns

a query for a given class. Executing the query without further constraints

returns all Datastore entities of the given class. Entities are returned in a Java

List of the corresponding class.

0.1.3 Scalable Services

The Datastore provides a high level API, that hides implementation details

from the programmer. In a similar fashion App Engine provides an API to

xviii

several scalable services. On the one hand some services are a compensation to

the restrictions of the sandbox. On the other hand there are services typically

useful to web application.

This system enables App Engine to handle scalability and performance of the

services while the developers do not have to worry about implementation details.

In the following the different services will be described in short:

1. URL Fetch: Because of the restrictions of the sandbox, applications are

not allowed to initiate arbitrary network connections. The URL fetch

service provides an API for accessing HTTP resources on the Internet,

such as web services or other web sites. Since requests to web resources

often take a long time, there is a way to perform asynchronous HTTP

requests as well as a timeout mechanism to abort requests to resources

that do not respond timely.

2. Mail: An application can send emails through the mail service. Many web

applications use emails for user notification or confirmation of user data.

There is also the possibility for an application to receive emails. When a

mail is sent to the applications address, the mail service performs a HTTP

request to a request handler forwarding the message to the application.

3. Memcache: The Memcache is a short lived key-value storage service used

for data that does not need the persistence or transactional features of the

Datastore. It can also be accessed by multiple instances of the application.

The advantage over the Datastore is that it performs much faster, since

the data is stored in memory. As the name indicates, the service is usually

used as cache for persistent data.

4. Image Manipulation: Web applications often need image transforma-

tions, for example when creating thumbnails. This service allows the ap-

plication to perform simple image manipulations on common file formats.

5. XMPP: An application can send and receive messages from any XMPP

compatible instant messaging service. Received messages trigger a request

handler similar to a HTTP request.

0.1.4 The App Engine Development Server

The App Engine SDK includes a development server that simulates the runtime

environment along with all the accessible services, including the Datastore. As

the name states, the development server is intended for development and de-

xix

bugging purposes, however there is the possibility to make the server remotely

accessible. This provides a way to host an App Engine web application on hard-

ware besides Googles servers. For example if the free quotas limit the application

and there is additional hardware available, one can host the application on an

alternative server.

Though this rarely makes sense for an actual web application, for scientific com-

putations it actually can be very useful. The work can be distributed heteroge-

neously on several Google App Engine accounts, as well as on some development

servers running on additional hardware. Since the development server has the

same behavior as the App Engine runtime environment, there is in principle no

difference where the application is executed.

There are necessarily some differences between the development server and the

App Engine runtime, however most of them make things easier on the devel-

opment server. For example, all the quota restrictions do not apply for an ap-

plication running on the development server, leaving more freedom in resource

usage. Moreover the underlying hardware is known, making rough runtime

estimates possible and thus correct scheduling of jobs easier. The differences

between an application running on Google’s infrastructure and one running in

the development server will be discussed in more detail in section 0.3.

The scalable services are simulated by the development server, in order to pro-

vide the same API to the programmer. For example the Datastore is simulated

using the local filesystem.

0.1.5 Quotas and Limits

App Engine applications can use each resource up to a maximum limit, called

quota. Each type of resource has a quota associated with it. There are two

different types of quotas: billable and fixed [6].

Billable quotas are set by the application administrator in order to prevent the

application from overusing costly resources. There is a certain amount of each

billable quota provided to the application for free. In order to use more than

the free resources, billing has to be activated for the application. With billing

activated the user sets a daily budget for the application assigned to the desired

resources. Application owners are only charged for the amount of resources the

application actually used and only the amount that exceeded the free quotas.

Fixed quotas are set by App Engine in order to ensure stability and integrity

xx

of the server system. These are maximum limits shared by all applications,

preventing applications from consuming too many resources at a time. When

billing is enabled for an applications the fixed quotas increase.

Once the quota for a resource is reached the resource is considered depleted.

Resources are replenished at the beginning of every day giving the application

a fresh contingent for the next 24 hours. An exception are the datastore quotas

which represent the total amount of storable data and thus are not replenished.

Besides the daily quotas there are also per-minute quotas preventing applications

from consuming their resources in a very short time. Per-minute quotas again

are increased for applications with billing enabled.

There are essential resources required to initiate a request handler, if one of those

is depleted, requests will be rejected with a HTTP 403 ”Forbidden” status code.

Following resources are necessary for handling a request:

• number of allowed requests;

• CPU cycles;

• incoming and outgoing bandwidth.

For the rest of the resources, an exception is raised once an application tries to

access a depleted resource. These exceptions can be caught by the application

in order to display appropriate error messages for users.

In the following, we shortly describe the resources and their corresponding quo-

tas relevant to this thesis. However there are many more quotas besides the

ones mentioned in this section, especially every scalable service has its own set

of quotas.

In the following the general resources with a quota are listed:

• Requests: The total number of HTTP requests to the application.

• Outgoing Bandwidth: The total amount of data sent by the application.

This includes data returned by request handlers, data served by the static

file servers, data sent in emails and data sent using the URL Fetch service.

• Incoming Bandwidth: The total amount of data received by the appli-

cation. This includes data sent to the application via HTTP requests as

well as returned data using the URL Fetch service.

• CPU Time: Total time spent processing, including all database opera-

tions. Waiting for other services such as URL Fetch or Image processing

xxi

does not count. CPU time is reported in seconds. CPU seconds are calcu-

lated in reference to a 1.2 GHz Intel x86 processor. This value is adjusted

because CPU cycles may vary greatly due to App Engine internal config-

urations, such as differing hardware.

Resource Daily Limit Maximum Rate

Requests 1,300,000 requests 7,400 requests/minuteOutgoing Bandwidth 1 gigabyte 56 megabytes/minuteIncoming Bandwidth 1 gigabyte 56 megabytes/minuteCPU Time 6.5 CPU-hours 15 CPU-minutes/minute

Table 0.1: Free quotas for general resources (as of 20.09.2010) [6].

Table 0.1 shows the quota limits for the general resources. For scientific compu-

tations the main limitation here will be the CPU cycles. In fact, the per minute

quota limits the application to a maximum computation power of 15 times a 1.2

GHz Intel processor on a minutely basis. Since the actual amount of CPU cycles

useable for computation may be even lower. Moreover, a system using App En-

gine in an automated way has to implement proper fault tolerance mechanisms,

since once resources are depleted requests may result in an exception or may

even be rejected in the first place.

The number of maximum requests as well as the corresponding per minute quota

are not problem for scientific applications, since splitting a problem into more

than 7400 requests per minute would create substantial transmission overhead.

Therefore the limiting factor will still be the CPU time long before the number

of requests becomes relevant. Note that these quota limits make sense in the

context of web applications, which are typically optimized for high throughput

and fast response but have no need for large amounts of CPU cycles. An ap-

plication dedicated to scientific computations on the other hand will consume a

lot more CPU time compared to the number of requests.

The bandwidth limits will in most cases not be problematic to the application.

The reason is that data has to be transferred over the Internet which is a rela-

tively slow medium, so typically problems that only need small amounts of data

transferred will be better suited for execution on Google App Engine. Data

intensive problems would have a high communication overhead and thus are not

a preferable class of problems for execution on Google App Engine.

In the following the quotas associated to the Datastore are listed:

• Stored Data: The total amount of data stored in the Datastore and its

xxii

indexes. There might by considerable overhead when storing entities in

the Datastore. For each entity the id, the ids of its ancestors and its kind

has to be stored. Since the Datastore is schemaless for every property the

name of the property has to be stored along with its value. Finally all the

index data has to be stored along with the entities.

• Number of Indexes: The number of different Datastore indexes for an

application, including every index created in the past that has not been

explicitly deleted.

• Datastore API Calls: The total number of calls to the Datastore API,

including retrieving, creating, updating or deleting an entity as well as

posting a query.

• Datastore Queries: The total number of Datastore queries. There are

some interface operations, such as ”not equal” queries, that internally

perform multiple queries. Every internal query counts for this quota.

• Data Sent to API: The amount of data sent to the API. This includes

creating and updating entities as well as data sent with an query.

• Data Received from API: The amount of data received by the Datas-

tore API when querying for entities.

• Datastore CPU time: The CPU time needed for performing database

operations. The Datastore CPU time is calculated in the same way as

for the regular CPU time quota. Note that CPU cycles used for database

operations also count towards the CPU time quota.

Resource Limit

Stored Data 1 gigabyteMaximum entity size 1 megabyteMaximum number of entities in a batch put/delete 500 entitiesMaximum size of a datastore API call request 1 megabyteMaximum size of a datastore API call response 1 megabyteNumber of Indexes 100

Table 0.2: General Datastore quotas (as of 20.09.2010) [6].

In Table 0.2 and 0.3 the general and daily quotas for the Datastore are listed.

The Datastore will be used for data that has to persist between multiple requests

to the application. This will be typically data that is inherent to the algorithm

and shared among all the requests. The daily limits will not be problematic

xxiii

Resource Daily Limit Maximum Rate

Datastore API Calls 10,000,000 calls 57,000 calls/minuteDatastore Queries 10,000,000 queries 57,000 queries/minuteData Sent to API 12 gigabytes 68 megabytes/minuteData Received from API 115 gigabytes 659 megabytes/minuteDatastore CPU Time 60 CPU-hours 20 CPU-minutes/minute

Table 0.3: Daily Datastore quotas (as of 20.09.2010) [6].

for a scientific computation, for the same reasons stated for the bandwidth

limitations. Moreover, the Datastore will be cleared after each algorithm run

thus resetting the general Datastore quotas.

However the maximum entity size limitations of one megabyte is quite a prob-

lem. A normal web application has no need to store large data entities but

rather stores many different small entities that are optimized for quick retrieval.

Scientific applications though often have large amounts of data to operate on.

As a consequence data beyond one megabyte has to be partitioned in order to

fit in the datastore. A more detailed discussion of the implications can be found

in Section 0.6.

xxiv

The Distribution Framework

The goal of this thesis is to build a simple framework for utilizing App Engine

servers for parallel scientific computations. The system will mainly be used

to identify properties of parallel algorithms that are well suited for use on the

App Engine environment and subsequently those that are less suited. There-

fore the system should be extensible, to allow easy incorporation of additional

algorithms. Besides, the management of data and distribution of jobs should be

independent from the actual algorithm used. Finally, the system should provide

an algorithm library that utilizes App Engine for parallelization.

In general the system uses a simple master-slave architecture. The master-

slave model is commonly used in distributed architectures [14]. It consists of

a central master process that controls the distribution of work among several

slave processes. When implementing a system based on the master-slave model,

it should be guaranteed that the master can provide work fast enough to feed

all slaves with sufficient work. When the job size is too small the master might

be too slow to generate enough jobs and can become a bottleneck.

The slave application is implemented as web application using the Google App

Engine framework. It provides a simple HTTP interface that is invoked pro-

grammatically by the master application. The interface accepts either data

transfer requests or requests containing a computational job. In either case

data is transmitted in the payload of the HTTP request. Parallelism is achieved

by multiple parallel requests to the application. In order to make communica-

tion between master and slaves easier, both applications are written in the Java

programming language.

The master application is a Java program running on the users machine. It

manages the logic and the distribution of the algorithm. The problem is split

into several work chunks. Each chunk is then submitted as a job to the slave

application, which performs the actual computation. The results of the jobs

are then collected and reassembled to provide the complete result of the algo-

rithm. Besides the master application has to manage scheduling of jobs and

data transfers.

xxv

In this chapter the architecture of the system and its components are explained

in detail. Furthermore important concepts used in the implementation of the

system will be discussed.

0.2 HTTP Requests

The slave application is in principle a HttpServlet implementing the request

handling method for Hypertext Transfer Protocol (HTTP) post requests. There-

fore the communication between the master and its slaves is entirely based on

the HTTP protocol.

In this section the basics of the protocol will explained followed by a description

of the HTTP library used by the system.

0.2.1 The HTTP Protocol

The HTTP is an stateless application level networking protocol. The latest

version of the protocol is HTTP/1.1 defined in RFC 2616 [9]. The protocol

assumes a reliable transport layer protocol. Therefore the TCP protocol is most

widely used as transport layer protocol.

HTTP is mainly used by web browsers to load web pages, however it has nu-

merous other applications. The protocol follows the request-response message

exchange pattern. A client sends a HTTP request message to a server, which

typically stores content or provides resources. The server replies with a HTTP

response message containing status information and any content requested by

the client.

The protocol defines nine request methods indicating the action that should be

performed by the server. The most important methods are:

1. GET: Retrieves whatever information is identified by the request URI.

The information should be returned in the message-body of the HTTP

response message.

2. HEAD: Is identical to the GET method, except that the HTTP response

must not contain a message-body. This method is typically used for re-

trieving metainformation on an entity or testing the validity of hypertext

links.

xxvi

3. POST: The POST method is used to submit the entity enclosed in

the request message to the server. A POST request might result in the

modification of an existing entity or even in the creation of a new one.

4. OPTIONS: The OPTIONS method is a request for information on the

available communication options for the entity associated with the request

URI.

Servers are required to at least implement the HEAD and GET method. The

communication between the master and slave application uses the HTTP post

method.

Every HTTP response message contains a three digit numeric status code, fol-

lowed by a textual description of the status. Status codes are organized in five

general classes indicated by the first digit of the status code:

1. 1xx Informational: Indicates that a request was received and the server

continues to process it. Such a response is only provisional consisting of

the status line and optional headers. One or more such responses might

be returned before a regular response to the request is sent back. Infor-

mational responses are typically used to avoid timeouts.

2. 2xx Successful: Indicates that the server has successfully received, un-

derstood and accepted the corresponding request.

3. 3xx Redirection: Indicates that further action is needed by the client

in order to fulfill the request.

4. 4xx Client Error: Indicates that the client seems to have caused an

error. The server should include a entity containing a closer description

of the error in the response message.

5. 5xx Server Error: Indicates that the server is unable to process a seem-

ingly valid request. Again, an entity containing a closer description of the

error should be included in the response.

A client at least has to recognize these five classes of response status codes and

react accordingly.

0.2.2 Apache HTTP Components

The Apache HTTP Components library [?] provides an easy to use API for ap-

plications making use of the HTTP protocol. Besides the library is open source

xxvii

thus making, required adaptions of the code possible. The master application

uses the HTTP Component client 4.0.1, which is the latest stable version by

this writing for building the HTTP requests necessary for invoking the slave

application.

Listing 4 shows a code sample performing a HTTP request to ”www.test.at”

using the functionality of the Apache HTTP components library used by the

master application:

Listing 4: Code example for performing a simple HTTP request.

1 HttpResponse response = null;

2 HttpClient client = new DefaultHttpClient ();

3 HttpPost post = new HttpPost("http :// www.test.at:80");

4

5 HttpEntity entity =

6 new SerializableEntity(new Integer (5), true);

7 post.setEntity(entity);

8

9 post.setHeader("type", "integer");

10

11 response = client.execute(post);

12 System.out.println(response.getStatusLine ());

13

14 client.getConnectionManager ().shutdown ();

The main class necessary for initiating a HTTP communication is the

HttpClient. Its most basic functionality is to execute HTTP methods.

Execution of a HTTP method consists of one or several HTTP request-response

exchanges. The DefaultHttpClient is the standard implementation of the

HttpClient.

The user provides a request object to the HttpClient for execution. The Http-

Client supports all methods defined in the HTTP 1.1 specification. The library

provides a separate class for every method. In the example a HttpPost method

is used for the request. Every method is initiated with a URI defining the target

of the request. An URI consists of the protocol in use, the host name, optional

port and a resource path. In this case the URI is ”http://www.test.at:80” the

protocol used is HTTP, the target host is www.test.at and the port used is 80.

HTTP request and response messages can optionally have content entities as-

sociated with them. Requests carrying an entity are referred to as entity en-

closing requests. The specification defines two entity enclosing methods namely

POST and PUT. The SerializableEntity class allows to construct an entity

xxviii

www.test.at

http://www.test.at:80

www.test.at

containing a serializable Java object. In the example, an entity containing an

Integer object is created and attached to the POST method. The second param-

eter of the SerializableEntity constructor determines whether object will be

buffered.

A message can contain one or multiple message headers describing properties of

the message, such as content type and content encoding. The example attaches

a header to the POST method with the name ”type” and the value ”integer”.

Message headers help a server to decide how to handle a request.

A HTTP response is the message sent back by the server as reply to a request,

implemented by the HttpResponse class. The first line of a HTTP response

contains the a status line containing the protocol version, followed by a numeric

status code and its textual description. In the example the status line is retrieved

by calling the getStatusLine() method of the HttpResponse and printed to

the standard output. A successful request will usually print HTTP/1.1 200 OK ,

indicating that the protocol version 1.1 was used and the request was successful.

0.2.3 Entity Compression

The HTTP components library provides a wide variety of functionality, though

there is no convenient way to compress entities attached to HTTP request.

Therefore we slightly modified the SerializableEntity to an entity called

CompressedSerializableEntity, providing compression of the contained seri-

alized object. The modified version works basically the same way as the original,

except that a compression filter stream is put between the ObjectOutputStream

and the ByteArrayOutputStream that actually writes the serialized object in

the buffer.

The information whether entity compression is enabled is stored in the header

of the HTTP request, so the slave application can decompress entities prior to

their usage.

Java provides three different compression filter streams in the JRE standard

library, ZIP, GZIP and raw deflate compression [18]. The ZipOutputStream

is an output stream filter for writing files in the ZIP file format. The

GZIPOutputStream is the same for GZIP file format. DeflateOutputStream

generates a stream compressing the data using the deflate algorithm.

An alternative compression algorithm library is unfortunately not an option,

since the App Engine runtime environment does not allow additional libraries

xxix

0.5 1 1.5 2 2.5 3 3.5 4x 106

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

array size

com

pres

sion

tim

e (m

illis

econ

ds)

zipgzipdeflate

Figure 0.2: Compression time needed for the different compression streams.

besides the JRE standard library. Therefore the slave application would be

incapable of decompressing the payload when using a alternative compression

library.

In order to determine the best suited compression we tested the different com-

pression streams in terms of compression efficiency and runtime required. For

this experiment, an one dimensional integer array filled with random numbers

was used as raw data. For the compression efficiency, random numbers in the

range from 0 to 1000 and numbers in the range from 0 to 10.000 where tested.

Using a smaller range of random numbers increases compression efficiency, since

there are less possible values that have to be encoded. The tests were performed

on a system with a Intel Core 2 Duo CPU with 2.4 ghz and 4 gigabyte RAM.

Figure 0.2 shows a comparison of compression times needed by the three different

streams. The times are almost identical especially those of ZIP and GZIP.

Deflate however performs the best throughout all tested data sizes. In addition,

for all three algorithms the runtime grows linearly for increasing data sizes.

In terms of compression the streams performed equally well, though deflate

compressed data was generally slightly. The reason that deflate performs slightly

xxx

0.5 1 1.5 2 2.5 3 3.5 4x 106

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 x 107

size of integer array

data

siz

e (b

yte)

deflate 1000deflate 10000uncompressed

Figure 0.3: Data size for an integer array filled with random numbers.

better in terms of execution time and compression efficiency is the overhead

needed for the ZIP and GZIP file format information. The streams most likely

even use the same internal compression routine.

Figure 0.3 shows the compression efficiency of the deflate algorithm (ZIP and

GZIP were omitted because of overlapping graphs). The test using random

numbers in the range from 0 to 1000 (green line) performs better than the

second test using numbers from 0 to 10.000 (red line) as expected.

0.3 Slave Types

A slave is a server reachable by a distinct network address executing an instance

of the slave web application. The slave application is actually intended to be

deployed to and executed on Google’s App Engine servers. In order to provide

more flexibility and a way to compare the performance of the App Engine server

infrastructure to known hardware we included the possibility to address local

servers executing an instance of the slave application using the App Engine

development server. As already described more closely the development is just

xxxi

a slim web server simulating the App Engine API locally. Consequently deployed

slaves will be referred to as App Engine slaves and slaves running on machines

using the development server will be referred to as local slaves. In the following

some basic considerations for each slave type are discussed followed by a concrete

comparison of their properties.

0.3.1 App Engine Slaves

An App Engine slave is a instance of our slave application executed in the run-

time environment of the App Engine servers. In principle one instance of the

slave application would be enough, since parallelism is achieved by sending mul-

tiple parallel HTTP requests. However, there are a various restrictions imposed

to an App Engine application in terms of resource usage. In order to circum-

vent these restrictions it is necessary to enable the system to distribute the work

among multiple instances of the slave application.

One App Engine account allows the user to create and deploy up to ten different

applications, each of them having a separate URI as well as separate resource

quota restrictions. Usually these are meant to be different applications, though

it is possible to deploy a single application multiple times. In terms of a web

application this would not make any sense, since each of the instances would be

reachable through a different address. For a scientific application that needs as

many resources as possible though, it is an useful way to get additional resources.

As a consequence the master application has to be able to distribute tasks to

different instances of the slave application, each reachable through a different

network address.

0.3.2 Local Slaves

As already mentioned instances of an App Engine application can also be exe-

cuted by the development server. Besides minor differences the instances behave

the same way as those deployed to Google’s infrastructure. There are even some

restrictions present for deployed applications that do not apply to an application

running on a development server. As a consequence local computers executing

an instance of the slave application using the development server can be incor-

porated as additional computing resources. For example an algorithm could be

distributed, to a couple of deployed instances as well as some instances running

on local cluster nodes using the development server.

xxxii

The local nodes are typically machines with multiple CPU cores. So the concept

of sending multiple parallel HTTP requests to one instance in order to achieve

parallelism applies here as well. The development server handles each request in

a separate thread, thus automatically distributing the load on the available cores.

In principle it does not make a difference to the master application, whether it

sends requests to a deployed instance where the App Engine frontend manages

load balancing of parallel requests or a instance running on the development

server where every request is handled by a separate thread. To the master

application it is only relevant how many requests an instance can handle in

parallel, which will be referred to as queue. Further discussion on the impact of

the queue size is provided in Section ??.

Besides making the distributed system more flexible, the use of the development

server provides a way to compare the performance of the App Engine framework

to regular hardware. In addition, changes in the distributed system that might

have an impact on the runtime of algorithms can be tested more reliably on local

slaves, since measurements on Google’s App Engine infrastructure are oftentimes

biased due to background load on the servers or the network.

0.3.3 Comparison of Slave Types

In terms of interface and general behavior, both types of slaves are equivalent,

though there are still some differences that have to be considered:

1. Latency/Bandwidth: Generally the network connection will be better

to a Local Slave, since they typically reside in the same local net as the

master application. App Engine slaves on the other hand are always ac-

cessed over the Internet, thus having a much slower connection. Besides

latency and bandwidth often may vary due to background load. A closer

analysis of App Engines network behavior is provided in Section 5.

2. Hardware: For local slaves the underlying hardware is known. As a

result rough calculation time estimates can be done, as well as heuristics

for scheduling jobs can be applied. On the contrary for App Engine slaves

the underlying hardware is neither known nor are there any guarantees in

that respect. Multiple requests can be executed on completely different

hardware even if the requests happen in a small time frame one after

another.

3. Reliability/Accessibility: Local slaves reliability depends on the proper

administration of the machines the instance is running. Granting access

xxxiii

to the application by opening the corresponding ports is an administrative

issue as well. App Engine slaves on the other side have no need for admin-

istration at all. Applications running on Google App Engine are highly

reliable and are accessible from everywhere over the Internet.

4. Restrictions: App Engine slaves have various restrictions, for example a

request handler has to terminate within 30 seconds otherwise the request

fails. Furthermore the total quotas as well as the per minute quotas are

limiting for App Engine slaves. For a local slave all these restriction do

no apply, thus leaving the programmer more flexibility.

5. Services: The scalable services provided by the runtime environment

are only simulated by the development server and thus may differ in their

behavior. However, the slave application will only use the Datastore which

is provided sufficiently by the development server.

Concluding App Engine slaves are in general more difficult to handle program-

matically, because of the various unknown variables and the strict restrictions

of the runtime environment. This however also means that incorporating the

possibility to use local slaves into the system does not require a lot of adjust-

ments in the code. Table 0.4 shows a quick overview of the differences between

Local and App Engine Slaves.

Feature Local Slave App Engine Slave

Latency/ fast local network InternetBandwidth

Hardware hardware is known; completely unknown;runtime estimates are possible may vary

Reliability/ administration needed highly reliableAccessibility and accessible

Restrictions most restrictions very restrictivedo not apply (see quotas)

Services only simulated provided bygoogle infrastructure

Table 0.4: Properties overview of App Engine and local slaves.

xxxiv

0.4 The Master Application

The master application is a program written in Java that automatically invokes

the web interface provided by the slave application. It is responsible for the

generation and distribution of the parallel tasks, as well as for collecting and

assembling the partial results into a complete solution. Moreover it is responsible

for mapping jobs efficiently to the given slaves. Another important requirement

is a good fault tolerance mechanism, since requests may fail for various reasons.

In the following, the architecture of the master application will be described

starting with a general overview of the architecture, followed by a more detailed

description of the individual components and their responsibilities.

0.4.1 Architecture

Figure 0.4: Master application architecture.

Figure 0.4 shows the main components of the master application and their de-

pendencies. The main entry point to the system is the DistributionEngine

class. A client using the system instantiates the DistributionEngine hand-

ing it a Implementation of the JobFactory representing the parallel problem.

Furthermore, the DistributionEngine needs a list of URIs of reachable slave

xxxv

instances. For every slave, a HostConnector is instantiated managing the actual

connection to it and providing high-level control to the DistributionEngine.

The HostConnector associated with every slave instance is responsible for sup-

plying the it with data and jobs. Each HostConnectors has a reference to the

JobFactory and directly requests jobs from it and posts results of finished jobs.

For the actual HTTP connection, multiple threads have to be used for man-

aging the parallel data and job requests. For that purpose, a HostConnector

uses JobTransferThreads for every WorkJob it submits. The HostConnector

implements the ResultListener interface providing a callback method for the

JobTransferThreads to deliver the Result of finished jobs. These results are

then forwarded to the JobFactory which is responsible for assembling the final

result. The TransferThreads contain the code for building the actual HTTP

request matching the interface of the slave application.

Since the HostConnectors are responsible for supplying the slaves with tasks,

they implicitly determine the mapping of jobs. A closer description of the map-

ping strategy is provided in Section 0.4.3. Besides the HostConnector are re-

sponsible for handling failed requests and possibly failed slave instances. Fault

tolerance mechanisms are discussed more closely in Section 0.4.4.

Some algorithms such as matrix multiplication require additional data to

be transferred besides the data associated with WorkJobs. For such algo-

rithms HostConnectors additionally manage a reference to a DataManager

that is responsible for transferring data to the slaves. The DataManager

itself uses DataTransferThreads which are just a slightly modified version of

JobTransferThreads. Data is generally transferred prior to the distribution of

jobs. Data transfers are split to multiple parallel HTTP requests. Section 0.6

provides a detailed description of the shared data management concept and the

underlying data transfer strategies.

0.4.2 Generating Jobs

A substantial part of the master is to correctly split the problem into smaller

work items that can be wrapped into WorkJobs. The system uses the concept

of a JobFactory, which is an interface that has to be implemented for a con-

crete parallel algorithm similar to the WorkJob interface. A class implementing

the interface carries the logic for generating appropriate WorkJobs that can be

submitted to the slave applications. The JobFactory is also responsible for

reassembling the partial results of the WorkJobs in order to produce a final

xxxvi

Result. For applications using shared data, the JobFactory class also provides

the serialized data that has to be sent separately.

Listing 5: The JobFactory interface.

1 public interface JobFactory {

2

3 public WorkJob getWorkJob ();

4 public int remainingJobs ();

5 public void submitResult(Result r);

6 public Result getEndResult ();

7

8 //only for applications using shared data

9 public boolean useSharedData ();

10 public Serializable getSharedData ();

11

12 }

The JobFactory manages a list of WorkJobs that have to be completed to solve

the algorithm with the given parameters. The getWorkJob() method returns

the next WorkJob ready for submission and null if there are no more jobs left to

execute. The remainingJobs() method returns the number of remaining jobs

that still have to be executed. This information is necessary for load balancing

purposes. For example, if there are three jobs left and five idle slaves available

the slaves with the fastest expected execution should be chosen first.

Results of completed WorkJobs are submitted to the JobFactory via the

submitResult() method. The class is responsible for assembling all the partial

results to a final result. WorkJobs and their corresponding Results must have

the same identifier, in order to allow proper assembly of the end result. Once

all the results are submitted the getEndResult() method provides the result

of the algorithm.

The useSharedData() method indicates whether the algorithm uses shared data

management. In case shared data is used the serialized data can be requested

through the getSharedData() method. Shared data management is described

in more detail in Section 0.6.

Using the concept of a JobFactory, WorkJobs and Results, the logic of a

specific algorithm is decoupled from the rest of the system. For integrating an

additional algorithm in the system, a programmer simply needs to implement

the interface. In Section 0.7.1 concrete examples for implementing algorithms

in the system will be discussed.

xxxvii

0.4.3 Job Mapping

Mapping in parallel computing represents the procedure of assigning the tasks of

a parallel computation to the available processors [14]. The main goal of when

mapping tasks is to minimize the global execution time of the computation.

This is usually achieved by minimizing the overhead of the parallel execution.

Typical sources of overhead are communication and processors staying idle due

to insufficient supply with work.

Communication overhead is minimized by avoiding unnecessary communication

between processors or machines. Avoiding idle processors requires good load

balancing, which means that the work should be distributed equally among

processors.

Mapping strategies can be roughly classified into two categories: static and

dynamic.

1. Static Mapping: Static mapping strategies decide the mapping of tasks

to the available processors before executing the program. Providing an

good mapping in advance is a complex and computationally expensive

task. In fact finding an optimal mapping is NP-complete. Knowledge of

task size, data size, the underlying hardware used and even the system

implementation is crucial. However for most practical applications there

exist heuristics that produce fairly good static mappings.

2. Dynamic Mapping: Dynamic mapping strategies distribute the tasks

dynamically during execution of the algorithm. When there is insufficient

knowledge on the environment, static mappings can cause load imbalances.

In such cases, dynamic mapping techniques often yield better results.

As described earlier parallelism is achieved by sending multiple tasks in parallel

to the web application. The web application then handles those requests in

parallel. In case of an App Engine slave, the requests might be handled in

parallel on one machine or on several different machines. In case of a local slave

the requests are handled in separate threads thus using multiple available CPU

cores. Thus, the mapping to cores depends on when and how many requests are

sent in parallel to each slave managed by the system.

The basic mapping approach of our system is similar to a dynamic work pool.

A distributed system using a work pool approach typically has a central node

managing the parallel tasks and the computation nodes request tasks from the

work pool for computation. More advanced implementations sort the parallel

xxxviii

tasks in the work pool using a priority queue. Such an approach has the advan-

tage that work is given only to nodes that actually free resources. Moreover,

if there are sufficient tasks roughly of the same size, even in a heterogeneous

environment there are almost no load imbalances to expect. Faster nodes will

automatically request more work, since they finish tasks faster and slower nodes

will not be flooded by work they cannot handle.

The mapping strategy of the system is inspired by the work pool approach.

The work pool is implemented by the JobFactory, which provides WorkJobs

on demand as well as the possibility to put back WorkJobs for reassignment. A

closer description of the JobFactory is provided in Section 0.4.2 .

Because tasks are pushed to the nodes by HTTP requests the computation nodes

are not able to request additional work by themselves. Therefore every slave has

a HostConnector associated with it, managing the job retrieval and the posting

of results for the particular slave instance. Every slave has a queue size, which

is simply a number indicating the number of parallel requests it can handle.

Initially the HostConnector retrieves the number jobs indicated by the queue

size and sends them to the corresponding slave. Every time a job is finished the

HostConnector retrieves a new job and sends it to the slave.

Different slave instances have different optimal queue sizes. For example slave

running on a machine with a larger number of CPU cores is able to handle more

requests in parallel than one running on a machine with only one core. For local

slaves the optimal queue size is usually the number of available processor cores.

For App Engine slaves it is a little more difficult to find an appropriate queue

size, since there is no information on the hardware available. Besides, the ex-

ecution time might be influenced by various factors such as other applications

sharing the same server and thus causing background load. The fact that subse-

quent requests might be handled on completely different hardware is problematic

as well. However typically if the problem can be partitioned intro similar sized

jobs and only App Engine slaves are used the best choice is to evenly distribute

the whole problem right at the start of the algorithm, since if the load is to high

the requests excessive request get aborted and can be rescheduled.

0.4.4 Fault Tolerance

Fault tolerance in the context of a distributed system means guaranteeing that

every parallel task is executed and results are collected correctly in order to make

completion of the algorithm possible. Recoverable problems such as single slave

xxxix

instances going offline should be recognized timely and handled accordingly. If

there are problems the system cannot recover from, for example a complete loss

of network connectivity, the system should persist its state in order to make

continuation of execution at a later time possible.

Retransmission of Requests

HTTP requests to the slave application may fail at any time for various reasons,

thus a mechanism for correctly handling failed requests forms an important

part of the system. In order to guarantee execution of the task associated to

the requests requires either resending until the request is performed correctly or

detaching the task and putting it back into the work pool.

The best action for recovering from a error often depends on the cause of the

problem. First of all, requests may get lost due to a unreliable network, here

the best reaction is to resend the request as soon as possible. Another reason

can be a busy slave instance that is temporarily not able to handle additional

requests or a depleted resource. The best reaction in this case is to resend the

request as soon as the slave is able to receive additional requests. In some cases

a task cannot be executed correctly by one slave, while others might be able

to execute it without a problem. For example, a long task assigned to an App

Engine slave that repeatedly exceeds the runtime limitations could be executed

by a local slave without a problem.

Resending of requests is implemented in the DataTransferThread and the

JobTransferThread itself, in order to avoid creating a new thread every time

a HTTP request gets lost. Threads resend failed requests for a configurable

number of times using a exponential backoff mechanism. A TransferThread

will initially wait a small amount of time before resending a request, however

doubling the wait time for every further attempt. This technique avoids flooding

the network with unnecessary HTTP requests that will be discarded anyways.

Once the maximum number of retries is reached a, TransferThread sets its state

to failed. The HostConnector regularly checks for failed JobTransferThreads

and tasks associated to a failed JobTransferThread are detached and put back

into the work pool in order to make reassignment to a different slave possible.

Unlike single jobs, data has to be transferred to the slave it is intended to in

order to make execution of the algorithm possible. As a consequence, once

a DataTransferThread fails the corresponding slave has to be removed from

the list of available slaves and its associated HostConnector has to be deac-

xl

tivated. Therefore, DataTransferThreads have typically a higher retry count

than JobTransferThreads in order to avoid accidentally removing an active

slave.

Handling Offline Slave Instances

The HostConnector regularly checks for failed TransferThreads and once it

discovers a high amount of failed requests it suspends job transmission, in order

to check the slaves availability. Ping requests are sent in order to check whether

the slave is still online. A ping request should cause the slave application to

return immediately with a empty response. If a ping request succeeds job trans-

mission or a successful result of a previous job is returned, the job submission

is resumed.

However, if a certain number of ping requests fail the HostConnector assumes

its slave has gone offline, puts back all active tasks into the work pool and

deactivates itself. Optionally, the availability of slaves can be checked prior to

execution in order to avoid starting to send requests to inactive slaves.

Handling Loss of Connectivity

Once a unrecoverable fault is detected such as complete loss of connectivity,

the Distribution System tries to persist its state in order to continue execution

at a later point in time. This behavior is especially desired for long running

algorithms where a unrecoverable error would mean the complete loss of all

the already finished computation. The state of the problem is implicitly given

by the JobFactory class that manages the open WorkJobs as well as the al-

ready computed partial Results. The framework provides the possibility to

make a JobFactory Serializable and additionally implementing the interface

Persistable.

Listing 6: The Persistable interface.

1 public interface Persistable {

2 public void saveState ();

3 public void loadState ();

4 }

Listing 6 shows the Persistable interface providing the methods saveState()

and loadState(). If a \verbJobFactory+ implementation additionally im-

plements this interface the system puts back all unfinished tasks in the work

xli

pool and calls the saveState() method once a unrecoverable error is detected.

This provides the possibility to load the state of the JobFactory at a later point

in time and continue execution of the algorithm.

0.5 The Slave Application

The slave application is a web application written in Java using the Google App

Engine framework. As previously discussed, instances of the slave application

can run either be App Engine slaves or local slaves. Basically the slave appli-

cation does nothing more than receiving small pieces of work, executing them

and sending back the results to the master.

Figure 0.5: Activity diagram illustrating the control flow of the slaveapplication

Figure 0.5 shows a UML activity diagram visualizing the general control flow of

the slave application. The entry point of the slave application is a HTTP request

to the servlets POST method. First of all it has to be checked whether the entity

xlii

in the message body of the HTTP request is compressed and if so, the payload

has to be decompressed prior to further usage. The next step is to determine the

type of the request. There are different types of requests: job, data, clear and

ping. Each request type has to be treated differently. The corresponding meta

information of the request is stored in form of message headers in the HTTP

request (see Section 0.5.3).

The most important request type is job request. Such a request contains a

parallel task intended to be executed by the slave application. Once a job request

is identified, the job itself has to be extracted from the entity, by deserializing

the data to a WorkJob object. The next step is to determine whether the job

needs shared data and if so, it has to be retrieved from the Datastore prior to

execution. After that, the WorkJob is executed by calling its run() method. The

result of the computation is then again stored in serialized form in the HTTP

response. If result compression is enabled the serialized object is additionally

compressed.

Data requests are used to transfer shared data that is used by all the jobs and

thus has only to be transferred once to each slave instance. A closer description

of the shared data management concept is provided in Section 0.6. Once a data

request is identified, the raw data is extracted and stored in the Datastore using

a wrapper data entity.

A clear request causes the slave application to delete the entire content of the

Datastore, in order to erase all save state. A clear request is typically sent

after a successful or failed run of an algorithm in order to prepare the slave for

subsequent runs of the algorithm.

Ping requests are used for determining whether a slave instance is still online.

The slave application instantly returns to the request once identifying a ping

request. A closer description of the fault tolerance mechanisms is provided in

Section 0.4.4.

0.5.1 WorkJobs

A WorkJob is a piece of work that can be received and executed by the slave

application. They contain the algorithmic logic as well as the data needed for

execution. WorkJob itself is a abstract class defining the necessary methods

expected by the system:

Listing 7: The abstract WorkJob class.

xliii

1 public abstract class WorkJob extends Serializable {

2 private int id;

3

4 public int getId ();

5 public void setId(int Id);

6 public Result run();

7

8 //only needed for algorithms with shared data

9 public void fetchSharedData ();

10

11 }

Every algorithm needs a specific implementation of a WorkJob that extends this

abstract class. WorkJobs always have to be serializable, since they are transferred

in serialized form.

The core of a WorkJob is the run() method which contains the algorithmic

logic of the job. The return value is of the type Result, which is again a generic

abstract class that needs to be extended when implementing a result class for a

specific algorithm.

How data needed for the algorithm is managed is left to the programmer im-

plementing the specific WorkJob. However, the class should only contain data

specific to the job. Transferring data that is used by multiple jobs within the

class would lead to redundant data transfers. Data shared by multiple jobs can

be sent separately and should be retrieved by invoking the fetchSharedData()

method. The concept of shared data management is described more detailed in

Section 0.6.

Every WorkJob has a unique identifier, that has to be the same as the identifier

of the corresponding Result. This allows the master to correctly map WorkJobs

to their Results, which is necessary for assembling the solution of the algorithm.

0.5.2 Results

The slave application returns a serialized instance of the class Result wrapped

in the HTTP response. Result is a abstract class every algorithm specific result

implementation must extend:

Listing 8: The abstract Result class.

1 public abstract class Result implements Serializable {

2 private int id;

xliv

3 private long calculationTime;

4

5 public long getCalculationTime ();

6 public void setCalculationTime(long calculationTime);

7 public int getId();

8 public void setId(int id);

9 }

The class only defines the id used for relating the result to the corresponding

WorkJob and a field storing the execution time of the run() method. The actual

data types for returning the results must be defined in the algorithm specific

implementation.

In the field calculationTime the execution time needed for the run() method

is stored. This value represents the time spent doing useful computations and

is used to determine the ratio between parallelization overhead and the actual

computation time.

Results can be optionally returned in a compressed form, in order to reduce

the amount of data to be transfered. The master application indicates that it

expects a compressed result in the HTTP message header (see 0.5.3).

0.5.3 Message Headers

HTTP requests contain header fields used for transferring meta-information,

such as which encoding is accepted by a browser. A header field has a name and a

corresponding value which is usually a string. Besides using the standard header

fields, self-defined custom headers can be used for transferring information. The

slave application decides using the header fields how to treat requests.

In the following the parameters used by the slave application are listed:

• type: The type field indicates the kind of request transferred, how it has

to be handled by the application and the kind of data contained in the

payload.

– job: A job request is a computational task to be executed by

the slave. The payload of the request contains the corresponding

WorkJob.

– data: A data request serves as a means to transfer data to the appli-

cation. The payload contains shared data to be stored in the Datas-

xlv

tore.

– clear: A clear request causes the slave to clear all stored data. It

contains no data in the payload. Such a request is typically sent after

all jobs have finished to reset the application.

– ping: A ping request is used to determine whether a slave is reach-

able. The application should respond immediately with a empty re-

sponse.

– retrieve: A retrieve request causes the application to read the con-

tents of the Datastore and send them back in the response of the

request. This flag is only used for debugging purposes.

• compression: The compression field indicates whether the payload of

the request is compressed and therefore has to be decompressed prior to

usage.

– enabled: The enabled flag indicates that request compression is en-

abled.

– disabled: The disabled flag indicates that request compression is

disabled.

• resultCompression: The resultCompression field indicates whether the

result should be compressed before returning it.

– enabled: The enabled flag indicates that result compression is en-

abled.

– disabled: The disabled flag indicates that result compression is dis-

abled.

• sharedData: The sharedData field indicates whether the corresponding

WorkJob uses shared data and therefore whether shared data has to be

retrieved prior to execution of the job.

– true: The true flag causes the slave to invoke the fetchSharedData()

method prior to the run() method.

– false: On the contrary the false flag causes the slave to invoke the

run() method immidiatly without retrieving further data.

• benchmark: The benchmark field indicates whether database operations

should be performed, when sending data requests. The field is used for de-

activating database operations, in order to more precisely measure transfer

xlvi

times.

– true: The true flag causes the slave to omit transferred data and

return immediately.

– false: The false flag causes the slave to store the enclosed data in

the Datastore.

0.6 Shared Data Management

The naive approach for transferring data from the master to its slaves is by

simply attaching all the necessary data to each job. However, various parallel

algorithms have shared data that has to be accessed by all of the jobs. The fact

that a single slave will compute multiple jobs results in unnecessary redundant

data transfer from the master to the slaves. Especially if communication takes

place over relatively slow networks such as the Internet, this can result in a

major bottleneck. A common example would be a parallel implementation of

the matrix multiplication algorithm, where one matrix is shared and has to

be accessed by each job. So in principle the shared matrix only has to be

communicated once to each slave.

For this reason we introduced data requests besides regular job requests in the

system. Using shared data is optional though, since not every algorithm needs

shared data or in some cases the overhead of sending data multiple times might

be acceptable for other reasons. So the the parallel computation process is split

in two phases: first shared data is transferred to each slave and stored; and

second the jobs with the actual computation tasks are distributed among the

slave instances.

As described in Section 2 the runtime environment of the App Engine framework

restricts any access to the file system. Consequently we had to use the Datastore

service in order to store shared data in a way that all the jobs executed on one

slave can access the data. The development server simulates the Datastore by

storing data in a single file in the file system, since it is usually used for testing

purposes only. This may seem inefficient, yet the jobs will not have to query for

data but usually need the whole share data stored. Therefore, it is still more

efficient to read data from the file system than to communicate it over a slow

network. For an application running on Google’s servers it is not guaranteed

that every request will be executed on the same hardware (though if possible it

is preferred). However the Datastore service manages proper access to the data

xlvii

for every request, thus such an application can be logically treated as a single

slave. From an performance viewpoint the Datastore service again in most cases

will be better than plainly sending data multiple times.

0.6.1 Data Splitting

We choose to split data into multiple chunks for data transfer for two reasons.

First of all the Datastore service allows a single storage entity to have a max-

imum size of one megabyte. Besides HTTP has limitations on how much data

one request is allowed to carry in its payload as well. In order to avoid limita-

tions, the shared data has to be split and stored in several parts. Once a job

needs to access the data, it simply reads all the chunks and reassembles them.

The second major advantage is, that by splitting data, multiple TCP streams

are used for transmission, which can often improve transfer speed notably. Es-

pecially in our case where data transfers typically will last only a couple of

seconds, multiple streams help hiding the effects of the TCP startup phase [13].

Of course also the overhead produced by transferring an additional HTTP

header for each separate data chunk has to be considered. So a good ratio

between the total data to be transferred and the data transferred in a single

chunk has to be found. An experimental analysis of the splitting factor is pro-

vided in Section 0.11.2.

0.6.2 Data Transfer Strategy

An important consideration when transferring data to multiple slaves whether

to transfer data to all of them in parallel or sequentially one after another. If

there are separate independent network links to the slaves, the parallel approach

is clearly the best strategy.

However, this is typically not the case in a practical scenario and outgoing

bandwidth to the hosts is shared. Over the Internet the most likely network

bottleneck is the upload bandwidth of the master. In a local network the nodes

are usually interconnected by a switch or a router. In a typical homogenous

network topology like in figure 0.6 the bandwidth bottleneck is already the link

to the switch.

In a heterogenous network topology where the master has a considerably faster

connection to the switch like in figure 0.7 data transfers won’t affect each other,

thus sending data in parallel is clearly the best option.

xlviii

Figure 0.6: Typical homogeneous local network topology.

Assuming however the network bandwidth to the slaves has to be shared, like in

figure 0.6, it is better to send data to the slaves one after another. A slave needs

the complete shared data in order to start the computation, so partial data is

not useful to a slave. Assuming a topology like in the homogenous example and

1000 mbit of data has to be transferred to each host, it takes three seconds to

broadcast the data to all slaves. When transferring the data in parallel after the

three seconds every slave can start computing.

However, if data is sent separately to the slaves it takes one second for each

transfer, since the whole bandwidth can be used. So after one second the data

transfer to the first slave is finished and a job can be assigned, so the first slave

can start its computation. Consequently it takes another second to transfer the

data to the second slave, which can start the computation after a total of two

seconds. After three seconds also the third slave can start computing. In total we

gain three seconds of computation time in comparison to transferring the data in

parallel, by enabling the first two slaves to start their computations earlier. As

a result in a scenario where the bandwidth to all the slaves is shared a sequential

data transfer strategy should be preferred over a parallel data transfer.

When transferring data sequentially the next consideration is in which order to

xlix

Figure 0.7: Heterogeneous local network topology, with superior master node.

transfer the data to the slaves. In general slaves to which a faster network link

is available and slaves that have more computing power are preferable. The

data is transferred faster to slaves with more bandwidth available thus enabling

them to begin their computation earlier. On the other hand the system benefits

more from faster slaves beginning their computation early. So if in the previous

example one of the three slave nodes is notably faster than the other two, it

is better to transfer the data first to the fast node. In practice most of these

variables are unknown or have to be configured by hand.

0.6.3 Performance Evaluation

In the following a performance evaluation of the described techniques for shared

data management is presented. We tested the matrix multiplication algorithm

in the computer rooms (Intel Core 2 Duo CPU with 3 ghz and 2 gigabyte RAM)

with four slave nodes and on the karwendel cluster (see ?? for hardware details)

with two slave nodes, using no shared data management as well as shared data

management with a parallel and a sequential data transfer strategy. For the

experiments square matrices with different dimensions where tested.

l

Figure 0.8 shows the results for the experiment performed in the computer

rooms, using four slave nodes. As expected there is a huge performance gain

using shared data management, because shared data, in this case the first matrix

of the multiplication, is transferred only once to each slave instead of alongside

with every job. For larger matrix dimensions the performance gain becomes

even more relevant, since more data has to be transferred. The sequential data

transfer strategy is throughout all matrix dimensions slightly faster than the

parallel data transfer strategy. Since the nodes in the computer are organized

in a homogeneous local network these results were expectable.

500 750 1000 1250 15000

1

2

3

4

5

6

7

8

9x 10

4

matrix dimension

com

plet

ion

time

(mill

isec

onds

)

no shared dmshared dm parallel transfershared dm sequential transfer

Figure 0.8: Completion time for square matrix multiplication using no shareddata management, shared data management with parallel and sequential data

transfer.

Figure 0.9 shows the results of the experiment on two karwendel cluster nodes.

For a hardware specification of the karwendel cluster see section 0.10.4. At first

sight the much lower performance gain when using shared data management is

noticeable. The cluster nodes on karwendel are interconnected by a infiniband

network, which is considerably faster than the gigabit network of the computer

rooms thus diminishing the impact of transferring redundant data. Also the

benefit of using a sequential data transfer strategy is less noticeable.

li

1000 1250 1500 1750 20000

1

2

3

4

5

6

7

8x 10

4

matrix dimension

com

plet

ion

time

(mill

isec

onds

)

no shared dmshared dm parallel transfershared dm sequential transfer

Figure 0.9: Completion time for square matrix multiplication using no shareddata management, shared data management with parallel and sequential data

transfer.

lii

Algorithms

In the prior section, the functionality of the Distribution Framework was de-

scribed in detail focusing on the job management aspects of the system.

This section describes some sample parallel algorithms implemented using the

Distribution Framework. The algorithms were used for analyzing the system

in respect to its scalability. Moreover, they were used to identify algorithm

properties that are well suited for distribution on Google App Engine.

The description for each algorithm is structured in a general description of the

algorithm, the idea used for parallelization, and the concrete implementation.

This section therefore also provides a documentation on how to integrate a par-

allel algorithm into the system. The concrete implementation of the algorithms

is documented by code extracts of the WorkJob, Result and JobFactory imple-

mentation. However the code is reduced to the parts relevant to the algorithm

omitting for example methods such as getters and setters or security checks for

parameters.

0.7 Monte Carlo Routines

Monte Carlo routines are a class of algorithms heavily based on random number

generation and results are gained by repeated random sampling [17]. Monte

Carlo algorithms are usually used for problems where applying a determinis-

tic algorithm would be computationally unfeasible. Typical applications are

simulations of physical and mathematical systems.

0.7.1 Pi Approximation

Pi is mathematical constant stating the ratio between a circles circumference

and its diameter, which can be approximated through a simple Monte Carlo

simulation [17]. The algorithm can be parallelized efficiently without much

effort. The general idea of the algorithm is to inscribe a circle into a square. By

liii

generating uniformly distributed random points within the square and counting

how many of them lie within the circle, one can approximate Pi.

Algorithm

!

"#$

Figure 0.10: Illustration of the monte carlo pi calculation

Figure 0.10 illustrates the principle of the algorithm. The example uses a square

with a diameter of 1 and thus a circle with a radius of 0.5 is inscribed. Random

points are generated within the square. Points falling within the circle are

marked red, whereas points outside of the circle are black. The amount of

points within the circle are π/4. So by counting how many of the random

points lie within the circle Pi can be approximated. Assuming P random points

were generated and M points reside within the circle (red), Pi can be calculated

by applying the formula: π = 4 ∗M/P

The algorithm relies on the fact that once enough points have been generated,

the points will be equally distributed on the square. Since the algorithm is based

on random numbers, the accuracy of the computation after a fixed number of

iterations can only be stated with a certain error probability, though the law of

large numbers states that the accuracy generally increases with a larger number

of generated points.

liv

Parallelization

Listing 9 shows a pseudo code illustration of the Pi Approximations main loop.

The loop iterates from zero to P, which is the total number of random points

that are generated. First of all the two random points representing the x and y

component of the point are generated. These are private variables and therefore

do not have any dependencies. In fact the distance() function could be directly

called with two random values without primary storing them in variables. The

distance() function calculates the distance of the generated point to the center

of the circle. The function is reentrant since it has no global variables, does not

modify any arguments and has no side effects. If the distance is smaller than the

radius of the circle it means the point is within the circle and the counter variable

M is incremented by one. As long as the incrementation operation is atomic it

does not matter in which order the variable is incremented, thus making the

iterations of the loop independent.

Listing 9: Pseudo code of the Pi Approximation algorithm.

1 for i = 0 to P do

2 x = random ()

3 y = random ()

4 dist = distance(x,y)

5 if dist < R do

6 M = M + 1

7 end

8 end

In a Master-Slave system with n parallel machines available, each machine is

assigned a number of points Pn to compute. The machines can independently

generate Pn points and count the number of points residing within the circle,

resulting in Mn points within a circle. The master node then collects the results

of every slave, sums up the number of generated points Pn and the number

of points within a circle Mn, and finally computes Pi according to the same

formula as in the sequential algorithm. The code for the slave nodes is the

same as for the sequential version of the algorithm, with the slight modification

that the number of points within the circle is returned instead of immediately

calculating Pi.

Besides its simplicity, the algorithm has some properties which make it very

suitable for distribution. First of all, there is almost no data that has to be

transferred between the master and its slaves. The parameters as well as the

results are single integer numbers. This is very beneficial for a system where

lv

data transmission is relatively slow. Moreover, there is no need for managing

shared data since the parameters of each slave are independent.

Furthermore, the individual size of jobs can be chosen freely by adjusting the

number of points to calculate. Besides jobs do not need to have the same size

and, as a consequence, good load balancing is easily achievable.

Implementation

Listing 10 shows a simplified implementation of the JobFactory interface. In

lines 3-6 the necessary fields are initialized. P is the total number of points

that have to be generated, M stores the number of generated points that fell

within the circle, numJobs represents the number of parallel jobs that should be

generated and remainingJobs is just a counter variable for the remaining jobs.

The algorithm is initialized by a constructer in lines 8-12, with the desired num-

ber of points to generate and the number of parallel jobs. The getWorkJob()

method initializes a PiJob (line 18) and sets its fraction of points to generate

(line 19), as well as its identifier (line 20). The points to generate are equally

distributed among the jobs, for each job the total number of points is divided by

the number of parallel tasks. For the sake of simplicity, we omitted the distribu-

tion of remaining points. As consequence in this code sample, if the number of

points is not dividable without remainder by the number of parallel tasks, there

are slightly less points generated. Finally the counter variable for remaining

jobs is decremented. The submitResult method from lines 29-32 simply adds

the points of each PiResult to the total number of points within a circle.

Listing 10: Implementation of the JobFactory for the Monte Carlo Pi

algorithm.

1 public class PiApproximation implements JobFactory{

2

3 private int P = 0; // number of points

4 private int M = 0; // number of points in circle

5 private int numJobs = 0;

6 private int remainingJobs = 0;

7

8 public PiCalculation(int P, int numJobs){

9 this.P = P;

10 this.numJobs = numJobs;

11 this.remainingJobs = numJobs;

12 }

13

lvi

14 public synchronized WorkJob getWorkJob () {

15 if(remainingJobs < 1)

16 return null;

17

18 PiJob j = new PiJob ();

19 j.setP(P/numJobs);

20 j.setId(numJobs -remainingJobs);

21 remainingJobs --;

22 return j;

23 }

24

25 public int remainingJobs () {

26 return remainingJobs

27 }

28

29 public synchronized void submitResult(Result r) {

30 PiResult pires = (PiResult) r;

31 M+ = pires.getM();

32 }

33 }

Listing 11 shows the concrete implementation of the algorithms WorkJob called

PiJob. The fields P (line 3) again defines the number of points to generate by

the concrete job and and M (line 4) respectively the portion of these points that

fell within the circle. The variables x and y represent the x and y value of the

random points. From line 9 to 12 the circle and its center is defined, assuming a

circle with a radius of 0.5 and consequently a square with the diameter 1. The

center of the circle is placed at the point (0.5, 0.5). The dist (line 13) variable

is used for storing the euclidian distance of the current random point to the

center of the circle.

The run() (lines 15-38) method performs the actual computation. First of all

the random number generator is initialized (lines 16-17). The main loop then

computes random points within the square by generating a random x value and

a random y value between 0 and 1 for each point (lines 20-22). In the next step

the points distance from the center of the circle is calculated, which is given

by the euclidian distance between the random point (x, y) and the center point

(0.5, 0.5) (lines 24-27). Now it can be easily determined whether the point lies

within the circle by checking if the distance from the center is actually smaller

than the radius of the circle (lines 29-31). Finally the total number of points

within the circle M is stored in the Result and returned (lines 34-37).

lvii

Listing 11: Implementation of the PiJob class.

1 public class PiJob extends WorkJob{

2

3 private int P = 0; // number of points

4 private int M = 0; // number of points in circle

5

6 private double x = 0;

7 private double y = 0;

8

9 // define circle R = 0.5; diameter of square is 1

10 private double R = 0.5;

11 private double center_x = 0.5;

12 private double center_y = 0.5;

13 private double dist = 0;

14

15 public Result run() {

16 Random r =

17 new Random(System.currentTimeMillis ());

18

19 for(int i = 0; i < P; i++){

20 // generate Random Point

21 x = r.nextDouble ();

22 y = r.nextDouble ();

23

24 // euclidian distance from center

25 dist = Math.sqrt (((x - center_x)*

26 (x - center_x)) + ((y - center_y)*

27 (y - center_y)));

28

29 if(dist < R){

30 M++;

31 }

32 }

33

34 PiResult res = new PiResult ();

35 res.setM(M);

36

37 return res;

38 }

39 }

Listing 12 shows the implementation of the algorithms result class used for

returning the results of the computation. Each job calculates the number of

points that were generated within the circle, so the result of each job is the

lviii

integer value M (line 2).

Listing 12: Implementation of the PiResult class.

1 public class PiResult extends Result {

2 private int M;

3

4 public int getM() {

5 return M;

6 }

7 public void setInt_res(int m) {

8 this.M = m;

9 }

10 }

0.7.2 Integration

Monte Carlo based integration is a form of numerical integration based on ran-

dom numbers. The same idea as for the Pi algorithm is applied. By sampling

random points in the interval of the function and measuring how many of the

random points lie under the function, the integral can be approximated [17].

Algorithm

The simplest form of integration based on random numbers follows the same

idea as the Monte Carlo Pi approximation. Figure 0.11 illustrates the basic idea

of the algorithm. The function f1 has to be integrated within the boundaries a

and b. First of all, there is a bounding rectangle defined having the width of the

integration boundaries and the height of the highest function value of f1 within

the boundaries. In the figure the bounding rectangle is indicated by the dotted

lines.

The next step is to generate random points within the bounding rectangle and

measure how many of the points reside between the function and the x-axis

(red) and how many above the function (black). Assuming there are a total of

N points generated, M points are under the function and the bounding rectangle

has the area A we can approximate the integral by the following formula:

∫ baf1(x) dx ≈M/N ∗A

lix

��

��

��

Figure 0.11: Illustration of the monte carlo integration

However a more efficient way to approximate an integral is by sampling random

function parameters within the boundaries and evaluating the function with the

given parameters. By summing up the evaluated function points and then divid-

ing the sum by the number of points sampled, the integral of the function can be

approximated as well. Assuming we generate N random function parameters

xn within the boundaries, the integral can be approximated by the following

formula:

∫ baf1(x) dx ≈

∑N0 f1(xn)/N

Parallelization

Listing 13 shows a short pseudo code illustration of the algorithms main loop.

The loop iterates from zero to N which is the total number of random function

parameters that are sampled. First of all a random function parameter x has

to be sampled. x again is a private variable and thus produces no dependencies

among iterations. The function f() evaluates the function value under the given

parameter and is obviously reentrant. Finally the function value is added to the

current sum. Similar to the Pi Approximation algorithm it does not matter in

which order the value are summed up as long as the add operation is atomic.

Listing 13: Pseudo code of Monte Carlo Integrations main loop.

lx

1 for i = 0 to N do

2 x = random ()

3 value = f(x)

4 sum = sum + value

5 end

The problem can be split by distributing the total number of random function

parameters among several parallel tasks. Every task consists of the same al-

gorithm, but with a smaller portion of points to sample, resulting in a smaller

precision. In the end, the master process collects all the results and averages

them, thus gaining the same precision as a sequential run with the total number

of points.

The algorithm has the same strengths as the Monte Carlo Pi approximation

when it comes to parallelization. There is very little data to transfer for the

parameters and for the results. Jobs can be almost freely sized with no need for

shared data.

Implementation

Listing 14 illustrates the implementation of the JobFactory interface for the

Monte Carlo Integration algorithm. The algorithm is initialized by its pub-

lic constructor (line 16 to 24). The parameters include the function that is

integrated, the number of function parameters N to generate, the integration

boundaries and again the number of parallel jobs. The parameters are stored in

their corresponding private fields defined at the beginning of the class (line 3 to

11). Besides, a list called results (line 12) holding the value of each returned

result as well as the field called area (line 14) storing the final result are defined.

The getWorkJob method (line 26 to 35) is very similar to the Pi approximations

implementation. The method initializes a IntegrationJob with the needed

parameters and the number of function parameters the job has to sample (line

30). The amount of sampled points is again equally distributed among the jobs

by simply dividing the total number of jobs by the number of parallel tasks.

The identifier is set, the counter for remainingJobs is decremented and the

generated job is returned (line 32 to 34).

The submitResult method (line 37 to 47) stores the results of each partial area

computation in the results list (line 38). Once the last result has arrived the

distinct areas are averaged in order to gain the final result (line 41 to 46).

lxi

Listing 14: Implementation of the Monte Carlo Integration JobFactory.

1 public class MonteCarloIntegration implements JobFactory{

2

3 // function to integrate

4 private IntegrationFunction f;

5 // iterations

6 private int N;

7 // integration boundaries

8 private float x1;

9 private float x2;

10 private int numJobs;

11 private int remainingJobs;

12 private ArrayList <Double > results =

13 new ArrayList <Double >();

14 private double area;

15

16 public MonteCarloIntegration(IntegrationFunction f

17 , int N, float x1 , float x2 , int numJobs) {

18 this.f = f;

19 N = N;

20 this.x1 = x1;

21 this.x2 = x2;

22 this.numJobs = numJobs;

23 this.remainingJobs = numJobs;

24 }

25


27 if(remainingJobs < 1)

28 return null;

29

30 IntegrationJob j =

31 new IntegrationJob(f, N/numJobs , x1 , x2);

32 j.setId(numJobs -remainingJobs);

33 remainingJobs --;

34 return j;

35 }

36


38 result_list.add ((( IntegrationResult)r).

39 getArea ());

40

41 if(remainingJobs < 1){

42 for(double d : results){

43 area += d;

44 }

lxii

45 area = area/results.size();

46 }

47 }

48 }

Listing 15 shows the implementation of the IntegrationJob class. The first part

of the class again consists of the fields (line 3 to 9) initialized by the public

constructor (line 11 to 17).

The run() method (line ) is responsible for the computation. First of all the

random number generator has to be initialized (line 20), as well as a variable

holding the function parameters x (line 22) and a variable holding the sum of

the evaluated function values sum (line 23). The core of the function is the main

loop generating random function parameters within the integration boundaries

and summing up the evaluated results(lines 25-28). IntegrationFunction is a

interface providing the method f(), that has to be implemented so it evaluates

a function value for a given parameter. Finally the summed up function values

are according to the formula divided by the number of generated points in

order to obtain the approximated integral (line 30), which is returned using the

IntegrationResult class (line 31).

Listing 15: Implementation of the IntegrationJob class.

1 public class IntegrationJob implements WorkJob{

2 // function to integrate

3 private IntegrationFunction f;

4 // iterations

5 private int N;

6 // integration boundaries

7 private float x1;

8 private float x2;

9

10 public IntegrationJob(IntegrationFunction f,

11 int N, float x1 , float x2) {

12 this.f = f;

13 this.N = N;

14 this.x1 = x1;

15 this.x2 = x2;

16 }

17


19 Random r =

20 new Random(System.currentTimeMillis ());

21 double x = 0:

lxiii

22 double sum = 0;

23

24 for(int i = 0; i < N; i++) {

25 x = x1 + r.nextDouble ()*x2;

26 sum+= f.f(x);

27 }

28

29 double area = sum/N;

30 return new IntegrationResult(area);

31 }

32 }

Listing 16 shows the result class for the Monte Carlo Integration. Again the

desired data type, in this case a double, is wrapped in class that is used for

transferring the result back to the master application.

Listing 16: Implementation of the IntegrationResult class.

1 public class IntegrationResult extends Result{

2 double area;

3 public IntegrationResult(double area) {

4 this.area = area;

5 }

6 public double getArea () {

7 return area;

8 }

9 public void setArea(double area) {

10 this.area = area;

11 }

12 }

0.8 Matrix Multiplication

Multiplying two matrices is a common task in mathematics and thus algorithmic

approaches for the problem are well studied. Besides being itself an important

problem matrix multiplication is equivalent to various other problems, such

as transitive closure and reduction, solving linear systems, and matrix inver-

sion [23]. Moreover it is commonly used in computer graphics for computing

coordinate transformations. As a consequence a faster matrix multiplication

algorithm makes all algorithms that are based on matrix multiplication faster.

lxiv

Because of the nature of matrix operations the algorithm is well suited for

parallelization and probably one of the major textbook examples for parallel

computing. First of all the sequential matrix multiplication will be described

shortly, followed by an explanation of the parallel approach and finally the

implementation using the Distribution Framework will be presented.

0.8.1 Algorithm

The most common form of matrix multiplication is the ordinary matrix product.

It is defined between two matrices A and B. The matrices can only be multiplied

if the width of A equals the height of B. The resulting matrix C has the height of

matrix A and the width of matrix B. So multiplying matrices with dimensions

m×n and n×p results in a m×p matrix. Moreover the ordinary matrix product

is not commutative, so multiplying two n × n matrices A by B generally will

not yield the same result as multiplying B by A.

The entry Ci,j of the result matrix is defined as follows:

For A ∈Mm×n, B ∈Mn×p and C ∈Mm×p

Ci,j = (AB)i,j =∑n

k=1Ai,kBk,j

Calculating all the entries for i and j with 1 ≤ i ≤ m and 1 ≤ j ≤ p yields the

result matrix C.

The naive algorithm strictly follows the mathematical definition:

Listing 17: Naive algorithm for matrix multiplication.

1 for i=1 to m do

2 for j=1 to p do

3 C[i,j] =∑n

k=1 A[i,k]*B[k,j]

The naive algorithm approach results in a complexity of O(mnp). There exist

asymptotically faster algorithms, such as Strassen’s algorithm, which are based

on the possibility to reduce the amount of multiplications needed when mul-

tiplying 2× 2 matrices. Besides that the algorithm is much more complex to

implement, it needs relatively large matrices to outperform the naive algorithm.

For all measurements in this thesis the naive matrix multiplication algorithm

was used.

lxv

0.8.2 Parallelization

Looking at the main loop of the naive matrix multiplication depicted in listing

17, it is visible that the two outermost loops carry no dependencies among iter-

ations. Every iteration writes a different entry of the result matrix C. So matrix

multiplication is fully parallelizable, by partitioning the data and calculating

the rows or columns of the result matrix independently.

In order to understand the concept for data partitioning in the parallel algorithm

one has to understand the data dependencies of the matrices. For example when

multiplying two 5× 5 matrices the entry C2,3 is given by:

C2,3 = A2,1B1,3 +A2,2B2,3 +A2,3B3,3 +A2,4B4,3 +A2,5B5,3

! " #

Figure 0.12: Data dependencies for entry C2,3 of the result matrix.

Figure 0.12 shows the data dependencies for calculating entry C2,3. Basically

for each entry of C a row of matrix A and column of matrix B is needed. So

for calculating the entries of a whole row of C the entire matrix B and the

corresponding row of matrix A is needed. The calculation of each entry of the

result matrix is independent and can be performed in parallel.

Typically every slave gets assigned a certain amount of rows of the matrix C it

has to compute. The data needed for calculating row x to y is the whole matrix

B and row x to y of matrix A.

Figure 0.13 shows an example of how data is partitioned in the parallel matrix

multiplication algorithm. In the example the multiplication of two 3 × 3 is

distributed to 3 processors. Each processors is responsible for calculating one

row of the result matrix C. Matrix B is needed by all the processors and is

lxvi

!"#$%&&#"'(

!"#$%&&#"')

!"#$%&&#"'*

+

,

-

./&0"/120% $#33%$0

1"#4.$4&0

Figure 0.13: Example of data partitioning in parallel matrix multiplication.

therefore broadcasted. Every processor receives one row of matrix A, which is

sufficient to compute the corresponding row of matrix C.

The processors can independently compute their share of the result matrix. The

multiplication algorithm itself does not change in the parallel version. Instead

of multiplying two 3× 3 matrices each processor multiplies a 1× 3 vector (Ai)

with a 3 × 3 matrix (B), resulting in one row of the result matrix (Ci).Finally

the rows are collected and assembled in order to provide the complete result

matrix C. Of course in a practical application every processor will have more

than one rows to compute assigned.

0.8.3 Implementation

Using row wise or column wise array iteration can have a huge impact on caching

and performance in Java as in most other programming languages. In Java it is

way more efficient to iterate row wise than column wise. The main loop of Matrix

Multiplication can be rearranged in any of 3! = 6 ways to achieve the same

result. Each permutation has different memory access patterns and therefore

might perform differently depending on processor and memory architecture.

lxvii

Testing the different index orderings showed that the runtime may vary by a

factor of up to ten.

Our implementation uses a pure row oriented main loop similar to the one sug-

gested in [15]. The pure row oriented version of the algorithm showed generally

the best runtime results compared to the other index orderings.

Listing 18 shows a simplified version of the matrix multiplication JobFactory.

The fields (line 2-14) of the class consist of the class fields. First of all there are

the three matrices A, B, C and their respective dimensions K, L, M. Moreover there

is a variable current_id storing the identifier that will be used for the next job

and a variable current_row storing the starting row of the next job. These are

followed by the number of lines that will be assigned to each task jobSize and

the number of total parallel tasks numJobs. verb+startIndexMap+ stores first

row for each job in order to map partial results correctly to the result array C.

The public constructor (line 16-29) takes the dimensions of the matrices as

well as the number of lines that should be assigned to each parallel task as

parameters. It is assumed that matrix dimension is divisible by the number

of lines assigned to each task. The fields are initialized accordingly, the three

matrices with the given dimension are allocated and matrix A and B are filled

with random numbers. The fillRandom() method is assumed to initialize the

matrices randomly.

The getWorkJob() method (lines 31-43) generates WorkJobs responsible for

computing a subset of lines of the result matrix C, which are later reassembled

to the complete result matrix. First of all a MatrixJob is allocated, followed by

setting the portion of matrix A it is responsible for. This is done by generating a

matrix containing a subset of rows of matrix A. The matrix is obtained using the

getRows method of the ArrayUtil class, which returns a new array containing

the desired subset of rows of the parameter array. The first parameter is the

original array, the second parameter defines the first row that is copied into the

new array and the last parameter the first row that is not copied anymore. By

using the current_row variable as starting parameter and the same variable

incremented by the jobSize as ending parameter, every WorkJob gets assigned

a submatrix of A containing exactly jobSize rows.

The postResult() method (lines 45-56) copies the partial matrices of the re-

turned results to the correct indices in the final result matrix C. This is done by

adjusting the index of C by the index of the first row the corresponding job was

responsible for.

lxviii

Listing 18: Implementation of the Matrix Multiplication JobFactory.

1 public class MatrixMultiplication implements JobFactory{

2 private int [][] A;

3 private int [][] B;

4 private int [][]C;

5 private int K;

6 private int L;

7 private int M;

8 private int current_id = 0;

9 private int current_row = 0;

10 private int numJobs = 0;

11 private int jobSize = 100;

12

13 private Map <Integer , Integer > startIndexMap =

14 new HashMap <Integer , Integer >();

15

16 public MatrixMultiplication(int k, int l, int m,

17 int jobs){

18 K = k;

19 L = l;

20 M = m;

21 A = new int[K][L];

22 B = new int[L][M];

23 C = new int[K][M];

24 fillRandom(A);

25 fillRandom(B);

26 numJobs = jobs;

27 jobSize = M / jobs;

28 remainder = M % jobs;

29 }

30


32 MatrixOptJob j = new MatrixOptJob ();

33 startIndexMap.put(current_id ,

34 current_line);

35 jobSizeMap.put(current_id , jobSize);

36

37 j.setA(ArrayUtil.getRows(A, current_row ,

38 current_line += jobSize));

39

40 j.setId(current_id);

41 current_id ++;

42 return j;

43 }

44

lxix


46 MatrixOptResult mres =

47 (MatrixOptResult) r;

48 int startindex =

49 startIndexMap.get(mres.getId ());

50 int [][] mat = mres.getResult ();

51

52 for (int j = startindex;

53 j < startindex + js; j ++){

54 C[j] = mat[j-startindex ];

55 }

56 }

57 }

Listing 19 shows the implementation of the WorkJob for the Matrix Multipli-

cation algorithm. The only fields (lines 2-3) needed for the MatrixJob are the

matrix B and the matrix A, which is in this context a submatrix of the multipli-

cations second matrix.

Again the run() method (line 5-29) contains the main computation of the paral-

lel algorithm. First of all the dimensions of the matrices are extracted from the

array lengths and a two dimensional array C for the result matrix is initialized.

The following loop implements the pure row oriented matrix multiplication and

is equivalent to the loop in the sequential algorithm (lines 11-23). It is designed

to always extract the currently used rows into one dimensional arrays which

then can be accessed in a fast row wise fashion.

Finally the partial result matrix C is wrapped in a MatrixResult with corre-

sponding identifier and returned (lines 24-28).

Listing 19: Implementation of the MatrixJob.

1 public class MatrixMultiplicationJob implements WorkJob {

2 private int [][] A;

3 private int [][] B;

4


6 int K = A.length;

7 int L = B.length;

8 int M = B[0]. length;

9 int [][] C = new int[K][M];

10

11 for (int i = 0; i < K; i++) {

12 int[] arowi = A[i];

13 int[] crowi = C[i];

lxx

14 for (int k = 0; k < L; k++) {

15 int[] browk = B[k];

16 int aik = arowi[k];

17 for (int j = 0; j < M;

18 j++) {

19 crowi[j] +=

20 aik * browk[j];

21 }

22 }

23 }

24 MatrixMultiplicationResult res =

25 new MatrixMultiplicationResult ();

26 res.setResult(C);

27 res.setId(this.id);

28 return res;

29 }

30 }

Listing 20 shows the implementation of the MatrixResult. The data field is a

two dimensional integer array representing the partial result matrix C.

Listing 20: Implementation of the MatrixResult.

1 public class MatrixResult extends Result {

2 private int [][] result;

3

4 public int [][] getResult () {

5 return result;

6 }

7 public void setResult(int [][] result) {

8 this.result = result;

9 }

10

11 }

0.9 Mandelbrot Set

The mandelbrot set is a set of points defined in the complex plane that form a

fractal. It was named after the mathematician Benoit Mandelbrot, who is known

for his work in chaos theory and fractal geometry [19]. The mandelbrot set has

become known outside of mathematics for its computer graphical depictions.

Figure 0.14 shows a colored image of the mandelbrot set.

lxxi

Figure 0.14: Coloured image of the mandelbrot set.

0.9.1 Algorithm

Mathematically the mandelbrot set is defined as the set of all complex numbers

c, for which the sequence zn+1 = z2n+c = 0 stays bounded. That means the value

of zn never exceeds a certain value, when starting the sequence with z0 = 0. For

example c = 2 results in the sequence 0, 2, 6, 38, ... which obviously goes toward

infinity, thus is not bounded and not in the mandelbrot set. On the contrary

c = −1 results in the sequence 0,−1, 0,−1, ... which is bounded and therefore

belongs to the mandelbrot set. Another example for a bounded sequence is c = i

resulting in the sequence 0, i, (−1 + i),−i, (−1 + i),−i, ....

It can be shown that once |zn| exceeds 2 the sequence is not bounded. So for

testing whether a given point c is a element of the mandelbrot set the algorithm

iterates over the sequence, testing in every iteration if |zn| is larger than 2. If

this is the case the point is not in the mandelbrot set and the loop terminates.

However there is no way to test for sure whether the sequence is bounded and

thus whether a point is in the mandelbrot set, so a maximum number of iter-

ations has to be defined after which the loop terminates and the sequence is

considered to be bounded.

Algorithms generating computer graphics of the mandelbrot set, assign every

point in the considered area of the complex plane a corresponding pixel in the

picture. Typically the x value of a pixel corresponds to the imaginary value of

c and the y value of the pixel corresponds to the real part of c. The algorithm

then iterates over every point and decides whether it belongs to the mandelbrot

set. For every point that is in the mandelbrot set the corresponding pixel is

lxxii

colored black whereas all other pixel are colored white. The result is a simple

depiction of the considered part of the complex plane showing the points that

belong to the mandelbrot set.

An extension to the simple black and white coloring is the escape time algorithm.

Instead of coloring each point that does not belong to the mandelbrot set white,

it counts the number of iterations that where needed for the point to meet the

escape condition and color the pixel according to a predefined color table. The

escape time algorithm therefore often produces bands of colors, for points that

escaped in the same iteration. However there are various more sophisticated

algorithms that produce more continuous colorings.


Listing 21 shows a pseudo code illustration of the mandelbrot set generators

main loop. The two nested loops iterate over the pixels of the image that is

generated, determining a color value for each. So the variables M and N de-

termine the pixel dimensions of the image. Each pixel of the image has to be

mapped to a number in the complex plane. In the code sample this is done by the

scale_real() and scale_imaginary() functions. The scaling of complex num-

ber is more closely discussed in the Implementation section of the algorithm.The

variable real and imag hold the real and imaginary part of the current complex

number considered. These variables are private to each iteration and thus do

not impose any loop dependencies. The mandelbrot_point() function does the

actual computation whether a number is part of the mandelbrot set and if not

how many iterations where needed to surpass the upper boundary. The function

has no global data or side effects and thus is reentrant. The color values of each

pixel are stored in the two dimensional array X. Since in every iteration a dif-

ferent entry of the array is written there are also no data dependencies between

iterations.

Listing 21: Pseudo code of the Mandelbrot Set generators main loop.

1 for i = 0 to M do

2 for j = 0 to N do

3 real = scale_real(i);

4 imag = scale_imaginary(j);

5 X[i][j] = mandelbrot_point(real , imag)

6 end

7 end

lxxiii

As mentioned the computation intensive part of generating mandelbrot com-

puter graphics is determining for each point whether it is part of the mandelbrot

set or not. Since the computation is for each point independent, the problem

can be easily parallelized by partitioning the desired set of points into n subsets,

where n is the number of desired parallel tasks.

Typically the master partitions the image in lines and assigns a certain number

of lines to each slave. The slaves decide for each point whether it belongs to

the mandelbrot set and return the color value for each pixel value back to the

master. The master then assembles the whole image accordingly.

Similar to the Monte Carlo algorithms the parallel mandelbrot set algorithm has

very beneficial properties for parallelization. There are very little parameters

to communicate to each slave. Moreover the size of each parallel task can be

freely chosen, thus making load balancing easy. However, the size of the results is

considerably larger, since the color value for every pixel has to be communicated

back to the master.


Figure 22 shows the JobFactory implementation of the mandelbrot set gener-

ator. For simplicity a square image is assumed. The class fields (lines 2-14)

consist of the image dimension N and the boundary values xmin, xmax, ymin and

ymax defining the section of the complex plane that will be considered. Besides

there is the variable id storing the identifier for next job, the variable lines

storing the number of that are assigned to each job, current_line storing the

index of the first pixel line that was not assigned to a job yet. The two dimen-

sional array result holds the color values of the final image and numJobs stores

the total number of jobs to generate. The map idLine stores for each generated

jobs its identifier and the corresponding start line, which is needed for mapping

the partial results to their actual position in the final image.

The public constructor (lines 16-26) takes the image dimension, the number of

jobs and the area of the image as parameters and initializes all fields straight-

forwardly.

The getWorkJob() method (lines 28-39) is responsible for correctly initializing

jobs. First of all a MandelbrotJob is initialized using its public constructor.

The parameters are the general dimension N of the generated image, the first

line current_line the job is responsible for and the number of pixel lines the

job should generate lines. In addition the boundary variables determining the

lxxiv

considered area of the complex plane are passed to the constructor. Finally

the correct identifier is set, the first line of the jobs is stored and all counter

variables are adjusted accordingly.

The submitResult() method (lines 41-53) collects the results and assembles

the complete image from the separate pixel lines. First of all the byte array

containing the pixel values is extracted from the MandelbrotResult. The values

are then copied into the final pixel matrix result. The correct position is

given by adjusting the index of the result matrix with the position of the jobs

corresponding start line.

Listing 22: Implementation of the MandelbrotSet JobFactory.

1 public class MandelbrotSet implements JobFactory {

2 private int N = 0;

3 private float xmin = 0;

4 private float xmax = 0;

5 private float ymin = 0;

6 private float ymax = 0;

7

8 private int id = 0;

9 private int lines = 0;

10 private int current_line = 0;

11 private byte [][] result = null;

12 private int numJobs = Integer.MAX_VALUE;

13 private Map <Integer , Integer > idLine =

14 new HashMap <Integer , Integer >();

15

16 public MandelbrotSet(int N, int jobs , float xmin ,

17 float xmax , float ymin , float ymax) {

18 this.N = N;

19 this.lines = N/jobs;

20 this.xmin = xmin;

21 this.xmax = xmax;

22 this.ymin = ymin;

23 this.ymax = ymax;

24 result = new byte[N][N];

25 numJobs = jobs;

26 }

27


29 MandelbrotJob j =

30 new MandelbrotJob(N, current_line , lines ,

31 xmin , xmax , xmin , ymin);

32

lxxv

33 j.setId(id);

34 sentJobs.put(id, current_line);

35 current_line += lines;

36 id++;

37 numJobs --;

38 return j;

39 }

40


42 MandelbrotResult mr = (MandelbrotResult) r;

43 byte [][] data = mr.getResult ();

44

45 for (int i = 0; i < data.length; i++) {

46 for (int j = 0;

47 j < data [0]. length; j++) {

48 result[i][j + idLine.

49 get(mr.getId())] =

50 data[i][j];

51 }

52 }

53 }

54 }

Listing 23 illustrates the implementation of the MandelbrotJob class. The fields

(lines 2-8) are the same as in the JobFactory and are straightforwardly initial-

ized by the public constructor.

The main computation takes place in the mandelbrot_point() method (lines

23-42), which determines for a given c whether it is part of the mandelbrot set

and if not assigns the corresponding pixel a color value. The complex number

c is split in two components cx and cy representing the real respectively the

imaginary part of the number. As explained earlier it can be shown that the

sequence is not bounded once the absolute value of zn exceeds two. In the

algorithm the square of the absolute value of the current number is compared to

a maximum value in order to avoid computing an additional square root every

iteration. The maximum value is stored in the variable max and its value is

obviously the square of two. The variables x and y store the real and imaginary

part of the current sequence number zn and val stores the square of its absolute

value. x_temp is just a temporary variable for the new x value, i is the loops

counter variable and max_iteration is the maximum number of iterations after

which c is considered to be in the mandelbrot set. Typically 256 is chosen for

the maximum number of iterations so valid RGB color values are generated for

lxxvi

each pixel.

The main loop (lines 33-40) of the method checks on every iteration whether

the absolute value has exceeded the escape value or the maximum number of

iterations has been met. If not the next number in the sequence and its square

absolute value is computed. The computation of the next x and y value is a

version of the already presented sequence formula that is split up for the real

and imaginary part of zn+1. Once the main loop terminates the number of

iterations needed is returned, which corresponds to the color value of the pixel

associated with c.

The run() method (lines 44-70) generates for each pixel to consider in the area a

complex number c and calls the mandelbrot_point() method for each of them.

As explained before the real value of c corresponds to the horizontal position of

the pixel and the imaginary value to the vertical position of the pixel. The area

of the complex plane that is considered is given by the boundaries xmin, xmax,

ymin and ymax. Since the algorithm can only generate a finite number of pixels

in each dimension, in this case given by N, it has to be decided which numbers are

mapped to a pixel. This is done by calculating the scaling variables xscale and

yscale that determine the distance between pixels in the vertical and horizontal

plane. The points are equally distributed among the considered area of the

complex plane so the scaling values are given by the difference of the maximum

and minimum value in a dimension divided by the number of pixels in that

dimension. With the scaling variables given the pixels can be easily mapped to

the corresponding c values (lines 54-57). Afterwards the mandelbrot_point()

method is called with the current complex number c (lines 59-60). Finally the

resulting pixel array is wrapped in a MandelbrotResult and returned (lines

64-69).

Listing 23: Implementation of MandelbrotJob.

1 public class MandelbrotJob extends WorkJob {

2 private int N;

3 private int ystart;

4 private int lines;

5 private float xmin = 0;

6 private float ymin = 0;

7 private float xmax = 0;

8 private float ymax = 0;

9

10 public MandelbrotJob(int N, int yStart ,

11 int lines , float xmin , float xmax ,

12 float ymin , float ymax) {

lxxvii

13 this.id = id;

14 this.ystart = yStart;

15 this.lines = lines;

16 this.xmax = xmax;

17 this.ymax = ymax;

18 this.xmin = xmin;

19 this.ymin = ymin;

20 this.N = N;

21 }

22

23 private static byte mandelbrot_point(float cx ,

24 float cy) {

25 float max = 2 * 2;

26 float val = 0;

27 float x = 0;

28 float y = 0;

29 float x_temp = 0;

30 int max_iteration = 256;

31 int i = 0;

32

33 while ((val < max)

34 && (i < max_iteration)) {

35 x_temp = x * x - y * y + cx;

36 y = 2 * x * y + cy;

37 x = x_temp;

38 val = x * x + y * y;

39 i++;

40 }

41 return (byte) i;

42 }

43


45 byte x[][] = new byte[N][ lines];

46 float xscale = (xmax - xmin)/ N

47 float yscale = (ymax - ymin)/ N

48 float cx = 0;

49 float cy = 0;

50

51 for (int i = 0; i < N; i++) {

52 for (int j = ystart;

53 j < y_start + lines; j++) {

54 cx = xmin + ((float) i

55 * xscale);

56 cy = ymin + ((float) j

57 * yscale);

lxxviii

58

59 x[i][j - y_start] =

60 mandelbrot_point(cx , cy);

61 }

62 }

63

64 MandelbrotResult r =

65 new MandelbrotResult ();

66

67 r.setId(this.id);

68 r.setResult(x);

69 return r;

70 }

71 }

Listing 24 shows the Result implementation of the Mandelbrot Set generator.

The data type is a two dimensional byte array storing the RGB pixel values for

the computed area of the image.

Listing 24: Implementation of MandelbrotResult.

1 public class MandelbrotResult extends Result {

2 private byte [][] result;

3

4 public byte [][] getResult () {

5 return result;

6 }

7 public void setResult(byte [][] result) {

8 this.result = result;

9 }

10 }

0.10 Rank Sort

Rank Sort is a simple sorting algorithm that can be parallelized. The sequential

version is very simple to implement, though performs rather bad in comparison

to fast sorting algorithms such as Quick Sort or Heap Sort. The parallel version

while still simple, has the potential to outperform the faster sequential sorting

algorithms [10].

lxxix

Unsorted List Rank Sorted List Rank

14 5 2 19 4 4 24 2 7 318 6 9 47 3 14 52 1 18 6

Table 0.5: Example illustrating the concept of ranks.

0.10.1 Algorithm

The general idea of the algorithm is that every element of the list to be sorted

has a property called rank. The rank represents the position of the element if

the list where already sorted.

Table 0.5 illustrates the concept of a rank. The left part of the table shows the

elements of a unsorted list and their respective ranks. The right part shows the

same list in sorted order and again the ranks of the elements. Note that for the

sorted list the rank of the elements equals their absolute positions in the sorted

list.

The rank of an element e can also be defined as the total number of elements

in the list that are smaller than e. In order to calculate the rank of the element

e has to be compared to all other elements in the list. For every element that

is smaller than e, the rank is increased by one. Once the rank of an element is

known it can immediately be placed in the ranks position of the sorted array.

This results in a runtime complexity of O(n) to compute the rank of one element

in the list. In order to sort the whole list the rank of every element has to be

computed, which results in a total runtime complexity of O(n2) for sorting a

list of n elements.


Listing 25 shows the main loop of the Rank Sort algorithm in pseudo code. The

loop iterates over every element in the unsorted list, calling the compute_rank()

function that determines the rank of the given element. As already described the

rank of an element corresponds to its index in the sorted list. As a consequence

if the compute_rank() is implemented correctly there are no overlapping write

lxxx

accesses to the sorted array.

Listing 25: Pseudo code of the Rank Sorts main loop.

1 for i = 0 to N do

2 rank = compute_rank(unsorted[i])

3 sorted[rank] = unsorted[i]

4 end

Since the rank computation of the elements has no dependencies, a parallel

version of the algorithm can be easily obtained by computing the rank of the

elements independently on different processors.

In a shared memory architecture the unsorted and sorted list are copied into

the shared memory. While the unsorted list is only read by the processors and

used to compute the rank of elements, every processors can directly write the

elements to the sorted result list. Note that there are no conflicts since only the

original list is used for the computation of the elements position.

In a master slave architecture, every slave computes the rank of some elements

of the list, while the master collects the ranks of the elements and puts them

in the correct position of the sorted result list. The unsorted list has to be

broadcasted to every slave in the system.

The parallel time complexity can be easily derived from the sequential com-

plexity of the algorithm. The inner loop complexity of O(n) stays the same.

Assuming a equal distribution of work, each processor has to perform the inner

loop only n/m times, where m is the number of available processors. As a result

there is a total time complexity of O(n ∗ n/m) or O(n2/m).


Listing 26 shows a simplified version of the Rank Sort JobFactory. The fields

(lines 2-11) consist of the unsorted array x, the total number of parallel jobs

to generate numJobs and the number of elements that are sorted by one job

jobSize. Moreover there is the variable remainder storing the number of addi-

tional elements to sort, when elements cannot be equally distributed to the par-

allel jobs and some of them have to sort an additional element. currentIndex

and currentId are counter variables holding the start index of the elements

assigned to the next job generated and respectively its identifier. Finally there

is a second array sorted holding the sorted result list, which is continuously

filled with elements by their rank. idIndexMap stores for every job identifier

lxxxi

the corresponding starting index, so the elements of the array can be correctly

associated with their rank when collecting results.

The RankSort class is initialized by its public constructor (lines 13-19), taking

as arguments the unsorted array and the number of desired parallel jobs. The

unsorted array x and the number of jobs numJobs are simply set to the passed

arguments. jobSize is given by the length of the unsorted array divided by the

number of jobs and respectively remainder is given by the modulo of the array

length and the number of jobs. The result array is initialized with the same

length as the unsorted array.

The getWorkJob() method (lines 21-36) again initializes a RankSortJob and

assigns it a portion of the array. The start index is given by the currentIndex

variable the end index by currentIndexjobSize+. However as long there are

remaining elements the last index is incremented assigning the job an additional

element and the counter for remaining elements is decremented. Finally the jobs

identifier is set and the WorkJob is returned.

The submitResult method (line 38-46) is responsible for continuously filling the

sorted result array with elements by their rank. The RankSortResult contains

the ranks of the elements the corresponding job had to consider.

Listing 26: Implementation of the RankSort JobFactory.

1 public class RankSort implements JobFactory {

2 private int[]x;

3 private int numJobs;

4 private int jobSize;

5

6 private int remainder;

7 private int currentIndex;

8 private int currentId;

9 private int[] sorted;

10 private Map <Integer , Integer > idIndexMap =

11 new HashMap <Integer , Integer >;

12

13 public RankSort(int []x, int jobs){

14 this.x = x;

15 jobSize = x.length/jobs;

16 remainder = x.length % jobs;

17 numJobs = jobs;

18 sorted = new int[x.length ];

19 }

20


lxxxii

22 RankSortJob j = new RankSortJob ();

23

24 idIndexMap.put(currentId , currentIndex);

25 j.setFrom(currentIndex);

26 j.setTo(currentIndex + jobSize);

27 if(remainder > 0){

28 j.setTo(j.getTo () + 1);

29 remainder --;

30 }

31 currentIndex = j.getTo();

32 j.setId(currentId);

33 currentId ++;

34

35 return j;

36 }

37


39 RankSortResult res = (RankSortResult) r;

40 int [] ranks = res.getRanks ();

41

42 for(int i = 0; i < ranks.length; i++){

43 sorted[ranks[i]] =

44 x[i+idIndexMap.get(res.getId ())];

45 }

46 }

47 }

Listing 27 shows the WorkJob implementation of the algorithm. x contains the

unsorted array, from and to are the boundary indices of the elements the given

job is responsible for.

The implementation of computeRank() (line 7 to 15) is the same as for a sequen-

tial version of the algorithm. The method iterates over the original unsorted

list and compares the element at the given position to every other element in

the list. For every element that is smaller as the considered element the rank is

increased by one. Finally the rank of the element is returned.

The run() method (line 17 to 30) initializes an array for storing the ranks of the

corresponding elements of the unsorted list. Then the computeRank() method is

called repeatedly for every element between the from and the to index. Finally

the result is assembled and returned.

Listing 27: Implementation of the RankSortJob.

1 public class RankSortJob extends WorkJob{

lxxxiii

2 private int[] x;

3 private int from;

4 private int to;

5

6 public static int computeRank(int[] x, int pos){

7 int rank = 0;

8 for(int i = 0; i < x.length; i++){

9 if(x[i] < x[pos]){

10 rank +=1;

11 }

12 }

13 return rank;

14 }

15


17 int[] ranks = new int[to -from];

18

19 for(int i = 0; i < ranks.length; i++){

20 ranks[i] =

21 computeRank(x, i + from);

22 }

23

24 RankSortResult r = new RankSortResult ();

25 r.setRanks(ranks);

26 r.setFrom(this.from);

27 r.setId(this.Id);

28

29 return r;

30 }

31 }

Listing 28 shows the implementation of the RankSortResult class. A

RankSortResult contains an array with the ranks of the elements that

were assigned to the corresponding job. The master application copies the the

corresponding elements to the result array by their rank.

Listing 28: Implementation of the RankSortResult.

1 public class RankSortResult extends Result{

2 private int[] ranks;

3

4 public int[] getRanks () {

5 return ranks;

6 }

7 public void setRanks(int[] ranks) {

lxxxiv

8 this.ranks = ranks;

9 }

10 }

lxxxv

lxxxvi

Experiments

In this chapter all the experiments testing App Engines computing capabilities

are presented. First of all a couple of general tests of the App Engine framework

were performed, in order to get a grasp of the general performance of the frame-

work, to identify possible bottlenecks and to provide a context for the algorithm

experiments. The algorithm experiments consist of a simple speedup analysis

followed by a scalability analysis. Finally the resource consumption of each al-

gorithm is analyzed and a rough cost estimation derived. Results are generally

compared to a equivalent experimental setup executed on the karwendel cluster,

in order to provide a comparison to a system with known hardware.

Note that getting consistent performance measurements on Google App Engine

is quite difficult, since there are no guarantees whatsoever in respect to the

location or the underlying hardware of the servers. For example two identical

consecutive requests to the same web application could be executed on com-

pletely different hardware in two different geographic locations. Though in the

duration of the tests we certainly always had the same hardware infrastructure

since there were no significant variations in the results. Moreover changes in the

load balancing strategies, other internal mechanism or the used hardware can

have influence on the performance. In addition background load on the servers

or the network might also influence the outcome of experiments. In order to

minimize the bias all experiments were conducted in a short period of time and

were executed multiple times in order to even out side effects.

0.10.4 Hardware and Experimental Setup

In the following the hardware and experimental setup used for all the experi-

ments presented in this chapter will be described.

For experiments performed on the karwendel cluster, the Master application

responsible for job distribution was executed on the head node of the cluster,

while the slave application was executed on a regular compute node using the

lxxxvii

head node

CPU: 2 x Opteron 848CPU speed: 2.2 GHzCores: 1Cache: L1: 64 kilobyte L2: 1 megabyteMemory: 16 gigabyte

compute node

CPU: 4 x Opteron 880CPU speed: 2.4 GHzCores: 2Cache: L1: 32 kilobyte L2: 1 megabyteMemory: 16 gigabyte

Network: Infiniband networkJVM: Java HotSpot 64-Bit Server VM (build 16.3-b01, mixed mode)

Table 0.6: Hardware specification of the karwendel cluster.

development server. Table 0.6 lists the hardware specification of the karwendel

cluster.

CPU: Intel Xeon 5150CPU speed: 2.66GHzCores: 2Cache: L1: 32 kilobyte L2: 4 megabyteMemory: 4 gigabyteInternet Connection: ...

Table 0.7: Hardware specification of the zid-gpl server.

For all experiments on App Engine the Master application was executed on the

zid-gpl server. Table 0.7 lists the hardware specification of the zid-gpl server.

If not stated elsewise every iteration of the experiments was repeated ten times

and results where averaged in order to reduce bias. The JVM was always

”warmed up” before starting the actual measurements in order to avoid effects

of JIT compilation to tamper the results. For a more detailed discussion of JIT

compilation see section 0.11.4. The JVM used in the karwendel experiments is

listed in table 0.6. App Engines exact JVM version is not known, it is however

a Java 1.6 virtual machine.

All App Engine experiments were performed between 25.12.2010 and 30.12.2010.

lxxxviii

0.11 Analyzing Google App Engine Performance

In order to get a general grasp for the performance of the App Engine infras-

tructure and to identify possible bottlenecks, we performed a couple of general

tests. First of all the network quality of App Engine was tested including a

latency and bandwidth analysis. Furthermore the general performance of the

Java environment was tested, followed by a test of the Just In Time Compila-

tion (JIT) compilation capabilities of the system. Finally the cache behavior of

the system was tested as well. Generally it would be better to use standardized

test benchmarks, App Engine however does not allow arbitrary libraries and

therefore we had to write all tests ourselves. We compared the results where

necessary with results gathered on karwendel in order to provide a comparison

to a system with known hardware.

0.11.1 Latency Analysis

The latency of requests when calling App Engine applications is of key impor-

tance for a performance analysis, since it often accounts for a big part of the

parallel overhead. Especially when the computation time of parallel tasks is

relatively short in comparison to the latency, it can have an large impact on

the overall performance of algorithms. In this context the latency of an HTTP

request is the time needed from issuing the request until the HTTP response is

completely returned.

There are various factors that influence latency. Foremost there is the network

infrastructure that determines the time needed until the data of the request is

transferred to the servers. However there is also the time needed for the App

Engine load balancer to correctly analyze and assign a request to an application

server. Moreover if there is no instance of the web application running on an

application server, one has to be initialized before assigning the request. As

described in the first chapter the App Engine load balancer starts and caches

instances of an application depending on the number of recent requests. In fact

applications that have no running instances may have a considerably higher

latency for the first couple of requests depending on the size of the application

and thus the time needed to load instances into memory.

In the following an analysis of HTTP request latency to the slave application

is presented. The experiment was set up to test the latency of a job request

depending on the size of the payload. Payload sizes from 0 up to 2700 kilobyte

in 300 kilobyte steps where tested, since this will be the typical size range for job

lxxxix

requests. Each request size was tested 50 times. For all experiments simple ping

requests to the application where issued measuring the time until the response

was completely returned.

0 300 600 900 1200 1500 1800 2100 2400 27000

1000

2000

3000

4000

5000

Late

ncy

(milli

seco

nds)

Payload Size (kilobyte)

Figure 0.15: Results of the latency analysis.

Figure 0.15 shows the results of the latency analysis. Notable is that the latency

does not increase linearly for a linearly increasing payload size. This effect can be

explained by TCP needing some time to fully utilize the fast Internet connection

and therefore being able to handle the larger payloads relatively faster. Note

that in this experiment single isolated requests where sent, in contrast to a

practical algorithm run where multiple parallel requests are sent thus utilizing

the connection more efficiently.

0.11.2 Bandwidth

Besides the latency of requests another very important network measure is the

bandwidth, since clearly the upload and download of data to the App Engine

servers is major part of the parallel overhead. Looking at the request size

latency analysis it is obvious that the bandwidth often will not be fully utilized.

Therefore shared data is transferred using multiple parallel TCP streams, as

more closely described in chapter 0.6. Moreover there are usually always a couple

of parallel job request in order to use the available bandwidth as efficiently as

possible.

In order to test the bandwidth to App Engine we used the parallel data transfer

implemented for shared data management, however without storing any of the

data in the datastore in order to avoid overhead imposed by database opera-

xc

tions. The experiment involved transferring data chunks of one, two and four

megabyte, which is a typical data size used for the algorithms, using between

1 and 60 parallel HTTP requests. Thereby we were able to identify a good

amount of parallel streams for data transfer as well as analyze the achievable

bandwidth to the App Engine servers.

10 20 30 40 50 60 700

1000

2000

3000

4000

5000

Number of Parallel Streams

Tran

sfer

Tim

e (m

illise

cond

s)

1 megabyte2 megabyte4 megabyte

Figure 0.16: Results of the bandwidth test.

Figure 0.16 shows the results of the bandwidth analysis. Up to 30 parallel

streams steady speedup of the data transfer can be achieved. Between 30 and

60 parallel tasks the transfer time stays approximately the same. For the four

megabyte chunk the time still seem to decrease slightly up to around 50 streams.

For more than 60 parallel tasks the time needed to transfer the data chunks start

to increase again.

Chunksize Minmum Speed Maximum Speed

1 mb 343kb/sec 1796kb/sec2 mb 507kb/sec 2560kb/sec4 mb 871kb/sec 3938kb/sec

Table 0.8: Maximum and minimum transfer speeds for different data chunksizes.

Table 0.8 shows the maximum and minimum transfer speeds for the different

chunk sizes. As expected the larger chunks have higher transfer speeds be-

cause TCP has more time to fully utilize the bandwidth and there is relatively

less overhead imposed by the HTTP header. For all chunk sizes a substantial

speedup of the data transfer can be achieved by using multiple parallel streams.

xci

In practice 30 parallel streams is probably the best choice. Even though with

more streams possibly a slightly fast transfer could be achieved, the higher

overhead for the database operations would outweigh the faster transfer.

0.11.3 Java Performance Analysis

In order to provide a rough performance estimation of the App Engine Java

environment we measured various metrics using a simple self written micro

benchmark and compared the results to the karwendel Java environment. The

performance of the basic arithmetic operations of each data type as well as the

performance of trigonometric functions, random number generation and Object

creation are measured. The benchmark repeatedly performs each operation in

a loop and estimates how many operations each can be executed in a second.

The purpose of this section is to provide a comparison of the App Engine Java

environment to the Java environment used for executing the algorithms on kar-

wendel.

Microbenchmarks

Listing 29 shows the benchmark routine for double arithmetics. For measur-

ing the time the more precise method System.nanoTime() is used instead of

start = System.currentTimeMillis(). The main loop repeatedly performs

the basic arithmetic operations until the maximum number of iterations is

reached. It is important to use the variable containing the computation re-

sult by for example printing it, otherwise the compiler would simply optimize

the seemingly useless computation away. Furthermore it is essential to avoid

using unchanging variables in the computations which again would enable the

compiler or the JVM to optimize the computation. Since generating random

values every iteration would not be feasible the counter variable is used for the

computation. Finally the time spent for the main loop is returned.

Listing 29: Benchmark routine for double add/sub operations.

1 public static long doubleAddArith(long N) {

2 long start = 0;

3 long end = 0;

4 double result = Math.random () * 10.0;

5 double i = 0.0;

6 start = System.nanoTime ();

7 while (i < N) {

8 result -= i++;

xcii

9 result += i++;

10 }

11 end = System.nanoTime ();

12 System.out.println(result);

13 return (end - start);

14 }

The other benchmarks are implemented analogously and therefore a closer de-

scription is omitted here.

The benchmark tests were executed as a WorkJob using the distributed system

in the same way as for the algorithms. The tests were run directly without a

warmup phase for the JVM, since the benchmarks are only intended to com-

pare system performance. Between each test the garbage collector is called by

invoking System.gc().

int a/s int m/d long a/s long m/d double a/s double m/d trig random object0

2

4

6

8

10

12

14

16x 108

Run

time

(nan

o se

cond

s)

App EngineKarwendel

Figure 0.17: Microbenchmark results.

FIgure 0.17 shows the results of the micro benchmark test. The bars show the

time needed in nanoseconds to execute each benchmark loop. For the arith-

metic tests the number of iterations was 100.000.000, for the trigonometric test

1.000.000 and for the random number generation as well as the object creation

test 10.000.000. For the basic data type operations a/s labels the addition and

subtraction tests, whereas m/d the tests for multiplication and subtraction.

First of all what is very surprising is the difference in addition/subtraction and

respectively multiplication/division operations for the integer data types. While

xciii

multiplication/division operations for int as well as long types is faster on App

Engine than on karwendel the addition/subtraction operations are way slower.

This effect can be explained either by the hardware used in the App Engine

infrastructure or the way the App Engine JVM handles addition/subtraction

operations. The operations for the double data type are more stable and are

faster on the karwendel. The trigonometric function provided by the standard

math library are slightly faster on app engine. The random number generation

of the standard math library is over twice as fast on karwendel than on App

Engine, which might be due to a different implementation used in App Engines

runtime environment. Object creation has almost identical performance on both

systems.

Scalar mul/div Fibonacci0

500

1000

1500

2000

2500

3000

3500

4000

Com

puta

tion

Tim

e (m

illise

cond

s)

App EngineKarwendel

Figure 0.18: Computation time results of Scalar mul/div and the Fibonaccinumber generator.

In order to verify the rather unusual results of the integer operations we tested

the different operations using actual algorithms. First a simple scalar multi-

plication/division algorithm that of course heavily tests the multiplication and

division operations and a fibonacci number generator that represents the addi-

tion/subtraction operations. The raw computation of the algorithms was mea-

sured, which is the time spent in the run method of the algorithms without

any overhead. For both algorithms a problem size was chosen that yielded a

computation time of around one second on App Engine and the same size was

then executed on karwendel.

Figure 0.18 shows the results of the experiment. The Fibonacci algorithm is

around three times faster on karwendel which matches the results from the

microbenchmarks. Notable is however that the Scalar algorithm is actually

even slower on karwendel than one would expect. This can be explained by

the fact that the Scalar algorithm has also a high memory usage and therefore

also tests the speed of the cache hierarchy. As more closely analyzed in chapter

xciv

0.11.5 the cache hierarchy of App Engine is generally faster than on karwendel

which explains the discrepancy in computation times.

0.11.4 JIT Compilation

Virtual Machines using JIT compilation dynamically compile frequently used

parts of the bytecode to native machine code, which can yield a notable speed

improvement depending on the program. Algorithms typically benefit a lot

from JIT compilation, since usually most computation takes place in a main

loop which can be easily compiled to optimized machine code. For performance

measurements it can be very difficult to deal with JIT compilation, since the

JVM has to be ”warmed up” before making the actual measurements in order

to assure consistent results.

In order to test the behavior of App Engine in terms of JIT compilation we used

the Fibonacci numbers generator, since the JIT compilation seemed to have a

big impact on the algorithms computation time. The algorithm was repeatedly

executed with the same problem size over a period of 50 iterations. Note that

only the effective computation time was measured which is the time spent in the

run method of a job in order to avoid any bias caused by overhead. The requests

were sent one after another with a sleeping time of one second in between each

request. The App Engine slave application had initially no instances running.

0 5 10 15 20 25 30 35 40 450

500

1000

1500

2000

2500

3000

3500

4000

4500

Request Number

Com

puta

tion

Tim

e (m

illise

cond

s)

Figure 0.19: Fibonacci test requests.

Figure 0.19 shows the requests in their consecutive order as they were sent and

their respective computation time. First of all notable is that every request ba-

sically has one of two computation times the higher one of around three seconds

xcv

or the considerably lower one of a little over a second. Secondly interesting is

that the first requests all have the larger computation time, whereas the later

request almost all have the lower one.

As described in chapter 2 App Engine spawns instances of an application de-

pending on the recent load of an application. Therefore for an application

that experiences numerous requests additional instances are initialized. Each

instance of an App Engine application has its own JVM, which therefore also

shares all static references, thus we were able to track instances by introducing a

class called InstanceTracker (see listing 30). The InstanceTracker is simply

a static Singleton class with a single field holding an identifier for the instance,

which is initialized based on the current time.

Listing 30: InstanceTracker class for tracking the current application instance.

1 public class InstanceTracker {

2 private static InstanceTracker tracker =

3 new InstanceTracker ();

4 long UUID;

5 private InstanceTracker (){

6 UUID = System.nanoTime ();

7 }

8 public static InstanceTracker getTracker (){

9 return tracker;

10 }

11 }

0 5 10 15 20 25 30 35 40 45 500

500

1000

1500

2000

2500

3000

3500

4000

4500

Request Number

Com

puta

tion

Tim

e (m

illise

cond

s)

Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7

Figure 0.20: Fibonacci test requests mapped to their respective instance.

By tracking the instances that handled the requests, the requests can be mapped

to their respective instance as depicted in figure 0.20. The requests were handled

xcvi

by a total of seven different instances. When looking at the computation times

it is noticeable that for each instance the first two requests it handled took

considerably longer than every following request it handled. This leads to the

conclusion that for this problem size after two requests the optimized version of

the code is executed which is a lot faster.

For the experiments this means that the slave application has to be ”warmed

up” with sufficient requests of the corresponding algorithm before actually be-

ginning the measurements in order to minimize bias caused by the JIT compiler.

The problem is however that the programmer has absolutely no control over the

initialization or the lifetime of instances, which means there is always a chance

that a new ”cold” instance is initialized during measurement. For a slave ap-

plication running in the development server this problem is not present, since

every request is handled in the same JVM. The JVM however still has to be

”warmed up” after startup.

Moreover the experiment shows how much impact the JVM can have on the

runtime of an algorithm.

0.11.5 Cache Hierarchy

Another important aspect of a computing environment is the cache hierarchy

and its respective speed. Therefore it is interesting to find out what cache hier-

archy is present in the App Engine infrastructure. Since there is no information

what hardware Google uses, we wrote a short cache test program in order to

get an idea of App Engines cache behavior. The cache tests presented in the

following are based on the tests in [21].

Listing 31: Short program to test the cache hierarchy.

1 for(int i = 1; i <= 16; i++){

2 int bytenum = 1024 * (int)Math.pow(2, i);

3 byte [] arr = new byte[bytenum ];

4

5 start = System.currentTimeMillis ();

6 for (int k = 0; k < steps; k++) {

7 arr[(k * 64) % arr.length ]++;

8 }

9 end = System.currentTimeMillis ();

10 }

xcvii

Listing 31 shows the code of cache hierarchy test program. The program in

principle tries to modify entries of an array for a given amount of iterations.

By increasing the size of the array the size when the array no longer entirely

fits into a cache level and therefore the next slower level has to be used can be

identified. Since in modern processors an array access loads a whole cache line

of 64 byte only every 64th entry of the byte array is modified in order to cheaply

modify every cache line.

2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

1000

2000

3000

4000

5000

6000

7000

Array Size (kilobyte)

Itera

tion

Tim

e (m

illise

cond

s)

App EngineKarwendel

Figure 0.21: Results of the cache hierarchy test.

Figure 0.21 shows the result of the cache test program executed on App Engine

as well as karwendel in order to show that the program produces meaningful

results for a system with known cache hierarchy. A karwendel node has a L1

cache of 64 kilobyte and a L2 cache of 1024 kilobyte, which is clearly visible by

the performance drops at an array size of 64 and respectively 1024 kilobyte.

The processors utilized in the App Engine infrastructure seem to have three

cache levels. The first performance drop is visible at 32 kilobyte, the second at

256 kilobyte and the third at 8 megabyte. This would indicate an L1 cache of

32 kilobyte, a L2 cache of 256 kilobyte and a L3 cache of 8 megabyte, which

would for example match the Intel Nehalem processor microarchitecture.

0.12 Speedup Analysis

The experiment consists of a simple speedup analysis of each algorithm using

a single application on App Engine, compared to the same problem executed

on a single karwendel cluster node. The first difficulty is however choosing an

xcviii

appropriate problem size, since App Engine allows single requests to run at most

30 seconds. Therefore for each algorithm a problem size was chosen that could

be solved in slightly under 30 seconds on App Engine using a single job. The

same problem size was then chosen for both platforms and the behavior of some

metrics for increasing number of parallel tasks was tracked.

FIrst of all there is the total runtime which is the time needed for the algorithm

to finish with a given problem size. The total runtime is a good overall metric

for how good the algorithm performs.

The next metric is the average computation time which is the average time

needed for a job to execute the run() method. This metric therefore reflects the

actual time spent doing useful computations. If the problem is split in multiple

parallel jobs each job has less work to do and thus typically a lower average

computation time. However when there are more parallel jobs running than

cores are available the average computation time should not decrease further

since cores have to be shared by multiple threads.

Another metric is the average overhead which is the average time a job addition-

ally needed to complete, besides doing useful computations. The overhead of a

job is the difference between the time needed for the job to return a result and

the time spent in the run() method. The overhead is typically comprised of the

time needed to transfer requests, compress data, perform database operations

or internal mechanisms of the Google App Engine framework.

Finally for algorithms making use of the shared data management, the +data

transfer time+ is measured, which is simply the average time needed to transfer

the shared data to a slave.

By measuring not only the total runtime it can be analyzed in more detail

why certain algorithms perform better than others. A high overhead typically

means that there is too much time invested in transferring data to the slaves in

comparison to the computation time. Note that we expect considerably higher

overhead values for the App Engine slave than for the local slave executed on

karwendel, since data has to be transferred over the Internet instead of a fast

local network. Moreover there is notable latency caused by the load balancing

mechanisms of the App Engine framework.

If every job takes the same time for its computation and has the same overhead,

the completion time of the algorithm is the sum of the average computation

time, the average overhead and the time spent to generate jobs and reassemble

results. So a large gap between the completion time and the sum of the average

xcix

computation time and average overhead is typically an indicator for poor load

balancing.

0.12.1 Pi Approximation

2 4 6 8 10 12 140

5

10

15

20

25

30

Parallel Tasks

Tim

e (s

econ

ds)

Karwendel

2 4 6 8 10 12 140

5

10

15

20

25

30App Engine

Total Execution TimeComputation TimeOverhead

Total Execution TimeComputation TimeOverhead

Figure 0.22: Runtime analysis of the Pi Approximation algorithm.

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

Parallel Tasks

Karwendel

Spee

dup

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

App Engine

Linear SpeedupSpeedup


Figure 0.23: Speedup analysis of the Pi Approximation algorithm.

The Pi Approximation algorithm is a pretty good benchmark for the raw com-

putation power of the App Engine framework, since there is almost no data to

transfer in the requests as well as the results. Moreover the algorithm itself does

not operate on data and therefore there are no caching effects present. Besides

c

these properties the algorithm can be almost perfectly load balanced. For the

experiment a problem size of 220.000.000 points to generate was chosen. How-

ever the algorithm is based on the standard random number generator provided

by the Java API, which seems to be rather slow on App Engine.

Figure 0.23 shows the results of the speedup experiment for the Pi Approxi-

mation algorithm. On karwendel there is effectively no overhead present for

transferring jobs and results because of the fast Infiniband network. As a result

the average computation time and the total runtime are almost the same up to

eight parallel tasks. Figure 0.22 shows that up to eight parallel tasks an almost

linear speedup can be achieved. Using more than eight parallel tasks results in

slightly increased total runtime and therefore a drop in the speedup caused by

load imbalance, since the karwendel compute nodes have eight processor cores.

In comparison the App Engine results show a rather high average overhead of

around 700 milliseconds. The average computation time shows a steady almost

linear speedup as expected. The total runtime time shows as well a notable

speedup of up to six. The irregularities in the runtime graph are caused by

random background load either on the application servers or the network. Even

though the overhead caused by the request latency is a limiting factor, the al-

gorithm performs well using Google App Engine. The raw average computation

for 15 parallel tasks is almost the same for both platforms, even though for a

single task the computation time is almost twice as high on App Engine than

on karwendel.

0.12.2 Matrix Multiplication

Matrix multiplication is an algorithm that mainly utilizes addition/multiplica-

tion and therefore reflects the performance of these operations on a system to

a certain point. Moreover there is a relatively large amount of data to transfer

in terms of parameters as well as well as results. Besides the caching behavior

of a system is tested since the algorithm has to iterate over large arrays that

typically will not entirely fit in the CPU cache. For the experiment two integer

square matrices of 1500× 1500 where multiplied.

Figure 0.24 shows the runtime results of integer matrix multiplication. In

terms of sequential computation time both systems show a similar performance,

though karwendel is slightly faster than App Engine. On karwendel, there is

again rather small average overhead of around 700 milliseconds. The average

data transfer time is also rather low around 600 milliseconds, which is again due

ci

2 4 6 8 10 12 140

5

10

15

20

25

30Karwendel

Parallel Tasks

Tim

e (s

econ

ds)

2 4 6 8 10 12 140

5

10

15

20

25

30App Engine

Completion TimeComputation TimeOverheadData Transfer Time

Completion TimeComputation TimeOverheadData Transfer Time

Figure 0.24: Runtime analysis of integer Matrix Multiplication.

to the fast network connection. Figure 0.25 shows that for karwendel again a

good speedup up to eight parallel tasks can be achieved, the maximum is how-

ever only at four. For more than eight tasks the runtime increases slightly due

to load imbalance.

On App Engine clearly the overhead for transferring data dominates the al-

gorithm. The data transfer time is between two and three seconds and the

average overhead ranges from almost ten to four seconds. An interesting effect

is that the overhead tends to decrease with more parallel jobs. This is caused

by the fact that a single job request only has one TCP stream to return data.

If there are multiple job requests used, the returning of data is automatically

split to multiple TCP streams, which increases the speed of returning data to

the master. The maximum speedup is only at around two as depicted in figure

0.25

0.12.3 Rank Sort

Rank Sort essentially tests sequential reading array traversal and the integer

increment (for incrementing the rank), which typically should be pretty fast.

For the experiments an integer array with 70.000 elements was sorted.

Figure ?? shows the results of the runtime analysis using the Rank Sort algo-

rithm. First of all surprising is the huge difference in the sequential computation

time. For the single task the computation time is around three times higher on

App Engine than on karwendel. This is most likely due to the bad addition/-

cii

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

Parallel Tasks

Karwendel

Spee

dup

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

App Engine



Figure 0.25: Speedup analysis of integer Matrix Multiplication.

subtraction performance of App Engine and the fact that the if clause in the

main loop of the algorithm prevents any kind of loop unrolling optimization.

The karwendel results show a steady speedup of runtime up to six parallel tasks.

The problem size is too small to achieve a speedup up to eight parallel tasks.

Typical again is the low average overhead and the very low data transfer time

due to the fast network links.

On the contrary on App Engine there is again a higher average overhead of

around 700 milliseconds, the data transfers takes about the same time. For

15 parallel tasks a pretty good speedup of over four can be achieved. The

algorithm seems to be better suited for execution on App Engine than matrix

multiplication, since the ratio between computation and data transfer is more

weighted towards computation.

0.12.4 Mandelbrot Set

The main properties of the Mandelbrot Set generator are on the one hand the

small WorkJobs and on the other hand the very big results that have to be

transferred back to the master. Note that for this algorithm a relatively small

problem size of 3200×3200 pixel was chosen, because a larger problem size would

exceed the maximum allowed size of ten megabytes for the HTTP response. This

results in a completion time of under eight seconds for the single job execution.

First of all noticeable is the huge overhead in comparison to a low effective

ciii

2 4 6 8 10 12 140

5

10

15

20

25

30Karwendel

Tim

e (s

econ

ds)

2 4 6 8 10 12 140

5

10

15

20

25

30App Engine

Parallel Tasks

Total Execution TimeComputation TimeOverheadData Transfer Time

Total Execution TimeComputation TimeOverheadData Transfer Time

Figure 0.26: Runtime analysis of the Rank Sort algorithm.

computation time on App Engine, making the algorithm practically unfeasible.

The effect that more parallel tasks result in a lower average overhead is even

more noticeable than for the Matrix Multiplication or Rank Sort.

Taking a closer look at the speedup (see Figure 0.29) on karwendel it seems that

the algorithm gains speedup beyond eight parallel tasks. Besides an uneven

amount of parallel tasks seem to result in a larger total runtime, even though

average computation time steadily decreases and overhead is constant. After

analyzing the problem more closely we found out that the implementation of

the algorithm has an inherent load imbalance, even though each job is assigned

an equal amount of pixel values to calculate.

The pixel areas are assigned in stripes from the top to the bottom of the image.

Due to the fact that most of the points in the center belong to the mandelbrot

set make these areas harder to compute, because for points that are not part of

the mandelbrot set the algorithm can immediately continue with the next point

once the escape condition is met. As a result jobs working on the center of

the image take longer than those working on the peripheral areas of the image.

This for one explains the speedup beyond eight parallel tasks, since if more tasks

work on the image the load imbalance has less effect. Secondly it also explains

the effect that an uneven amount of jobs results in a higher completion time,

since for an uneven job number there is always a single job responsible for the

very center of the image which consumes most of the computation time.

In order to resolve the load imbalance pixel lines would have to be assigned

randomly to each job, which however imposes additional management overhead

civ

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

Parallel Tasks

Karwendel

Spee

dup

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

App Engine



Figure 0.27: Speedup analysis of the Rank Sort algorithm.

for tracking the mapping between pixel lines and jobs.

0.13 Scalability Analysis

The simple speedup analysis clearly does not favor App Engine, since the 30

second computation limit made the use of rather short tasks with therefore a

high relative overhead necessary. In order to test the potential of App Engine for

computing larger problem sizes we performed a scalability analysis. Instead of

distributing the jobs to one application the jobs were distributed to ten different

deployed App Engine applications in order to circumvent the minutely quotas.

As comparison again one karwendel node was used with the same experimental

setup.

The problem size was multiplied by the number of parallel tasks, which means

problem size increases linear proportional to the number of jobs. For example

using the Pi Approximation algorithm the problem size is set to N ∗P , assuming

N as the number of parallel tasks and P as the initial problem size. So if there

were an ideal speedup the runtime should stay the same while increasing the

problem size.

We used the Pi Approximation algorithm for the Scalability Analysis, since the

algorithm had the best results in the speedup experiment. Besides the goal of

this experiment is not to analyze the algorithm properties but to identify the

limitations of the App Engine framework in terms of peak performance. The

cv

2 4 6 8 10 12 140

5

10

15

20

25

30Karwendel

Parallel Tasks

Tim

e (s

econ

ds)

2 4 6 8 10 12 140

5

10

15

20

25

30App Engine

Completion TimeComputation TimeOverhead

Completion TimeComputation TimeOverhead

Figure 0.28: Runtime analysis of the Mandelbrot algorithm.

initial problem size was chosen smaller than in the speedup analysis in order to

avoid exceeding the 30 second request deadline. The initial problem size was set

to 180.000.000 points to generate. The number of parallel requests was chosen

between 1 and 25 because for a larger number the minutely quotas start to get

reached and App Engine denied further connections.

For a N bigger than ten App Engine started to sporadically abort requests with

the following message in the system log:

Request was aborted after waiting too long to attempt to service your request.

This may happen sporadically when the App Engine serving cluster is under

unexpectedly high or uneven load. If you see this message frequently, please

contact the App Engine team.

The aborted requests do not harm the integrity of the algorithm since jobs are

rescheduled. However they produced substantial overhead, since the request

were not aborted instantly but not until ten seconds after they were issued.

The number of aborted request however occurred rather randomly, even though

they are certainly related to the high load of the algorithm since for less than

ten parallel jobs no requests were aborted.

Figure 0.30 shows the results of the scalability analysis. The top diagram shows

the total runtime of the algorithm executed on one karwendel compute node

compared to the App Engine results. In the bottom the number of requests

aborted by the App Engine frontend is diagrammed, in order to explain the

irregularities in the runtime of App Engine.

cvi

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

Parallel Tasks

Karwendel

Spee

dup

1 3 5 7 9 11 13 15

2

4

6

8

10

12

14

App Engine



Figure 0.29: Speedup analysis of the Mandelbrot algorithm.

The results on karwendel show a constant runtime for one up to eight parallel

tasks as expected, since there is a almost linear speedup up to eight jobs for the

Pi Approximation algorithm. There is a substantial step in runtime between

eight and nine tasks since the maximum number of processor cores is reached

and the algorithm becomes load imbalanced. For more than nine parallel tasks

the runtime steadily increases proportional to the problem size.

On App Engine the algorithm scales pretty well, with generally only slightly

increasing runtime. Most of the irregularities in the runtime are caused by the

overhead induced by aborted requests. For more than 17 parallel tasks App

Engine has a generally a lower runtime than karwendel.

Concluding the scalability analysis showed that using ten slave applications de-

ployed to Google App Engine, which is equivalent to one free account, we can

gain a peak performance comparable to one compute node of the karwendel clus-

ter. Moreover it has to be considered that the standard random number library

itself shows very poor performance on App Engine. However more complex

Monte Carlo simulations would require a more sophisticated random number

generator anyway, since the quality of random numbers generated by the stan-

dard Java random number library is in most cases not sufficient for scientific

applications [12].

cvii

5 10 15 20 250

10

20

30

40

Number of Parallel Tasks

Tota

l Run

time

(sec

onds

)

5 10 15 20 250

1

2

3

4

5

6

7

Number of Parallel Tasks

Num

ber o

f Abo

rted

Req

uest

s

App EngineKarwendel

Aborted Requests

Figure 0.30: Scalability analysis of the Pi Approximation algorithm.

0.14 Resource Consumption and Cost Estimation

In order to determine the limiting quota resources for an algorithm it has to

be analyzed which of the resources are used extensively and will first reach a

minutely or respectively a daily quota limit. Of course the problem size also

has an impact on the limiting resource, since for example the ratio between

the amount of data to transfer and the computation time is not the same for

different problem sizes. Resource quotas can be tracked via the administration

web interface of the application.

Tracking the resources of an algorithm on the one hand provides a means to es-

timate the possible work throughput under the given quota limitations, on the

other hand it makes a rough cost estimation for accounts with billing enabled

possible. Moreover a more sophisticated system executing algorithms with dif-

cviii

ferent resource consumption could schedule jobs in a way that the free resources

are used optimally. However typically the CPU hours should be the limiting

factor, since algorithms that have to transfer lots of data won’t have good per-

formance results on Google App Engine anyway.

For testing the resource consumption we used the same problem size as for the

algorithm speedup analysis executed 100 times in a row. We tracked the three

most limiting resources namely, CPU hours, incoming bandwidth and outgoing

bandwidth. Database quotas could be tracked as well, however the database

CPU usage is already included in the overall CPU hours, the overall storage

capacity of one gigabyte will never be a problem since after each algorithm run

the database gets cleared and the data sent to and data received from datastore

API quotas are pretty much unreachable as well.

Resource Unit Unit Cost

Outgoing Bandwidth gigabytes $0.12Incoming Bandwidth gigabytes $0.10

CPU Time CPU hours $0.10

Table 0.9: Resource costs as for 10.01.2011.

Table 0.9 shows the resource units and costs per unit for the measured quotas.

Problem Size Algorithm Out Bandwidth In Bandwidth CPU time Est. Cost

220.000.000 Pi 0 gigabytes 0 gigabytes 1.7 hours $0.171500× 1500 Matrix 0.85 gigabytes 0.75 gigabytes 1.15 hours $0.29270.000 RankSort 0.02 gigabytes 0.01 gigabytes 1.16 hours $0.0323200× 3200 Mandelbrot 0.95 gigabytes 0 gigabytes 0.15 hours $0.129

Table 0.10: Resource consumption and cost estimation for 100 iterations ofthe given problem size.

Table 0.10 lists the resource consumption and the cost estimation of the al-

gorithms. As expected the Pi Approximation algorithm is very computation

heavy and has almost no data to transfer. Matrix Multiplication makes heavy

use of all the resources, besides being a very computation intense algorithm the

parameter and result matrices have to be communicated between master and

slave. Surprisingly the Rank Sort algorithm consumes very little of the band-

width resources in comparison to the CPU time consumed, even though the

unsorted array has to be transferred to the slave and the ranks of each element

back to the master. The mandelbrot set generator is clearly dominated by the

amount of data that has to be transferred back as a result.

cix

For the Pi Approximation we can generally state that for $1 around 129 ∗ 109

random points can be sampled, since the algorithm has a linear computation

effort. For the other algorithms it is however more difficult to give a general-

ized estimation of resource consumption, since the consumption of the different

resources is not linear and scales differently for increasing problem sizes. The

Mandelbrot set algorithm is especially problematic, since the computation effort

varies vastly depending on the area of the complex plane that is considered.

Algorithm Out Bandwidth In Bandwidth CPU time

Pi O(1) O(1) O(n)Matrix O(n2) O(n2) O(n2)RankSort O(n) O(n) O(n2)Mandelbrot O(n2) O(1) O(n2)

Table 0.11: Resource complexity of each algorithm.

However the general complexity for each of the resources can be stated. Ta-

ble 0.11 lists with what complexity each of the resources is consumed by the

algorithms.

cx

Related Work

Most of the work regarding scientific computing in the cloud covers IaaS systems,

since existing algorithms and benchmarks can be easily ported to the cloud

platform. Moreover the flexibility in terms of code reuse makes them a preferable

choice.

The paper ”Scientific Cloud Computing: Early Definition and Experience” [24]

provides a an early general overview of Cloud Computing with regard to scien-

tific computing. It discusses the distinct properties of Cloud Computing, the

enabling technologies of cloud computing and provides a comparative study to

regular Grid Computing. Besides various early cloud providers and there dis-

tinct capabilities are discussed. The paper mainly provides an collection of key

cloud properties and a distinction to other computing paradigms.

There are various papers that perform performance evaluations of different cloud

systems. Most of them focus on Amazon EC2 the most popular IaaS cloud

service. ”Performance Analysis of Cloud Computing Services for Many-Tasks

Scientific Computing” [16] provides a performance analysis of four different com-

mercial cloud providers in terms of parallel scientific computing. For the stated

reasons all four cloud providers selected are IaaS providers: Amazon EC2 [1],

GoGrid [4], ElasticHosts [3], and Mosso [7]. The main question of the paper

is whether cloud performance is sufficient for MTC-based scientific computing.

The conclusion is that the compute performance of the tested clouds is rather

low, though it might still be an viable alternative to traditional scientific com-

puting alternatives for scientists who need resources instantly and temporarily.

AppScale [11] is project dedicated to provide execution of Google App En-

gine applications over Xen-based clusters, including IaaS cloud systems such as

Amazon’s AWS/EC2 and Eucalyptus. Moreover it provides a framework for

researches to investigate the interactions between PaaS and IaaS systems and

the internal technologies used for PaaS systems such as Google App Engine.

The system basically emulates the App Engine framework and its API. In this

thesis we used the development server to execute App Engine applications on

proprietary hardware, AppScale provides a more sophisticated means to do so

cxi

and even allows to wrap App Engine applications to other cloud services such

as Amazon EC2.

Google recently released a open-source library for performing map-reduce like

operations on the datastore using Google App Engine and task queues [22].

However, instead of a low level service, Google only provides a library that

provides a simple map-reduce implementation based on App Engine task queues.

So in principle the library has the same limitations as a regular App Engine

application and thus also consumes the same amount of resources. The API

is similar to Hadoop’s map-reduce API, it is even a Hadoop transition guide

provided. So far only the map phase is supported, support for reduce phase is

however anounced.

cxii

Conclusions

Cloud computing as a computing paradigm recently emerged to a topic of high

research interest. It is especially attractive for smaller companies and research

groups that can not afford expensive infrastructure. Most of the research regard-

ing scientific computing in the cloud however focused on IaaS cloud providers.

This thesis focused on investigating the capabilities of Google App Engine, a

Platform as a Service cloud provider, in terms of scientific computing. PaaS

cloud providers do not offer the convenience to execute arbitrary software, which

means programs have to be written according to the provided framework. As a

consequence it imposes various restrictions to the programmer which make the

use for efficient scientific computations rather difficult. Foremost problematic is

the restriction to Java or Python as programming language, since most scientific

programs are written in C or Fortran, which makes porting of existing code

expensive. Moreover the programmer is restricted to a subset of the standard

libraries and is prevented from using arbitrary libraries. As a result algorithms

usually have to be reprogrammed from scratch and libraries have to be available

in source code in order to be used.

Another problem is that the high level programming languages and the unknown

hardware make algorithm profiling and performance tuning a difficult task. Be-

sides there are a lot of unknown variables such as random background load in the

network, the App Engine frontend or the application servers that may influence

the performance of algorithms. It is noticeable that the App Engine frame-

work is intended for developing web applications and therefore the architecture

and API was constructed with the typical requirements of a web application in

mind. Therefore various restrictions, like for example the 30 seconds request

deadline, that certainly make sense in the context of a web application have to

be circumvented. As a consequence various sources of additional overhead arise.

The measurements showed that algorithms with larger amounts of data to trans-

fer, such as Matrix Multiplication are rather unsuited for execution on App

Engine, because the data has to be transferred over a slow network and there-

fore the overhead for data transfer often dominates the algorithm. On the

cxiii

contrary embarrassingly parallel algorithms with very little data to transfer,

such as Monte Carlo simulations perform a lot better in comparison to other

algorithms. The scalability analysis showed that using the Pi Approximation

algorithm with a single free App Engine account yields comparable performance

to one compute node of the karwendel cluster, even though the random number

library in use, showed rather poor performance on App Engine. Moreover App

Engine accounts with billing enabled have more relaxed minutely quotas and

thus an even better peak performance could be achieved.

In terms of cost management App Engine provides a more fine grained resource

based payment model, than other cloud providers. It is however difficult to

provide an exact cost comparison to other cloud providers in the context of sci-

entific computing. In addition Google grants a sizable amount of free resources

to each user, which makes it particularly interesting. Each free account can de-

ploy up to ten applications with their own resource quotas, thus every account

provides a total of 65 CPU hours as well as 10 gigabyte of incoming and out-

going bandwidth each day for free. Assuming that each member of a resource

group can open up and provide an account, a large amount of resources can be

used for free. Using a more sophisticated job scheduler the free resources could

be assigned optimally.

Furthermore once the Distribution Framework is set up and algorithms correctly

implemented App Engine slave applications can be used in heterogeneous con-

junction with slave applications executed on other systems. In the experiments

we used a traditional computer cluster executing the development server as a

comparative system. AppScale [11] is a project porting. By implication App

Engine framework could even be used in conjunction with AppScale images de-

ployed to other IaaS cloud providers and thus provides a lot of flexibility in

terms of resource usage and cost management.

Concluding we can state that even though the App Engine framework is defi-

nitely not designed for scientific applications, it can provide a significant amount

of flexible computing resources for free if the extra effort to port the algorithms

to App Engine can be accepted.

cxiv

List of Figures

0.1 Request handling architecture of Google App Engine taken from [22] xii

0.2 Compression time needed for the different compression streams. . xxx

0.3 Data size for an integer array filled with random numbers. . . . . xxxi

0.4 Master application architecture. . . . . . . . . . . . . . . . . . . . xxxv

0.5 Activity diagram illustrating the control flow of the slave application xlii

0.6 Typical homogeneous local network topology. . . . . . . . . . . xlix

0.7 Heterogeneous local network topology, with superior master node. l

0.8 Completion time for square matrix multiplication using no shared

data management, shared data management with parallel and

sequential data transfer. . . . . . . . . . . . . . . . . . . . . . . . li

0.9 Completion time for square matrix multiplication using no shared

data management, shared data management with parallel and

sequential data transfer. . . . . . . . . . . . . . . . . . . . . . . . lii

0.10 Illustration of the monte carlo pi calculation . . . . . . . . . . . . liv

0.11 Illustration of the monte carlo integration . . . . . . . . . . . . . lx

0.12 Data dependencies for entry C2,3 of the result matrix. . . . . . . lxvi

0.13 Example of data partitioning in parallel matrix multiplication. . lxvii

0.14 Coloured image of the mandelbrot set. . . . . . . . . . . . . . . . lxxii

0.15 Results of the latency analysis. . . . . . . . . . . . . . . . . . . . xc

0.16 Results of the bandwidth test. . . . . . . . . . . . . . . . . . . . . xci

0.17 Microbenchmark results. . . . . . . . . . . . . . . . . . . . . . . . xciii

0.18 Computation time results of Scalar mul/div and the Fibonacci

number generator. . . . . . . . . . . . . . . . . . . . . . . . . . . xciv

cxv

0.19 Fibonacci test requests. . . . . . . . . . . . . . . . . . . . . . . . xcv

0.20 Fibonacci test requests mapped to their respective instance. . . . xcvi

0.21 Results of the cache hierarchy test. . . . . . . . . . . . . . . . . xcviii

0.22 Runtime analysis of the Pi Approximation algorithm. . . . . . . c

0.23 Speedup analysis of the Pi Approximation algorithm. . . . . . . . c

0.24 Runtime analysis of integer Matrix Multiplication. . . . . . . . . cii

0.25 Speedup analysis of integer Matrix Multiplication. . . . . . . . . ciii

0.26 Runtime analysis of the Rank Sort algorithm. . . . . . . . . . . . civ

0.27 Speedup analysis of the Rank Sort algorithm. . . . . . . . . . . . cv

0.28 Runtime analysis of the Mandelbrot algorithm. . . . . . . . . . . cvi

0.29 Speedup analysis of the Mandelbrot algorithm. . . . . . . . . . . cvii

0.30 Scalability analysis of the Pi Approximation algorithm. . . . . . cviii

cxvi

List of Tables

0.1 Free quotas for general resources (as of 20.09.2010) [6]. . . . . . . xxii

0.2 General Datastore quotas (as of 20.09.2010) [6]. . . . . . . . . . . xxiii

0.3 Daily Datastore quotas (as of 20.09.2010) [6]. . . . . . . . . . . . xxiv

0.4 Properties overview of App Engine and local slaves. . . . . . . . xxxiv

0.5 Example illustrating the concept of ranks. . . . . . . . . . . . . . lxxx

0.6 Hardware specification of the karwendel cluster. . . . . . . . . . . lxxxviii

0.7 Hardware specification of the zid-gpl server. . . . . . . . . . . . . lxxxviii

0.8 Maximum and minimum transfer speeds for different data chunk

sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xci

0.9 Resource costs as for 10.01.2011. . . . . . . . . . . . . . . . . . . cix

0.10 Resource consumption and cost estimation for 100 iterations of

the given problem size. . . . . . . . . . . . . . . . . . . . . . . . . cix

0.11 Resource complexity of each algorithm. . . . . . . . . . . . . . . cx

cxvii

cxviii

Bibliography

[1] Amazon Elastic Compute Cloud (Amazon EC2), http://aws.amazon.com/

de/ec2/.

[2] App Engine Developers Guide, http://code.google.com/intl/de-DE/

appengine/docs/.

[3] ElasticHosts cloud service, http://www.elastichosts.com/.

[4] GoGrid cloud-server hosting, http://www.gogrid.com/.

[5] Google App Engine Framework, http://code.google.com/intl/de-DE/

appengine/.

[6] Google App Engine Quotas, http://code.google.com/intl/de-DE/

appengine/docs/quotas.html.

[7] Mosso cloud service, http://www.mosso.com/.

[8] The JRE White List, http://code.google.com/intl/de-DE/appengine/

docs/java/jrewhitelist.html.

[9] RFC 2616 - Hypertext Transfer Protocol – HTTP/1.1, 1999.

[10] Felician Alecu, Parallel Rank Sort, 2005.

[11] Navraj Chohan, Chris Bunch, Sydney Pang, Chandra Krintz, Nagy

Mostafa, Sunil Soman, and Rich Wolski, AppScale Design and Implemen-

tation, 2009.

[12] P.D. Coddington, J.A. Mathew, and K.A. Hawick, Interfaces and Imple-

mentations of Random Number Generators for Java Grande Applications,

1999.

[13] Alberto Gotta, Francesco Potorti, and Raffaello Secchi, An Analysis of

TCP Startup over an Experimental DVB-RCS Platform, 2006.

[14] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Intro-

duction to Parallel Computing, 2003.

cxix

http://aws.amazon.com/de/ec2/

http://aws.amazon.com/de/ec2/

http://code.google.com/intl/de-DE/appengine/docs/

http://code.google.com/intl/de-DE/appengine/docs/

http://www.elastichosts.com/

http://www.gogrid.com/

http://code.google.com/intl/de-DE/appengine/

http://code.google.com/intl/de-DE/appengine/

http://code.google.com/intl/de-DE/appengine/docs/quotas.html

http://code.google.com/intl/de-DE/appengine/docs/quotas.html

http://www.mosso.com/

http://code.google.com/intl/de-DE/appengine/docs/java/jrewhitelist.html

http://code.google.com/intl/de-DE/appengine/docs/java/jrewhitelist.html

[15] Geir Gunderson and Trond Steihaug, Data Structures in Java for Matrix

Computations, 2002.

[16] Alexandru Iosup, Simon Ostermann, Nezih Yigitbasi, Radu Prodan,

Thomas Fahringer, and Dick Epema, Performance Analysis of Cloud Com-

puting Services for Many-Tasks Scientific Computing, 2010.

[17] Malvin H. Kalos and Paula A. Whitlock, Monte Carlo Methods, 2008.

[18] Qusay H. Mahmoud, Compressing and Decompressing Data Using Java

APIs, http://java.sun.com/developer/technicalArticles/Programming/

compression/, 2002.

[19] Benoit B. Mandelbrot, Fractals and Chaos: The Mandelbrot Set and Be-

yond, 2004.

[20] Peter Mell and Tim Grance, Definition of Cloud Computing v15, NIST,

2009.

[21] Igor Ostrovsky, Gallery of Processor Cache Effects, http://igoro.com/

archive/gallery-of-processor-cache-effects/.

[22] Dan Sanderson, Programming Google App Engine, 2009.

[23] Steve S. Skiena, The Algorithm Design Manual, 1997.

[24] Lizhe Wang and Gregor von Laszewski, Scientific Cloud Computing: Early

Definition and Experience, 2008.

cxx

http://java.sun.com/developer/technicalArticles/Programming/compression/

http://java.sun.com/developer/technicalArticles/Programming/compression/

http://igoro.com/archive/gallery-of-processor-cache-effects/

http://igoro.com/archive/gallery-of-processor-cache-effects/

scienti c computing in the cloud with google app engineradu/sperk.pdf · google app engine is a...

Documents