scienti c computing in the cloud with google app engineradu/sperk.pdf · google app engine is a...
TRANSCRIPT
Scientific Computing in the Cloud with
Google App Engine
master thesis in computer science
by
Michael Sperk
submitted to the Faculty of Mathematics, Computer
Science and Physics of the University of Innsbruck
in partial fulfillment of the requirements
for the degree of Master of Science
supervisor: Prof. Dr. Radu Prodan, Institute of ComputerScience
Innsbruck, 17 January 2011
Certificate of authorship/originality
I certify that the work in this thesis has not previously been submitted for a
degree nor has it been submitted as part of requirements for a degree except as
fully acknowledged within the text.
I also certify that the thesis has been written by me. Any help that I have
received in my research work and the preparation of the thesis itself has been
acknowledged. In addition, I certify that all information sources and literature
used are indicated in the thesis.
Michael Sperk, Innsbruck on the 17 January 2011
i
ii
Abstract
Cloud Computing as a computing paradigm recently emerged to a topic of high
research interest. It has become attractive alternative to traditional computing
environments, especially for smaller research groups that can not afford expen-
sive infrastructure. Most of the research regarding scientific computing in the
cloud however focused on IaaS cloud providers. Google App Engine is a PaaS
cloud framework dedicated to the development of scalable web applications.
The focus of this thesis is to investigate App Engine’s capabilities in terms of
scientific computing. Moreover algorithm properties that are well suited for ex-
ecution on Google App Engine as well as potential problems and bottlenecks
are identified.
iv
Acknowledgements
....
vi
Contents
0.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
0.1.1 The Runtime Environment . . . . . . . . . . . . . . . . . xiii
0.1.2 The Datastore . . . . . . . . . . . . . . . . . . . . . . . . xiv
0.1.3 Scalable Services . . . . . . . . . . . . . . . . . . . . . . . xviii
0.1.4 The App Engine Development Server . . . . . . . . . . . xix
0.1.5 Quotas and Limits . . . . . . . . . . . . . . . . . . . . . . xx
0.2 HTTP Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi
0.2.1 The HTTP Protocol . . . . . . . . . . . . . . . . . . . . . xxvi
0.2.2 Apache HTTP Components . . . . . . . . . . . . . . . . . xxvii
0.2.3 Entity Compression . . . . . . . . . . . . . . . . . . . . . xxix
0.3 Slave Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi
0.3.1 App Engine Slaves . . . . . . . . . . . . . . . . . . . . . . xxxii
0.3.2 Local Slaves . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii
0.3.3 Comparison of Slave Types . . . . . . . . . . . . . . . . . xxxiii
0.4 The Master Application . . . . . . . . . . . . . . . . . . . . . . . xxxv
0.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . xxxv
0.4.2 Generating Jobs . . . . . . . . . . . . . . . . . . . . . . . xxxvi
0.4.3 Job Mapping . . . . . . . . . . . . . . . . . . . . . . . . . xxxviii
0.4.4 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . xxxix
0.5 The Slave Application . . . . . . . . . . . . . . . . . . . . . . . . xlii
0.5.1 WorkJobs . . . . . . . . . . . . . . . . . . . . . . . . . . . xliii
0.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . xliv
0.5.3 Message Headers . . . . . . . . . . . . . . . . . . . . . . . xlv
0.6 Shared Data Management . . . . . . . . . . . . . . . . . . . . . . xlvii
0.6.1 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . xlviii
0.6.2 Data Transfer Strategy . . . . . . . . . . . . . . . . . . . xlviii
0.6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . l
0.7 Monte Carlo Routines . . . . . . . . . . . . . . . . . . . . . . . . liii
0.7.1 Pi Approximation . . . . . . . . . . . . . . . . . . . . . . liii
0.7.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . lix
0.8 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . lxiv
0.8.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . lxv
vii
0.8.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . lxvi
0.8.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . lxvii
0.9 Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxi
0.9.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxii
0.9.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . lxxiii
0.9.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . lxxiv
0.10 Rank Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxix
0.10.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . lxxx
0.10.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . lxxx
0.10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . lxxxi
0.10.4 Hardware and Experimental Setup . . . . . . . . . . . . . lxxxvii
0.11 Analyzing Google App Engine Performance . . . . . . . . . . . . lxxxix
0.11.1 Latency Analysis . . . . . . . . . . . . . . . . . . . . . . . lxxxix
0.11.2 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . xc
0.11.3 Java Performance Analysis . . . . . . . . . . . . . . . . . xcii
0.11.4 JIT Compilation . . . . . . . . . . . . . . . . . . . . . . . xcv
0.11.5 Cache Hierarchy . . . . . . . . . . . . . . . . . . . . . . . xcvii
0.12 Speedup Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . xcviii
0.12.1 Pi Approximation . . . . . . . . . . . . . . . . . . . . . . c
0.12.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . ci
0.12.3 Rank Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . cii
0.12.4 Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . . ciii
0.13 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . cv
0.14 Resource Consumption and Cost Estimation . . . . . . . . . . . . cviii
List of Figures cxv
List of Tables cxvii
Bibliography cxix
viii
Introduction
In the last few years a new paradigm for handling computing resources called
Cloud Computing emerged. The basic idea is that resources, software and data
are provided as on-demand services over the Internet [20]. The actual technology
used in the cloud is abstracted from the user, so all the administrative tasks
are shifted to the service provider. Moreover the provider deals with problems
such as load balancing and scalability. Typically resource virtualization is used
to deal with these problems. Cloud Computing provides a flexible and cost
efficient alternative to local management of compute resources. Payment is
typically done on a per use basis, so the user only pays for the resources that
were actually consumed.
Cloud services can be classified to three categories by the level of abstraction of
the service [20]:
1. Infrastructure as a Service (IaaS): IaaS provides only basic storage,
network and computing resources . The user does not manage or control
the underlying cloud infrastructure, but can deploy and execute arbitrary
software including operating system and applications.
2. Platform as a Service (PaaS): PaaS provides a platform for executing
consumer created applications, developed using programming languages
and tools provided by the producer. The user does not manage the un-
derlying cloud infrastructure, storage or the operating system, but has
control over the deployed applications.
3. Software as a Service (SaaS): SaaS provides the use of applications
developed by the producer running on the cloud infrastructure through
a thin client, typically a web browser. The user has no control over the
infrastructure, the operating system or the software capabilities.
Cloud computing has recently become an appealing alternative for research
groups, to buying and maintaining expensive computing clusters. Most work on
scientific computing in the cloud focused on IaaS clouds such as Amazon EC2 [1],
because arbitrary software can be executed which makes the process of porting
ix
existing scientific programs a lot easier. Moreover there are no restrictions in
terms of operating system or programming language.
Google App Engine is a PaaS Cloud Service especially dedicated to scalable
web applications [5]. It mainly targets smaller companies that cannot afford
the infrastructure to handle a large number of requests or sudden traffic peaks.
App Engine provides a framework for developing servlet based web application
using Python or Java as programming language. Applications are deployed to
Google’s server infrastructure.
Each application can consume resources such as CPU time, number of requests
or used bandwidth. Google grants a certain amount of free daily resources to
each applications. If billing is enabled the user pays for resource usage that
surpasses the free limits, otherwise the web application becomes unavailable if
critical resources are depleted. This makes the service especially interesting
for scientific computing, since an automated program could just use up the
given daily free resources and pause computation until resources are replenished.
Moreover each member of the research group can provide his own account with
separate resources.
The problem is though that the framework is very restrictive in terms of pro-
gramming language, libraries and many other aspects. This makes an use for
scientific computing more difficult than on common IaaS cloud platforms.
The focus of this thesis is to explore the capabilities of the Google App En-
gine framework in terms of scientific computing. The goal is to build a simple
framework for utilizing App Engine servers for parallel scientific computations.
Subsequently a few exemplary algorithms should be implemented and analyzed
in order to identify algorithm properties that might be well suited for execution
in a PaaS cloud environment. Moreover potential problems and bottlenecks that
arise should be analyzed as well.
The thesis is structured in four main parts first introducing the App Engine
framework and the parts of the API that will be used. Followed by a description
of the Distribution Framework that was developed in the course of the thesis
and a basic introduction to the algorithms that were implemented to test the
system. Finally the the experimental results obtained by testing the Distribution
Framework under practical circumstances are presented.
x
Google App Engine
Google App Engine is a Cloud service for hosting Web applications on Google’s
large scale server infrastructure. However, it provides a whole framework for
building scalable Web applications rather than plain access to hardware. As
more people access the application, App Engine automatically allocates and
manages additional resources.
The user never has to set up or maintain a server environment. In addition,
common problems such as load balancing, caching and traffic peaks are handled
by App Engine automatically.
The framework provides a certain amount of free resources, enough for smaller
applications. Google estimates with the free resources an application can handle
about 5 million page views per month. If an application needs resources that
exceed the monthly quota limits, these are billed on a per-use basis. For example,
if an application is very computation heavy only the additional CPU hours are
billed.
In this chapter the aspects of the App Engine framework relevant to this thesis
will be described. The description in this chapter are mostly based on [22] and
the official online documentation [2].
0.1 Architecture
Figure 0.1 shows the request handling architecture of Google App Engine. Each
request is at first inspected by the App Engine frontend. In fact there are multi-
ple frontend machines and a load balancer that manages the proper distribution
of requests to the actual machines. The frontend determines to which applica-
tion the request is addressed by inspecting the domain name of the request.
In the next step the frontend reads the configuration of the corresponding appli-
cation. The configuration of an application determines how the frontend handles
a request, depending on the URL. The URL path can be mapped either to a
xi
Figure 0.1: Request handling architecture of Google App Engine takenfrom [22]
static file or to a request handler. Static files are typically images, Java scripts or
files. A request handler dynamically generates a response for the request, based
on application code. If no matching mapping is found in the configuration a
HTTP 404 "Not Found" error message is responded by the frontend.
Requests to static files are forwarded to the static file servers. The static file
servers are optimized for fast delivery of resources that do not change often.
Whether a file is static and should be stored on the static file servers is decided
at application deployment.
If the request is linked to a request handler it is forwarded to the application
servers. One specific application server is chosen and a instance application is
started. If there is already a instance of the application running it can be reused,
so typically servers already running an instance are preferred. The appropriate
request handler of the application is then invoked.
The strategies for load balancing and distributing requests to application servers
are still being optimized. However the main goal is fast responding request han-
dlers, in order to guarantee a high throughput of requests. How many instances
of an application are started at a time and how requests are distributed depends
on the applications traffic and resource usage patterns. Typically there are just
enough instances started at a time to handle the current traffic.
The application code itself runs in a runtime environment, an abstraction above
the operating system. This mechanism allows servers to manage resources such
as CPU cycles and memory for multiple applications running on the same server.
xii
Besides applications are prevented from interfering with one another.
The application server waits until the request handler terminates and returns the
response to the frontend, thus completing the request. Request handlers have to
terminate before returning data, therefore streaming of data is not possible. The
frontend then constructs the final response to the client. If the client indicates
that it supports compression by adding the ”Accept-Encoding” request header,
data is automatically compressed using zip file format.
App Engine consists of three main parts: the runtime environment, the Data-
store and the scalable services. The runtime environment executes the code of
the application. The Datastore provides a possibility for developers to persist
data beyond requests. Finally App Engine provides a couple of scalable services
typically useful to web applications. In the following each of the parts will by
described shortly.
0.1.1 The Runtime Environment
As already mentioned, the application code runs in a runtime environment,
which is abstracted from the underlying operating system. This isolated envi-
ronment running the applications is called the sandbox. App Engine applications
can be programmed either in Python or in Java. As a consequence each pro-
gramming language has its own runtime environment.
The Python runtime provides an optimized interpreter (by the time of this writ-
ing Python 2.5.2 was the latest supported version). Besides the standard library,
a wide variety of useful libraries and frameworks for Python web application de-
velopment, such as Django, can be used.
The Java runtime follows the Java Servlet standards, providing the correspond-
ing APIs to the application. Common web application technologies such as
JavaServer Pages (JSP) are supported as well. The App Engine SDK supports
developing applications using Java in the version 5 or 6.
Though applications are typically developed in Java, in principle any language
supporting compilers producing Java bytecode, such as JavaScript, Ruby or
Scala, can be used. This section will focus on the Java runtime.
The sandbox imposes several restrictions to applications:
1. Developers have limited access to the filesystem. Files deployed along
with the application can be read, however there is no write access to the
xiii
filesystem whatsoever.
2. Applications have no direct access to the network, though HTTP requests
can be performed through a service API.
3. In general no access to the underlying hardware or the operating system
is granted.
4. App Engine does not support domains without ”www” such as http:
//example.com, because of canonical name records being used for load
balancing.
5. Usage of threads is not permitted.
6. Java applications can only use a limited set of classes from the standard
Java Runtime Environment, documented in [8].
Sandboxing on the one hand prevents applications from performing malicious
operations that could harm the stability of the server infrastructure or interfere
with other applications running on the same physical machine. On the other
hand, it enables App Engine to perform automatic load balancing, because it
does not matter on what underlying hardware or operating system the appli-
cation is executed. There is no guarantee that two requests will be executed
on the same machine even if the requests arrive one after another and from the
same client. Multiple instances of the same or even of different applications can
run on the same machine without affecting one another.
The sandbox also limits resources such as CPU or memory use and can throttle
applications that use a particular high amount of resources in order to protect
applications executed on the same machine. A single request has a maximum
of 30 seconds to terminate and respond to the client, although App Engine
is optimized for much shorter requests and may slowdown an application that
consumes too many CPU cycles.
Since scientific applications are CPU intensive, these limitations imposed by the
runtime environment are problematic for such an application.
0.1.2 The Datastore
Web applications need a way to persist data between the stateless requests
to the application. The traditional approach is a relational database residing
on a single database server. The central database is accessed by a single or
potentially by multiple web servers retrieving the necessary data. The advantage
xiv
of such a system is that every web server always has the most recent version
of the data. However, once the limit for handling multiple parallel database
requests is reached, it gets difficult to scale the system up to more requests.
Alongside relational database systems there are various other approaches like
XML databases or object databases.
The Datastore is App Engine’s own distributed data storage service. The main
idea is to provide a high level API for use and hide the details of how storage
is actually done from the developer. This spares the application developer the
task of keeping data up to date while still maintaining scalability.
The database paradigm of the Datastore most closely resembles an object
database. Data objects are called entities and have a set of properties. Prop-
erty values can be chosen from a set of supported data types. Entities are of a
named kind in order to provide a mechanism for categorizing data.
This concept might seem similar to a relational database. Entities resemble rows
and properties resemble the columns in a table. However, there are some key
differences to a relational database. First of all, the Datastore is schemaless,
which means that entities of the same kind are not required to have the same
properties. Furthermore, two entities are allowed to have a property with the
same name but different value types. Another important difference is that a
single property can have multiple values.
Entities are identified by a key, which can either be generated automatically by
the App Engine or manually by the programmer. Unlike the primary key in
a relational database, an entity key is not a field, but a separate aspect of the
entity. App Engine uses the key in combination with the kind of an entity to
determine where to store the entity in the distributed server infrastructure. As
a consequence the key as well as the kind of an entity cannot be changed once
it is created.
Indexes used in the Datastore are defined in a configuration file. While testing
the application locally on the development server index, suggestions are auto-
matically added to the configuration. The framework recognizes typical queries
performed by the application and generates according indexes. The index def-
initions can be manually fine tuned by modifying the configuration file before
uploading the application.
App Engine’s query concept provides most common query types. A query con-
sists of the entity kind alongside with a set of conditions and a sorting order.
Executing a query results in all the entities of the given kind meeting all of
xv
the given conditions being returned sorted by the given order. Besides letting
the query return the entities there is also the option to let it return only the
key values of the entities. This helps to minimize the data transfer from the
Datastore to the application, if only some of the queried entities are actually
used.
The data of web application is typically accessed by multiple users simultane-
ously, thus making a transaction concept important. App Engine guarantees
atomicity : every update of an entity involving multiple properties either suc-
ceeds entirely or fails entirely, leaving the object in its original state. Other users
will only see the complete update or the original entity and never something in
between.
App Engine uses a optimistic concurrency control mechanism. It is assumed that
transactions can take place without conflict. Once a conflict occurs (multiple
users try to update the same entity at the same time), the entity is rolled back to
its original state and all users trying to perform an update receive an concurrency
failure exception. Such a concept is most efficient for a system where conflict
occurs rather sparse, which is usually the case for a web application. Reads
will always succeed and the user will just see the most recent version of the
entity. There is also a possibility to read multiple entities in a group in order to
guarantee consistency of the data.
There is also the possibility to define transactions manually, by bundling mul-
tiple database operations into a transaction. For example, an application can
read an entity, update a property accordingly, write the entity back and commit
the transaction. Again, if the transaction fails all of the database operations
have to be repeated.
The Datastore provides two standard Java interfaces for data access: Java Data
Objects (JDO) and Java Persistence API (JPA). The implementation of the two
interfaces uses the DataNucleus Access Platform, which is an open source im-
plementation of the specified APIs. Alongside the high level APIs, App Engine
also provides a low level API, which can be used to program further database
interfaces. The low level API can also be used directly from the application,
which in some cases might be more efficient than the high level APIs.
The Java Data Objects API
In the following, the JDO API will be shortly described, alongside with an
example illustrating the use of the API. JDO uses annotations to describe how
xvi
entities are stored and reconstructed. In the following a JDO data class called
DataStoreEntity is defined in order to demonstrate the use of annotations:
Listing 1: example of a Datastore entity
1
2 @PersistenceCapable
3 public class DataStoreEntity {
4
5 @PrimaryKey
6 @Persistent
7 String key;
8
9 @Persistent
10 private Blob data;
11
12 public DataStoreEntity(byte [] data , String key){
13 this.data = new Blob(data);
14 this.key = key;
15 }
16 public byte[] getData () {
17 return data.getBytes ();
18 }
19 public void setData(byte[] data) {
20 this.data = new Blob(data);
21 }
22 public String getKey () {
23 return key;
24 }
25 public void setKey(String key) {
26 this.key = key;
27 }
28 }
The class is marked with the annotation @PersistenceCapable indicating that
it is a storable data class. The class defines two fields annotated with @Per-
sistent telling the datastore that they should be stored as properties of the
entity. The field key is additionally annotated with @PrimaryKey, making it
the database key of the entity. Besides the standard Java data types, there are
several additional classes for various purposes provided.
In order to perform database operations, a PersistenceManager is needed
which is retrieved through a PersistenceManagerFactory (PMF). The PMF
takes some time to initialize, though only one instance is needed for the appli-
xvii
cation. Typically the PMF is stored in a static variable making it available to
the application through a singleton wrapper:
Listing 2: PersistenceManagerFactory
1
2 public final class PMF {
3 private static final PersistenceManagerFactory
4 pmfInstance = JDOHelper.
5 getPersistenceManagerFactory("transactions -optional");
6
7 private PMF() {}
8 public static PersistenceManagerFactory get() {
9 return pmfInstance;
10 }
11 }
Having defined a JDO data class and the singleton wrapper for the PersistenceManagerFactory,
instances of the entity can be stored into the Datastore and retrieved using the
query API:
Listing 3: using the query API
1 PersistenceManager pm = PMF.get().getPersistenceManager ();
2
3 pm.makePersistent(new DataStoreEntity(
4 data , req.getHeader("id")));
5
6 Query query = pm.newQuery(DataStoreEntity.class);
7 List <DataStoreEntity > objs =
8 (List <DataStoreEntity >) query.execute ();
Every database operation is performed through a PersistenceManager in-
stance. The makePersistent() method simply stores persistence capable
classes in the Datastore. Datastore entities are retrieved using queries. Queries
are also generated by the PersistenceManager, the newQuery() method returns
a query for a given class. Executing the query without further constraints
returns all Datastore entities of the given class. Entities are returned in a Java
List of the corresponding class.
0.1.3 Scalable Services
The Datastore provides a high level API, that hides implementation details
from the programmer. In a similar fashion App Engine provides an API to
xviii
several scalable services. On the one hand some services are a compensation to
the restrictions of the sandbox. On the other hand there are services typically
useful to web application.
This system enables App Engine to handle scalability and performance of the
services while the developers do not have to worry about implementation details.
In the following the different services will be described in short:
1. URL Fetch: Because of the restrictions of the sandbox, applications are
not allowed to initiate arbitrary network connections. The URL fetch
service provides an API for accessing HTTP resources on the Internet,
such as web services or other web sites. Since requests to web resources
often take a long time, there is a way to perform asynchronous HTTP
requests as well as a timeout mechanism to abort requests to resources
that do not respond timely.
2. Mail: An application can send emails through the mail service. Many web
applications use emails for user notification or confirmation of user data.
There is also the possibility for an application to receive emails. When a
mail is sent to the applications address, the mail service performs a HTTP
request to a request handler forwarding the message to the application.
3. Memcache: The Memcache is a short lived key-value storage service used
for data that does not need the persistence or transactional features of the
Datastore. It can also be accessed by multiple instances of the application.
The advantage over the Datastore is that it performs much faster, since
the data is stored in memory. As the name indicates, the service is usually
used as cache for persistent data.
4. Image Manipulation: Web applications often need image transforma-
tions, for example when creating thumbnails. This service allows the ap-
plication to perform simple image manipulations on common file formats.
5. XMPP: An application can send and receive messages from any XMPP
compatible instant messaging service. Received messages trigger a request
handler similar to a HTTP request.
0.1.4 The App Engine Development Server
The App Engine SDK includes a development server that simulates the runtime
environment along with all the accessible services, including the Datastore. As
the name states, the development server is intended for development and de-
xix
bugging purposes, however there is the possibility to make the server remotely
accessible. This provides a way to host an App Engine web application on hard-
ware besides Googles servers. For example if the free quotas limit the application
and there is additional hardware available, one can host the application on an
alternative server.
Though this rarely makes sense for an actual web application, for scientific com-
putations it actually can be very useful. The work can be distributed heteroge-
neously on several Google App Engine accounts, as well as on some development
servers running on additional hardware. Since the development server has the
same behavior as the App Engine runtime environment, there is in principle no
difference where the application is executed.
There are necessarily some differences between the development server and the
App Engine runtime, however most of them make things easier on the devel-
opment server. For example, all the quota restrictions do not apply for an ap-
plication running on the development server, leaving more freedom in resource
usage. Moreover the underlying hardware is known, making rough runtime
estimates possible and thus correct scheduling of jobs easier. The differences
between an application running on Google’s infrastructure and one running in
the development server will be discussed in more detail in section 0.3.
The scalable services are simulated by the development server, in order to pro-
vide the same API to the programmer. For example the Datastore is simulated
using the local filesystem.
0.1.5 Quotas and Limits
App Engine applications can use each resource up to a maximum limit, called
quota. Each type of resource has a quota associated with it. There are two
different types of quotas: billable and fixed [6].
Billable quotas are set by the application administrator in order to prevent the
application from overusing costly resources. There is a certain amount of each
billable quota provided to the application for free. In order to use more than
the free resources, billing has to be activated for the application. With billing
activated the user sets a daily budget for the application assigned to the desired
resources. Application owners are only charged for the amount of resources the
application actually used and only the amount that exceeded the free quotas.
Fixed quotas are set by App Engine in order to ensure stability and integrity
xx
of the server system. These are maximum limits shared by all applications,
preventing applications from consuming too many resources at a time. When
billing is enabled for an applications the fixed quotas increase.
Once the quota for a resource is reached the resource is considered depleted.
Resources are replenished at the beginning of every day giving the application
a fresh contingent for the next 24 hours. An exception are the datastore quotas
which represent the total amount of storable data and thus are not replenished.
Besides the daily quotas there are also per-minute quotas preventing applications
from consuming their resources in a very short time. Per-minute quotas again
are increased for applications with billing enabled.
There are essential resources required to initiate a request handler, if one of those
is depleted, requests will be rejected with a HTTP 403 ”Forbidden” status code.
Following resources are necessary for handling a request:
• number of allowed requests;
• CPU cycles;
• incoming and outgoing bandwidth.
For the rest of the resources, an exception is raised once an application tries to
access a depleted resource. These exceptions can be caught by the application
in order to display appropriate error messages for users.
In the following, we shortly describe the resources and their corresponding quo-
tas relevant to this thesis. However there are many more quotas besides the
ones mentioned in this section, especially every scalable service has its own set
of quotas.
In the following the general resources with a quota are listed:
• Requests: The total number of HTTP requests to the application.
• Outgoing Bandwidth: The total amount of data sent by the application.
This includes data returned by request handlers, data served by the static
file servers, data sent in emails and data sent using the URL Fetch service.
• Incoming Bandwidth: The total amount of data received by the appli-
cation. This includes data sent to the application via HTTP requests as
well as returned data using the URL Fetch service.
• CPU Time: Total time spent processing, including all database opera-
tions. Waiting for other services such as URL Fetch or Image processing
xxi
does not count. CPU time is reported in seconds. CPU seconds are calcu-
lated in reference to a 1.2 GHz Intel x86 processor. This value is adjusted
because CPU cycles may vary greatly due to App Engine internal config-
urations, such as differing hardware.
Resource Daily Limit Maximum Rate
Requests 1,300,000 requests 7,400 requests/minuteOutgoing Bandwidth 1 gigabyte 56 megabytes/minuteIncoming Bandwidth 1 gigabyte 56 megabytes/minuteCPU Time 6.5 CPU-hours 15 CPU-minutes/minute
Table 0.1: Free quotas for general resources (as of 20.09.2010) [6].
Table 0.1 shows the quota limits for the general resources. For scientific compu-
tations the main limitation here will be the CPU cycles. In fact, the per minute
quota limits the application to a maximum computation power of 15 times a 1.2
GHz Intel processor on a minutely basis. Since the actual amount of CPU cycles
useable for computation may be even lower. Moreover, a system using App En-
gine in an automated way has to implement proper fault tolerance mechanisms,
since once resources are depleted requests may result in an exception or may
even be rejected in the first place.
The number of maximum requests as well as the corresponding per minute quota
are not problem for scientific applications, since splitting a problem into more
than 7400 requests per minute would create substantial transmission overhead.
Therefore the limiting factor will still be the CPU time long before the number
of requests becomes relevant. Note that these quota limits make sense in the
context of web applications, which are typically optimized for high throughput
and fast response but have no need for large amounts of CPU cycles. An ap-
plication dedicated to scientific computations on the other hand will consume a
lot more CPU time compared to the number of requests.
The bandwidth limits will in most cases not be problematic to the application.
The reason is that data has to be transferred over the Internet which is a rela-
tively slow medium, so typically problems that only need small amounts of data
transferred will be better suited for execution on Google App Engine. Data
intensive problems would have a high communication overhead and thus are not
a preferable class of problems for execution on Google App Engine.
In the following the quotas associated to the Datastore are listed:
• Stored Data: The total amount of data stored in the Datastore and its
xxii
indexes. There might by considerable overhead when storing entities in
the Datastore. For each entity the id, the ids of its ancestors and its kind
has to be stored. Since the Datastore is schemaless for every property the
name of the property has to be stored along with its value. Finally all the
index data has to be stored along with the entities.
• Number of Indexes: The number of different Datastore indexes for an
application, including every index created in the past that has not been
explicitly deleted.
• Datastore API Calls: The total number of calls to the Datastore API,
including retrieving, creating, updating or deleting an entity as well as
posting a query.
• Datastore Queries: The total number of Datastore queries. There are
some interface operations, such as ”not equal” queries, that internally
perform multiple queries. Every internal query counts for this quota.
• Data Sent to API: The amount of data sent to the API. This includes
creating and updating entities as well as data sent with an query.
• Data Received from API: The amount of data received by the Datas-
tore API when querying for entities.
• Datastore CPU time: The CPU time needed for performing database
operations. The Datastore CPU time is calculated in the same way as
for the regular CPU time quota. Note that CPU cycles used for database
operations also count towards the CPU time quota.
Resource Limit
Stored Data 1 gigabyteMaximum entity size 1 megabyteMaximum number of entities in a batch put/delete 500 entitiesMaximum size of a datastore API call request 1 megabyteMaximum size of a datastore API call response 1 megabyteNumber of Indexes 100
Table 0.2: General Datastore quotas (as of 20.09.2010) [6].
In Table 0.2 and 0.3 the general and daily quotas for the Datastore are listed.
The Datastore will be used for data that has to persist between multiple requests
to the application. This will be typically data that is inherent to the algorithm
and shared among all the requests. The daily limits will not be problematic
xxiii
Resource Daily Limit Maximum Rate
Datastore API Calls 10,000,000 calls 57,000 calls/minuteDatastore Queries 10,000,000 queries 57,000 queries/minuteData Sent to API 12 gigabytes 68 megabytes/minuteData Received from API 115 gigabytes 659 megabytes/minuteDatastore CPU Time 60 CPU-hours 20 CPU-minutes/minute
Table 0.3: Daily Datastore quotas (as of 20.09.2010) [6].
for a scientific computation, for the same reasons stated for the bandwidth
limitations. Moreover, the Datastore will be cleared after each algorithm run
thus resetting the general Datastore quotas.
However the maximum entity size limitations of one megabyte is quite a prob-
lem. A normal web application has no need to store large data entities but
rather stores many different small entities that are optimized for quick retrieval.
Scientific applications though often have large amounts of data to operate on.
As a consequence data beyond one megabyte has to be partitioned in order to
fit in the datastore. A more detailed discussion of the implications can be found
in Section 0.6.
xxiv
The Distribution Framework
The goal of this thesis is to build a simple framework for utilizing App Engine
servers for parallel scientific computations. The system will mainly be used
to identify properties of parallel algorithms that are well suited for use on the
App Engine environment and subsequently those that are less suited. There-
fore the system should be extensible, to allow easy incorporation of additional
algorithms. Besides, the management of data and distribution of jobs should be
independent from the actual algorithm used. Finally, the system should provide
an algorithm library that utilizes App Engine for parallelization.
In general the system uses a simple master-slave architecture. The master-
slave model is commonly used in distributed architectures [14]. It consists of
a central master process that controls the distribution of work among several
slave processes. When implementing a system based on the master-slave model,
it should be guaranteed that the master can provide work fast enough to feed
all slaves with sufficient work. When the job size is too small the master might
be too slow to generate enough jobs and can become a bottleneck.
The slave application is implemented as web application using the Google App
Engine framework. It provides a simple HTTP interface that is invoked pro-
grammatically by the master application. The interface accepts either data
transfer requests or requests containing a computational job. In either case
data is transmitted in the payload of the HTTP request. Parallelism is achieved
by multiple parallel requests to the application. In order to make communica-
tion between master and slaves easier, both applications are written in the Java
programming language.
The master application is a Java program running on the users machine. It
manages the logic and the distribution of the algorithm. The problem is split
into several work chunks. Each chunk is then submitted as a job to the slave
application, which performs the actual computation. The results of the jobs
are then collected and reassembled to provide the complete result of the algo-
rithm. Besides the master application has to manage scheduling of jobs and
data transfers.
xxv
In this chapter the architecture of the system and its components are explained
in detail. Furthermore important concepts used in the implementation of the
system will be discussed.
0.2 HTTP Requests
The slave application is in principle a HttpServlet implementing the request
handling method for Hypertext Transfer Protocol (HTTP) post requests. There-
fore the communication between the master and its slaves is entirely based on
the HTTP protocol.
In this section the basics of the protocol will explained followed by a description
of the HTTP library used by the system.
0.2.1 The HTTP Protocol
The HTTP is an stateless application level networking protocol. The latest
version of the protocol is HTTP/1.1 defined in RFC 2616 [9]. The protocol
assumes a reliable transport layer protocol. Therefore the TCP protocol is most
widely used as transport layer protocol.
HTTP is mainly used by web browsers to load web pages, however it has nu-
merous other applications. The protocol follows the request-response message
exchange pattern. A client sends a HTTP request message to a server, which
typically stores content or provides resources. The server replies with a HTTP
response message containing status information and any content requested by
the client.
The protocol defines nine request methods indicating the action that should be
performed by the server. The most important methods are:
1. GET: Retrieves whatever information is identified by the request URI.
The information should be returned in the message-body of the HTTP
response message.
2. HEAD: Is identical to the GET method, except that the HTTP response
must not contain a message-body. This method is typically used for re-
trieving metainformation on an entity or testing the validity of hypertext
links.
xxvi
3. POST: The POST method is used to submit the entity enclosed in
the request message to the server. A POST request might result in the
modification of an existing entity or even in the creation of a new one.
4. OPTIONS: The OPTIONS method is a request for information on the
available communication options for the entity associated with the request
URI.
Servers are required to at least implement the HEAD and GET method. The
communication between the master and slave application uses the HTTP post
method.
Every HTTP response message contains a three digit numeric status code, fol-
lowed by a textual description of the status. Status codes are organized in five
general classes indicated by the first digit of the status code:
1. 1xx Informational: Indicates that a request was received and the server
continues to process it. Such a response is only provisional consisting of
the status line and optional headers. One or more such responses might
be returned before a regular response to the request is sent back. Infor-
mational responses are typically used to avoid timeouts.
2. 2xx Successful: Indicates that the server has successfully received, un-
derstood and accepted the corresponding request.
3. 3xx Redirection: Indicates that further action is needed by the client
in order to fulfill the request.
4. 4xx Client Error: Indicates that the client seems to have caused an
error. The server should include a entity containing a closer description
of the error in the response message.
5. 5xx Server Error: Indicates that the server is unable to process a seem-
ingly valid request. Again, an entity containing a closer description of the
error should be included in the response.
A client at least has to recognize these five classes of response status codes and
react accordingly.
0.2.2 Apache HTTP Components
The Apache HTTP Components library [?] provides an easy to use API for ap-
plications making use of the HTTP protocol. Besides the library is open source
xxvii
thus making, required adaptions of the code possible. The master application
uses the HTTP Component client 4.0.1, which is the latest stable version by
this writing for building the HTTP requests necessary for invoking the slave
application.
Listing 4 shows a code sample performing a HTTP request to ”www.test.at”
using the functionality of the Apache HTTP components library used by the
master application:
Listing 4: Code example for performing a simple HTTP request.
1 HttpResponse response = null;
2 HttpClient client = new DefaultHttpClient ();
3 HttpPost post = new HttpPost("http :// www.test.at:80");
4
5 HttpEntity entity =
6 new SerializableEntity(new Integer (5), true);
7 post.setEntity(entity);
8
9 post.setHeader("type", "integer");
10
11 response = client.execute(post);
12 System.out.println(response.getStatusLine ());
13
14 client.getConnectionManager ().shutdown ();
The main class necessary for initiating a HTTP communication is the
HttpClient. Its most basic functionality is to execute HTTP methods.
Execution of a HTTP method consists of one or several HTTP request-response
exchanges. The DefaultHttpClient is the standard implementation of the
HttpClient.
The user provides a request object to the HttpClient for execution. The Http-
Client supports all methods defined in the HTTP 1.1 specification. The library
provides a separate class for every method. In the example a HttpPost method
is used for the request. Every method is initiated with a URI defining the target
of the request. An URI consists of the protocol in use, the host name, optional
port and a resource path. In this case the URI is ”http://www.test.at:80” the
protocol used is HTTP, the target host is www.test.at and the port used is 80.
HTTP request and response messages can optionally have content entities as-
sociated with them. Requests carrying an entity are referred to as entity en-
closing requests. The specification defines two entity enclosing methods namely
POST and PUT. The SerializableEntity class allows to construct an entity
xxviii
containing a serializable Java object. In the example, an entity containing an
Integer object is created and attached to the POST method. The second param-
eter of the SerializableEntity constructor determines whether object will be
buffered.
A message can contain one or multiple message headers describing properties of
the message, such as content type and content encoding. The example attaches
a header to the POST method with the name ”type” and the value ”integer”.
Message headers help a server to decide how to handle a request.
A HTTP response is the message sent back by the server as reply to a request,
implemented by the HttpResponse class. The first line of a HTTP response
contains the a status line containing the protocol version, followed by a numeric
status code and its textual description. In the example the status line is retrieved
by calling the getStatusLine() method of the HttpResponse and printed to
the standard output. A successful request will usually print HTTP/1.1 200 OK ,
indicating that the protocol version 1.1 was used and the request was successful.
0.2.3 Entity Compression
The HTTP components library provides a wide variety of functionality, though
there is no convenient way to compress entities attached to HTTP request.
Therefore we slightly modified the SerializableEntity to an entity called
CompressedSerializableEntity, providing compression of the contained seri-
alized object. The modified version works basically the same way as the original,
except that a compression filter stream is put between the ObjectOutputStream
and the ByteArrayOutputStream that actually writes the serialized object in
the buffer.
The information whether entity compression is enabled is stored in the header
of the HTTP request, so the slave application can decompress entities prior to
their usage.
Java provides three different compression filter streams in the JRE standard
library, ZIP, GZIP and raw deflate compression [18]. The ZipOutputStream
is an output stream filter for writing files in the ZIP file format. The
GZIPOutputStream is the same for GZIP file format. DeflateOutputStream
generates a stream compressing the data using the deflate algorithm.
An alternative compression algorithm library is unfortunately not an option,
since the App Engine runtime environment does not allow additional libraries
xxix
0.5 1 1.5 2 2.5 3 3.5 4x 106
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
array size
com
pres
sion
tim
e (m
illis
econ
ds)
zipgzipdeflate
Figure 0.2: Compression time needed for the different compression streams.
besides the JRE standard library. Therefore the slave application would be
incapable of decompressing the payload when using a alternative compression
library.
In order to determine the best suited compression we tested the different com-
pression streams in terms of compression efficiency and runtime required. For
this experiment, an one dimensional integer array filled with random numbers
was used as raw data. For the compression efficiency, random numbers in the
range from 0 to 1000 and numbers in the range from 0 to 10.000 where tested.
Using a smaller range of random numbers increases compression efficiency, since
there are less possible values that have to be encoded. The tests were performed
on a system with a Intel Core 2 Duo CPU with 2.4 ghz and 4 gigabyte RAM.
Figure 0.2 shows a comparison of compression times needed by the three different
streams. The times are almost identical especially those of ZIP and GZIP.
Deflate however performs the best throughout all tested data sizes. In addition,
for all three algorithms the runtime grows linearly for increasing data sizes.
In terms of compression the streams performed equally well, though deflate
compressed data was generally slightly. The reason that deflate performs slightly
xxx
0.5 1 1.5 2 2.5 3 3.5 4x 106
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 x 107
size of integer array
data
siz
e (b
yte)
deflate 1000deflate 10000uncompressed
Figure 0.3: Data size for an integer array filled with random numbers.
better in terms of execution time and compression efficiency is the overhead
needed for the ZIP and GZIP file format information. The streams most likely
even use the same internal compression routine.
Figure 0.3 shows the compression efficiency of the deflate algorithm (ZIP and
GZIP were omitted because of overlapping graphs). The test using random
numbers in the range from 0 to 1000 (green line) performs better than the
second test using numbers from 0 to 10.000 (red line) as expected.
0.3 Slave Types
A slave is a server reachable by a distinct network address executing an instance
of the slave web application. The slave application is actually intended to be
deployed to and executed on Google’s App Engine servers. In order to provide
more flexibility and a way to compare the performance of the App Engine server
infrastructure to known hardware we included the possibility to address local
servers executing an instance of the slave application using the App Engine
development server. As already described more closely the development is just
xxxi
a slim web server simulating the App Engine API locally. Consequently deployed
slaves will be referred to as App Engine slaves and slaves running on machines
using the development server will be referred to as local slaves. In the following
some basic considerations for each slave type are discussed followed by a concrete
comparison of their properties.
0.3.1 App Engine Slaves
An App Engine slave is a instance of our slave application executed in the run-
time environment of the App Engine servers. In principle one instance of the
slave application would be enough, since parallelism is achieved by sending mul-
tiple parallel HTTP requests. However, there are a various restrictions imposed
to an App Engine application in terms of resource usage. In order to circum-
vent these restrictions it is necessary to enable the system to distribute the work
among multiple instances of the slave application.
One App Engine account allows the user to create and deploy up to ten different
applications, each of them having a separate URI as well as separate resource
quota restrictions. Usually these are meant to be different applications, though
it is possible to deploy a single application multiple times. In terms of a web
application this would not make any sense, since each of the instances would be
reachable through a different address. For a scientific application that needs as
many resources as possible though, it is an useful way to get additional resources.
As a consequence the master application has to be able to distribute tasks to
different instances of the slave application, each reachable through a different
network address.
0.3.2 Local Slaves
As already mentioned instances of an App Engine application can also be exe-
cuted by the development server. Besides minor differences the instances behave
the same way as those deployed to Google’s infrastructure. There are even some
restrictions present for deployed applications that do not apply to an application
running on a development server. As a consequence local computers executing
an instance of the slave application using the development server can be incor-
porated as additional computing resources. For example an algorithm could be
distributed, to a couple of deployed instances as well as some instances running
on local cluster nodes using the development server.
xxxii
The local nodes are typically machines with multiple CPU cores. So the concept
of sending multiple parallel HTTP requests to one instance in order to achieve
parallelism applies here as well. The development server handles each request in
a separate thread, thus automatically distributing the load on the available cores.
In principle it does not make a difference to the master application, whether it
sends requests to a deployed instance where the App Engine frontend manages
load balancing of parallel requests or a instance running on the development
server where every request is handled by a separate thread. To the master
application it is only relevant how many requests an instance can handle in
parallel, which will be referred to as queue. Further discussion on the impact of
the queue size is provided in Section ??.
Besides making the distributed system more flexible, the use of the development
server provides a way to compare the performance of the App Engine framework
to regular hardware. In addition, changes in the distributed system that might
have an impact on the runtime of algorithms can be tested more reliably on local
slaves, since measurements on Google’s App Engine infrastructure are oftentimes
biased due to background load on the servers or the network.
0.3.3 Comparison of Slave Types
In terms of interface and general behavior, both types of slaves are equivalent,
though there are still some differences that have to be considered:
1. Latency/Bandwidth: Generally the network connection will be better
to a Local Slave, since they typically reside in the same local net as the
master application. App Engine slaves on the other hand are always ac-
cessed over the Internet, thus having a much slower connection. Besides
latency and bandwidth often may vary due to background load. A closer
analysis of App Engines network behavior is provided in Section 5.
2. Hardware: For local slaves the underlying hardware is known. As a
result rough calculation time estimates can be done, as well as heuristics
for scheduling jobs can be applied. On the contrary for App Engine slaves
the underlying hardware is neither known nor are there any guarantees in
that respect. Multiple requests can be executed on completely different
hardware even if the requests happen in a small time frame one after
another.
3. Reliability/Accessibility: Local slaves reliability depends on the proper
administration of the machines the instance is running. Granting access
xxxiii
to the application by opening the corresponding ports is an administrative
issue as well. App Engine slaves on the other side have no need for admin-
istration at all. Applications running on Google App Engine are highly
reliable and are accessible from everywhere over the Internet.
4. Restrictions: App Engine slaves have various restrictions, for example a
request handler has to terminate within 30 seconds otherwise the request
fails. Furthermore the total quotas as well as the per minute quotas are
limiting for App Engine slaves. For a local slave all these restriction do
no apply, thus leaving the programmer more flexibility.
5. Services: The scalable services provided by the runtime environment
are only simulated by the development server and thus may differ in their
behavior. However, the slave application will only use the Datastore which
is provided sufficiently by the development server.
Concluding App Engine slaves are in general more difficult to handle program-
matically, because of the various unknown variables and the strict restrictions
of the runtime environment. This however also means that incorporating the
possibility to use local slaves into the system does not require a lot of adjust-
ments in the code. Table 0.4 shows a quick overview of the differences between
Local and App Engine Slaves.
Feature Local Slave App Engine Slave
Latency/ fast local network InternetBandwidth
Hardware hardware is known; completely unknown;runtime estimates are possible may vary
Reliability/ administration needed highly reliableAccessibility and accessible
Restrictions most restrictions very restrictivedo not apply (see quotas)
Services only simulated provided bygoogle infrastructure
Table 0.4: Properties overview of App Engine and local slaves.
xxxiv
0.4 The Master Application
The master application is a program written in Java that automatically invokes
the web interface provided by the slave application. It is responsible for the
generation and distribution of the parallel tasks, as well as for collecting and
assembling the partial results into a complete solution. Moreover it is responsible
for mapping jobs efficiently to the given slaves. Another important requirement
is a good fault tolerance mechanism, since requests may fail for various reasons.
In the following, the architecture of the master application will be described
starting with a general overview of the architecture, followed by a more detailed
description of the individual components and their responsibilities.
0.4.1 Architecture
Figure 0.4: Master application architecture.
Figure 0.4 shows the main components of the master application and their de-
pendencies. The main entry point to the system is the DistributionEngine
class. A client using the system instantiates the DistributionEngine hand-
ing it a Implementation of the JobFactory representing the parallel problem.
Furthermore, the DistributionEngine needs a list of URIs of reachable slave
xxxv
instances. For every slave, a HostConnector is instantiated managing the actual
connection to it and providing high-level control to the DistributionEngine.
The HostConnector associated with every slave instance is responsible for sup-
plying the it with data and jobs. Each HostConnectors has a reference to the
JobFactory and directly requests jobs from it and posts results of finished jobs.
For the actual HTTP connection, multiple threads have to be used for man-
aging the parallel data and job requests. For that purpose, a HostConnector
uses JobTransferThreads for every WorkJob it submits. The HostConnector
implements the ResultListener interface providing a callback method for the
JobTransferThreads to deliver the Result of finished jobs. These results are
then forwarded to the JobFactory which is responsible for assembling the final
result. The TransferThreads contain the code for building the actual HTTP
request matching the interface of the slave application.
Since the HostConnectors are responsible for supplying the slaves with tasks,
they implicitly determine the mapping of jobs. A closer description of the map-
ping strategy is provided in Section 0.4.3. Besides the HostConnector are re-
sponsible for handling failed requests and possibly failed slave instances. Fault
tolerance mechanisms are discussed more closely in Section 0.4.4.
Some algorithms such as matrix multiplication require additional data to
be transferred besides the data associated with WorkJobs. For such algo-
rithms HostConnectors additionally manage a reference to a DataManager
that is responsible for transferring data to the slaves. The DataManager
itself uses DataTransferThreads which are just a slightly modified version of
JobTransferThreads. Data is generally transferred prior to the distribution of
jobs. Data transfers are split to multiple parallel HTTP requests. Section 0.6
provides a detailed description of the shared data management concept and the
underlying data transfer strategies.
0.4.2 Generating Jobs
A substantial part of the master is to correctly split the problem into smaller
work items that can be wrapped into WorkJobs. The system uses the concept
of a JobFactory, which is an interface that has to be implemented for a con-
crete parallel algorithm similar to the WorkJob interface. A class implementing
the interface carries the logic for generating appropriate WorkJobs that can be
submitted to the slave applications. The JobFactory is also responsible for
reassembling the partial results of the WorkJobs in order to produce a final
xxxvi
Result. For applications using shared data, the JobFactory class also provides
the serialized data that has to be sent separately.
Listing 5: The JobFactory interface.
1 public interface JobFactory {
2
3 public WorkJob getWorkJob ();
4 public int remainingJobs ();
5 public void submitResult(Result r);
6 public Result getEndResult ();
7
8 //only for applications using shared data
9 public boolean useSharedData ();
10 public Serializable getSharedData ();
11
12 }
The JobFactory manages a list of WorkJobs that have to be completed to solve
the algorithm with the given parameters. The getWorkJob() method returns
the next WorkJob ready for submission and null if there are no more jobs left to
execute. The remainingJobs() method returns the number of remaining jobs
that still have to be executed. This information is necessary for load balancing
purposes. For example, if there are three jobs left and five idle slaves available
the slaves with the fastest expected execution should be chosen first.
Results of completed WorkJobs are submitted to the JobFactory via the
submitResult() method. The class is responsible for assembling all the partial
results to a final result. WorkJobs and their corresponding Results must have
the same identifier, in order to allow proper assembly of the end result. Once
all the results are submitted the getEndResult() method provides the result
of the algorithm.
The useSharedData() method indicates whether the algorithm uses shared data
management. In case shared data is used the serialized data can be requested
through the getSharedData() method. Shared data management is described
in more detail in Section 0.6.
Using the concept of a JobFactory, WorkJobs and Results, the logic of a
specific algorithm is decoupled from the rest of the system. For integrating an
additional algorithm in the system, a programmer simply needs to implement
the interface. In Section 0.7.1 concrete examples for implementing algorithms
in the system will be discussed.
xxxvii
0.4.3 Job Mapping
Mapping in parallel computing represents the procedure of assigning the tasks of
a parallel computation to the available processors [14]. The main goal of when
mapping tasks is to minimize the global execution time of the computation.
This is usually achieved by minimizing the overhead of the parallel execution.
Typical sources of overhead are communication and processors staying idle due
to insufficient supply with work.
Communication overhead is minimized by avoiding unnecessary communication
between processors or machines. Avoiding idle processors requires good load
balancing, which means that the work should be distributed equally among
processors.
Mapping strategies can be roughly classified into two categories: static and
dynamic.
1. Static Mapping: Static mapping strategies decide the mapping of tasks
to the available processors before executing the program. Providing an
good mapping in advance is a complex and computationally expensive
task. In fact finding an optimal mapping is NP-complete. Knowledge of
task size, data size, the underlying hardware used and even the system
implementation is crucial. However for most practical applications there
exist heuristics that produce fairly good static mappings.
2. Dynamic Mapping: Dynamic mapping strategies distribute the tasks
dynamically during execution of the algorithm. When there is insufficient
knowledge on the environment, static mappings can cause load imbalances.
In such cases, dynamic mapping techniques often yield better results.
As described earlier parallelism is achieved by sending multiple tasks in parallel
to the web application. The web application then handles those requests in
parallel. In case of an App Engine slave, the requests might be handled in
parallel on one machine or on several different machines. In case of a local slave
the requests are handled in separate threads thus using multiple available CPU
cores. Thus, the mapping to cores depends on when and how many requests are
sent in parallel to each slave managed by the system.
The basic mapping approach of our system is similar to a dynamic work pool.
A distributed system using a work pool approach typically has a central node
managing the parallel tasks and the computation nodes request tasks from the
work pool for computation. More advanced implementations sort the parallel
xxxviii
tasks in the work pool using a priority queue. Such an approach has the advan-
tage that work is given only to nodes that actually free resources. Moreover,
if there are sufficient tasks roughly of the same size, even in a heterogeneous
environment there are almost no load imbalances to expect. Faster nodes will
automatically request more work, since they finish tasks faster and slower nodes
will not be flooded by work they cannot handle.
The mapping strategy of the system is inspired by the work pool approach.
The work pool is implemented by the JobFactory, which provides WorkJobs
on demand as well as the possibility to put back WorkJobs for reassignment. A
closer description of the JobFactory is provided in Section 0.4.2 .
Because tasks are pushed to the nodes by HTTP requests the computation nodes
are not able to request additional work by themselves. Therefore every slave has
a HostConnector associated with it, managing the job retrieval and the posting
of results for the particular slave instance. Every slave has a queue size, which
is simply a number indicating the number of parallel requests it can handle.
Initially the HostConnector retrieves the number jobs indicated by the queue
size and sends them to the corresponding slave. Every time a job is finished the
HostConnector retrieves a new job and sends it to the slave.
Different slave instances have different optimal queue sizes. For example slave
running on a machine with a larger number of CPU cores is able to handle more
requests in parallel than one running on a machine with only one core. For local
slaves the optimal queue size is usually the number of available processor cores.
For App Engine slaves it is a little more difficult to find an appropriate queue
size, since there is no information on the hardware available. Besides, the ex-
ecution time might be influenced by various factors such as other applications
sharing the same server and thus causing background load. The fact that subse-
quent requests might be handled on completely different hardware is problematic
as well. However typically if the problem can be partitioned intro similar sized
jobs and only App Engine slaves are used the best choice is to evenly distribute
the whole problem right at the start of the algorithm, since if the load is to high
the requests excessive request get aborted and can be rescheduled.
0.4.4 Fault Tolerance
Fault tolerance in the context of a distributed system means guaranteeing that
every parallel task is executed and results are collected correctly in order to make
completion of the algorithm possible. Recoverable problems such as single slave
xxxix
instances going offline should be recognized timely and handled accordingly. If
there are problems the system cannot recover from, for example a complete loss
of network connectivity, the system should persist its state in order to make
continuation of execution at a later time possible.
Retransmission of Requests
HTTP requests to the slave application may fail at any time for various reasons,
thus a mechanism for correctly handling failed requests forms an important
part of the system. In order to guarantee execution of the task associated to
the requests requires either resending until the request is performed correctly or
detaching the task and putting it back into the work pool.
The best action for recovering from a error often depends on the cause of the
problem. First of all, requests may get lost due to a unreliable network, here
the best reaction is to resend the request as soon as possible. Another reason
can be a busy slave instance that is temporarily not able to handle additional
requests or a depleted resource. The best reaction in this case is to resend the
request as soon as the slave is able to receive additional requests. In some cases
a task cannot be executed correctly by one slave, while others might be able
to execute it without a problem. For example, a long task assigned to an App
Engine slave that repeatedly exceeds the runtime limitations could be executed
by a local slave without a problem.
Resending of requests is implemented in the DataTransferThread and the
JobTransferThread itself, in order to avoid creating a new thread every time
a HTTP request gets lost. Threads resend failed requests for a configurable
number of times using a exponential backoff mechanism. A TransferThread
will initially wait a small amount of time before resending a request, however
doubling the wait time for every further attempt. This technique avoids flooding
the network with unnecessary HTTP requests that will be discarded anyways.
Once the maximum number of retries is reached a, TransferThread sets its state
to failed. The HostConnector regularly checks for failed JobTransferThreads
and tasks associated to a failed JobTransferThread are detached and put back
into the work pool in order to make reassignment to a different slave possible.
Unlike single jobs, data has to be transferred to the slave it is intended to in
order to make execution of the algorithm possible. As a consequence, once
a DataTransferThread fails the corresponding slave has to be removed from
the list of available slaves and its associated HostConnector has to be deac-
xl
tivated. Therefore, DataTransferThreads have typically a higher retry count
than JobTransferThreads in order to avoid accidentally removing an active
slave.
Handling Offline Slave Instances
The HostConnector regularly checks for failed TransferThreads and once it
discovers a high amount of failed requests it suspends job transmission, in order
to check the slaves availability. Ping requests are sent in order to check whether
the slave is still online. A ping request should cause the slave application to
return immediately with a empty response. If a ping request succeeds job trans-
mission or a successful result of a previous job is returned, the job submission
is resumed.
However, if a certain number of ping requests fail the HostConnector assumes
its slave has gone offline, puts back all active tasks into the work pool and
deactivates itself. Optionally, the availability of slaves can be checked prior to
execution in order to avoid starting to send requests to inactive slaves.
Handling Loss of Connectivity
Once a unrecoverable fault is detected such as complete loss of connectivity,
the Distribution System tries to persist its state in order to continue execution
at a later point in time. This behavior is especially desired for long running
algorithms where a unrecoverable error would mean the complete loss of all
the already finished computation. The state of the problem is implicitly given
by the JobFactory class that manages the open WorkJobs as well as the al-
ready computed partial Results. The framework provides the possibility to
make a JobFactory Serializable and additionally implementing the interface
Persistable.
Listing 6: The Persistable interface.
1 public interface Persistable {
2 public void saveState ();
3 public void loadState ();
4 }
Listing 6 shows the Persistable interface providing the methods saveState()
and loadState(). If a \verbJobFactory+ implementation additionally im-
plements this interface the system puts back all unfinished tasks in the work
xli
pool and calls the saveState() method once a unrecoverable error is detected.
This provides the possibility to load the state of the JobFactory at a later point
in time and continue execution of the algorithm.
0.5 The Slave Application
The slave application is a web application written in Java using the Google App
Engine framework. As previously discussed, instances of the slave application
can run either be App Engine slaves or local slaves. Basically the slave appli-
cation does nothing more than receiving small pieces of work, executing them
and sending back the results to the master.
Figure 0.5: Activity diagram illustrating the control flow of the slaveapplication
Figure 0.5 shows a UML activity diagram visualizing the general control flow of
the slave application. The entry point of the slave application is a HTTP request
to the servlets POST method. First of all it has to be checked whether the entity
xlii
in the message body of the HTTP request is compressed and if so, the payload
has to be decompressed prior to further usage. The next step is to determine the
type of the request. There are different types of requests: job, data, clear and
ping. Each request type has to be treated differently. The corresponding meta
information of the request is stored in form of message headers in the HTTP
request (see Section 0.5.3).
The most important request type is job request. Such a request contains a
parallel task intended to be executed by the slave application. Once a job request
is identified, the job itself has to be extracted from the entity, by deserializing
the data to a WorkJob object. The next step is to determine whether the job
needs shared data and if so, it has to be retrieved from the Datastore prior to
execution. After that, the WorkJob is executed by calling its run() method. The
result of the computation is then again stored in serialized form in the HTTP
response. If result compression is enabled the serialized object is additionally
compressed.
Data requests are used to transfer shared data that is used by all the jobs and
thus has only to be transferred once to each slave instance. A closer description
of the shared data management concept is provided in Section 0.6. Once a data
request is identified, the raw data is extracted and stored in the Datastore using
a wrapper data entity.
A clear request causes the slave application to delete the entire content of the
Datastore, in order to erase all save state. A clear request is typically sent
after a successful or failed run of an algorithm in order to prepare the slave for
subsequent runs of the algorithm.
Ping requests are used for determining whether a slave instance is still online.
The slave application instantly returns to the request once identifying a ping
request. A closer description of the fault tolerance mechanisms is provided in
Section 0.4.4.
0.5.1 WorkJobs
A WorkJob is a piece of work that can be received and executed by the slave
application. They contain the algorithmic logic as well as the data needed for
execution. WorkJob itself is a abstract class defining the necessary methods
expected by the system:
Listing 7: The abstract WorkJob class.
xliii
1 public abstract class WorkJob extends Serializable {
2 private int id;
3
4 public int getId ();
5 public void setId(int Id);
6 public Result run();
7
8 //only needed for algorithms with shared data
9 public void fetchSharedData ();
10
11 }
Every algorithm needs a specific implementation of a WorkJob that extends this
abstract class. WorkJobs always have to be serializable, since they are transferred
in serialized form.
The core of a WorkJob is the run() method which contains the algorithmic
logic of the job. The return value is of the type Result, which is again a generic
abstract class that needs to be extended when implementing a result class for a
specific algorithm.
How data needed for the algorithm is managed is left to the programmer im-
plementing the specific WorkJob. However, the class should only contain data
specific to the job. Transferring data that is used by multiple jobs within the
class would lead to redundant data transfers. Data shared by multiple jobs can
be sent separately and should be retrieved by invoking the fetchSharedData()
method. The concept of shared data management is described more detailed in
Section 0.6.
Every WorkJob has a unique identifier, that has to be the same as the identifier
of the corresponding Result. This allows the master to correctly map WorkJobs
to their Results, which is necessary for assembling the solution of the algorithm.
0.5.2 Results
The slave application returns a serialized instance of the class Result wrapped
in the HTTP response. Result is a abstract class every algorithm specific result
implementation must extend:
Listing 8: The abstract Result class.
1 public abstract class Result implements Serializable {
2 private int id;
xliv
3 private long calculationTime;
4
5 public long getCalculationTime ();
6 public void setCalculationTime(long calculationTime);
7 public int getId();
8 public void setId(int id);
9 }
The class only defines the id used for relating the result to the corresponding
WorkJob and a field storing the execution time of the run() method. The actual
data types for returning the results must be defined in the algorithm specific
implementation.
In the field calculationTime the execution time needed for the run() method
is stored. This value represents the time spent doing useful computations and
is used to determine the ratio between parallelization overhead and the actual
computation time.
Results can be optionally returned in a compressed form, in order to reduce
the amount of data to be transfered. The master application indicates that it
expects a compressed result in the HTTP message header (see 0.5.3).
0.5.3 Message Headers
HTTP requests contain header fields used for transferring meta-information,
such as which encoding is accepted by a browser. A header field has a name and a
corresponding value which is usually a string. Besides using the standard header
fields, self-defined custom headers can be used for transferring information. The
slave application decides using the header fields how to treat requests.
In the following the parameters used by the slave application are listed:
• type: The type field indicates the kind of request transferred, how it has
to be handled by the application and the kind of data contained in the
payload.
– job: A job request is a computational task to be executed by
the slave. The payload of the request contains the corresponding
WorkJob.
– data: A data request serves as a means to transfer data to the appli-
cation. The payload contains shared data to be stored in the Datas-
xlv
tore.
– clear: A clear request causes the slave to clear all stored data. It
contains no data in the payload. Such a request is typically sent after
all jobs have finished to reset the application.
– ping: A ping request is used to determine whether a slave is reach-
able. The application should respond immediately with a empty re-
sponse.
– retrieve: A retrieve request causes the application to read the con-
tents of the Datastore and send them back in the response of the
request. This flag is only used for debugging purposes.
• compression: The compression field indicates whether the payload of
the request is compressed and therefore has to be decompressed prior to
usage.
– enabled: The enabled flag indicates that request compression is en-
abled.
– disabled: The disabled flag indicates that request compression is
disabled.
• resultCompression: The resultCompression field indicates whether the
result should be compressed before returning it.
– enabled: The enabled flag indicates that result compression is en-
abled.
– disabled: The disabled flag indicates that result compression is dis-
abled.
• sharedData: The sharedData field indicates whether the corresponding
WorkJob uses shared data and therefore whether shared data has to be
retrieved prior to execution of the job.
– true: The true flag causes the slave to invoke the fetchSharedData()
method prior to the run() method.
– false: On the contrary the false flag causes the slave to invoke the
run() method immidiatly without retrieving further data.
• benchmark: The benchmark field indicates whether database operations
should be performed, when sending data requests. The field is used for de-
activating database operations, in order to more precisely measure transfer
xlvi
times.
– true: The true flag causes the slave to omit transferred data and
return immediately.
– false: The false flag causes the slave to store the enclosed data in
the Datastore.
0.6 Shared Data Management
The naive approach for transferring data from the master to its slaves is by
simply attaching all the necessary data to each job. However, various parallel
algorithms have shared data that has to be accessed by all of the jobs. The fact
that a single slave will compute multiple jobs results in unnecessary redundant
data transfer from the master to the slaves. Especially if communication takes
place over relatively slow networks such as the Internet, this can result in a
major bottleneck. A common example would be a parallel implementation of
the matrix multiplication algorithm, where one matrix is shared and has to
be accessed by each job. So in principle the shared matrix only has to be
communicated once to each slave.
For this reason we introduced data requests besides regular job requests in the
system. Using shared data is optional though, since not every algorithm needs
shared data or in some cases the overhead of sending data multiple times might
be acceptable for other reasons. So the the parallel computation process is split
in two phases: first shared data is transferred to each slave and stored; and
second the jobs with the actual computation tasks are distributed among the
slave instances.
As described in Section 2 the runtime environment of the App Engine framework
restricts any access to the file system. Consequently we had to use the Datastore
service in order to store shared data in a way that all the jobs executed on one
slave can access the data. The development server simulates the Datastore by
storing data in a single file in the file system, since it is usually used for testing
purposes only. This may seem inefficient, yet the jobs will not have to query for
data but usually need the whole share data stored. Therefore, it is still more
efficient to read data from the file system than to communicate it over a slow
network. For an application running on Google’s servers it is not guaranteed
that every request will be executed on the same hardware (though if possible it
is preferred). However the Datastore service manages proper access to the data
xlvii
for every request, thus such an application can be logically treated as a single
slave. From an performance viewpoint the Datastore service again in most cases
will be better than plainly sending data multiple times.
0.6.1 Data Splitting
We choose to split data into multiple chunks for data transfer for two reasons.
First of all the Datastore service allows a single storage entity to have a max-
imum size of one megabyte. Besides HTTP has limitations on how much data
one request is allowed to carry in its payload as well. In order to avoid limita-
tions, the shared data has to be split and stored in several parts. Once a job
needs to access the data, it simply reads all the chunks and reassembles them.
The second major advantage is, that by splitting data, multiple TCP streams
are used for transmission, which can often improve transfer speed notably. Es-
pecially in our case where data transfers typically will last only a couple of
seconds, multiple streams help hiding the effects of the TCP startup phase [13].
Of course also the overhead produced by transferring an additional HTTP
header for each separate data chunk has to be considered. So a good ratio
between the total data to be transferred and the data transferred in a single
chunk has to be found. An experimental analysis of the splitting factor is pro-
vided in Section 0.11.2.
0.6.2 Data Transfer Strategy
An important consideration when transferring data to multiple slaves whether
to transfer data to all of them in parallel or sequentially one after another. If
there are separate independent network links to the slaves, the parallel approach
is clearly the best strategy.
However, this is typically not the case in a practical scenario and outgoing
bandwidth to the hosts is shared. Over the Internet the most likely network
bottleneck is the upload bandwidth of the master. In a local network the nodes
are usually interconnected by a switch or a router. In a typical homogenous
network topology like in figure 0.6 the bandwidth bottleneck is already the link
to the switch.
In a heterogenous network topology where the master has a considerably faster
connection to the switch like in figure 0.7 data transfers won’t affect each other,
thus sending data in parallel is clearly the best option.
xlviii
Figure 0.6: Typical homogeneous local network topology.
Assuming however the network bandwidth to the slaves has to be shared, like in
figure 0.6, it is better to send data to the slaves one after another. A slave needs
the complete shared data in order to start the computation, so partial data is
not useful to a slave. Assuming a topology like in the homogenous example and
1000 mbit of data has to be transferred to each host, it takes three seconds to
broadcast the data to all slaves. When transferring the data in parallel after the
three seconds every slave can start computing.
However, if data is sent separately to the slaves it takes one second for each
transfer, since the whole bandwidth can be used. So after one second the data
transfer to the first slave is finished and a job can be assigned, so the first slave
can start its computation. Consequently it takes another second to transfer the
data to the second slave, which can start the computation after a total of two
seconds. After three seconds also the third slave can start computing. In total we
gain three seconds of computation time in comparison to transferring the data in
parallel, by enabling the first two slaves to start their computations earlier. As
a result in a scenario where the bandwidth to all the slaves is shared a sequential
data transfer strategy should be preferred over a parallel data transfer.
When transferring data sequentially the next consideration is in which order to
xlix
Figure 0.7: Heterogeneous local network topology, with superior master node.
transfer the data to the slaves. In general slaves to which a faster network link
is available and slaves that have more computing power are preferable. The
data is transferred faster to slaves with more bandwidth available thus enabling
them to begin their computation earlier. On the other hand the system benefits
more from faster slaves beginning their computation early. So if in the previous
example one of the three slave nodes is notably faster than the other two, it
is better to transfer the data first to the fast node. In practice most of these
variables are unknown or have to be configured by hand.
0.6.3 Performance Evaluation
In the following a performance evaluation of the described techniques for shared
data management is presented. We tested the matrix multiplication algorithm
in the computer rooms (Intel Core 2 Duo CPU with 3 ghz and 2 gigabyte RAM)
with four slave nodes and on the karwendel cluster (see ?? for hardware details)
with two slave nodes, using no shared data management as well as shared data
management with a parallel and a sequential data transfer strategy. For the
experiments square matrices with different dimensions where tested.
l
Figure 0.8 shows the results for the experiment performed in the computer
rooms, using four slave nodes. As expected there is a huge performance gain
using shared data management, because shared data, in this case the first matrix
of the multiplication, is transferred only once to each slave instead of alongside
with every job. For larger matrix dimensions the performance gain becomes
even more relevant, since more data has to be transferred. The sequential data
transfer strategy is throughout all matrix dimensions slightly faster than the
parallel data transfer strategy. Since the nodes in the computer are organized
in a homogeneous local network these results were expectable.
500 750 1000 1250 15000
1
2
3
4
5
6
7
8
9x 10
4
matrix dimension
com
plet
ion
time
(mill
isec
onds
)
no shared dmshared dm parallel transfershared dm sequential transfer
Figure 0.8: Completion time for square matrix multiplication using no shareddata management, shared data management with parallel and sequential data
transfer.
Figure 0.9 shows the results of the experiment on two karwendel cluster nodes.
For a hardware specification of the karwendel cluster see section 0.10.4. At first
sight the much lower performance gain when using shared data management is
noticeable. The cluster nodes on karwendel are interconnected by a infiniband
network, which is considerably faster than the gigabit network of the computer
rooms thus diminishing the impact of transferring redundant data. Also the
benefit of using a sequential data transfer strategy is less noticeable.
li
1000 1250 1500 1750 20000
1
2
3
4
5
6
7
8x 10
4
matrix dimension
com
plet
ion
time
(mill
isec
onds
)
no shared dmshared dm parallel transfershared dm sequential transfer
Figure 0.9: Completion time for square matrix multiplication using no shareddata management, shared data management with parallel and sequential data
transfer.
lii
Algorithms
In the prior section, the functionality of the Distribution Framework was de-
scribed in detail focusing on the job management aspects of the system.
This section describes some sample parallel algorithms implemented using the
Distribution Framework. The algorithms were used for analyzing the system
in respect to its scalability. Moreover, they were used to identify algorithm
properties that are well suited for distribution on Google App Engine.
The description for each algorithm is structured in a general description of the
algorithm, the idea used for parallelization, and the concrete implementation.
This section therefore also provides a documentation on how to integrate a par-
allel algorithm into the system. The concrete implementation of the algorithms
is documented by code extracts of the WorkJob, Result and JobFactory imple-
mentation. However the code is reduced to the parts relevant to the algorithm
omitting for example methods such as getters and setters or security checks for
parameters.
0.7 Monte Carlo Routines
Monte Carlo routines are a class of algorithms heavily based on random number
generation and results are gained by repeated random sampling [17]. Monte
Carlo algorithms are usually used for problems where applying a determinis-
tic algorithm would be computationally unfeasible. Typical applications are
simulations of physical and mathematical systems.
0.7.1 Pi Approximation
Pi is mathematical constant stating the ratio between a circles circumference
and its diameter, which can be approximated through a simple Monte Carlo
simulation [17]. The algorithm can be parallelized efficiently without much
effort. The general idea of the algorithm is to inscribe a circle into a square. By
liii
generating uniformly distributed random points within the square and counting
how many of them lie within the circle, one can approximate Pi.
Algorithm
!
"#$
Figure 0.10: Illustration of the monte carlo pi calculation
Figure 0.10 illustrates the principle of the algorithm. The example uses a square
with a diameter of 1 and thus a circle with a radius of 0.5 is inscribed. Random
points are generated within the square. Points falling within the circle are
marked red, whereas points outside of the circle are black. The amount of
points within the circle are π/4. So by counting how many of the random
points lie within the circle Pi can be approximated. Assuming P random points
were generated and M points reside within the circle (red), Pi can be calculated
by applying the formula: π = 4 ∗M/P
The algorithm relies on the fact that once enough points have been generated,
the points will be equally distributed on the square. Since the algorithm is based
on random numbers, the accuracy of the computation after a fixed number of
iterations can only be stated with a certain error probability, though the law of
large numbers states that the accuracy generally increases with a larger number
of generated points.
liv
Parallelization
Listing 9 shows a pseudo code illustration of the Pi Approximations main loop.
The loop iterates from zero to P, which is the total number of random points
that are generated. First of all the two random points representing the x and y
component of the point are generated. These are private variables and therefore
do not have any dependencies. In fact the distance() function could be directly
called with two random values without primary storing them in variables. The
distance() function calculates the distance of the generated point to the center
of the circle. The function is reentrant since it has no global variables, does not
modify any arguments and has no side effects. If the distance is smaller than the
radius of the circle it means the point is within the circle and the counter variable
M is incremented by one. As long as the incrementation operation is atomic it
does not matter in which order the variable is incremented, thus making the
iterations of the loop independent.
Listing 9: Pseudo code of the Pi Approximation algorithm.
1 for i = 0 to P do
2 x = random ()
3 y = random ()
4 dist = distance(x,y)
5 if dist < R do
6 M = M + 1
7 end
8 end
In a Master-Slave system with n parallel machines available, each machine is
assigned a number of points Pn to compute. The machines can independently
generate Pn points and count the number of points residing within the circle,
resulting in Mn points within a circle. The master node then collects the results
of every slave, sums up the number of generated points Pn and the number
of points within a circle Mn, and finally computes Pi according to the same
formula as in the sequential algorithm. The code for the slave nodes is the
same as for the sequential version of the algorithm, with the slight modification
that the number of points within the circle is returned instead of immediately
calculating Pi.
Besides its simplicity, the algorithm has some properties which make it very
suitable for distribution. First of all, there is almost no data that has to be
transferred between the master and its slaves. The parameters as well as the
results are single integer numbers. This is very beneficial for a system where
lv
data transmission is relatively slow. Moreover, there is no need for managing
shared data since the parameters of each slave are independent.
Furthermore, the individual size of jobs can be chosen freely by adjusting the
number of points to calculate. Besides jobs do not need to have the same size
and, as a consequence, good load balancing is easily achievable.
Implementation
Listing 10 shows a simplified implementation of the JobFactory interface. In
lines 3-6 the necessary fields are initialized. P is the total number of points
that have to be generated, M stores the number of generated points that fell
within the circle, numJobs represents the number of parallel jobs that should be
generated and remainingJobs is just a counter variable for the remaining jobs.
The algorithm is initialized by a constructer in lines 8-12, with the desired num-
ber of points to generate and the number of parallel jobs. The getWorkJob()
method initializes a PiJob (line 18) and sets its fraction of points to generate
(line 19), as well as its identifier (line 20). The points to generate are equally
distributed among the jobs, for each job the total number of points is divided by
the number of parallel tasks. For the sake of simplicity, we omitted the distribu-
tion of remaining points. As consequence in this code sample, if the number of
points is not dividable without remainder by the number of parallel tasks, there
are slightly less points generated. Finally the counter variable for remaining
jobs is decremented. The submitResult method from lines 29-32 simply adds
the points of each PiResult to the total number of points within a circle.
Listing 10: Implementation of the JobFactory for the Monte Carlo Pi
algorithm.
1 public class PiApproximation implements JobFactory{
2
3 private int P = 0; // number of points
4 private int M = 0; // number of points in circle
5 private int numJobs = 0;
6 private int remainingJobs = 0;
7
8 public PiCalculation(int P, int numJobs){
9 this.P = P;
10 this.numJobs = numJobs;
11 this.remainingJobs = numJobs;
12 }
13
lvi
14 public synchronized WorkJob getWorkJob () {
15 if(remainingJobs < 1)
16 return null;
17
18 PiJob j = new PiJob ();
19 j.setP(P/numJobs);
20 j.setId(numJobs -remainingJobs);
21 remainingJobs --;
22 return j;
23 }
24
25 public int remainingJobs () {
26 return remainingJobs
27 }
28
29 public synchronized void submitResult(Result r) {
30 PiResult pires = (PiResult) r;
31 M+ = pires.getM();
32 }
33 }
Listing 11 shows the concrete implementation of the algorithms WorkJob called
PiJob. The fields P (line 3) again defines the number of points to generate by
the concrete job and and M (line 4) respectively the portion of these points that
fell within the circle. The variables x and y represent the x and y value of the
random points. From line 9 to 12 the circle and its center is defined, assuming a
circle with a radius of 0.5 and consequently a square with the diameter 1. The
center of the circle is placed at the point (0.5, 0.5). The dist (line 13) variable
is used for storing the euclidian distance of the current random point to the
center of the circle.
The run() (lines 15-38) method performs the actual computation. First of all
the random number generator is initialized (lines 16-17). The main loop then
computes random points within the square by generating a random x value and
a random y value between 0 and 1 for each point (lines 20-22). In the next step
the points distance from the center of the circle is calculated, which is given
by the euclidian distance between the random point (x, y) and the center point
(0.5, 0.5) (lines 24-27). Now it can be easily determined whether the point lies
within the circle by checking if the distance from the center is actually smaller
than the radius of the circle (lines 29-31). Finally the total number of points
within the circle M is stored in the Result and returned (lines 34-37).
lvii
Listing 11: Implementation of the PiJob class.
1 public class PiJob extends WorkJob{
2
3 private int P = 0; // number of points
4 private int M = 0; // number of points in circle
5
6 private double x = 0;
7 private double y = 0;
8
9 // define circle R = 0.5; diameter of square is 1
10 private double R = 0.5;
11 private double center_x = 0.5;
12 private double center_y = 0.5;
13 private double dist = 0;
14
15 public Result run() {
16 Random r =
17 new Random(System.currentTimeMillis ());
18
19 for(int i = 0; i < P; i++){
20 // generate Random Point
21 x = r.nextDouble ();
22 y = r.nextDouble ();
23
24 // euclidian distance from center
25 dist = Math.sqrt (((x - center_x)*
26 (x - center_x)) + ((y - center_y)*
27 (y - center_y)));
28
29 if(dist < R){
30 M++;
31 }
32 }
33
34 PiResult res = new PiResult ();
35 res.setM(M);
36
37 return res;
38 }
39 }
Listing 12 shows the implementation of the algorithms result class used for
returning the results of the computation. Each job calculates the number of
points that were generated within the circle, so the result of each job is the
lviii
integer value M (line 2).
Listing 12: Implementation of the PiResult class.
1 public class PiResult extends Result {
2 private int M;
3
4 public int getM() {
5 return M;
6 }
7 public void setInt_res(int m) {
8 this.M = m;
9 }
10 }
0.7.2 Integration
Monte Carlo based integration is a form of numerical integration based on ran-
dom numbers. The same idea as for the Pi algorithm is applied. By sampling
random points in the interval of the function and measuring how many of the
random points lie under the function, the integral can be approximated [17].
Algorithm
The simplest form of integration based on random numbers follows the same
idea as the Monte Carlo Pi approximation. Figure 0.11 illustrates the basic idea
of the algorithm. The function f1 has to be integrated within the boundaries a
and b. First of all, there is a bounding rectangle defined having the width of the
integration boundaries and the height of the highest function value of f1 within
the boundaries. In the figure the bounding rectangle is indicated by the dotted
lines.
The next step is to generate random points within the bounding rectangle and
measure how many of the points reside between the function and the x-axis
(red) and how many above the function (black). Assuming there are a total of
N points generated, M points are under the function and the bounding rectangle
has the area A we can approximate the integral by the following formula:
∫ baf1(x) dx ≈M/N ∗A
lix
��
��
��
Figure 0.11: Illustration of the monte carlo integration
However a more efficient way to approximate an integral is by sampling random
function parameters within the boundaries and evaluating the function with the
given parameters. By summing up the evaluated function points and then divid-
ing the sum by the number of points sampled, the integral of the function can be
approximated as well. Assuming we generate N random function parameters
xn within the boundaries, the integral can be approximated by the following
formula:
∫ baf1(x) dx ≈
∑N0 f1(xn)/N
Parallelization
Listing 13 shows a short pseudo code illustration of the algorithms main loop.
The loop iterates from zero to N which is the total number of random function
parameters that are sampled. First of all a random function parameter x has
to be sampled. x again is a private variable and thus produces no dependencies
among iterations. The function f() evaluates the function value under the given
parameter and is obviously reentrant. Finally the function value is added to the
current sum. Similar to the Pi Approximation algorithm it does not matter in
which order the value are summed up as long as the add operation is atomic.
Listing 13: Pseudo code of Monte Carlo Integrations main loop.
lx
1 for i = 0 to N do
2 x = random ()
3 value = f(x)
4 sum = sum + value
5 end
The problem can be split by distributing the total number of random function
parameters among several parallel tasks. Every task consists of the same al-
gorithm, but with a smaller portion of points to sample, resulting in a smaller
precision. In the end, the master process collects all the results and averages
them, thus gaining the same precision as a sequential run with the total number
of points.
The algorithm has the same strengths as the Monte Carlo Pi approximation
when it comes to parallelization. There is very little data to transfer for the
parameters and for the results. Jobs can be almost freely sized with no need for
shared data.
Implementation
Listing 14 illustrates the implementation of the JobFactory interface for the
Monte Carlo Integration algorithm. The algorithm is initialized by its pub-
lic constructor (line 16 to 24). The parameters include the function that is
integrated, the number of function parameters N to generate, the integration
boundaries and again the number of parallel jobs. The parameters are stored in
their corresponding private fields defined at the beginning of the class (line 3 to
11). Besides, a list called results (line 12) holding the value of each returned
result as well as the field called area (line 14) storing the final result are defined.
The getWorkJob method (line 26 to 35) is very similar to the Pi approximations
implementation. The method initializes a IntegrationJob with the needed
parameters and the number of function parameters the job has to sample (line
30). The amount of sampled points is again equally distributed among the jobs
by simply dividing the total number of jobs by the number of parallel tasks.
The identifier is set, the counter for remainingJobs is decremented and the
generated job is returned (line 32 to 34).
The submitResult method (line 37 to 47) stores the results of each partial area
computation in the results list (line 38). Once the last result has arrived the
distinct areas are averaged in order to gain the final result (line 41 to 46).
lxi
Listing 14: Implementation of the Monte Carlo Integration JobFactory.
1 public class MonteCarloIntegration implements JobFactory{
2
3 // function to integrate
4 private IntegrationFunction f;
5 // iterations
6 private int N;
7 // integration boundaries
8 private float x1;
9 private float x2;
10 private int numJobs;
11 private int remainingJobs;
12 private ArrayList <Double > results =
13 new ArrayList <Double >();
14 private double area;
15
16 public MonteCarloIntegration(IntegrationFunction f
17 , int N, float x1 , float x2 , int numJobs) {
18 this.f = f;
19 N = N;
20 this.x1 = x1;
21 this.x2 = x2;
22 this.numJobs = numJobs;
23 this.remainingJobs = numJobs;
24 }
25
26 public synchronized WorkJob getWorkJob () {
27 if(remainingJobs < 1)
28 return null;
29
30 IntegrationJob j =
31 new IntegrationJob(f, N/numJobs , x1 , x2);
32 j.setId(numJobs -remainingJobs);
33 remainingJobs --;
34 return j;
35 }
36
37 public synchronized void submitResult(Result r) {
38 result_list.add ((( IntegrationResult)r).
39 getArea ());
40
41 if(remainingJobs < 1){
42 for(double d : results){
43 area += d;
44 }
lxii
45 area = area/results.size();
46 }
47 }
48 }
Listing 15 shows the implementation of the IntegrationJob class. The first part
of the class again consists of the fields (line 3 to 9) initialized by the public
constructor (line 11 to 17).
The run() method (line ) is responsible for the computation. First of all the
random number generator has to be initialized (line 20), as well as a variable
holding the function parameters x (line 22) and a variable holding the sum of
the evaluated function values sum (line 23). The core of the function is the main
loop generating random function parameters within the integration boundaries
and summing up the evaluated results(lines 25-28). IntegrationFunction is a
interface providing the method f(), that has to be implemented so it evaluates
a function value for a given parameter. Finally the summed up function values
are according to the formula divided by the number of generated points in
order to obtain the approximated integral (line 30), which is returned using the
IntegrationResult class (line 31).
Listing 15: Implementation of the IntegrationJob class.
1 public class IntegrationJob implements WorkJob{
2 // function to integrate
3 private IntegrationFunction f;
4 // iterations
5 private int N;
6 // integration boundaries
7 private float x1;
8 private float x2;
9
10 public IntegrationJob(IntegrationFunction f,
11 int N, float x1 , float x2) {
12 this.f = f;
13 this.N = N;
14 this.x1 = x1;
15 this.x2 = x2;
16 }
17
18 public Result run() {
19 Random r =
20 new Random(System.currentTimeMillis ());
21 double x = 0:
lxiii
22 double sum = 0;
23
24 for(int i = 0; i < N; i++) {
25 x = x1 + r.nextDouble ()*x2;
26 sum+= f.f(x);
27 }
28
29 double area = sum/N;
30 return new IntegrationResult(area);
31 }
32 }
Listing 16 shows the result class for the Monte Carlo Integration. Again the
desired data type, in this case a double, is wrapped in class that is used for
transferring the result back to the master application.
Listing 16: Implementation of the IntegrationResult class.
1 public class IntegrationResult extends Result{
2 double area;
3 public IntegrationResult(double area) {
4 this.area = area;
5 }
6 public double getArea () {
7 return area;
8 }
9 public void setArea(double area) {
10 this.area = area;
11 }
12 }
0.8 Matrix Multiplication
Multiplying two matrices is a common task in mathematics and thus algorithmic
approaches for the problem are well studied. Besides being itself an important
problem matrix multiplication is equivalent to various other problems, such
as transitive closure and reduction, solving linear systems, and matrix inver-
sion [23]. Moreover it is commonly used in computer graphics for computing
coordinate transformations. As a consequence a faster matrix multiplication
algorithm makes all algorithms that are based on matrix multiplication faster.
lxiv
Because of the nature of matrix operations the algorithm is well suited for
parallelization and probably one of the major textbook examples for parallel
computing. First of all the sequential matrix multiplication will be described
shortly, followed by an explanation of the parallel approach and finally the
implementation using the Distribution Framework will be presented.
0.8.1 Algorithm
The most common form of matrix multiplication is the ordinary matrix product.
It is defined between two matrices A and B. The matrices can only be multiplied
if the width of A equals the height of B. The resulting matrix C has the height of
matrix A and the width of matrix B. So multiplying matrices with dimensions
m×n and n×p results in a m×p matrix. Moreover the ordinary matrix product
is not commutative, so multiplying two n × n matrices A by B generally will
not yield the same result as multiplying B by A.
The entry Ci,j of the result matrix is defined as follows:
For A ∈Mm×n, B ∈Mn×p and C ∈Mm×p
Ci,j = (AB)i,j =∑n
k=1Ai,kBk,j
Calculating all the entries for i and j with 1 ≤ i ≤ m and 1 ≤ j ≤ p yields the
result matrix C.
The naive algorithm strictly follows the mathematical definition:
Listing 17: Naive algorithm for matrix multiplication.
1 for i=1 to m do
2 for j=1 to p do
3 C[i,j] =∑n
k=1 A[i,k]*B[k,j]
The naive algorithm approach results in a complexity of O(mnp). There exist
asymptotically faster algorithms, such as Strassen’s algorithm, which are based
on the possibility to reduce the amount of multiplications needed when mul-
tiplying 2× 2 matrices. Besides that the algorithm is much more complex to
implement, it needs relatively large matrices to outperform the naive algorithm.
For all measurements in this thesis the naive matrix multiplication algorithm
was used.
lxv
0.8.2 Parallelization
Looking at the main loop of the naive matrix multiplication depicted in listing
17, it is visible that the two outermost loops carry no dependencies among iter-
ations. Every iteration writes a different entry of the result matrix C. So matrix
multiplication is fully parallelizable, by partitioning the data and calculating
the rows or columns of the result matrix independently.
In order to understand the concept for data partitioning in the parallel algorithm
one has to understand the data dependencies of the matrices. For example when
multiplying two 5× 5 matrices the entry C2,3 is given by:
C2,3 = A2,1B1,3 +A2,2B2,3 +A2,3B3,3 +A2,4B4,3 +A2,5B5,3
! " #
Figure 0.12: Data dependencies for entry C2,3 of the result matrix.
Figure 0.12 shows the data dependencies for calculating entry C2,3. Basically
for each entry of C a row of matrix A and column of matrix B is needed. So
for calculating the entries of a whole row of C the entire matrix B and the
corresponding row of matrix A is needed. The calculation of each entry of the
result matrix is independent and can be performed in parallel.
Typically every slave gets assigned a certain amount of rows of the matrix C it
has to compute. The data needed for calculating row x to y is the whole matrix
B and row x to y of matrix A.
Figure 0.13 shows an example of how data is partitioned in the parallel matrix
multiplication algorithm. In the example the multiplication of two 3 × 3 is
distributed to 3 processors. Each processors is responsible for calculating one
row of the result matrix C. Matrix B is needed by all the processors and is
lxvi
!"#$%&&#"'(
!"#$%&&#"')
!"#$%&&#"'*
+
,
-
./&0"/120% $#33%$0
1"#4.$4&0
Figure 0.13: Example of data partitioning in parallel matrix multiplication.
therefore broadcasted. Every processor receives one row of matrix A, which is
sufficient to compute the corresponding row of matrix C.
The processors can independently compute their share of the result matrix. The
multiplication algorithm itself does not change in the parallel version. Instead
of multiplying two 3× 3 matrices each processor multiplies a 1× 3 vector (Ai)
with a 3 × 3 matrix (B), resulting in one row of the result matrix (Ci).Finally
the rows are collected and assembled in order to provide the complete result
matrix C. Of course in a practical application every processor will have more
than one rows to compute assigned.
0.8.3 Implementation
Using row wise or column wise array iteration can have a huge impact on caching
and performance in Java as in most other programming languages. In Java it is
way more efficient to iterate row wise than column wise. The main loop of Matrix
Multiplication can be rearranged in any of 3! = 6 ways to achieve the same
result. Each permutation has different memory access patterns and therefore
might perform differently depending on processor and memory architecture.
lxvii
Testing the different index orderings showed that the runtime may vary by a
factor of up to ten.
Our implementation uses a pure row oriented main loop similar to the one sug-
gested in [15]. The pure row oriented version of the algorithm showed generally
the best runtime results compared to the other index orderings.
Listing 18 shows a simplified version of the matrix multiplication JobFactory.
The fields (line 2-14) of the class consist of the class fields. First of all there are
the three matrices A, B, C and their respective dimensions K, L, M. Moreover there
is a variable current_id storing the identifier that will be used for the next job
and a variable current_row storing the starting row of the next job. These are
followed by the number of lines that will be assigned to each task jobSize and
the number of total parallel tasks numJobs. verb+startIndexMap+ stores first
row for each job in order to map partial results correctly to the result array C.
The public constructor (line 16-29) takes the dimensions of the matrices as
well as the number of lines that should be assigned to each parallel task as
parameters. It is assumed that matrix dimension is divisible by the number
of lines assigned to each task. The fields are initialized accordingly, the three
matrices with the given dimension are allocated and matrix A and B are filled
with random numbers. The fillRandom() method is assumed to initialize the
matrices randomly.
The getWorkJob() method (lines 31-43) generates WorkJobs responsible for
computing a subset of lines of the result matrix C, which are later reassembled
to the complete result matrix. First of all a MatrixJob is allocated, followed by
setting the portion of matrix A it is responsible for. This is done by generating a
matrix containing a subset of rows of matrix A. The matrix is obtained using the
getRows method of the ArrayUtil class, which returns a new array containing
the desired subset of rows of the parameter array. The first parameter is the
original array, the second parameter defines the first row that is copied into the
new array and the last parameter the first row that is not copied anymore. By
using the current_row variable as starting parameter and the same variable
incremented by the jobSize as ending parameter, every WorkJob gets assigned
a submatrix of A containing exactly jobSize rows.
The postResult() method (lines 45-56) copies the partial matrices of the re-
turned results to the correct indices in the final result matrix C. This is done by
adjusting the index of C by the index of the first row the corresponding job was
responsible for.
lxviii
Listing 18: Implementation of the Matrix Multiplication JobFactory.
1 public class MatrixMultiplication implements JobFactory{
2 private int [][] A;
3 private int [][] B;
4 private int [][]C;
5 private int K;
6 private int L;
7 private int M;
8 private int current_id = 0;
9 private int current_row = 0;
10 private int numJobs = 0;
11 private int jobSize = 100;
12
13 private Map <Integer , Integer > startIndexMap =
14 new HashMap <Integer , Integer >();
15
16 public MatrixMultiplication(int k, int l, int m,
17 int jobs){
18 K = k;
19 L = l;
20 M = m;
21 A = new int[K][L];
22 B = new int[L][M];
23 C = new int[K][M];
24 fillRandom(A);
25 fillRandom(B);
26 numJobs = jobs;
27 jobSize = M / jobs;
28 remainder = M % jobs;
29 }
30
31 public synchronized WorkJob getWorkJob () {
32 MatrixOptJob j = new MatrixOptJob ();
33 startIndexMap.put(current_id ,
34 current_line);
35 jobSizeMap.put(current_id , jobSize);
36
37 j.setA(ArrayUtil.getRows(A, current_row ,
38 current_line += jobSize));
39
40 j.setId(current_id);
41 current_id ++;
42 return j;
43 }
44
lxix
45 public synchronized void submitResult(Result r) {
46 MatrixOptResult mres =
47 (MatrixOptResult) r;
48 int startindex =
49 startIndexMap.get(mres.getId ());
50 int [][] mat = mres.getResult ();
51
52 for (int j = startindex;
53 j < startindex + js; j ++){
54 C[j] = mat[j-startindex ];
55 }
56 }
57 }
Listing 19 shows the implementation of the WorkJob for the Matrix Multipli-
cation algorithm. The only fields (lines 2-3) needed for the MatrixJob are the
matrix B and the matrix A, which is in this context a submatrix of the multipli-
cations second matrix.
Again the run() method (line 5-29) contains the main computation of the paral-
lel algorithm. First of all the dimensions of the matrices are extracted from the
array lengths and a two dimensional array C for the result matrix is initialized.
The following loop implements the pure row oriented matrix multiplication and
is equivalent to the loop in the sequential algorithm (lines 11-23). It is designed
to always extract the currently used rows into one dimensional arrays which
then can be accessed in a fast row wise fashion.
Finally the partial result matrix C is wrapped in a MatrixResult with corre-
sponding identifier and returned (lines 24-28).
Listing 19: Implementation of the MatrixJob.
1 public class MatrixMultiplicationJob implements WorkJob {
2 private int [][] A;
3 private int [][] B;
4
5 public Result run() {
6 int K = A.length;
7 int L = B.length;
8 int M = B[0]. length;
9 int [][] C = new int[K][M];
10
11 for (int i = 0; i < K; i++) {
12 int[] arowi = A[i];
13 int[] crowi = C[i];
lxx
14 for (int k = 0; k < L; k++) {
15 int[] browk = B[k];
16 int aik = arowi[k];
17 for (int j = 0; j < M;
18 j++) {
19 crowi[j] +=
20 aik * browk[j];
21 }
22 }
23 }
24 MatrixMultiplicationResult res =
25 new MatrixMultiplicationResult ();
26 res.setResult(C);
27 res.setId(this.id);
28 return res;
29 }
30 }
Listing 20 shows the implementation of the MatrixResult. The data field is a
two dimensional integer array representing the partial result matrix C.
Listing 20: Implementation of the MatrixResult.
1 public class MatrixResult extends Result {
2 private int [][] result;
3
4 public int [][] getResult () {
5 return result;
6 }
7 public void setResult(int [][] result) {
8 this.result = result;
9 }
10
11 }
0.9 Mandelbrot Set
The mandelbrot set is a set of points defined in the complex plane that form a
fractal. It was named after the mathematician Benoit Mandelbrot, who is known
for his work in chaos theory and fractal geometry [19]. The mandelbrot set has
become known outside of mathematics for its computer graphical depictions.
Figure 0.14 shows a colored image of the mandelbrot set.
lxxi
Figure 0.14: Coloured image of the mandelbrot set.
0.9.1 Algorithm
Mathematically the mandelbrot set is defined as the set of all complex numbers
c, for which the sequence zn+1 = z2n+c = 0 stays bounded. That means the value
of zn never exceeds a certain value, when starting the sequence with z0 = 0. For
example c = 2 results in the sequence 0, 2, 6, 38, ... which obviously goes toward
infinity, thus is not bounded and not in the mandelbrot set. On the contrary
c = −1 results in the sequence 0,−1, 0,−1, ... which is bounded and therefore
belongs to the mandelbrot set. Another example for a bounded sequence is c = i
resulting in the sequence 0, i, (−1 + i),−i, (−1 + i),−i, ....
It can be shown that once |zn| exceeds 2 the sequence is not bounded. So for
testing whether a given point c is a element of the mandelbrot set the algorithm
iterates over the sequence, testing in every iteration if |zn| is larger than 2. If
this is the case the point is not in the mandelbrot set and the loop terminates.
However there is no way to test for sure whether the sequence is bounded and
thus whether a point is in the mandelbrot set, so a maximum number of iter-
ations has to be defined after which the loop terminates and the sequence is
considered to be bounded.
Algorithms generating computer graphics of the mandelbrot set, assign every
point in the considered area of the complex plane a corresponding pixel in the
picture. Typically the x value of a pixel corresponds to the imaginary value of
c and the y value of the pixel corresponds to the real part of c. The algorithm
then iterates over every point and decides whether it belongs to the mandelbrot
set. For every point that is in the mandelbrot set the corresponding pixel is
lxxii
colored black whereas all other pixel are colored white. The result is a simple
depiction of the considered part of the complex plane showing the points that
belong to the mandelbrot set.
An extension to the simple black and white coloring is the escape time algorithm.
Instead of coloring each point that does not belong to the mandelbrot set white,
it counts the number of iterations that where needed for the point to meet the
escape condition and color the pixel according to a predefined color table. The
escape time algorithm therefore often produces bands of colors, for points that
escaped in the same iteration. However there are various more sophisticated
algorithms that produce more continuous colorings.
0.9.2 Parallelization
Listing 21 shows a pseudo code illustration of the mandelbrot set generators
main loop. The two nested loops iterate over the pixels of the image that is
generated, determining a color value for each. So the variables M and N de-
termine the pixel dimensions of the image. Each pixel of the image has to be
mapped to a number in the complex plane. In the code sample this is done by the
scale_real() and scale_imaginary() functions. The scaling of complex num-
ber is more closely discussed in the Implementation section of the algorithm.The
variable real and imag hold the real and imaginary part of the current complex
number considered. These variables are private to each iteration and thus do
not impose any loop dependencies. The mandelbrot_point() function does the
actual computation whether a number is part of the mandelbrot set and if not
how many iterations where needed to surpass the upper boundary. The function
has no global data or side effects and thus is reentrant. The color values of each
pixel are stored in the two dimensional array X. Since in every iteration a dif-
ferent entry of the array is written there are also no data dependencies between
iterations.
Listing 21: Pseudo code of the Mandelbrot Set generators main loop.
1 for i = 0 to M do
2 for j = 0 to N do
3 real = scale_real(i);
4 imag = scale_imaginary(j);
5 X[i][j] = mandelbrot_point(real , imag)
6 end
7 end
lxxiii
As mentioned the computation intensive part of generating mandelbrot com-
puter graphics is determining for each point whether it is part of the mandelbrot
set or not. Since the computation is for each point independent, the problem
can be easily parallelized by partitioning the desired set of points into n subsets,
where n is the number of desired parallel tasks.
Typically the master partitions the image in lines and assigns a certain number
of lines to each slave. The slaves decide for each point whether it belongs to
the mandelbrot set and return the color value for each pixel value back to the
master. The master then assembles the whole image accordingly.
Similar to the Monte Carlo algorithms the parallel mandelbrot set algorithm has
very beneficial properties for parallelization. There are very little parameters
to communicate to each slave. Moreover the size of each parallel task can be
freely chosen, thus making load balancing easy. However, the size of the results is
considerably larger, since the color value for every pixel has to be communicated
back to the master.
0.9.3 Implementation
Figure 22 shows the JobFactory implementation of the mandelbrot set gener-
ator. For simplicity a square image is assumed. The class fields (lines 2-14)
consist of the image dimension N and the boundary values xmin, xmax, ymin and
ymax defining the section of the complex plane that will be considered. Besides
there is the variable id storing the identifier for next job, the variable lines
storing the number of that are assigned to each job, current_line storing the
index of the first pixel line that was not assigned to a job yet. The two dimen-
sional array result holds the color values of the final image and numJobs stores
the total number of jobs to generate. The map idLine stores for each generated
jobs its identifier and the corresponding start line, which is needed for mapping
the partial results to their actual position in the final image.
The public constructor (lines 16-26) takes the image dimension, the number of
jobs and the area of the image as parameters and initializes all fields straight-
forwardly.
The getWorkJob() method (lines 28-39) is responsible for correctly initializing
jobs. First of all a MandelbrotJob is initialized using its public constructor.
The parameters are the general dimension N of the generated image, the first
line current_line the job is responsible for and the number of pixel lines the
job should generate lines. In addition the boundary variables determining the
lxxiv
considered area of the complex plane are passed to the constructor. Finally
the correct identifier is set, the first line of the jobs is stored and all counter
variables are adjusted accordingly.
The submitResult() method (lines 41-53) collects the results and assembles
the complete image from the separate pixel lines. First of all the byte array
containing the pixel values is extracted from the MandelbrotResult. The values
are then copied into the final pixel matrix result. The correct position is
given by adjusting the index of the result matrix with the position of the jobs
corresponding start line.
Listing 22: Implementation of the MandelbrotSet JobFactory.
1 public class MandelbrotSet implements JobFactory {
2 private int N = 0;
3 private float xmin = 0;
4 private float xmax = 0;
5 private float ymin = 0;
6 private float ymax = 0;
7
8 private int id = 0;
9 private int lines = 0;
10 private int current_line = 0;
11 private byte [][] result = null;
12 private int numJobs = Integer.MAX_VALUE;
13 private Map <Integer , Integer > idLine =
14 new HashMap <Integer , Integer >();
15
16 public MandelbrotSet(int N, int jobs , float xmin ,
17 float xmax , float ymin , float ymax) {
18 this.N = N;
19 this.lines = N/jobs;
20 this.xmin = xmin;
21 this.xmax = xmax;
22 this.ymin = ymin;
23 this.ymax = ymax;
24 result = new byte[N][N];
25 numJobs = jobs;
26 }
27
28 public synchronized WorkJob getWorkJob () {
29 MandelbrotJob j =
30 new MandelbrotJob(N, current_line , lines ,
31 xmin , xmax , xmin , ymin);
32
lxxv
33 j.setId(id);
34 sentJobs.put(id, current_line);
35 current_line += lines;
36 id++;
37 numJobs --;
38 return j;
39 }
40
41 public synchronized void submitResult(Result r) {
42 MandelbrotResult mr = (MandelbrotResult) r;
43 byte [][] data = mr.getResult ();
44
45 for (int i = 0; i < data.length; i++) {
46 for (int j = 0;
47 j < data [0]. length; j++) {
48 result[i][j + idLine.
49 get(mr.getId())] =
50 data[i][j];
51 }
52 }
53 }
54 }
Listing 23 illustrates the implementation of the MandelbrotJob class. The fields
(lines 2-8) are the same as in the JobFactory and are straightforwardly initial-
ized by the public constructor.
The main computation takes place in the mandelbrot_point() method (lines
23-42), which determines for a given c whether it is part of the mandelbrot set
and if not assigns the corresponding pixel a color value. The complex number
c is split in two components cx and cy representing the real respectively the
imaginary part of the number. As explained earlier it can be shown that the
sequence is not bounded once the absolute value of zn exceeds two. In the
algorithm the square of the absolute value of the current number is compared to
a maximum value in order to avoid computing an additional square root every
iteration. The maximum value is stored in the variable max and its value is
obviously the square of two. The variables x and y store the real and imaginary
part of the current sequence number zn and val stores the square of its absolute
value. x_temp is just a temporary variable for the new x value, i is the loops
counter variable and max_iteration is the maximum number of iterations after
which c is considered to be in the mandelbrot set. Typically 256 is chosen for
the maximum number of iterations so valid RGB color values are generated for
lxxvi
each pixel.
The main loop (lines 33-40) of the method checks on every iteration whether
the absolute value has exceeded the escape value or the maximum number of
iterations has been met. If not the next number in the sequence and its square
absolute value is computed. The computation of the next x and y value is a
version of the already presented sequence formula that is split up for the real
and imaginary part of zn+1. Once the main loop terminates the number of
iterations needed is returned, which corresponds to the color value of the pixel
associated with c.
The run() method (lines 44-70) generates for each pixel to consider in the area a
complex number c and calls the mandelbrot_point() method for each of them.
As explained before the real value of c corresponds to the horizontal position of
the pixel and the imaginary value to the vertical position of the pixel. The area
of the complex plane that is considered is given by the boundaries xmin, xmax,
ymin and ymax. Since the algorithm can only generate a finite number of pixels
in each dimension, in this case given by N, it has to be decided which numbers are
mapped to a pixel. This is done by calculating the scaling variables xscale and
yscale that determine the distance between pixels in the vertical and horizontal
plane. The points are equally distributed among the considered area of the
complex plane so the scaling values are given by the difference of the maximum
and minimum value in a dimension divided by the number of pixels in that
dimension. With the scaling variables given the pixels can be easily mapped to
the corresponding c values (lines 54-57). Afterwards the mandelbrot_point()
method is called with the current complex number c (lines 59-60). Finally the
resulting pixel array is wrapped in a MandelbrotResult and returned (lines
64-69).
Listing 23: Implementation of MandelbrotJob.
1 public class MandelbrotJob extends WorkJob {
2 private int N;
3 private int ystart;
4 private int lines;
5 private float xmin = 0;
6 private float ymin = 0;
7 private float xmax = 0;
8 private float ymax = 0;
9
10 public MandelbrotJob(int N, int yStart ,
11 int lines , float xmin , float xmax ,
12 float ymin , float ymax) {
lxxvii
13 this.id = id;
14 this.ystart = yStart;
15 this.lines = lines;
16 this.xmax = xmax;
17 this.ymax = ymax;
18 this.xmin = xmin;
19 this.ymin = ymin;
20 this.N = N;
21 }
22
23 private static byte mandelbrot_point(float cx ,
24 float cy) {
25 float max = 2 * 2;
26 float val = 0;
27 float x = 0;
28 float y = 0;
29 float x_temp = 0;
30 int max_iteration = 256;
31 int i = 0;
32
33 while ((val < max)
34 && (i < max_iteration)) {
35 x_temp = x * x - y * y + cx;
36 y = 2 * x * y + cy;
37 x = x_temp;
38 val = x * x + y * y;
39 i++;
40 }
41 return (byte) i;
42 }
43
44 public Result run() {
45 byte x[][] = new byte[N][ lines];
46 float xscale = (xmax - xmin)/ N
47 float yscale = (ymax - ymin)/ N
48 float cx = 0;
49 float cy = 0;
50
51 for (int i = 0; i < N; i++) {
52 for (int j = ystart;
53 j < y_start + lines; j++) {
54 cx = xmin + ((float) i
55 * xscale);
56 cy = ymin + ((float) j
57 * yscale);
lxxviii
58
59 x[i][j - y_start] =
60 mandelbrot_point(cx , cy);
61 }
62 }
63
64 MandelbrotResult r =
65 new MandelbrotResult ();
66
67 r.setId(this.id);
68 r.setResult(x);
69 return r;
70 }
71 }
Listing 24 shows the Result implementation of the Mandelbrot Set generator.
The data type is a two dimensional byte array storing the RGB pixel values for
the computed area of the image.
Listing 24: Implementation of MandelbrotResult.
1 public class MandelbrotResult extends Result {
2 private byte [][] result;
3
4 public byte [][] getResult () {
5 return result;
6 }
7 public void setResult(byte [][] result) {
8 this.result = result;
9 }
10 }
0.10 Rank Sort
Rank Sort is a simple sorting algorithm that can be parallelized. The sequential
version is very simple to implement, though performs rather bad in comparison
to fast sorting algorithms such as Quick Sort or Heap Sort. The parallel version
while still simple, has the potential to outperform the faster sequential sorting
algorithms [10].
lxxix
Unsorted List Rank Sorted List Rank
14 5 2 19 4 4 24 2 7 318 6 9 47 3 14 52 1 18 6
Table 0.5: Example illustrating the concept of ranks.
0.10.1 Algorithm
The general idea of the algorithm is that every element of the list to be sorted
has a property called rank. The rank represents the position of the element if
the list where already sorted.
Table 0.5 illustrates the concept of a rank. The left part of the table shows the
elements of a unsorted list and their respective ranks. The right part shows the
same list in sorted order and again the ranks of the elements. Note that for the
sorted list the rank of the elements equals their absolute positions in the sorted
list.
The rank of an element e can also be defined as the total number of elements
in the list that are smaller than e. In order to calculate the rank of the element
e has to be compared to all other elements in the list. For every element that
is smaller than e, the rank is increased by one. Once the rank of an element is
known it can immediately be placed in the ranks position of the sorted array.
This results in a runtime complexity of O(n) to compute the rank of one element
in the list. In order to sort the whole list the rank of every element has to be
computed, which results in a total runtime complexity of O(n2) for sorting a
list of n elements.
0.10.2 Parallelization
Listing 25 shows the main loop of the Rank Sort algorithm in pseudo code. The
loop iterates over every element in the unsorted list, calling the compute_rank()
function that determines the rank of the given element. As already described the
rank of an element corresponds to its index in the sorted list. As a consequence
if the compute_rank() is implemented correctly there are no overlapping write
lxxx
accesses to the sorted array.
Listing 25: Pseudo code of the Rank Sorts main loop.
1 for i = 0 to N do
2 rank = compute_rank(unsorted[i])
3 sorted[rank] = unsorted[i]
4 end
Since the rank computation of the elements has no dependencies, a parallel
version of the algorithm can be easily obtained by computing the rank of the
elements independently on different processors.
In a shared memory architecture the unsorted and sorted list are copied into
the shared memory. While the unsorted list is only read by the processors and
used to compute the rank of elements, every processors can directly write the
elements to the sorted result list. Note that there are no conflicts since only the
original list is used for the computation of the elements position.
In a master slave architecture, every slave computes the rank of some elements
of the list, while the master collects the ranks of the elements and puts them
in the correct position of the sorted result list. The unsorted list has to be
broadcasted to every slave in the system.
The parallel time complexity can be easily derived from the sequential com-
plexity of the algorithm. The inner loop complexity of O(n) stays the same.
Assuming a equal distribution of work, each processor has to perform the inner
loop only n/m times, where m is the number of available processors. As a result
there is a total time complexity of O(n ∗ n/m) or O(n2/m).
0.10.3 Implementation
Listing 26 shows a simplified version of the Rank Sort JobFactory. The fields
(lines 2-11) consist of the unsorted array x, the total number of parallel jobs
to generate numJobs and the number of elements that are sorted by one job
jobSize. Moreover there is the variable remainder storing the number of addi-
tional elements to sort, when elements cannot be equally distributed to the par-
allel jobs and some of them have to sort an additional element. currentIndex
and currentId are counter variables holding the start index of the elements
assigned to the next job generated and respectively its identifier. Finally there
is a second array sorted holding the sorted result list, which is continuously
filled with elements by their rank. idIndexMap stores for every job identifier
lxxxi
the corresponding starting index, so the elements of the array can be correctly
associated with their rank when collecting results.
The RankSort class is initialized by its public constructor (lines 13-19), taking
as arguments the unsorted array and the number of desired parallel jobs. The
unsorted array x and the number of jobs numJobs are simply set to the passed
arguments. jobSize is given by the length of the unsorted array divided by the
number of jobs and respectively remainder is given by the modulo of the array
length and the number of jobs. The result array is initialized with the same
length as the unsorted array.
The getWorkJob() method (lines 21-36) again initializes a RankSortJob and
assigns it a portion of the array. The start index is given by the currentIndex
variable the end index by currentIndexjobSize+. However as long there are
remaining elements the last index is incremented assigning the job an additional
element and the counter for remaining elements is decremented. Finally the jobs
identifier is set and the WorkJob is returned.
The submitResult method (line 38-46) is responsible for continuously filling the
sorted result array with elements by their rank. The RankSortResult contains
the ranks of the elements the corresponding job had to consider.
Listing 26: Implementation of the RankSort JobFactory.
1 public class RankSort implements JobFactory {
2 private int[]x;
3 private int numJobs;
4 private int jobSize;
5
6 private int remainder;
7 private int currentIndex;
8 private int currentId;
9 private int[] sorted;
10 private Map <Integer , Integer > idIndexMap =
11 new HashMap <Integer , Integer >;
12
13 public RankSort(int []x, int jobs){
14 this.x = x;
15 jobSize = x.length/jobs;
16 remainder = x.length % jobs;
17 numJobs = jobs;
18 sorted = new int[x.length ];
19 }
20
21 public synchronized WorkJob getWorkJob () {
lxxxii
22 RankSortJob j = new RankSortJob ();
23
24 idIndexMap.put(currentId , currentIndex);
25 j.setFrom(currentIndex);
26 j.setTo(currentIndex + jobSize);
27 if(remainder > 0){
28 j.setTo(j.getTo () + 1);
29 remainder --;
30 }
31 currentIndex = j.getTo();
32 j.setId(currentId);
33 currentId ++;
34
35 return j;
36 }
37
38 public synchronized void submitResult(Result r) {
39 RankSortResult res = (RankSortResult) r;
40 int [] ranks = res.getRanks ();
41
42 for(int i = 0; i < ranks.length; i++){
43 sorted[ranks[i]] =
44 x[i+idIndexMap.get(res.getId ())];
45 }
46 }
47 }
Listing 27 shows the WorkJob implementation of the algorithm. x contains the
unsorted array, from and to are the boundary indices of the elements the given
job is responsible for.
The implementation of computeRank() (line 7 to 15) is the same as for a sequen-
tial version of the algorithm. The method iterates over the original unsorted
list and compares the element at the given position to every other element in
the list. For every element that is smaller as the considered element the rank is
increased by one. Finally the rank of the element is returned.
The run() method (line 17 to 30) initializes an array for storing the ranks of the
corresponding elements of the unsorted list. Then the computeRank() method is
called repeatedly for every element between the from and the to index. Finally
the result is assembled and returned.
Listing 27: Implementation of the RankSortJob.
1 public class RankSortJob extends WorkJob{
lxxxiii
2 private int[] x;
3 private int from;
4 private int to;
5
6 public static int computeRank(int[] x, int pos){
7 int rank = 0;
8 for(int i = 0; i < x.length; i++){
9 if(x[i] < x[pos]){
10 rank +=1;
11 }
12 }
13 return rank;
14 }
15
16 public Result run() {
17 int[] ranks = new int[to -from];
18
19 for(int i = 0; i < ranks.length; i++){
20 ranks[i] =
21 computeRank(x, i + from);
22 }
23
24 RankSortResult r = new RankSortResult ();
25 r.setRanks(ranks);
26 r.setFrom(this.from);
27 r.setId(this.Id);
28
29 return r;
30 }
31 }
Listing 28 shows the implementation of the RankSortResult class. A
RankSortResult contains an array with the ranks of the elements that
were assigned to the corresponding job. The master application copies the the
corresponding elements to the result array by their rank.
Listing 28: Implementation of the RankSortResult.
1 public class RankSortResult extends Result{
2 private int[] ranks;
3
4 public int[] getRanks () {
5 return ranks;
6 }
7 public void setRanks(int[] ranks) {
lxxxiv
8 this.ranks = ranks;
9 }
10 }
lxxxv
lxxxvi
Experiments
In this chapter all the experiments testing App Engines computing capabilities
are presented. First of all a couple of general tests of the App Engine framework
were performed, in order to get a grasp of the general performance of the frame-
work, to identify possible bottlenecks and to provide a context for the algorithm
experiments. The algorithm experiments consist of a simple speedup analysis
followed by a scalability analysis. Finally the resource consumption of each al-
gorithm is analyzed and a rough cost estimation derived. Results are generally
compared to a equivalent experimental setup executed on the karwendel cluster,
in order to provide a comparison to a system with known hardware.
Note that getting consistent performance measurements on Google App Engine
is quite difficult, since there are no guarantees whatsoever in respect to the
location or the underlying hardware of the servers. For example two identical
consecutive requests to the same web application could be executed on com-
pletely different hardware in two different geographic locations. Though in the
duration of the tests we certainly always had the same hardware infrastructure
since there were no significant variations in the results. Moreover changes in the
load balancing strategies, other internal mechanism or the used hardware can
have influence on the performance. In addition background load on the servers
or the network might also influence the outcome of experiments. In order to
minimize the bias all experiments were conducted in a short period of time and
were executed multiple times in order to even out side effects.
0.10.4 Hardware and Experimental Setup
In the following the hardware and experimental setup used for all the experi-
ments presented in this chapter will be described.
For experiments performed on the karwendel cluster, the Master application
responsible for job distribution was executed on the head node of the cluster,
while the slave application was executed on a regular compute node using the
lxxxvii
head node
CPU: 2 x Opteron 848CPU speed: 2.2 GHzCores: 1Cache: L1: 64 kilobyte L2: 1 megabyteMemory: 16 gigabyte
compute node
CPU: 4 x Opteron 880CPU speed: 2.4 GHzCores: 2Cache: L1: 32 kilobyte L2: 1 megabyteMemory: 16 gigabyte
Network: Infiniband networkJVM: Java HotSpot 64-Bit Server VM (build 16.3-b01, mixed mode)
Table 0.6: Hardware specification of the karwendel cluster.
development server. Table 0.6 lists the hardware specification of the karwendel
cluster.
CPU: Intel Xeon 5150CPU speed: 2.66GHzCores: 2Cache: L1: 32 kilobyte L2: 4 megabyteMemory: 4 gigabyteInternet Connection: ...
Table 0.7: Hardware specification of the zid-gpl server.
For all experiments on App Engine the Master application was executed on the
zid-gpl server. Table 0.7 lists the hardware specification of the zid-gpl server.
If not stated elsewise every iteration of the experiments was repeated ten times
and results where averaged in order to reduce bias. The JVM was always
”warmed up” before starting the actual measurements in order to avoid effects
of JIT compilation to tamper the results. For a more detailed discussion of JIT
compilation see section 0.11.4. The JVM used in the karwendel experiments is
listed in table 0.6. App Engines exact JVM version is not known, it is however
a Java 1.6 virtual machine.
All App Engine experiments were performed between 25.12.2010 and 30.12.2010.
lxxxviii
0.11 Analyzing Google App Engine Performance
In order to get a general grasp for the performance of the App Engine infras-
tructure and to identify possible bottlenecks, we performed a couple of general
tests. First of all the network quality of App Engine was tested including a
latency and bandwidth analysis. Furthermore the general performance of the
Java environment was tested, followed by a test of the Just In Time Compila-
tion (JIT) compilation capabilities of the system. Finally the cache behavior of
the system was tested as well. Generally it would be better to use standardized
test benchmarks, App Engine however does not allow arbitrary libraries and
therefore we had to write all tests ourselves. We compared the results where
necessary with results gathered on karwendel in order to provide a comparison
to a system with known hardware.
0.11.1 Latency Analysis
The latency of requests when calling App Engine applications is of key impor-
tance for a performance analysis, since it often accounts for a big part of the
parallel overhead. Especially when the computation time of parallel tasks is
relatively short in comparison to the latency, it can have an large impact on
the overall performance of algorithms. In this context the latency of an HTTP
request is the time needed from issuing the request until the HTTP response is
completely returned.
There are various factors that influence latency. Foremost there is the network
infrastructure that determines the time needed until the data of the request is
transferred to the servers. However there is also the time needed for the App
Engine load balancer to correctly analyze and assign a request to an application
server. Moreover if there is no instance of the web application running on an
application server, one has to be initialized before assigning the request. As
described in the first chapter the App Engine load balancer starts and caches
instances of an application depending on the number of recent requests. In fact
applications that have no running instances may have a considerably higher
latency for the first couple of requests depending on the size of the application
and thus the time needed to load instances into memory.
In the following an analysis of HTTP request latency to the slave application
is presented. The experiment was set up to test the latency of a job request
depending on the size of the payload. Payload sizes from 0 up to 2700 kilobyte
in 300 kilobyte steps where tested, since this will be the typical size range for job
lxxxix
requests. Each request size was tested 50 times. For all experiments simple ping
requests to the application where issued measuring the time until the response
was completely returned.
0 300 600 900 1200 1500 1800 2100 2400 27000
1000
2000
3000
4000
5000
Late
ncy
(milli
seco
nds)
Payload Size (kilobyte)
Figure 0.15: Results of the latency analysis.
Figure 0.15 shows the results of the latency analysis. Notable is that the latency
does not increase linearly for a linearly increasing payload size. This effect can be
explained by TCP needing some time to fully utilize the fast Internet connection
and therefore being able to handle the larger payloads relatively faster. Note
that in this experiment single isolated requests where sent, in contrast to a
practical algorithm run where multiple parallel requests are sent thus utilizing
the connection more efficiently.
0.11.2 Bandwidth
Besides the latency of requests another very important network measure is the
bandwidth, since clearly the upload and download of data to the App Engine
servers is major part of the parallel overhead. Looking at the request size
latency analysis it is obvious that the bandwidth often will not be fully utilized.
Therefore shared data is transferred using multiple parallel TCP streams, as
more closely described in chapter 0.6. Moreover there are usually always a couple
of parallel job request in order to use the available bandwidth as efficiently as
possible.
In order to test the bandwidth to App Engine we used the parallel data transfer
implemented for shared data management, however without storing any of the
data in the datastore in order to avoid overhead imposed by database opera-
xc
tions. The experiment involved transferring data chunks of one, two and four
megabyte, which is a typical data size used for the algorithms, using between
1 and 60 parallel HTTP requests. Thereby we were able to identify a good
amount of parallel streams for data transfer as well as analyze the achievable
bandwidth to the App Engine servers.
10 20 30 40 50 60 700
1000
2000
3000
4000
5000
Number of Parallel Streams
Tran
sfer
Tim
e (m
illise
cond
s)
1 megabyte2 megabyte4 megabyte
Figure 0.16: Results of the bandwidth test.
Figure 0.16 shows the results of the bandwidth analysis. Up to 30 parallel
streams steady speedup of the data transfer can be achieved. Between 30 and
60 parallel tasks the transfer time stays approximately the same. For the four
megabyte chunk the time still seem to decrease slightly up to around 50 streams.
For more than 60 parallel tasks the time needed to transfer the data chunks start
to increase again.
Chunksize Minmum Speed Maximum Speed
1 mb 343kb/sec 1796kb/sec2 mb 507kb/sec 2560kb/sec4 mb 871kb/sec 3938kb/sec
Table 0.8: Maximum and minimum transfer speeds for different data chunksizes.
Table 0.8 shows the maximum and minimum transfer speeds for the different
chunk sizes. As expected the larger chunks have higher transfer speeds be-
cause TCP has more time to fully utilize the bandwidth and there is relatively
less overhead imposed by the HTTP header. For all chunk sizes a substantial
speedup of the data transfer can be achieved by using multiple parallel streams.
xci
In practice 30 parallel streams is probably the best choice. Even though with
more streams possibly a slightly fast transfer could be achieved, the higher
overhead for the database operations would outweigh the faster transfer.
0.11.3 Java Performance Analysis
In order to provide a rough performance estimation of the App Engine Java
environment we measured various metrics using a simple self written micro
benchmark and compared the results to the karwendel Java environment. The
performance of the basic arithmetic operations of each data type as well as the
performance of trigonometric functions, random number generation and Object
creation are measured. The benchmark repeatedly performs each operation in
a loop and estimates how many operations each can be executed in a second.
The purpose of this section is to provide a comparison of the App Engine Java
environment to the Java environment used for executing the algorithms on kar-
wendel.
Microbenchmarks
Listing 29 shows the benchmark routine for double arithmetics. For measur-
ing the time the more precise method System.nanoTime() is used instead of
start = System.currentTimeMillis(). The main loop repeatedly performs
the basic arithmetic operations until the maximum number of iterations is
reached. It is important to use the variable containing the computation re-
sult by for example printing it, otherwise the compiler would simply optimize
the seemingly useless computation away. Furthermore it is essential to avoid
using unchanging variables in the computations which again would enable the
compiler or the JVM to optimize the computation. Since generating random
values every iteration would not be feasible the counter variable is used for the
computation. Finally the time spent for the main loop is returned.
Listing 29: Benchmark routine for double add/sub operations.
1 public static long doubleAddArith(long N) {
2 long start = 0;
3 long end = 0;
4 double result = Math.random () * 10.0;
5 double i = 0.0;
6 start = System.nanoTime ();
7 while (i < N) {
8 result -= i++;
xcii
9 result += i++;
10 }
11 end = System.nanoTime ();
12 System.out.println(result);
13 return (end - start);
14 }
The other benchmarks are implemented analogously and therefore a closer de-
scription is omitted here.
The benchmark tests were executed as a WorkJob using the distributed system
in the same way as for the algorithms. The tests were run directly without a
warmup phase for the JVM, since the benchmarks are only intended to com-
pare system performance. Between each test the garbage collector is called by
invoking System.gc().
int a/s int m/d long a/s long m/d double a/s double m/d trig random object0
2
4
6
8
10
12
14
16x 108
Run
time
(nan
o se
cond
s)
App EngineKarwendel
Figure 0.17: Microbenchmark results.
FIgure 0.17 shows the results of the micro benchmark test. The bars show the
time needed in nanoseconds to execute each benchmark loop. For the arith-
metic tests the number of iterations was 100.000.000, for the trigonometric test
1.000.000 and for the random number generation as well as the object creation
test 10.000.000. For the basic data type operations a/s labels the addition and
subtraction tests, whereas m/d the tests for multiplication and subtraction.
First of all what is very surprising is the difference in addition/subtraction and
respectively multiplication/division operations for the integer data types. While
xciii
multiplication/division operations for int as well as long types is faster on App
Engine than on karwendel the addition/subtraction operations are way slower.
This effect can be explained either by the hardware used in the App Engine
infrastructure or the way the App Engine JVM handles addition/subtraction
operations. The operations for the double data type are more stable and are
faster on the karwendel. The trigonometric function provided by the standard
math library are slightly faster on app engine. The random number generation
of the standard math library is over twice as fast on karwendel than on App
Engine, which might be due to a different implementation used in App Engines
runtime environment. Object creation has almost identical performance on both
systems.
Scalar mul/div Fibonacci0
500
1000
1500
2000
2500
3000
3500
4000
Com
puta
tion
Tim
e (m
illise
cond
s)
App EngineKarwendel
Figure 0.18: Computation time results of Scalar mul/div and the Fibonaccinumber generator.
In order to verify the rather unusual results of the integer operations we tested
the different operations using actual algorithms. First a simple scalar multi-
plication/division algorithm that of course heavily tests the multiplication and
division operations and a fibonacci number generator that represents the addi-
tion/subtraction operations. The raw computation of the algorithms was mea-
sured, which is the time spent in the run method of the algorithms without
any overhead. For both algorithms a problem size was chosen that yielded a
computation time of around one second on App Engine and the same size was
then executed on karwendel.
Figure 0.18 shows the results of the experiment. The Fibonacci algorithm is
around three times faster on karwendel which matches the results from the
microbenchmarks. Notable is however that the Scalar algorithm is actually
even slower on karwendel than one would expect. This can be explained by
the fact that the Scalar algorithm has also a high memory usage and therefore
also tests the speed of the cache hierarchy. As more closely analyzed in chapter
xciv
0.11.5 the cache hierarchy of App Engine is generally faster than on karwendel
which explains the discrepancy in computation times.
0.11.4 JIT Compilation
Virtual Machines using JIT compilation dynamically compile frequently used
parts of the bytecode to native machine code, which can yield a notable speed
improvement depending on the program. Algorithms typically benefit a lot
from JIT compilation, since usually most computation takes place in a main
loop which can be easily compiled to optimized machine code. For performance
measurements it can be very difficult to deal with JIT compilation, since the
JVM has to be ”warmed up” before making the actual measurements in order
to assure consistent results.
In order to test the behavior of App Engine in terms of JIT compilation we used
the Fibonacci numbers generator, since the JIT compilation seemed to have a
big impact on the algorithms computation time. The algorithm was repeatedly
executed with the same problem size over a period of 50 iterations. Note that
only the effective computation time was measured which is the time spent in the
run method of a job in order to avoid any bias caused by overhead. The requests
were sent one after another with a sleeping time of one second in between each
request. The App Engine slave application had initially no instances running.
0 5 10 15 20 25 30 35 40 450
500
1000
1500
2000
2500
3000
3500
4000
4500
Request Number
Com
puta
tion
Tim
e (m
illise
cond
s)
Figure 0.19: Fibonacci test requests.
Figure 0.19 shows the requests in their consecutive order as they were sent and
their respective computation time. First of all notable is that every request ba-
sically has one of two computation times the higher one of around three seconds
xcv
or the considerably lower one of a little over a second. Secondly interesting is
that the first requests all have the larger computation time, whereas the later
request almost all have the lower one.
As described in chapter 2 App Engine spawns instances of an application de-
pending on the recent load of an application. Therefore for an application
that experiences numerous requests additional instances are initialized. Each
instance of an App Engine application has its own JVM, which therefore also
shares all static references, thus we were able to track instances by introducing a
class called InstanceTracker (see listing 30). The InstanceTracker is simply
a static Singleton class with a single field holding an identifier for the instance,
which is initialized based on the current time.
Listing 30: InstanceTracker class for tracking the current application instance.
1 public class InstanceTracker {
2 private static InstanceTracker tracker =
3 new InstanceTracker ();
4 long UUID;
5 private InstanceTracker (){
6 UUID = System.nanoTime ();
7 }
8 public static InstanceTracker getTracker (){
9 return tracker;
10 }
11 }
0 5 10 15 20 25 30 35 40 45 500
500
1000
1500
2000
2500
3000
3500
4000
4500
Request Number
Com
puta
tion
Tim
e (m
illise
cond
s)
Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7
Figure 0.20: Fibonacci test requests mapped to their respective instance.
By tracking the instances that handled the requests, the requests can be mapped
to their respective instance as depicted in figure 0.20. The requests were handled
xcvi
by a total of seven different instances. When looking at the computation times
it is noticeable that for each instance the first two requests it handled took
considerably longer than every following request it handled. This leads to the
conclusion that for this problem size after two requests the optimized version of
the code is executed which is a lot faster.
For the experiments this means that the slave application has to be ”warmed
up” with sufficient requests of the corresponding algorithm before actually be-
ginning the measurements in order to minimize bias caused by the JIT compiler.
The problem is however that the programmer has absolutely no control over the
initialization or the lifetime of instances, which means there is always a chance
that a new ”cold” instance is initialized during measurement. For a slave ap-
plication running in the development server this problem is not present, since
every request is handled in the same JVM. The JVM however still has to be
”warmed up” after startup.
Moreover the experiment shows how much impact the JVM can have on the
runtime of an algorithm.
0.11.5 Cache Hierarchy
Another important aspect of a computing environment is the cache hierarchy
and its respective speed. Therefore it is interesting to find out what cache hier-
archy is present in the App Engine infrastructure. Since there is no information
what hardware Google uses, we wrote a short cache test program in order to
get an idea of App Engines cache behavior. The cache tests presented in the
following are based on the tests in [21].
Listing 31: Short program to test the cache hierarchy.
1 for(int i = 1; i <= 16; i++){
2 int bytenum = 1024 * (int)Math.pow(2, i);
3 byte [] arr = new byte[bytenum ];
4
5 start = System.currentTimeMillis ();
6 for (int k = 0; k < steps; k++) {
7 arr[(k * 64) % arr.length ]++;
8 }
9 end = System.currentTimeMillis ();
10 }
xcvii
Listing 31 shows the code of cache hierarchy test program. The program in
principle tries to modify entries of an array for a given amount of iterations.
By increasing the size of the array the size when the array no longer entirely
fits into a cache level and therefore the next slower level has to be used can be
identified. Since in modern processors an array access loads a whole cache line
of 64 byte only every 64th entry of the byte array is modified in order to cheaply
modify every cache line.
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
1000
2000
3000
4000
5000
6000
7000
Array Size (kilobyte)
Itera
tion
Tim
e (m
illise
cond
s)
App EngineKarwendel
Figure 0.21: Results of the cache hierarchy test.
Figure 0.21 shows the result of the cache test program executed on App Engine
as well as karwendel in order to show that the program produces meaningful
results for a system with known cache hierarchy. A karwendel node has a L1
cache of 64 kilobyte and a L2 cache of 1024 kilobyte, which is clearly visible by
the performance drops at an array size of 64 and respectively 1024 kilobyte.
The processors utilized in the App Engine infrastructure seem to have three
cache levels. The first performance drop is visible at 32 kilobyte, the second at
256 kilobyte and the third at 8 megabyte. This would indicate an L1 cache of
32 kilobyte, a L2 cache of 256 kilobyte and a L3 cache of 8 megabyte, which
would for example match the Intel Nehalem processor microarchitecture.
0.12 Speedup Analysis
The experiment consists of a simple speedup analysis of each algorithm using
a single application on App Engine, compared to the same problem executed
on a single karwendel cluster node. The first difficulty is however choosing an
xcviii
appropriate problem size, since App Engine allows single requests to run at most
30 seconds. Therefore for each algorithm a problem size was chosen that could
be solved in slightly under 30 seconds on App Engine using a single job. The
same problem size was then chosen for both platforms and the behavior of some
metrics for increasing number of parallel tasks was tracked.
FIrst of all there is the total runtime which is the time needed for the algorithm
to finish with a given problem size. The total runtime is a good overall metric
for how good the algorithm performs.
The next metric is the average computation time which is the average time
needed for a job to execute the run() method. This metric therefore reflects the
actual time spent doing useful computations. If the problem is split in multiple
parallel jobs each job has less work to do and thus typically a lower average
computation time. However when there are more parallel jobs running than
cores are available the average computation time should not decrease further
since cores have to be shared by multiple threads.
Another metric is the average overhead which is the average time a job addition-
ally needed to complete, besides doing useful computations. The overhead of a
job is the difference between the time needed for the job to return a result and
the time spent in the run() method. The overhead is typically comprised of the
time needed to transfer requests, compress data, perform database operations
or internal mechanisms of the Google App Engine framework.
Finally for algorithms making use of the shared data management, the +data
transfer time+ is measured, which is simply the average time needed to transfer
the shared data to a slave.
By measuring not only the total runtime it can be analyzed in more detail
why certain algorithms perform better than others. A high overhead typically
means that there is too much time invested in transferring data to the slaves in
comparison to the computation time. Note that we expect considerably higher
overhead values for the App Engine slave than for the local slave executed on
karwendel, since data has to be transferred over the Internet instead of a fast
local network. Moreover there is notable latency caused by the load balancing
mechanisms of the App Engine framework.
If every job takes the same time for its computation and has the same overhead,
the completion time of the algorithm is the sum of the average computation
time, the average overhead and the time spent to generate jobs and reassemble
results. So a large gap between the completion time and the sum of the average
xcix
computation time and average overhead is typically an indicator for poor load
balancing.
0.12.1 Pi Approximation
2 4 6 8 10 12 140
5
10
15
20
25
30
Parallel Tasks
Tim
e (s
econ
ds)
Karwendel
2 4 6 8 10 12 140
5
10
15
20
25
30App Engine
Total Execution TimeComputation TimeOverhead
Total Execution TimeComputation TimeOverhead
Figure 0.22: Runtime analysis of the Pi Approximation algorithm.
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
Parallel Tasks
Karwendel
Spee
dup
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
App Engine
Linear SpeedupSpeedup
Linear SpeedupSpeedup
Figure 0.23: Speedup analysis of the Pi Approximation algorithm.
The Pi Approximation algorithm is a pretty good benchmark for the raw com-
putation power of the App Engine framework, since there is almost no data to
transfer in the requests as well as the results. Moreover the algorithm itself does
not operate on data and therefore there are no caching effects present. Besides
c
these properties the algorithm can be almost perfectly load balanced. For the
experiment a problem size of 220.000.000 points to generate was chosen. How-
ever the algorithm is based on the standard random number generator provided
by the Java API, which seems to be rather slow on App Engine.
Figure 0.23 shows the results of the speedup experiment for the Pi Approxi-
mation algorithm. On karwendel there is effectively no overhead present for
transferring jobs and results because of the fast Infiniband network. As a result
the average computation time and the total runtime are almost the same up to
eight parallel tasks. Figure 0.22 shows that up to eight parallel tasks an almost
linear speedup can be achieved. Using more than eight parallel tasks results in
slightly increased total runtime and therefore a drop in the speedup caused by
load imbalance, since the karwendel compute nodes have eight processor cores.
In comparison the App Engine results show a rather high average overhead of
around 700 milliseconds. The average computation time shows a steady almost
linear speedup as expected. The total runtime time shows as well a notable
speedup of up to six. The irregularities in the runtime graph are caused by
random background load either on the application servers or the network. Even
though the overhead caused by the request latency is a limiting factor, the al-
gorithm performs well using Google App Engine. The raw average computation
for 15 parallel tasks is almost the same for both platforms, even though for a
single task the computation time is almost twice as high on App Engine than
on karwendel.
0.12.2 Matrix Multiplication
Matrix multiplication is an algorithm that mainly utilizes addition/multiplica-
tion and therefore reflects the performance of these operations on a system to
a certain point. Moreover there is a relatively large amount of data to transfer
in terms of parameters as well as well as results. Besides the caching behavior
of a system is tested since the algorithm has to iterate over large arrays that
typically will not entirely fit in the CPU cache. For the experiment two integer
square matrices of 1500× 1500 where multiplied.
Figure 0.24 shows the runtime results of integer matrix multiplication. In
terms of sequential computation time both systems show a similar performance,
though karwendel is slightly faster than App Engine. On karwendel, there is
again rather small average overhead of around 700 milliseconds. The average
data transfer time is also rather low around 600 milliseconds, which is again due
ci
2 4 6 8 10 12 140
5
10
15
20
25
30Karwendel
Parallel Tasks
Tim
e (s
econ
ds)
2 4 6 8 10 12 140
5
10
15
20
25
30App Engine
Completion TimeComputation TimeOverheadData Transfer Time
Completion TimeComputation TimeOverheadData Transfer Time
Figure 0.24: Runtime analysis of integer Matrix Multiplication.
to the fast network connection. Figure 0.25 shows that for karwendel again a
good speedup up to eight parallel tasks can be achieved, the maximum is how-
ever only at four. For more than eight tasks the runtime increases slightly due
to load imbalance.
On App Engine clearly the overhead for transferring data dominates the al-
gorithm. The data transfer time is between two and three seconds and the
average overhead ranges from almost ten to four seconds. An interesting effect
is that the overhead tends to decrease with more parallel jobs. This is caused
by the fact that a single job request only has one TCP stream to return data.
If there are multiple job requests used, the returning of data is automatically
split to multiple TCP streams, which increases the speed of returning data to
the master. The maximum speedup is only at around two as depicted in figure
0.25
0.12.3 Rank Sort
Rank Sort essentially tests sequential reading array traversal and the integer
increment (for incrementing the rank), which typically should be pretty fast.
For the experiments an integer array with 70.000 elements was sorted.
Figure ?? shows the results of the runtime analysis using the Rank Sort algo-
rithm. First of all surprising is the huge difference in the sequential computation
time. For the single task the computation time is around three times higher on
App Engine than on karwendel. This is most likely due to the bad addition/-
cii
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
Parallel Tasks
Karwendel
Spee
dup
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
App Engine
Linear SpeedupSpeedup
Linear SpeedupSpeedup
Figure 0.25: Speedup analysis of integer Matrix Multiplication.
subtraction performance of App Engine and the fact that the if clause in the
main loop of the algorithm prevents any kind of loop unrolling optimization.
The karwendel results show a steady speedup of runtime up to six parallel tasks.
The problem size is too small to achieve a speedup up to eight parallel tasks.
Typical again is the low average overhead and the very low data transfer time
due to the fast network links.
On the contrary on App Engine there is again a higher average overhead of
around 700 milliseconds, the data transfers takes about the same time. For
15 parallel tasks a pretty good speedup of over four can be achieved. The
algorithm seems to be better suited for execution on App Engine than matrix
multiplication, since the ratio between computation and data transfer is more
weighted towards computation.
0.12.4 Mandelbrot Set
The main properties of the Mandelbrot Set generator are on the one hand the
small WorkJobs and on the other hand the very big results that have to be
transferred back to the master. Note that for this algorithm a relatively small
problem size of 3200×3200 pixel was chosen, because a larger problem size would
exceed the maximum allowed size of ten megabytes for the HTTP response. This
results in a completion time of under eight seconds for the single job execution.
First of all noticeable is the huge overhead in comparison to a low effective
ciii
2 4 6 8 10 12 140
5
10
15
20
25
30Karwendel
Tim
e (s
econ
ds)
2 4 6 8 10 12 140
5
10
15
20
25
30App Engine
Parallel Tasks
Total Execution TimeComputation TimeOverheadData Transfer Time
Total Execution TimeComputation TimeOverheadData Transfer Time
Figure 0.26: Runtime analysis of the Rank Sort algorithm.
computation time on App Engine, making the algorithm practically unfeasible.
The effect that more parallel tasks result in a lower average overhead is even
more noticeable than for the Matrix Multiplication or Rank Sort.
Taking a closer look at the speedup (see Figure 0.29) on karwendel it seems that
the algorithm gains speedup beyond eight parallel tasks. Besides an uneven
amount of parallel tasks seem to result in a larger total runtime, even though
average computation time steadily decreases and overhead is constant. After
analyzing the problem more closely we found out that the implementation of
the algorithm has an inherent load imbalance, even though each job is assigned
an equal amount of pixel values to calculate.
The pixel areas are assigned in stripes from the top to the bottom of the image.
Due to the fact that most of the points in the center belong to the mandelbrot
set make these areas harder to compute, because for points that are not part of
the mandelbrot set the algorithm can immediately continue with the next point
once the escape condition is met. As a result jobs working on the center of
the image take longer than those working on the peripheral areas of the image.
This for one explains the speedup beyond eight parallel tasks, since if more tasks
work on the image the load imbalance has less effect. Secondly it also explains
the effect that an uneven amount of jobs results in a higher completion time,
since for an uneven job number there is always a single job responsible for the
very center of the image which consumes most of the computation time.
In order to resolve the load imbalance pixel lines would have to be assigned
randomly to each job, which however imposes additional management overhead
civ
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
Parallel Tasks
Karwendel
Spee
dup
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
App Engine
Linear SpeedupSpeedup
Linear SpeedupSpeedup
Figure 0.27: Speedup analysis of the Rank Sort algorithm.
for tracking the mapping between pixel lines and jobs.
0.13 Scalability Analysis
The simple speedup analysis clearly does not favor App Engine, since the 30
second computation limit made the use of rather short tasks with therefore a
high relative overhead necessary. In order to test the potential of App Engine for
computing larger problem sizes we performed a scalability analysis. Instead of
distributing the jobs to one application the jobs were distributed to ten different
deployed App Engine applications in order to circumvent the minutely quotas.
As comparison again one karwendel node was used with the same experimental
setup.
The problem size was multiplied by the number of parallel tasks, which means
problem size increases linear proportional to the number of jobs. For example
using the Pi Approximation algorithm the problem size is set to N ∗P , assuming
N as the number of parallel tasks and P as the initial problem size. So if there
were an ideal speedup the runtime should stay the same while increasing the
problem size.
We used the Pi Approximation algorithm for the Scalability Analysis, since the
algorithm had the best results in the speedup experiment. Besides the goal of
this experiment is not to analyze the algorithm properties but to identify the
limitations of the App Engine framework in terms of peak performance. The
cv
2 4 6 8 10 12 140
5
10
15
20
25
30Karwendel
Parallel Tasks
Tim
e (s
econ
ds)
2 4 6 8 10 12 140
5
10
15
20
25
30App Engine
Completion TimeComputation TimeOverhead
Completion TimeComputation TimeOverhead
Figure 0.28: Runtime analysis of the Mandelbrot algorithm.
initial problem size was chosen smaller than in the speedup analysis in order to
avoid exceeding the 30 second request deadline. The initial problem size was set
to 180.000.000 points to generate. The number of parallel requests was chosen
between 1 and 25 because for a larger number the minutely quotas start to get
reached and App Engine denied further connections.
For a N bigger than ten App Engine started to sporadically abort requests with
the following message in the system log:
Request was aborted after waiting too long to attempt to service your request.
This may happen sporadically when the App Engine serving cluster is under
unexpectedly high or uneven load. If you see this message frequently, please
contact the App Engine team.
The aborted requests do not harm the integrity of the algorithm since jobs are
rescheduled. However they produced substantial overhead, since the request
were not aborted instantly but not until ten seconds after they were issued.
The number of aborted request however occurred rather randomly, even though
they are certainly related to the high load of the algorithm since for less than
ten parallel jobs no requests were aborted.
Figure 0.30 shows the results of the scalability analysis. The top diagram shows
the total runtime of the algorithm executed on one karwendel compute node
compared to the App Engine results. In the bottom the number of requests
aborted by the App Engine frontend is diagrammed, in order to explain the
irregularities in the runtime of App Engine.
cvi
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
Parallel Tasks
Karwendel
Spee
dup
1 3 5 7 9 11 13 15
2
4
6
8
10
12
14
App Engine
Linear SpeedupSpeedup
Linear SpeedupSpeedup
Figure 0.29: Speedup analysis of the Mandelbrot algorithm.
The results on karwendel show a constant runtime for one up to eight parallel
tasks as expected, since there is a almost linear speedup up to eight jobs for the
Pi Approximation algorithm. There is a substantial step in runtime between
eight and nine tasks since the maximum number of processor cores is reached
and the algorithm becomes load imbalanced. For more than nine parallel tasks
the runtime steadily increases proportional to the problem size.
On App Engine the algorithm scales pretty well, with generally only slightly
increasing runtime. Most of the irregularities in the runtime are caused by the
overhead induced by aborted requests. For more than 17 parallel tasks App
Engine has a generally a lower runtime than karwendel.
Concluding the scalability analysis showed that using ten slave applications de-
ployed to Google App Engine, which is equivalent to one free account, we can
gain a peak performance comparable to one compute node of the karwendel clus-
ter. Moreover it has to be considered that the standard random number library
itself shows very poor performance on App Engine. However more complex
Monte Carlo simulations would require a more sophisticated random number
generator anyway, since the quality of random numbers generated by the stan-
dard Java random number library is in most cases not sufficient for scientific
applications [12].
cvii
5 10 15 20 250
10
20
30
40
Number of Parallel Tasks
Tota
l Run
time
(sec
onds
)
5 10 15 20 250
1
2
3
4
5
6
7
Number of Parallel Tasks
Num
ber o
f Abo
rted
Req
uest
s
App EngineKarwendel
Aborted Requests
Figure 0.30: Scalability analysis of the Pi Approximation algorithm.
0.14 Resource Consumption and Cost Estimation
In order to determine the limiting quota resources for an algorithm it has to
be analyzed which of the resources are used extensively and will first reach a
minutely or respectively a daily quota limit. Of course the problem size also
has an impact on the limiting resource, since for example the ratio between
the amount of data to transfer and the computation time is not the same for
different problem sizes. Resource quotas can be tracked via the administration
web interface of the application.
Tracking the resources of an algorithm on the one hand provides a means to es-
timate the possible work throughput under the given quota limitations, on the
other hand it makes a rough cost estimation for accounts with billing enabled
possible. Moreover a more sophisticated system executing algorithms with dif-
cviii
ferent resource consumption could schedule jobs in a way that the free resources
are used optimally. However typically the CPU hours should be the limiting
factor, since algorithms that have to transfer lots of data won’t have good per-
formance results on Google App Engine anyway.
For testing the resource consumption we used the same problem size as for the
algorithm speedup analysis executed 100 times in a row. We tracked the three
most limiting resources namely, CPU hours, incoming bandwidth and outgoing
bandwidth. Database quotas could be tracked as well, however the database
CPU usage is already included in the overall CPU hours, the overall storage
capacity of one gigabyte will never be a problem since after each algorithm run
the database gets cleared and the data sent to and data received from datastore
API quotas are pretty much unreachable as well.
Resource Unit Unit Cost
Outgoing Bandwidth gigabytes $0.12Incoming Bandwidth gigabytes $0.10
CPU Time CPU hours $0.10
Table 0.9: Resource costs as for 10.01.2011.
Table 0.9 shows the resource units and costs per unit for the measured quotas.
Problem Size Algorithm Out Bandwidth In Bandwidth CPU time Est. Cost
220.000.000 Pi 0 gigabytes 0 gigabytes 1.7 hours $0.171500× 1500 Matrix 0.85 gigabytes 0.75 gigabytes 1.15 hours $0.29270.000 RankSort 0.02 gigabytes 0.01 gigabytes 1.16 hours $0.0323200× 3200 Mandelbrot 0.95 gigabytes 0 gigabytes 0.15 hours $0.129
Table 0.10: Resource consumption and cost estimation for 100 iterations ofthe given problem size.
Table 0.10 lists the resource consumption and the cost estimation of the al-
gorithms. As expected the Pi Approximation algorithm is very computation
heavy and has almost no data to transfer. Matrix Multiplication makes heavy
use of all the resources, besides being a very computation intense algorithm the
parameter and result matrices have to be communicated between master and
slave. Surprisingly the Rank Sort algorithm consumes very little of the band-
width resources in comparison to the CPU time consumed, even though the
unsorted array has to be transferred to the slave and the ranks of each element
back to the master. The mandelbrot set generator is clearly dominated by the
amount of data that has to be transferred back as a result.
cix
For the Pi Approximation we can generally state that for $1 around 129 ∗ 109
random points can be sampled, since the algorithm has a linear computation
effort. For the other algorithms it is however more difficult to give a general-
ized estimation of resource consumption, since the consumption of the different
resources is not linear and scales differently for increasing problem sizes. The
Mandelbrot set algorithm is especially problematic, since the computation effort
varies vastly depending on the area of the complex plane that is considered.
Algorithm Out Bandwidth In Bandwidth CPU time
Pi O(1) O(1) O(n)Matrix O(n2) O(n2) O(n2)RankSort O(n) O(n) O(n2)Mandelbrot O(n2) O(1) O(n2)
Table 0.11: Resource complexity of each algorithm.
However the general complexity for each of the resources can be stated. Ta-
ble 0.11 lists with what complexity each of the resources is consumed by the
algorithms.
cx
Related Work
Most of the work regarding scientific computing in the cloud covers IaaS systems,
since existing algorithms and benchmarks can be easily ported to the cloud
platform. Moreover the flexibility in terms of code reuse makes them a preferable
choice.
The paper ”Scientific Cloud Computing: Early Definition and Experience” [24]
provides a an early general overview of Cloud Computing with regard to scien-
tific computing. It discusses the distinct properties of Cloud Computing, the
enabling technologies of cloud computing and provides a comparative study to
regular Grid Computing. Besides various early cloud providers and there dis-
tinct capabilities are discussed. The paper mainly provides an collection of key
cloud properties and a distinction to other computing paradigms.
There are various papers that perform performance evaluations of different cloud
systems. Most of them focus on Amazon EC2 the most popular IaaS cloud
service. ”Performance Analysis of Cloud Computing Services for Many-Tasks
Scientific Computing” [16] provides a performance analysis of four different com-
mercial cloud providers in terms of parallel scientific computing. For the stated
reasons all four cloud providers selected are IaaS providers: Amazon EC2 [1],
GoGrid [4], ElasticHosts [3], and Mosso [7]. The main question of the paper
is whether cloud performance is sufficient for MTC-based scientific computing.
The conclusion is that the compute performance of the tested clouds is rather
low, though it might still be an viable alternative to traditional scientific com-
puting alternatives for scientists who need resources instantly and temporarily.
AppScale [11] is project dedicated to provide execution of Google App En-
gine applications over Xen-based clusters, including IaaS cloud systems such as
Amazon’s AWS/EC2 and Eucalyptus. Moreover it provides a framework for
researches to investigate the interactions between PaaS and IaaS systems and
the internal technologies used for PaaS systems such as Google App Engine.
The system basically emulates the App Engine framework and its API. In this
thesis we used the development server to execute App Engine applications on
proprietary hardware, AppScale provides a more sophisticated means to do so
cxi
and even allows to wrap App Engine applications to other cloud services such
as Amazon EC2.
Google recently released a open-source library for performing map-reduce like
operations on the datastore using Google App Engine and task queues [22].
However, instead of a low level service, Google only provides a library that
provides a simple map-reduce implementation based on App Engine task queues.
So in principle the library has the same limitations as a regular App Engine
application and thus also consumes the same amount of resources. The API
is similar to Hadoop’s map-reduce API, it is even a Hadoop transition guide
provided. So far only the map phase is supported, support for reduce phase is
however anounced.
cxii
Conclusions
Cloud computing as a computing paradigm recently emerged to a topic of high
research interest. It is especially attractive for smaller companies and research
groups that can not afford expensive infrastructure. Most of the research regard-
ing scientific computing in the cloud however focused on IaaS cloud providers.
This thesis focused on investigating the capabilities of Google App Engine, a
Platform as a Service cloud provider, in terms of scientific computing. PaaS
cloud providers do not offer the convenience to execute arbitrary software, which
means programs have to be written according to the provided framework. As a
consequence it imposes various restrictions to the programmer which make the
use for efficient scientific computations rather difficult. Foremost problematic is
the restriction to Java or Python as programming language, since most scientific
programs are written in C or Fortran, which makes porting of existing code
expensive. Moreover the programmer is restricted to a subset of the standard
libraries and is prevented from using arbitrary libraries. As a result algorithms
usually have to be reprogrammed from scratch and libraries have to be available
in source code in order to be used.
Another problem is that the high level programming languages and the unknown
hardware make algorithm profiling and performance tuning a difficult task. Be-
sides there are a lot of unknown variables such as random background load in the
network, the App Engine frontend or the application servers that may influence
the performance of algorithms. It is noticeable that the App Engine frame-
work is intended for developing web applications and therefore the architecture
and API was constructed with the typical requirements of a web application in
mind. Therefore various restrictions, like for example the 30 seconds request
deadline, that certainly make sense in the context of a web application have to
be circumvented. As a consequence various sources of additional overhead arise.
The measurements showed that algorithms with larger amounts of data to trans-
fer, such as Matrix Multiplication are rather unsuited for execution on App
Engine, because the data has to be transferred over a slow network and there-
fore the overhead for data transfer often dominates the algorithm. On the
cxiii
contrary embarrassingly parallel algorithms with very little data to transfer,
such as Monte Carlo simulations perform a lot better in comparison to other
algorithms. The scalability analysis showed that using the Pi Approximation
algorithm with a single free App Engine account yields comparable performance
to one compute node of the karwendel cluster, even though the random number
library in use, showed rather poor performance on App Engine. Moreover App
Engine accounts with billing enabled have more relaxed minutely quotas and
thus an even better peak performance could be achieved.
In terms of cost management App Engine provides a more fine grained resource
based payment model, than other cloud providers. It is however difficult to
provide an exact cost comparison to other cloud providers in the context of sci-
entific computing. In addition Google grants a sizable amount of free resources
to each user, which makes it particularly interesting. Each free account can de-
ploy up to ten applications with their own resource quotas, thus every account
provides a total of 65 CPU hours as well as 10 gigabyte of incoming and out-
going bandwidth each day for free. Assuming that each member of a resource
group can open up and provide an account, a large amount of resources can be
used for free. Using a more sophisticated job scheduler the free resources could
be assigned optimally.
Furthermore once the Distribution Framework is set up and algorithms correctly
implemented App Engine slave applications can be used in heterogeneous con-
junction with slave applications executed on other systems. In the experiments
we used a traditional computer cluster executing the development server as a
comparative system. AppScale [11] is a project porting. By implication App
Engine framework could even be used in conjunction with AppScale images de-
ployed to other IaaS cloud providers and thus provides a lot of flexibility in
terms of resource usage and cost management.
Concluding we can state that even though the App Engine framework is defi-
nitely not designed for scientific applications, it can provide a significant amount
of flexible computing resources for free if the extra effort to port the algorithms
to App Engine can be accepted.
cxiv
List of Figures
0.1 Request handling architecture of Google App Engine taken from [22] xii
0.2 Compression time needed for the different compression streams. . xxx
0.3 Data size for an integer array filled with random numbers. . . . . xxxi
0.4 Master application architecture. . . . . . . . . . . . . . . . . . . . xxxv
0.5 Activity diagram illustrating the control flow of the slave application xlii
0.6 Typical homogeneous local network topology. . . . . . . . . . . xlix
0.7 Heterogeneous local network topology, with superior master node. l
0.8 Completion time for square matrix multiplication using no shared
data management, shared data management with parallel and
sequential data transfer. . . . . . . . . . . . . . . . . . . . . . . . li
0.9 Completion time for square matrix multiplication using no shared
data management, shared data management with parallel and
sequential data transfer. . . . . . . . . . . . . . . . . . . . . . . . lii
0.10 Illustration of the monte carlo pi calculation . . . . . . . . . . . . liv
0.11 Illustration of the monte carlo integration . . . . . . . . . . . . . lx
0.12 Data dependencies for entry C2,3 of the result matrix. . . . . . . lxvi
0.13 Example of data partitioning in parallel matrix multiplication. . lxvii
0.14 Coloured image of the mandelbrot set. . . . . . . . . . . . . . . . lxxii
0.15 Results of the latency analysis. . . . . . . . . . . . . . . . . . . . xc
0.16 Results of the bandwidth test. . . . . . . . . . . . . . . . . . . . . xci
0.17 Microbenchmark results. . . . . . . . . . . . . . . . . . . . . . . . xciii
0.18 Computation time results of Scalar mul/div and the Fibonacci
number generator. . . . . . . . . . . . . . . . . . . . . . . . . . . xciv
cxv
0.19 Fibonacci test requests. . . . . . . . . . . . . . . . . . . . . . . . xcv
0.20 Fibonacci test requests mapped to their respective instance. . . . xcvi
0.21 Results of the cache hierarchy test. . . . . . . . . . . . . . . . . xcviii
0.22 Runtime analysis of the Pi Approximation algorithm. . . . . . . c
0.23 Speedup analysis of the Pi Approximation algorithm. . . . . . . . c
0.24 Runtime analysis of integer Matrix Multiplication. . . . . . . . . cii
0.25 Speedup analysis of integer Matrix Multiplication. . . . . . . . . ciii
0.26 Runtime analysis of the Rank Sort algorithm. . . . . . . . . . . . civ
0.27 Speedup analysis of the Rank Sort algorithm. . . . . . . . . . . . cv
0.28 Runtime analysis of the Mandelbrot algorithm. . . . . . . . . . . cvi
0.29 Speedup analysis of the Mandelbrot algorithm. . . . . . . . . . . cvii
0.30 Scalability analysis of the Pi Approximation algorithm. . . . . . cviii
cxvi
List of Tables
0.1 Free quotas for general resources (as of 20.09.2010) [6]. . . . . . . xxii
0.2 General Datastore quotas (as of 20.09.2010) [6]. . . . . . . . . . . xxiii
0.3 Daily Datastore quotas (as of 20.09.2010) [6]. . . . . . . . . . . . xxiv
0.4 Properties overview of App Engine and local slaves. . . . . . . . xxxiv
0.5 Example illustrating the concept of ranks. . . . . . . . . . . . . . lxxx
0.6 Hardware specification of the karwendel cluster. . . . . . . . . . . lxxxviii
0.7 Hardware specification of the zid-gpl server. . . . . . . . . . . . . lxxxviii
0.8 Maximum and minimum transfer speeds for different data chunk
sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xci
0.9 Resource costs as for 10.01.2011. . . . . . . . . . . . . . . . . . . cix
0.10 Resource consumption and cost estimation for 100 iterations of
the given problem size. . . . . . . . . . . . . . . . . . . . . . . . . cix
0.11 Resource complexity of each algorithm. . . . . . . . . . . . . . . cx
cxvii
cxviii
Bibliography
[1] Amazon Elastic Compute Cloud (Amazon EC2), http://aws.amazon.com/
de/ec2/.
[2] App Engine Developers Guide, http://code.google.com/intl/de-DE/
appengine/docs/.
[3] ElasticHosts cloud service, http://www.elastichosts.com/.
[4] GoGrid cloud-server hosting, http://www.gogrid.com/.
[5] Google App Engine Framework, http://code.google.com/intl/de-DE/
appengine/.
[6] Google App Engine Quotas, http://code.google.com/intl/de-DE/
appengine/docs/quotas.html.
[7] Mosso cloud service, http://www.mosso.com/.
[8] The JRE White List, http://code.google.com/intl/de-DE/appengine/
docs/java/jrewhitelist.html.
[9] RFC 2616 - Hypertext Transfer Protocol – HTTP/1.1, 1999.
[10] Felician Alecu, Parallel Rank Sort, 2005.
[11] Navraj Chohan, Chris Bunch, Sydney Pang, Chandra Krintz, Nagy
Mostafa, Sunil Soman, and Rich Wolski, AppScale Design and Implemen-
tation, 2009.
[12] P.D. Coddington, J.A. Mathew, and K.A. Hawick, Interfaces and Imple-
mentations of Random Number Generators for Java Grande Applications,
1999.
[13] Alberto Gotta, Francesco Potorti, and Raffaello Secchi, An Analysis of
TCP Startup over an Experimental DVB-RCS Platform, 2006.
[14] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Intro-
duction to Parallel Computing, 2003.
cxix
[15] Geir Gunderson and Trond Steihaug, Data Structures in Java for Matrix
Computations, 2002.
[16] Alexandru Iosup, Simon Ostermann, Nezih Yigitbasi, Radu Prodan,
Thomas Fahringer, and Dick Epema, Performance Analysis of Cloud Com-
puting Services for Many-Tasks Scientific Computing, 2010.
[17] Malvin H. Kalos and Paula A. Whitlock, Monte Carlo Methods, 2008.
[18] Qusay H. Mahmoud, Compressing and Decompressing Data Using Java
APIs, http://java.sun.com/developer/technicalArticles/Programming/
compression/, 2002.
[19] Benoit B. Mandelbrot, Fractals and Chaos: The Mandelbrot Set and Be-
yond, 2004.
[20] Peter Mell and Tim Grance, Definition of Cloud Computing v15, NIST,
2009.
[21] Igor Ostrovsky, Gallery of Processor Cache Effects, http://igoro.com/
archive/gallery-of-processor-cache-effects/.
[22] Dan Sanderson, Programming Google App Engine, 2009.
[23] Steve S. Skiena, The Algorithm Design Manual, 1997.
[24] Lizhe Wang and Gregor von Laszewski, Scientific Cloud Computing: Early
Definition and Experience, 2008.
cxx