openmodeller a framework for species modelingopenmodeller.cria.org.br/documentos/relatorios/... ·...

openModeller

A framework for species modeling

Fapesp process: 04/11012-0

Partial Report #3 (April 2007 – March 2008)

Index INTRODUCTION ...................................................................................................1

OBJECTIVES ......................................................................................................1

SUMMARY OF MAIN ACTIVITIES AND RESULTS ........................................................2

Locality data component ............................ ...........................................................2

Locality data server...............................................................................................2

Clients (drivers) that read locality data..................................................................3

Data cleaning........................................................................................................3

Environmental data component....................... .....................................................4

Pre-analysis component ............................. ...........................................................5

Modeling component ................................. ............................................................5

Post-analysis component ............................ ..........................................................6

Desktop interface .................................. .................................................................7

Web interface ...................................... ....................................................................7

Integration with TerraLib .......................... .............................................................8

Evaluation of parallelization techniques and analys is of parallel algorithms for biodiversity data analysis ..................... ...........................................................9

Initial implementation using services architecture . ..........................................10

Other developments................................. ............................................................10

Model repository .................................................................................................10

Study Cases ........................................ ..................................................................10

Seminars ........................................... ....................................................................12

Training ........................................... ......................................................................12

Publications ....................................... ...................................................................14

Journal papers ....................................................................................................14

Conference papers .............................................................................................15

Other papers .......................................................................................................16

In preparation......................................................................................................16

GENERAL COMMENTS.......................................................................................17

ANNEX 1. CHANGELOG .....................................................................................19

ANNEX 2. ACTIVITIES RELATED TO THE SPECIESLINK NETWORK ...........................21

Centralized data repository ........................ .........................................................21

Centralized query .................................. ...............................................................24

Sensitive data ..................................... ..................................................................27

Future developments ................................ ...........................................................28

ANNEX 3. REPORT ON A TOOL TO ANALYZE DATA ON SPECIES DISTRIBU TION USING PARALLEL PROCESSING ....................................................................................29

ANNEX 4. PARALLEL VERSIONS OF THE PROJECTION FOR COMPUTER CLU STERS...45

ANNEX 5. MANUAL FOR THE INSTALLATION OF A SERVICES PLATFORM .................47

1

Introduction The goal of this project is to develop a framework to facilitate the work of scientists in predictive modeling of species distribution. The four year project funded by Fapesp involves three institutions: CRIA (Centro de Referência em Informação Ambiental), Poli (Escola Politécnica da USP), and INPE (Instituto Nacional de Pesquisas Espaciais). This report summarizes the activities carried out during the project’s third year, from April 2007 to March 2008.

Objectives This project’s original aim is to develop a flexible and robust modeling framework to predict species distribution. Main objectives are:

• Develop a component-based modeling framework with reusable modules compliant with web services technology

• Enable use and comparison of different modeling algorithms through the same framework

• Enable modeling pre-analysis and pos-analysis using specific components

• Develop multiple interfaces (web, desktop, command line, web service)

• Facilitate access to distributed biological and environmental data networks

• Allow usage of high performance computing with species distribution modeling.

• Carry out use cases to test and validate the framework.

Figure 1 illustrates the original proposal. The project addressed the following components: locality data, environmental data, pre-analysis, modeling, post analysis, and interfaces. Study cases testing the framework and training are other important activities of this project.

GBIF speciesLink GEOSS

Specimen data (occurrence points) Environmental data (layers)

Pre-analysis

Modeling

Post-analysis

Algorithms

Interface (desktop, web, console, soap server, swig w rapper, others)

USER

data cleaninggeoreferencing

Original architecture

Environmental componentLocality component

Figure 1. Proposed modeling framework architecture

2

Summary of main activities and results During the third year of the project there were 8 doctoral students, 3 master students, 6 undergraduate students and 6 fellowships for technical training directly involved with the project, besides the direct involvement of staff members from each institution (CRIA, Poli, and INPE).

There were five new releases of the openModeller framework since March 2007 (the complete changelog can be found in annex 1). Each release included bug fixes and new features developed during the project. New features included new algorithms (Support Vector Machines and Envelope Score), a new infrastructure for unit testing, ROC curve to measure algorithm performance, jackknifing to help choosing environment layers in modeling experiments, and a new unified compiling strategy for all platforms. Now both the library and the desktop interface have packages for all three major platforms: GNU/Linux, Mac OSX and Windows.

The latest version of openModeller Desktop (1.0.6) is getting an average of 600 downloads per month, which confirms the expectation in the beginning of the project that it would become the most popular interface of the library.

The computer cluster was purchased, installed and is available to researchers. Different user interfaces are being developed to allow access in a transparent and friendly way. Research on parallelizing algorithms such as GARP has been conducted and a first version of this algorithm is available for testing. Soon it will be available to end users.

The number of publications since the last report includes 7 journal papers, 13 conference papers among others that are in preparation.

The speciesLink network1, which constitutes the main source of species occurrence data in the Brazilian territory and whose project originated openModeller, almost doubled the number of records freely and openly available on-line, reaching approximately 2.3 million records of voucher specimens in February 2008.

There were also two new releases of TerraLib and TerraView that are already compatible with the latest version of the openModeller framework.

The publicity gained from our regular software releases and interactions with other individuals and institutions contributed to the use of openModeller by other interfaces and tools, such as the GBIF portal. The number of international publications that mention openModeller is continuously increasing2. The number of potential areas for future collaboration with the wider scientific community now includes institutions such as NASA.

Locality data component

Locality data server The fact that CRIA is also responsible for the development of the speciesLink network3, which today represents the main source of on-line species occurrence data in Brazil, is an opportunity to develop a locality data service for openModeller. The development of the network is also crucial for study cases.

1 http://splink.cria.org.br/ 2 http://openmodeller.sourceforge.net/index.php?option=com_content&task=blogcategory&id=10&Itemid=6 3 http://splink.cria.org.br/

3

For this purpose, this last year speciesLink changed its architecture from using an online dynamic distributed search system to a centralized search approach (see annex 2). A system that periodically searches the network and harvests all new and modified records and stores them in a central database was designed and developed. This has improved performance and makes it easier to serve all speciesLink records for other applications as there is now a single endpoint. The newest DarwinCore4 version was adopted to map the speciesLink central database fields. The next step will be to install a TAPIR5 compatible data provider software to serve this data to openModeller. This development was co-funded by the JRS Biodiversity Foundation.

Clients (drivers) that read locality data The openModeller locality component involves a set of generic clients (drivers) that can read locality data from different sources using different formats. Currently there are drivers for TAB-delimited text files and Terralib databases. To access the speciesLink data service, a new driver is being developed to provide direct access to any TAPIR/DarwinCore provider.

Data cleaning Ecological niche modeling requires high quality data, so within the openModeller framework, the idea is to develop an application where users may test data quality. During the first two years of the project, work was concentrated in developing a more universal application (data tester). The idea originated in a project that was carried out in collaboration with GBIF to design and develop an extensible Java framework for performing tests against XML data sets and reporting on data quality.

At the same time, the development of data-cleaning applications is a continuous study for the speciesLink network. CRIA’s staff has been developing tools to find data inconsistencies and typing errors and producing on-line reports for each data provider (biological collections) for the last 5 years. This field of research has proved to be very dynamic, and through the constant interaction with collections (the speciesLink network today has more then 130 collection) new features are frequently added.

What the team was basically doing was developing applications for the speciesLink network, that is the main source of species occurrence data on the Internet for Brazil, and then rewriting the tests to be used by data tester.

This has not proved to be efficient so we have changed our strategy. Last year, work was concentrated in making the speciesLink network more efficient through a centralized data repository and in adding new features to the data cleaning applications. Plans are to develop a Web interface for users to be able to submit data sets and retrieve reports with data quality evaluations.

The following data fields are currently being checked:

• Names

• Dates

• Locality

• Consistency between Collector, collector number and name between herbaria

4 http://www.tdwg.org/activities/darwincore/ 5 http://www.tdwg.org/activities/tapir

4

Brazil does not have a validated list of species names. The data cleaning tools check names with the Catalogue of Life6 but in a recent study we estimate that less then 5% of the approximately 204 thousand unique names that are part of the speciesLink network are included in the Catalogue of Life. For this reason a phonetic database of names of the network was created to help find typing errors. When a name is phonetically equal but written differently it is presented as a “suspect record”. The data cleaning report returns the following information:

• Name (family, genus, species, subspecies),

• Number of records written this way in the collection,

• Number of records written this way in the network,

• Status of this name in the Catalogue of Life, and

• Catalogue number.

Geographic error detection includes checking whether given coordinates are consistent with the administrative regions provided by the original records, such as whether the values registered for latitude/longitude fall within the municipality registered. The system also checks whether data on country – state – municipality are coherent. Information as to records that have values of latitude or longitude equal to zero and values that are points in the sea are also shown. If the collection is of terrestrial species, then all coordinates that occur in the sea are errors. Another tool to detect possible errors is checking for outliers on latitude and longitude using a statistical method based on reverse-jackknifing procedure. Additionally, the system calculates latitude/longitude values to nongeoreferenced records based on the official coordinates of municipalities in Brazil.

The system also checks the date of the record. Each collection informs the date of their oldest record, so the system compares this date with the date informed for each record and also compares the date of the record with the date it was last updated. Dates prior to the oldest date informed by the collection, and record dates that are more recent then the date the record was updated are presented as “suspect records”.

The last feature is of particular interest to herbaria that commonly exchange duplicate exsiccates. Common fields for duplicates are the collector’s name and number. The system assumes that if different records from different herbaria have the same collector’s name and number they are duplicates and therefore should have the same species name. If not, they are presented as “suspect records”. The results are presented as a table where duplicate records are shown with the different name and the name of the specialist responsible for its identification and the identification date. This way, herbaria can benefit from specialists visits to different herbaria.

Environmental data component In the future, we envision a system that will serve environmental layers as a service, such as GBIF and the speciesLink network are serving species occurrence points. Such a system is under discussion7 and a new protocol called Web Coverage Server (WCS) is already available for this purpose. New versions of GDAL can optionally access remote rasters through WCS. The corresponding driver was compiled, tested and seemed to work properly. Since GDAL can be used by openModeller, this increases the number of environmental data sources that can be accessed in a distributed data scenario. Initial tests were limited by the small number of WCS

6 http://www.catalogueoflife.org/search.php 7 see GEO and GEOSS at http://www.earthobservations.org/

5

servers available. To overcome this limitation and gain more knowledge and experience with WCS, a new WCS server is being developed on top of TerraLib.

CRIA is also investigating ways to manage the large number of environmental layers accumulated over the years. Each layer represents a different variable and is stored as an individual file (or directory) in its own raster format. The first attempt was to create a metadata file containing keywords for each layer. A data harvester program was then developed to read all metadata files and store this information in a relational database so that queries could be performed. A prototype web interface was also developed to display all keywords as tag clouds with incremental filters being applied after each keyword selection. This approach is still being evaluated and may be eventually used in the openModeller Web interface.

Pre-analysis component A jackknife-based procedure is the first technique implemented in openModeller to help in identifying the most relevant environment layers that can be used in each modeling experiment. Jackknife is currently implemented as a separate class and it expects an algorithm as a parameter. The specified algorithm will generate a model for different subsets of layers (each time excluding one of the layers from the complete set) and the results are tested with an independent set of points. Jackknife calculates bias and variance estimates of statistical parameters (we are currently using model accuracy). In addition, it measures the contribution of each variable (environment layer) for the chosen parameter estimate (model accuracy). This method is still being tested in order to find the best openModeller algorithm(s) that can be used for this purpose.

A second technique implemented in openModeller is the Chi Square method described at Li, L., et al. (2006) "An Integrated Bayesian Modelling Approach for Predicting Mosquito Larval Habitats". This method also expects a set of layers as input. For each pair of layers a contingence matrix is built. This matrix records the number of occurrences in both layers, considering them categorized in classes. The Chi-square test is applied with the significance level of 0.05. The algorithm returns, for each variable, the number of layers with a correlation at the significance level of 0.05.

A generic API to interact in the same way with different pre-analysis techniques involving input layers is still being studied and discussed. The implementation is planned for the final year of the project.

Modeling component openModeller changed its compilation system to use CMake8 so that the same build system could be used across all platforms. This change allows openModeller to be more easily deployed in the three major platforms: GNU/Linux, Mac OS X and Windows. The adoption of a CMake build system for all platforms provides a significant reduction in maintenance, complexity of the code base, and duplication of effort. Additionally, for GNU/Linux two packages are now being generated in each release using the two main formats: Red Hat Package Manager (rpm) and Debian (deb).

Considering the frequency in which changes are made to the library by different developers, one of the efforts during the last year was to develop a new infrastructure

8 Cross Platform Make: http://www.cmake.org/

6

for unit testing. Unit tests provide a way to check that specific parts of the source code are working as expected. Several C++ frameworks for unit tests were investigated before deciding to use CxxTest9 for the openModeller library. This process was documented in the openModeller Wiki10. A set of unit tests were developed with the chosen framework, although they still don’t cover the entire library. Now that we have documentation about how to write, compile and run unit tests, the plan is to gradually increase the number of tests and to follow a test-driven development strategy for future work on the library.

Two new algorithms are available in openModeller: Support Vector Machines (SVM) and Envelope Score. SVM is a machine learning technique based on concepts from the statistical learning theory. SVMs have been used in many different applications that involve pattern recognition. Recently, SVM has also been applied to the problem of creating species’ distribution models, but there is still considerable scope for further research. SVMs are known for their good generalization ability and robustness to high dimensional datasets. The openModeller implementation of SVM was carried out in a partnership with the “Instituto de Ciências Matemáticas e de Computação”11 of the University of São Paulo.

Envelope Score is a lax bioclimatic envelope algorithm where the probability of presence in a specific point is proportional to the number of environment layers whose envelope (min-max range calculated in model creation) contains the corresponding environment value for the point. The primary motivation for implementing Envelope Score was to generate more meaningful models in paleo-climate scenarios, where the traditional Bioclim algorithm produces overly constrained models. This work was done in collaboration with Chris Yesson from the University of Reading.

The Aquamaps algorithm already mentioned in the last report (implemented in collaboration with the Incofish12 project) is currently being updated to incorporate recent changes made by the authors. After validation, this new version should be officially released as part of openModeller to become an alternative for modeling marine organisms. Another change in existing algorithms was the inclusion of the Chebyshev metric in Environmental Distance.

Another two algorithms are planned for the final year of this project. The first will be based on Maximum Entropy (MaxEnt) techniques. Maximum entropy has been successfully applied in many different areas including species’ distribution models. The second new algorithm will be based on Neural Networks using Multilayer Perceptron with Backpropagation. When available, both algorithms will be compared with the existing ones.

Post-analysis component The Receiver Operating Characteristic (ROC) curve and its associated measure (AUC - area under the curve) were implemented as part of the openModeller library. ROC curve data and AUC are now automatically calculated after model creation using command-line tools, Web Service calls, and the Desktop interface (which includes a graphical display of the curve). When no absence points are used, openModeller generates background points to calculate the curve. Besides automatic calculation for training data, the same class can be used for any extrinsic tests.

9 http://cxxtest.sourceforge.net/ 10 http://openmodeller.cria.org.br/wikis/om/UnitTesting 11 http://www.icmc.usp.br/ 12 for more about Incofish see http://www.incofish.org/

7

A new method for model validation should be incorporated into the Web Services interface during the last year of the project. This method will allow any number of external model validations using any of the available methods (AUC, accuracy or others) to be performed remotely.

Desktop interface Two additional releases (1.0.5 & 1.0.6) were made and a third is planned. The current generation has numerous improvements, including:

• The running of experiments is now threaded so the application remains responsive and interactive during the running of an experiment. This includes the ability to browse individual model outputs while the experiment run is still underway.

• For model preparation, the data fetcher tools have been updated - fetching data from the new GBIF Rest interface is now possible. In addition, users can specify minimum size search results. Lastly the data fetcher tools have been re-implemented as a more user friendly wizard interface.

• Version 1.0.6 now displays an AUC graph and ROC score for each model. When preparing a model the use of absence data is now supported. In addition, the visualization of model outputs in the mapping tab shows absence and presence data with separate symbology (by default red for absence and green for presence). The environmental values at each presence or absence point are now presented in a new table tab which has been optimized for large datasets.

• The model visualization now supports the overlay of a user defined 'context layer' (vector geographic boundaries) to aid users in placing the model in an appropriate geographic context.

• Once an experiment is completed, users can visualize the models in a new thumbnail view with options to sort by algorithm or by taxon. This, for example, facilitates easy visual comparison of different algorithm outputs for the same taxon.

• The addition of a new post processing tool allows users to create 'hotspot' maps and 'consensus' maps. Hotspot maps aggregate the outputs from two or more models using a specified probability threshold. The result is a raster showing the number of taxa predicted present per raster cell. The consensus map tool allows the outputs of multiple algorithms for the same taxon to be combined to create a single unified raster where the value in each cell represents the number of algorithms that predict presence for that cell.

• Regular updates were made to ensure that the openModeller Desktop application remains current and compatible with the most recent version of the openModeller library.

• The creation of a detailed unit testing framework using QTest and CTest with similar goals in mind to the test framework being developed for the openModeller library.

Web interface Three new prototype Web interfaces involving openModeller were developed during the last year. Two of them (one in PHP and the other in Flex) were outcomes of the

8

Biogeo Interoperability Workshop13. This workshop was organized by a specific task group from the TDWG Geospatial Interest Group14 and hosted by CRIA. The aim of this workshop was actually to test biodiversity and geospatial protocols and data standards and to investigate how they could interoperate. This was done by developing two demo applications which basically consisted of Web interfaces making use of specific Web services. One of the services to be tested was based on the openModeller Web Services interface (OMWS) developed by the openModeller team to facilitate remote modeling. The final report of the workshop15 includes comments about OMWS. Both applications developed during the workshop are Web interfaces that can generate niche models.

The other prototype was produced in a partnership between CRIA, University of Colorado and GBIF. The result was also a Web interface making use of specific services to produce niche models16. However, one of the objectives was to develop a Java library to access the modeling service so that GBIF could use it. GBIF is incorporating this functionality into their data portal with a scheduled release in April 2008. Being integrated into the GBIF portal will provide access to openModeller to a very wide audience.

These prototypes, as well as the first Web interface developed as part of this project, offered a good opportunity to learn about this kind of interface and to test ideas. However, we feel that such interface should also be able to interact with openModeller as a local library, which would require C++ or Python as the programming language. So a final and full-featured Web interface will be developed as part of this project during its final year.

Integration with TerraLib Two new versions of TerraLib and TerraView were released during the last year: 3.1.4 and 3.2.0. The openModeller driver to access TerraLib databases and the TerraView plugin are currently being updated for compatibility with version 3.2.0 of TerraLib and TerraView. In this version TerraLib is now built as a dynamic library, which will help solve previous integration issues with the openModeller library. TerraView and TerraLib version 3.2.0 allow the integration of new data sources that are external to the TerraLib database: shapefiles, data collected from WMS servers and third party TerraLib databases. This will expand the capabilities of using distributed occurrence and environmental data for TerraView/openModeller users.

TerraView 3.2.0 now includes sample point generation functionality. It allows users to generate points over the layer of occurrence and/or environmental data in a random or stratified way. These points can be used to get values from modeling results and also for post-analysis operations.

The integration between the R and TerraLib is also being updated to the latest version of TerraLib.

13 http://www.tdwg.org/homepage-news-item/article/geointeroperability-workshop-outcomes/ 14 http://www.tdwg.org/activities/geospatial/ 15 http://wiki.tdwg.org/twiki/bin/viewfile/Geospatial/InteroperabilityWorkshop1?rev=1;filename=BioGeoSDIreport.pdf 16 http://dbmuseblade.colorado.edu/gbiftestbed/

9

Evaluation of parallelization techniques and analys is of parallel algorithms for biodiversity data analysis The following tasks were carried out to improve the performance of openModeller by applying parallelization techniques:

• Software installation and configuration in the new cluster environment.

• Development of a portal for job submission on the cluster.

• Performance analysis of openModeller.

• Parallelization of one of the existing algorithms (P-GARP).

• Parallelization of the model projection step.

The following tasks are currently in progress:

• Adapt the existing openModeller Web Services scheduler script to submit jobs to the cluster.

• Merge P-GARP and the parallelized projection into the openModeller code repository.

• Deploy P-GARP and the parallelized projection in the cluster.

• Develop new strategies to improve the parallel versions.

Four different versions of the original openModeller projection code were produced and tested. The first version used the OpenMP library17 to parallelize model projection in multiprocessor computers. Tests were performed in one node of the cluster (each node has dual processor cores). The initial results were not good (see annex 3 – only in Portuguese) due to the intensive hard disk access rate of the application. The other two versions used Message Passing Interface (MPI)18 on the cluster. In version 2, each node of the cluster generates a file in the TIFF format for each part of the distribution map, and in the end one of the nodes merges all files. In version 3, each node generates an intermediate custom file for each part of the map. This file contains the associated probability for each pixel. In the end, one of the nodes receives these intermediate files from the other nodes to generate the final map. Version 4 was based on version 3, but without generating intermediate files. The associated probability for each pixel is stored in a buffer which is transferred to node 0. More details can be found in annex 3 and 4 (only in Portuguese). There was a 5.7 fold reduction in execution time using version 4 with ten nodes compared to the sequential execution time (sequential version of openModeller running in one node) when the map extent is limited to Brazil using the Bioclim algorithm. A 3.1 fold improvement was obtained when the map extent is South America. These versions are still being studied and improved.

P-GARP is a parallel version of the GARP algorithm designed as part of this project for efficient use on clusters. The current implementation was tested for correctness and performance. P-GARP models were successfully compared with GARP models using the same input. P-GARP is currently available for testing but only for cluster identified users.

The ideas that were used to develop P-GARP can be applied in the parallelization of other algorithms to improve performance. P-BestSubsets and HighP-BestSubsets are parallel versions of the openModeller GARP Best Subsets algorithm, using GARP and P-GARP, respectively. The design of both new algorithms has been completed. P-BestSubsets is partially implemented, but is still not available. HighP-BestSubsets implementation has not been initiated.

17 http://www.openmp.org/drupal/mp-documents/spec25.pdf 18 http://www.netlib.org/mpi/

10

A Website for job submission to the cluster was built. It provides an interface for end users to request either a sequential openModeller execution running on one node or a parallel execution running on several nodes.

Initial implementation using services architecture To help identify openModeller components that can be processed in alternative parallel and distributed ways using a high-performance infrastructure, a detailed performance analysis was carried out during the second year of the project. In this study the openModeller framework was split into several components and submitted to a typical workload defined with the help of end users. Results can be found in the performance analysis online report19 (only available in Portuguese). A Web system was also developed to collect performance metrics of the openModeller components.

Building on the above mentioned work, studies for establishing a new Service Oriented Architecture model (or SOA-based) for openModeller were initiated. These studies were also based on previous studies about architecture and reference models for openModeller, the SOA reference model, and a comparative study between precision agriculture and biodiversity modeling information systems (all published as part of this project). First results include the definition of a reference architecture for ecological niche modeling and the application of SOA for service identification in ecological niche modeling and GIS systems.

A new version of openModeller using the new SOA-based architecture is being developed. Phase 1 consists of providing a service-oriented platform to deliver and integrate services attending to established SOA requirements. A preliminary version is available using Jboss, Apache Tomcat, Apache Axis and Apache Ant (all open source tools) but not yet integrated with openModeller (see annex 5, only available in Portuguese). The next step consists of breaking openModeller into several separate services. These will reflect the pre-analysis, modeling and post-analysis stages of the modeling process.

Other developments

Model repository Karla Donato Fook's doctoral thesis proposes a Web Services architecture that will support collaboration in a species distribution modeling network. By using this new architecture, users will be able to share model instances (definition of parameters, data used, chosen algorithms) and potential distribution maps. This will allow them to access other models and compare results. The proposed architecture was partially implemented. In 2007 two prototypes were released and presented to CRIA and INPE users for feedback. A first final release is expected to be delivered in April, 2008.

Study Cases The following studies were carried out in collaboration with CRIA during the period between March 2007 and February 2008 with the objective of testing the framework, training and involving users from other institutions, and producing papers to disseminate openModeller and its applications:

19 http://openmodeller.incubadora.fapesp.br/portal/bolsas/Relatorio-Fapesp.rar

11

• Comparison of two different algorithms (GARP and Maxent) in modeling the potential habitat of manned wolf (Chrysocyon brachyurus). The main objective was to know the consequences of actual habitat fragmentation for this species. Renata Kawashima & Eduardo Mantovani (Instituto Nacional de Pesquisas Espaciais – INPE) and Marinez F. Siqueira (CRIA). Status: published.

• Performance test of two different algorithms (GARP and SVM – Support Vector Machine) in modeling Cerrado tree species (Stryphnodendron obovatum). The main objective was to compare the accuracy of these algorithms and test the effect of using a high number of environmental layers in the process. Ana Carolina Lorena (Universidade Federal do ABC), Renato De Giovanni (CRIA), André C.P.L.F.Carvalho & Ricardo C. Prati (Universidade de São Paulo - USP, Campus de São Carlos). Status: accepted.

• Application of potential species modeling to known geographic distribution of Hennecartia omphalandra (Monimiaceae). Marcus Gonzales & Ariane L. Peixoto, (Escola Nacional de Botânica Tropical – Jardim Botânico do Rio de Janeiro), and Marinez F. Siqueira (CRIA). Status: in preparation.

• Potential distribution modeling of species threatened of extinction from State of Minas Gerais, Brazil. Luciana H. Yoshino Kamino (Universidade Federal de Minas Gerais – UFMG) and Marinez F. Siqueira (CRIA). Status: in preparation.

• Potential distribution modeling and model validation using openModeller. Francisco C. Barreto (Universidade Federal de Viçosa – UFV) and Marinez F. Siqueira (CRIA). Status: in preparation.

• Ecological niche modeling in the Brazilian Atlantic Forest: a comparative evaluation of presence-only methods for modeling the geographic distribution of anurans. João Gabriel Giovanelli & João Alexandrino (Universidade do Estado de São Paulo - UNESP, Campus de Rio Claro) and Marinez F. Siqueira (CRIA).. Status: in preparation.

The following study cases were carried out at INPE:

• Fabio Iwashita’s master thesis assessed the sensibility of species distribution models according to the precision of locality data. Two algorithms implemented in openModeller (GARP and BIOCLIM) and Maxent were evaluated. The paper from this thesis will be submitted in March, 2008 (Iwashita, in preparation).

• For the state of São Paulo, species of genus Croton (Euphorbiaceae) and species from the Tribo Cynodonteae (Poaceae – Chloridoideae) where studied in collaboration with the Instituto Botânico de São Paulo (IBt). The first analysis of the spatial distribution of the Tribo Cynodonteae indicated that the sampling effort has to be intensified to enable a better understanding of the biogeography and conservation status of the group (Santos et al., 2007). For the Croton genus, the importance of the soil variable in species distribution modeling for this group was analyzed (Caruzo et al. 2007). For both taxa, experiments considering species distribution models with different algorithms and variables can be further tested to better discuss the biological processes related to the species distribution

• A database for studying Arecaceae (Palmae) distribution in Brazil is under development. Occurrence data from the most important Brazilian and international herbaria were incorporated into the database. At the moment, a data cleaning process to correct the geographical coordinates and taxonomic nomenclature is being performed. The database with Arecaceae occurrence data will enable a study about the spatial distribution of Brazilian palms and also help testing openModeller algorithm implementations.

12

• Different environmental data and species distribution algorithms were tested to achieve better results and optimize modeling procedures. In particular, Normalized Difference Vegetation Index (NDVI) was used to model the genus Coccocypselum (Rubiacea) using GARP-OM Best Subsets algorithm and Maxent (Costa et al., in preparation).

• A study case is under preparation to analyze the impact of climate change over the life forms in the boundaries between savanna and tropical rain forest in the Brazilian Amazon. A database containing occurrence data from savanna and forest vegetation is under construction and will be used to model the current and the predicted distribution of dominant life forms (Box model algorithm) as well as for a number of selected species using algorithms implemented in openModeller.

Seminars In order to integrate the teams and try to make all have a clearer concept of the project as a whole, a number of seminars and meetings are held. During the last year of the project three seminars and two general meetings were held.

Date: 14th June, 2007

Location: Poli

Presentations: Using the new interface to submit Condor jobs in the cluster (Nilton Cézar de Paula, Poli), Plans for Unit Tests in openModeller (Albert Massayuki Kuniyoshi, Poli), News about openModeller Desktop (Tim Sutton, CRIA), Model Repository (Karla Fook, INPE).

Date: 16th August, 2007

Location: INPE

Presentations: Support Vector Machines (Ana Carolina Lorena, UFABC), Status of the cluster environment (César Bravo, Poli), Discussions about the project.

Date: 18th October, 2007

Location: Poli

Presentations: Adaptive Systems (Prof. João José Neto, Poli), AdaptGarp (César Bravo, Poli), Using openModeller Desktop with the cluster (Tim Sutton, CRIA).

The study group about Biodiversity Modeling that was created at INPE, called “Referata Biodiversa”20, proceeded with the monthly meetings. Over the last year, the group discussed issues such as climate changes and potential vegetation models, spatial dependence in predictive vegetation modeling, status and problems related to digital database of biological collections, statistical methods to modeling evaluation, among others. Besides contributing towards the integration of INPE’s team, it also offers the opportunity of interaction with other groups such as the Instituto de Biociências (USP) and the Instituto de Botânica de São Paulo (IBt-SP).

Training Dr. Marinez F. Siqueira from CRIA offered training courses, lectures, and acted as advisor to graduate students working with openModeller. Last year activities include:

20 http://www.dpi.inpe.br/referata/index.html

13

• Instructor: III course “Potential distribution of species”. Instituto de Pesquisas Ecológicas – IPE. Nazaré Paulista, SP, Brazil. Apr/2008.

• Instructor: Advanced course “Potential distribution of species”. CENARGEM/EMBRAPA. Feb/2008.

• Co-adviser of Luciana H. Yoshino Kamino. PhD degree: Modelagem de espécies de plantas ameaçadas de extinção de Minas Gerais. Pós-graduação em Biologia Vegetal. Laboratório de Sistemática Vegetal. Depto de Botânica /ICB /UFMG. Jan/2008.

• Co-adviser of Francisco Candido Cardoso Barreto. PhD Degree. Potential distribution modeling and models validation in openModeller. Programa de Pós-Graduação em Entomologia. Universidade Federal de Viçosa - UFV, Brazil. Feb/2008.

• Instructor: II course “Potential distribution of species”. Instituto de Pesquisas Ecológicas – IPE. Nazaré Paulista, SP, Brazil. Sep/2007.

• Instructor: I course “Potential distribution of species”. Instituto de Pesquisas Ecológicas – IPE. Nazaré Paulista, SP, Brazil. Mar/2007.

• Lecture: Modelagem de distribuição geográfica de espécies. In: XVIII Semana de Estudos da Ecologia. Instituto de Biociências, UNESP, Campus Rio Claro. 10 – 14 September, 2007.

• Lecture: Modelagem de distribuição potencial de espécies. In: Faculdades Integradas Metropolitanas de Campinas – METROCAMP. 6th October, 2007.

• Lecture: Acesso a dados de coleções biológicas. In Faculdades Integradas Metropolitanas de Campinas – METROCAMP. October 6, 2007.

• Lecture: Mudanças ambientais: possíveis impactos na biodiversidade. In: Programa de Extensão da Escola Nacional de Botânica Tropical. Seminários em Ciência e Tecnologia. Jardim Botânico do Rio de Janeiro. 20th April, 2007.

• Lecture: Environmental satellite data: applications in studies of biodiversity. “Strategies for Open and Permanent Access to Scientific Information in Latin America: Focus on Health and Environmental Information for Sustainable Development”. Atibaia, SP. 8-10 May 2007.

Dra. Silvana Amaral presented the openModeller project as part of INPE´s activity in the following events:

• Visit of the Ministry of Forestry of Indonesia at INPE, June/2007, presentation entitled “Species Distribution Modeling in the Amazônia”;

• Lecture in the Post-Graduation in Remote Sensing at INPE (24/10/2007), in the course “Tópicos Especiais em Floresta” (SER 455-3), presentation entitled “Modelos de Distribuição de Espécies”;

• Rede GEOMA Symposium, Petrópolis-RJ (29-31/10/2007), presenting the paper “Estudos de Modelagem de Distribuição de Espécies no Componente Biodiversidade na Rede GEOMA”.

During this year the following people were involved with openModeller through scholarships and training:

Doctoral students:

• Cristina Giannini, Instituto de Biociências da Universidade de São Paulo (IB/USP), since 07/2007.

• Elisângela Silva da Cunha Rodrigues, EPUSP (CAPES scholarship), since 07/2007;

• Fabiana Soares Santana, EPUSP, since 02/2007;

14

• Fabrício Rodrigues, EPUSP (CAPES scholarship), since 07/2007;

• Francisco Candido Cardoso Barreto, UFV;

• Karla Donato Fook, INPE, since 03/2004;

• Luciana H, Yoshino Kamino, UFMG;

• Nilton Cézar de Paula, EPUSP, since 06/2006.

Master students:

• Fabio Iwashita, INPE Remote Sensing Program, finished in 03/2007 under the supervision of Dr. Silvana Amaral;

• João Gabriel R. Giovanelli, UNESP, Rio Claro;

• Marcos Gonzales, ENBT/JBRJ.

Undergraduate students during the period of this report:

• Albert Massayuki Kuniyoshi – EPUSP;

• Alex Oshika, student of Computer Engineering – EPUSP;

• Danilo de Jesus da Silva Bellini, student of Electric Engineering – EPUSP (CNPq scholarship);

• Luciano Bergantini Lippi, student of Computer Engineering – EPUSP;

• Marcos Cabral Santos, student of Computer Engineering – EPUSP (Fapesp scholarship);

• Mariana Ramos Franco, student of Computer Engineering – EPUSP (Fapesp scholarship).

FAPESP Technical Training scholarships:

• Alexandre Copertino Jardim, scholarship type TT4, since 10/2007;

• Dr. César Alberto Bravo Pariente, EPUSP, scholarship type TT5, 12/2006 – 12/2007;

• Luciana Satiko Arasato, scholarship type TT3, since 10/2007;

• Missae Yamamoto, scholarship type TT5, since 10/2007;

• Renata Luiza Stange, EPUSP, scholarship type TT4a, since 01/2008;

• Tim Sutton, CRIA, scholarship type TT5, since 05/2006.

Publications

Journal papers Canhos, V.P., Siqueira, M.F.; Marino, A.; Canhos, D.A.L. “Análise da vulnerabilidade da biodiversidade brasileira frente às mudanças climáticas globais”. Parcerias Estratégicas. Centro de Gestão e Estudos Estratégicos. Accepted in March, 2008.

De Marco Jr, P. & Siqueira, M.F. “Como determinar a distribuição potencial de espécies sob uma abordagem conservacionista?” Megadiversidade, Belo Horizonte. (Submitted in December 2007).

Muñoz, M.E.S., Giovanni, R., Siqueira, M.F., Sutton, T., Brewer, P., Scachetti-Pereira, R., Canhos, V.P. & Canhos, D.A.L. “openModeller: A Generic Approach to Potential Distribution Modelling of Species”. Geoinformatica. (Submitted in December 2007).

Pereira, R. S. & Siqueira, M. F. “Algoritmo Genético para Produção de Conjunto de Regras (GARP)”. Megadiversidade, Belo Horizonte. (Article in press)

15

Santana, F.S., Bravo, C., Saraiva, A.M. & Correa, P.L.P. “Parallel Genetic Algorithm for Rule-set Production”. Environmental Modelling and Software. (Submitted in March 2008)

Santana, F. S., Siqueira, M. F., Saraiva, A. M. & Correa, P. L. P. 2008. “A reference business process for ecological niche modelling”. Ecological Informatics Journal, v. 3 p. 75-86.

Siqueira, M.F. & Durigan, G. 2007. “Modelagem da distribuição geográfica de espécies lenhosas de cerrado no Estado de São Paulo”. Revista Brasileira de Botânica. v.30. p239-249.

Siqueira, M.F., Durigan, G., De Marco Jr., P. & Peterson, A.T. “Something from Nothing: Using Landscape Similarity and Ecological Niche Modeling to Find Rare Plant Species”. Journal for Nature Conservancy. (Submitted in December 2007)

Conference papers Amaral, S., Costa, C.B., Iwashita, F., Ximenes, A. & Valeriano, D.M. (2007). “Estudos de Modelagem de Distribuição de Espécies no Componente Biodiversidade na Rede GEOMA”. I Simpósio da Rede Geoma, Petrópolis, RJ.

Amaral, S., Costa, C.B. & Rennó, C.D. (2007). “Normalized Difference Vegetation Index (NDVI) improving species distribution models: an example with the neotropical genus Coccocypselum (Rubiaceae)”. Anais do XIII Simpósio Brasileiro de Sensoriamento Remoto, Florianópolis, Brasil, INPE, p. 2275-22282 (marte.dpi.inpe.br/col/dpi.inpe.br/sbsr@80/2006/11.15.14.30/doc/2275-2282.pdf).

Araujo, J. M., Correa, P.L.P. & Saraiva, A. M. “A Framework for Species Distribution Modeling: a performance evaluation approach”, I2TS'2007 Proceedings of the 6th International Information and Telecommunication Technologies Symposium, Brasília: IEEE R9, 2007. Editors: Fundação Bardall de Educação e Cultura; Boukerche, A, Loureiro, A.A.F., Melo, A.C.M.A. and Gondim, P.R.L. p. 111-118. Oral presentation.

Bravo, C., Neto, J.J, & Santana, F.S. “Unifyinig Genetic Representation and Operators in an Adaptive Framework”. Analysis of Genetic Representations and Operators, AGRO 2007.

Bravo, C., Neto, J.J, Santana, F.S. & Saraiva, A.M. “Towards an adaptive implementation of genetic algorithms”. Anais da XXXIII Conferência Latinoamericana de Informática – CLEI 2007; Taller Latinoamericano de Informática para la Biodiversidad – INBI 2007, San José, Costa Rica. Proceedings of the CLEI – Centro Latinoamericano para Estudios en Informatica, 2007. v.1 p. 1-5.

Caruzo, M.B., Costa, C. B., Amaral, S. & Cordeiro, I. (2007). “Aplicação de classes de solo em modelos de distribuição de espécies: um exemplo com Croton L. (Euphorbiaceae)”. Paper presented at Congresso Nacional de Botânica, São Paulo.

Kawashita, R.S., Siqueira, M.F. & Mantovani, E. (2007). “Dados do monitoramento da cobertura vegetal por NDVI na modelagem da distribuição geográfica potencial do lobo-guará (Chrysocyon bracyurus)”. XIII Simpósio Brasileiro de Sensoriamento Remoto. Florianópolis, SC. v.13. p.3983 – 3990.

Kuniyoshi, M. A. & Correa, P. L. P. “Aplicação de Testes Unitários no openModeller”, Anais do 15º Simpósio Internacional de Iniciação Científica da USP, São Carlos, 2007. Abstract. Poster presentation.

Lorena, A. C., Siqueira, M. F., Giovanni, R., Carvalho, A. C. P. L. F. & Prati, R. C. “Potential Distribution Modelling Using Machine Learning”. In: The Twenty First International Conference on Industrial, Engineering & Other Applications of Applied

16

Intelligent Systems (IEA/AIE), Wroclaw, Poland. Lecture Notes in Artificial Intelligence, v. 5027, Springer-Verlag, 2008. (Accepted)

Santana, F. S., Murakami, E., Saraiva, A. M. & Correa, P. L. P. “A comparative study between precision agriculture and biodiversity modelling information systems”. 6th Biennal Conference of the European Federation of IT in Agriculture, Glasgow: C.Parker, S.Skerratt, C.Park, J.Shields, 2007. v. 1. p. 1-6. Oral presentation.

Santana, F. S., Murakami, E., Saraiva, A. M., Bravo, C. & Correa, P. L. P. “Uma arquitetura de referência para sistemas de informação para modelagem de nicho ecológico”, Anais do 6º Congresso Brasileiro de Agroinformática – SBIAgro 2007, Campinas: Embrapa Informática Agropecuária, 2007. Editors: S.Tiernes, L.H.A. Rodrigues. p. 101-105. Oral presentation.

Santana, F. S., Pinaya, J.L.D., Saraiva, A. M., Correa, P. L. P., Becerra, J.L.R. & Bravo, C. “Aplicação de SOA para identificação de serviços em sistemas de modelagem de nicho ecológico e GIS”, I2TS'2007 Proceedings of the 6th International Information and Telecommunication Technologies Symposium, Brasília: IEEE R9, 2007. Editors: Fundação Bardall de Educação e Cultura; Boukerche, A, Loureiro, A.A.F., Melo, A.C.M.A. and Gondim, P.R.L.

Santos, A.L. dos, Wanderley, M.G.L., Bestetti, C.B. & Amaral, S. (2007). “Diversidade da tribo Cynodonteae (Poaceae: Chloridoideae) no Estado de São Paulo”. Paper presented at Congresso Nacional de Botânica, São Paulo, SP.

Other papers Sutton, T., Giovanni, R. & Siqueira, M.F. “Introducing openModeller - A fundamental niche modelling framework”. OSGeo Journal Volume 1. ISSN 1994-1897. Available at http://www.osgeo.org/files/journal/final_pdfs/OSGeo_vol1_openModeller.pdf

Franco, M.F. “Arcabouço para distribuição e modelagem de espécies – uma análise de desempenho”. Scientific Report – FAPESP. FAPESP Process: 2006/03616-9. July. 2007. 39pp.

Stange, R.L. “Manual de Instalação da Plataforma de Serviços”. Internal Technical Report, February, 2008. 11pp.

In preparation Costa, C.B. & Amaral, S., “Presence-only modeling method for predicting species distribution: an example with the neotropical Rubiaceae genus Coccocypselum P. Br”. Biota Neotropica.

Giovanelli, J.G.R., Siqueira, M.F., Haddad, C.F.B. & Alexandrino, J. “Ecological niche modeling in the Brazilian Atlantic Forest: a comparative evaluation of presence-only methods for modelling the geographic distribution of anurans.

Gonzalez, M., Peixoto, A.L. & Siqueira, M.F. “Chorology of Hennecartia omphalandra Poisson (Monimiaceae), a Miocene species from the South American Atlantic Forest”.

Iwashita, F., Amaral, S., Monteiro, A.M.V. “Species distribution models sensibility to geographical positioning data”. Journal of Biogeography.

Santana, F.S., Sato, L., Bravo, C. & Saraiva, A.M. “Performance improvement strategies for ecological niche modelling”.

17

General Comments During the project we found that certain aspects of the architecture would be better if approached in a different way. Although a fully componentized architecture is still being researched, we are concentrated on providing a cohesive and simplified Web Service modeling infrastructure. Instead of creating individual Web Services for each openModeller component as originally proposed, we are aiming at consolidating the main functional areas into a single modeling Web Service. We realized that some of the components, such as the locality and environmental components would be better implemented with a focus on supplying data to the modeling environment rather than to end-users. For example, TAPIR, WFS and DiGIR services all provide a robust and well-established protocol for obtaining occurrence data, so implementing another service with a similar goal would not make much sense. Similar logic applies to the environmental component, where protocols such as WCS are already well-established. The idea is to be able to retrieve data from these kinds of services. An additional motivation is to keep the public Web Services API simple in order to ease integration with third parties and facilitate maintenance. For these reasons, the core pre and post analysis functionality are being incorporated into the main modeling Web Service.

GBIF speciesLink GEOSS

Specimen data (occurrence points) Environmental data (layers)

openModeller

Algorithm 1

Algorithm 2

Algorithm n

Interface (desktop, web, console, soap server, swig w rapper, others)

USER

data cleaninggeoreferencing

New architecture

modelling, pre-analysis,post-analysis

localitycomponent

environmentalcomponent

Figure 2. New architecture for openModeller

Another part of the architecture that we are likely to change relates to the data cleaning component. Originally we conceived a Web Service for data cleaning but we feel that it is better to work on the existing data cleaning infrastructure of the speciesLink network. These gains will benefit end users who will retrieve data from speciesLink using the new openModeller TAPIR driver.

The publicity gained from our regular software releases and interactions with other individuals and institutions has resulted in a number of potential areas for future collaboration with the wider scientific community. These include:

18

• An informal offer from Dr. Neil Caithness at the University of Oxford (UK) to host openModeller services at the OxGrid Campus Grid Computing Centre and the National Grid Service for the UK.

• Informal discussions with various people from the American Museum of Natural History and NatureServe on how we can help them to integrate openModeller into their current niche modeling processes.

• Informal discussions with Brian Hamlin (UC Berkeley, USA) towards including openModeller in future large scale modeling experiments they are planning.

openModeller has also been selected for the second generation of the LifeMapper21 project being developed by the University of Kansas.

Another initiative using openModeller is a GEOSS22 demonstration project being developed by GBIF and the Italian National Research Council. The result of this demonstration project should be used by another project called Ecological Model Web23 being developed by the Ecological Forecasting Program at NASA.

In addition we have been able to engage with users of our software from various countries around the world through our users' mailing list and IRC presence. This has enabled the introduction of a Taiwanese translation of openModeller Desktop which was contributed by one of our users.

A major concern refers to the continuity of these developments. We consider openModeller too important an initiative to depend on a project based grant or to be left solely as an open source initiative without substantial funding. Therefore, this last year will also be decisive as to planning its continuity and sustainability.

21 http://www.lifemapper.org/ 22 http://www.earthobservations.org/ 23 http://www.ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=4422708&arnumber=4423343&index=634

19

Annex 1. Changelog

Release 0.5.3 (2008-03-26)

SVN revision: 4209

� Fixed bug when GDAL failed to read a raster row (in which case the row used to have zero values). Now the row is filled with nodata values.

� XML request for model creation now supports additional options to filter occurrences (using spatiallyUnique or environmentallyUnique sampler functions).

� Changed nodata value of the default raster type (ByteHFA) to 101.

� Masks must now use nodata to indicate masked areas (zeros will not work anymore).

� om_testmodel now generates pseudo-absence points when there are no absences to be tested.

� Display confusion matrix cell values in om_console and om_testmodel.

� Renamed getLayerFilename to getLayerPath and getMaskFilename to getMaskPath in EnvironmentPtr class.

� Refactored om_pseudo, om_create and om_project to use the getopts command line library.

� Created man pages for om_pseudo, om_create and om_project.

� New parameter in om_pseudo to speficy the proportion of absence points to be generated.

� New algorithm AquaMaps (for marine organisms).

� Removed algorithms minimum distance and distance to average since EnvironmentalDistance provides the same functionality.

� "type" property of AlgParamMetadata changed from char * to a new enumeration called AlgParamDatatype. Values can be Integer, Real and String.

� Fixed bug in the environmental distance algorithm: probabilities were negative for points whose distance to the nearest point was beyond the maximum allowed distance.

� New classes to perform jackknife and chi-square in the environmental layers.

� Updated TerraLib drivers for compatibility with TerraLib version 3.2.0.

Release 0.5.2 (2007-10-24)

SVN revision: 3806

� Fixed compilation issues under Windows.

� Included new command line program to generate pseudo occurrences.

� Minor improvements in console tools (absences are now displayed in om_viewer and om_niche).

� Code clean up.

Release 0.5.1 (2007-08-30)

SVN revision: 3661

20

� Fixed MSVC compilation problems.

� Fixed bug in deserialization of the new GARP algorithm (OM GARP).

� Fixed crash in one-class SVM when input points contained absences.

� Fixed bug in deserialization of environmental distance algorithm with Mahalanobis distance.

� Implemented serialization/deserialization for the new GARP Best Subsets (OM GARP).

� New algorithm "Envelope Score".

� Fixed bug in the pseudo-absence generation of the SVM algorithm when no absences were passed as a parameter.

Release 0.5 (2007-08-15)

SVN revision: 3527

� New algorithm "Support Vector Machines" (C-SVC, nu-SVC and one-class SVM).

� Added support for multiple normalization techniques (two implementations are available: ScaleNormalizer and MeanVarianceNormalizer).

� New method to cancel jobs (model creation or model projection).

� Sample serialization is now based on the original (unnormalized) environment values.

� New infrastructure for unit tests using cxxtest.

Release 0.4.2 (2007-05-08)

� Included ROC curve as part of model statistics.

� Added metric Chebyshev in the Environmental Distance algorithm.

� Log object is now a singleton.

� Minor bugfixes.

21

Annex 2. Activities related to the speciesLink netw ork

Centralized data repository During this year, a new architecture was designed to include a centralized data repository. Figure 1 shows the diagram prior to this activity. All cache nodes include a DiGIR provider that is also present in collections that are serving data directly to the DiGIR portal. All queries were distributed and the network was beginning to have problems with performance. This development was co-financed through a project with the JRS Biodiversity Foundation.

The former architecture already had a database with a subset of DarwinCore fields In order to analyze the data and apply data cleaning tools. A webservice application called mapcria has also been developed to be used by different applications to dynamically produce maps through a web interface.

DiGIR PortalDiGIR Portal

Cache nodeCache node Cache nodeCache node

Network ManagerNetwork Manager

Data cleaningData cleaning

Web SiteWeb Sitemapcria

webservice

mapcriawebservice

Data analysis

Reports

MapsPostGIS

MapsPostGIS

Collections witha DiGIR provider

Collections with spLinker

DiGIR

SOAP

DiGIR

WMSWMS

Geo

grap

hic d

ata

Databasesubset

Distributed query

Figure 3. Diagram of network architecture

With funding from JRS Biodiversity Foundation and Fapesp the architecture was altered and now includes a central repository of all data served by the collections. The repository was installed in a Dell PowerEdge 1900 server, running Linux Fedora Core 6 operating system. The server has two powerful Intel Xeon 3GHz CPUs with 8GB of RAM memory. The whole system is being developed using open-source software: the web server is Apache, the database management system is PostgreSQL and the programming language is Perl.

22

DiGIR PortalDiGIR Portal

Cache nodeCache node Cache nodeCache node

IndicatorsIndicators

Network ManagerNetwork Manager

Query interfaceQuery interface

Data cleaningData cleaning

Web SiteWeb Sitemapcria

webservice

mapcriawebservice

Data analysis

Reports

MapsPostGIS

MapsPostGIS

Central RepositoryCentral Repository

Data HarvesterData Harvester

Collections witha DiGIR provider

Collections withspLinker

DiGIR

SOAP

DiGIR

WMSWMS

Centralized query

Geo

grap

hic da

ta

Dis

tribu

ted

quer

y

Figure 4. speciesLink’s new architecture

Introducing a Central Repository to the architecture also meant developing a data harvester and a new query interface. The first idea was to store a subset of the DarwinCore fields selecting only those of interest for data cleaning and ecological niche modeling applications. As the analysis of fields for different purposes evolves, the team decided to store all data that was being provided by the collections. This way the system is already designed to use and provide all fields.

The Central Repository uses PostgreSQL24 on Linux, an open source relational database system that has more than 15 years of active development and a proven architecture. CRIA’s team has about 10 years experience with the software that has proven to be robust with a very good performance, and has a number of resources available such as transaction control, maintenance of the referential integrity and automatic triggers. It also has native programming interfaces for C/C++, Java, .Net, Perl, Python, Ruby, Tcl, ODBC, among others, and good documentation.

Its SQL implementation strongly conforms to the ANSI-SQL 92/99 standards and supports compound functional indexes which can use any of its B-tree, R-tree, hash, or GiST storage methods. GiST (Generalized Search Tree) serves as a foundation for many public projects that use PostgreSQL such as PostGIS. PostGIS is a project which adds support for geographic objects in PostgreSQL, allowing it to be used as a spatial database for geographic information systems (GIS). Other advanced features include table inheritance, a rules systems, and database events. Table inheritance puts an object oriented slant on table creation, allowing database designers to derive new tables from other tables, treating them as base classes.

24 http://www.postgresql.org/

23

All these implementations were used in the development of the centralized database. A general table (splink) was created to store textual data following the DarwinCore data model and using inheritance mechanisms to create tables for each collection of the network (figure 3).

Figure 5. Diagram of the main table (splink) with secondary tables for each collection

PostGIS was used to facilitate the creation of maps, making geographic queries more efficient. A table was created for this specific purpose, using the geographic object “point” to store all georeferenced data of the network.

The data harvester was developed using Perl and recognizes any changes in the network through the DiGIR portal that accesses all DiGIR providers. Once a day, the

24

system checks to see if the database has been updated. If there is any change, data analysis processes (data cleaning, network manager, and indicators) are triggered.

Centralized query 25 A classification system was added to the metadata to enable users to select the type of collection they wish to search. XML files were created for each collection and these are used by different applications such as the network manager, indicators, and the centralized query. Data outputs of the system include selecting what fields should be presented (small subset, locality, all), in what format (XML, HTML, Excel), and whether the user wishes to plot the georeferenced data on a map. Below is a print screen of a query where the collections selected were those that are classified as “plants, herbarium, and voucher”. Possible options for different outputs are also shown. This interface was written in Perl and is available in English and Portuguese languages.

25 available at http://splink.cria.org.br/centralized_search

25

The classification system is also used by both management and monitoring system and indicators. The next figure shows a print screen of the map that is produced. Points are plotted with different colors for each collection and layers such as roadways, rivers, protected areas, among many others can be added.

26

Through the map interface it is now possible to retrieve information about the layers that were activated.

Figure 6. Occurrence points of a species plotted on a map

27

Figure 7. Information on activated layers where the point is located

Sensitive data There is a big debate as to what data can be made freely and openly available on the Internet to all interested and what should be filtered by the collection. Some believe that if a species is endangered, locality data must be omitted. Others believe that by including locality data one is enabling others to carry out conservation programs that involve that specific species. Some collections filter the data or use an application that reduces the precision of the data. Simply filtering the data can be a problem as users will not be able to differentiate “no data” from “sensitive data”. Altering its precisions is also a problem as managing, monitoring, or modeling tools may produce bad reports or analysis. For the speciesLink network, when a collection marks the record, or a specific field of the record, as not available, spLinker sends special values that the system will interpret as “restricted” data and this information will be shown as such. This means that if a researcher requires that data for his/her analysis, they know that the data exists and can contact the curator to try and obtain it. Another feature added to the query is an indication whether the species belongs to the IUCN or to national and state redlists. The icon sp that runs an application that integrates all databases at CRIA with other systems appears in red sp if the species is in one of the redlists.

As a last feature, we have also added a gateway to GBIF data, represented by the icon . All these features can be seen in figure 6.

28

Figure 8. Records that have their coordinates filtered by the collection and sp indicating the

species is in one of the redlists held at CRIA

Future developments The next developments to the speciesLink network include

• Migrate from the current DiGIR protocol to the TAPIR protocol, facilitating an efficient implementation of DarwinCore extensions making it possible the implementation of thematic networks such as microbiological, herbaria, etc.

• Adapt the current centralized database, the mirror database (used by the cache-nodes), the spLinker software, the harvesting procedures and the search interface to accommodate DarwinCore extensions such the one already defined to microbiological collections

• Adapt the current software, interface and databases to allow the implementation of different views of the speciesLink content as thematic networks such as a microbiological or herbaria speciesLink

• Help new collections to join the network

• Continue giving support to all participant collections

29

Annex 3. Report on a tool to analyze data on specie s distribution using parallel processing

RELATÓRIO DE INICIAÇÃO CIENTÍFICA

TÍTULO: FERRAMENTA PARA ANÁLISE DE DADOS SOBRE DISTRIBUIÇÃO DE ESPÉCIES UTILIZANDO

PROCESSAMENTO PARALELO

Orientando: Marcos Paulo Pereira Cabral dos Santos

Orientador: Profa. Dra. Liria Matsumoto Sato

Data: 22/06/2007

Instituição: Universidade de São Paulo

Escola Politécnica

Departamento de Engenharia de Computação e Sistemas Digitais

30

1- Introdução O presente documento consiste no relatório final das atividades realizadas no projeto de Iniciação Científica no período compreendido entre 01 de setembro de 2006 e 22 de junho de 2007. Durante este período foram realizadas as seguintes atividades: estudo do paralelismo e suas formas de implementação, estudo da programação orientada a objetos, familiarização com a ferramenta openModeller, análise e paralelização do código fonte do OpenModeller usando os sistemas OpenMP e MPI, bem como uma bateria de testes e depuração do código final obtido.

Foi necessário um adiantamento de dois meses do prazo de entrega, sem comprometer o que foi proposto no plano inicial. A causa do adiantamento foi minha aceitação para o Programa de Duplo Diploma entre a Escola Politécnica da Universidade de São Paulo e a Ecole Centrale de Paris, onde estudarei de julho de 2007 a junho de 2009.

2-Plano Inicial Os métodos mais usados para a predição de distribuição de espécies são baseados no conceito de nichos ecológicos [Peterson,2001; Anderson et al. 2002ª,b] que combinam dados de ocorrência da espécie com as condições ambientais em cada ponto. Modelos de nichos podem ser projetados em dimensões geográficas para predizer onde a espécie em análise está presente. Contudo, tais métodos podem demandar um tempo excessivo de execução. Buscando reduzir o tempo de execução e conseqüentemente viabilizar a análise de distribuições que requerem tempo de processamento inviável, como também, permitir análises mais complexas com a aplicação de diversos algoritmos e considerações diversas sobre os dados ambientais, este projeto de iniciação científica pretende aplicar o conceito de paralelismo no módulo de projeção da versão corrente de uma ferramenta já disponível denominada openModeller. Serão apresentadas três versões de paralelização do módulo de projeção do Openmodeler:

• Versão paralela utilizando o sistema OpenMP [http://www.openmp.org/drupal/mp-documents/spec25.pdf] para um computador multiprocessador, podendo ser executado em um dos nós multiprocessadores do cluster. OpenMP é uma interface padrão de diretivas para os compiladores da lingiuagem C que provê os recursos necessários para a paralelização de um programa para computadores multiprocessadores os quais contêm vários processadores e memória compartilhada. Implementações deste padrão encontram-se disponíveis gratuitamente.

• Versões paralelas utilizando a biblioteca MPI [Snir,2006] para ser executado utilizando vários nós do cluster. Foram implementadas duas versões. MPI (Message Passage Interface) é um padrão de interface para comunicação entre processos distribuídos por passagem de mensagem.

O projeto será realizado em três etapas:

Etapa 1: estudo e familiarização com o sistema open Modeller

• Estudo e uso do sistema corrente

• Estudo e análise do código do sistema openmodeller corrente

Etapa 2: desenvolvimento da versão paralela utiliza ndo a interface OpenMP.

• Definição da Estratégia de Paralelização

• Implementação no código do openModeller

• Depuração e testes

31

Etapa 3: desenvolvimento das versões paralelas util izando o sistema MPI

• Estudo de programação e familiarização com o sistema MPI

• Implementações usando a interface MPI, para cluster de computadores: uma versão tomando como base a estratégia de paralelismo definida na segunda etapa e utilizando o código fonte do openModeller e uma segunda adotando outra estratégia.

• Depuração e Testes

Cronograma

Início: 1 de Setembro de 2006

Duração: 12 meses

1 2 3 4 5 6 7 8 9 10 11 12

Etapa 1

A X

B X

Etapa 2

A X X

B X X

C X

Etapa 3

A X X

B X X

C X

Cabe a até o presente momento, apresentar a primeira versão de paralelização do openModeller.

3-Resumo das atividades realizadas No início do projeto foram estudados os conceitos básicos de programação paralela e de programação orientada a objetos em C++ [Stroustrup's,2000]. Depois de aprendidos os fundamentos do paralelismo, deu-se início a um processo de familiarização com a biblioteca OpenMP e com a ferramenta de análise de distribuição de espécies, OpenModeler.

Definiu-se a primeira estratégia de paralelização do código, seguida de sua implementação e depuração.

Foram apresentados os conceitos básicos de MPI e montadas novas estratégias de paralelização, uma delas foi utilizada em outra implementação em OpenMP.

Finalmente foram realizados os testes que permitiram concluir sobre o desempenho das versões.

3.1-Etapas cumpridas As etapas foram cumpridas conforme o proposto no plano inicial, com exceção da terceira etapa, quando foi necessário adicionar mais um passo, o de redefinição de estratégias de paralelização do openModeller.

32

Foi necessária uma redução do prazo de conclusão do projeto para 10 meses, conforme justificado na introdução e no formulário de encaminhamento deste relatório.

Etapa 1: estudo e familiarização com o sistema open Modeller

• Estudo dos princípios de programação paralela, dos fundamentos de programação orientada a objetos em C++ , familiarização com o OpenMP e uso do sistema corrente.

• Estudo e análise do código do sistema openmodeller corrente.

Etapa 2: desenvolvimento da versão paralela utiliza ndo a interface OpenMP.

• Definição da Estratégia de Paralelização.

• Implementação no código do openModeller.

• Depuração e testes.

Etapa 3: desenvolvimento das versões paralelas util izando a interface MPI

• Otimização, Depuração e Testes da implementação com o OpenMP;

• Estudo de programação e familiarização com o sistema MPI

• Redefinição de estratégias de paralelização do openModeller

• Implementação usando a interface MPI, para cluster de computadores, tomando como base a estratégia de paralelismo redefinida nesta etapa e nova implementação em OpenMP.

• Definição e implementação de uma segunda estratégia gerando-se uma nova versão.

Cronograma

Início: 1 de Setembro de 2006. Duração: 10 meses

1 2 3 4 5 6 7 8 9 10

Etapa 1

A x

B x

Etapa 2

A x x

B x

C x

Etapa 3

A x

B x

C x

D x x

E x x

4-Detalhamento das atividades realizadas Foram realizadas atividades de estudo e a implementação paralela do openModeller utilizando o sistemas OpenMP e MPI.

33

4.1-Estudos de Programação paralela Os primeiros conceitos de programação paralela a serem estudados foram os de Macrotasking, Microtasking e laços paralelos. Como método de aprendizagem, utilizou-se a resolução de pequenos problemas, como multiplicação de matrizes e soma de seus elementos utilizando processamento paralelo. As soluções eram implementadas em linguagem C, com uso da biblioteca CPAR [Sato,1995], desenvolvida por gerações anteriores do Laboratório de Arquitetura e Programação de Alto Desempenho, LAHPC-USP. Foi possível de se implementar em CPAR: macrotarefas e microtarefas, laços parelelos e semáforos.

Depois de ganhar familiaridade com os conceitos de paralelismo, iniciou-se o estudo do OpenMP, biblioteca para multiprocessadores com memória compartilhada. Os mesmos problemas que eram resolvidos com CPAR foram resolvidos com OpenMP, sempre observando quais meios de implementação mudavam de uma biblioteca para outra.

Foram tópicos relevantes da aprendizagem do OpenMP: região paralela, laços paralelos, variáveis locais e compartilhadas, seção crítica e barreira.

Para que fosse possível a modificação do código do OpenModeller, foi necessário aprender programação orientada a objetos em C++, bem como a junção desta linha de programação com a de paralelismo.

Por último foram estudados os conceitos e aplicações de MPI, conforme será detalhado mais adiante.

4.2- O openModeller

O openModeller é uma implementação de métodos para a predição de distribuição de espécies baseados no conceito de nichos ecológicos [SOURCEFORGE,2006]. Tais métodos combinam dados de ocorrência de uma determinada espécie com as condições ambientais em cada ponto. Eles tentam identificar, através de algoritmos existentes, quais pontos no espaço ambiental têm condições similares entre. Agrupados, estes pontos representam um modelo de nicho ecológico, dadas as dimensões ambientais consideradas. Desta forma pode-se se predizer onde a espécie poderá ou não manter populações, através das projeções destes nichos em dimensões geográficas.

Esta ferramenta de modelagem, escrita em linguagem C++ ANSI, recebe como parâmetros um conjunto de pontos de ocorrência (latitude e longitude) de uma determinada espécie e um conjunto de mapas de variáveis ambientais. Os algoritmos utilizados na versão corrente do openModeller são: Bioclim, Climate Space Model, GARP, Environmental Distance e Minimum Distance.

O funcionamento do software se dá em duas etapas: modelagem e projeção. Na primeira etapa combinam-se os dados de ocorrência da espécie com as condições ambientais de cada ponto para se obter, através dos algoritmos já citados, um modelo que representa a viabilidade da espécie sob determinadas condições ambientais. Na segunda, o modelo é projetado em dimensões geográficas para predizer onde a espécie poderá ou não manter populações.

Estão envolvidos no desenvolvimento do openModeller a Escola Politécnica da Universdade de São Paulo (EPUSP), o Instituto Nacional de Pesquisas Espaciais (INPE) e o Centro de Referência em Formação Ambiental (CRIA).

34

4.3-OpenMP

Serão brevemente descritas as diretivas do OpenMP que se utilizam na paralelização do openModeller.

O OpenMP é uma biblioteca que suporta a programação paralela de memória compartilhada em todas as arquiteturas [www.openmp.org]. Está disponível para as linguagens C/C++ e Fortran, em plataformas do Unix e do Windows NT. A implementação do paralelismo em OpenMP se faz da seguinte forma:

A região paralela é implementada pela diretiva #pragma omp parallel { região paralela}.

Variáveis declaradas e objetos criados dentro da região paralela são tidos como locais. Variáveis declaradas anteriormente à região paralela devem ser especificadas como privadas ou compartilhadas logo em seguida da chamada da região paralela: #pragma omp paralllel private(variáveis locais separadas por vírgula) shared (variáveis compartilhadas separadas por vírgula). Objetos criados anteriormente à região paralela são compartilhados, enquanto aqueles criados dentro dela, são locais.

Laços paralelos são implementados através da diretiva #pragma omp for seguida do for a ser paralelizado. O total de iterações passa a ser dividido entre os processos.

Para se implementar uma seção crítica usa-se #pragma omp critica { região crítica}, e para se impor uma barreira basta usar a diretiva #pragma omp barrier.

Quando se deseja definir o numero de threads que estará presente numa determinada região paralela, chama-se a função omp_set_num_threads (número de processos).

Caso não se especifique o número de threads, ele passa a ser o que está definido numa variável de ambiente. Caso contrário, o número de threads passa a ser o número de processadores da máquina. Para sabermos qual thread corrente está executando o código, usamos a função omp_get_thread_num.

4.3- O MPI- Message Passing Interface

4.3.1-Definições

Massage Passing: conjuntos de processos com acesso a uma memória local. A troca de informações se dá enviando-as da memória local de um processo para a memória local de um processo remoto.

MPI: trata-se de uma biblioteca de Message Passing desenvolvida para ambientes de memória distribuída e que fornece funções básicas para a comunicação entre os processos. Em suas aplicações, o paralelismo é explícito, isto é, o programador é responsável pela distribuição de tarefas.

4.3.2- Conceitos básicos de MPI

Rank: cada processo é designado, pelo sistema, por um número inteiro de 0 a N-1 (N é o número de processos), este número é chamado de rank.

Grupos: conjunto ordenado de N processos associado a um comunicador.

35

Comunicador: objeto local que representa o contexto de uma comunicação, isto é, os processos que podem se contatados. Cada grupo pode ter seu comunicador, e o comunicador associado a todos os processos é o MPI_COMM_WORLD.

4.3.4.Implementações em MPI

As aplicações em MPI podem ser descritas da seguinte forma:

Um problema é dividido em partes que são distribuídas entre os processos para que cada um faça a sua computação. Quando esta termina um processo mestre recebe os cálculos que cada processo escravo realizou e continua a execução do programa baseando-se nos resultados enviados por cada processo.

4.3.5 Funções do MPI utilizadas na paralelização do openModeller

MPI_Comm_size: retorna o número de processos. Tem como argumentos o comunicador e o endereço da variável inteira para onde será retornado o número de processos.

MPI_Comm_rank: : retorna o número do processo que está realizando determinada tarefa, podendo ser um número de 0 a n-1 (sendo n o numero total de processos). Tem como argumentos o comunicador e o endereço da variável inteira para onde será retornado o número do processo.

MPI_Send e MPI_Recv: rotinas básicas de envio e recebimento de mensagens, respectivamente. Os parâmetros do MPI_Send são, nessa ordem: Endereço do dado a ser transmitido, número de itens a ser enviado, tipo de dados, destino, comunicador usado. Analogamente os argumentos do MPI_recv: endereço do dado a ser transmitido, número de itens a ser enviado, tipo de dados, destino, comunicador, status da mensagem.

MPI_Barrier: para os processos que chegarem a um ponto determinado da execução do programa para até que todos os demais processos os “alcancem”.

5- Estratégias de Paralelização do openModeller

Foi paralelizado o módulo de projeção do openModeller, assim como estava previsto no plano inicial. O processo de paralelização se deu em etapas: localização do trecho a ser paralelizado, paralelização, compilação e depuração.

5.1- Primeira estratégia

Na primeira estratégia, implementado em OpenMP, foi paralelizado o módulo de Projeção do openModeller dividindo-o em blocos de lotes que eram executados simultaneamente e os processos escreviam o resultado em um único arquivo *.tif- a saída do programa- que consistia num mapa indicador da existência ou não de determinada espécie.

5.1.1-Localização do trecho a ser paralelizado

Inicialmente, fez-se uma visão panorâmica do código referente à etapa de projeção, que está estruturada em módulos e é compilado por partes automaticamente através de um comando makefile. Deu-se maior atenção às regiões que continham laços

36

longos, que são considerados como regiões que demorariam mais tempo de computação devido à sua complexidade e que seriam possíveis de serem paralelizadas.

Não foi encontrado algum laço longo com o número de iterações pré-determinado (comando for) . Todavia foi encontrado um laço — do tipo while — que tem como critério de parada uma comparação entre objetos, através da sobrecarga do operador binário de diferença, algo característico do polimorfismo existente na linguagem C++. Este laço está presente no arquivo Projector.cpp.

O trecho está apresentado a seguir:

MapIterator it = map->begin();

MapIterator fin;

while( it != fin )

{

(…)

++it;

}

Nota-se que, além da sobrecarga do operador diferença, foi usada, nesse trecho, a sobrecarga do operador de incremento pré-fixado.

5.1.2-Paralelização

A paralelização do while apresentada na seção 4.4.1 requisitou a proposta de uma solução não trivial, uma vez que a expressão de condição envolvia uma comparação entre dois objetos, sendo um deles iterado através do operador “++” pré-fixado e o outro um objeto final. Além disso, era necessário garantir que cada thread fizesse uma operação com um objeto distinto.

Para resolver estes problemas primeiramente todo o laço foi aninhado por uma região paralela. Depois, foram criadas variáveis e objetos auxiliares e compartilhados entre as threads. A variável e o objeto que eram iterados continuaram sendo elementos locais. Assim, dentro de uma seção crítica, iterava-se a variável e o objeto compartilhados e depois atribuía-se os valores atualizados aos componentes locais. Tais componentes continuavam na comparação do while como critério de parada. Desta forma, fez-se com que cada thread realizasse uma tarefa diferente e cada uma parasse quando seus componentes locais atendiam o critério de parada. Além disso, este procedimento permitiu que, se porventura uma thread fosse mais lenta que os demais, o número de operações realizado por ela seria menor. Também vale o dual dessa solução: se alguma thread fosse mais rápida ela realizaria mais operações.

Em continuidade da resolução do problema, para garantir o processamento de objetos distintos por cada thread, a variável e o objeto foram iniciados dentro de uma seção crítica. Foi necessário utilizar duas variáveis e objeto auxiliares do tipo compartilhado. Esta iniciação foi feita da seguinte maneira:

37

1- a variável auxiliar 1 é iniciada anteriormente à região paralela com o valor zero;

2- dentro da região paralela as threads executam segundo a seguinte sequênciaInicio da seção crítica:

se a variável auxiliar 1 tem o valor zero, a variável auxiliar 2 e a variável auxiliar 1 recebe 1, o objeto auxiliar tem os seus atributos atualizados com valores iniciais;

senão, o valor da variável auxiliar 2 é incrementado assim como os atributos do objeto auxiliar.

a variável e o objeto locais recebem recebem respectivamente o valor da variável auxiliar 2 e os atributos do objeto auxiliar.

Instala-se uma barreira.

Fim da seção crítica

O efeito desta lógica é fazer com que somente um processo, aquele que “chegar” primeiro, inicie variável, os demais apenas teriam acesso ao valor através das variáveis compartilhadas. Tanto a inicialização como a iteração são feitas em seções críticas e ocorre um certo aumento no número de operações, variáveis e objetos. Apesar disso, ganha-se tempo na computação global do laço, posto que quase todo seu restante é feito na forma paralela.

Mostra-se a seguir como ficaram as partes mais relevantes deste trecho. Observa-se que em (*) todo o corpo do laço é realizado paralelamente:

int temp_it = 0 ; MapIterator controle_it ; int controle_pixels; int temp_pixels=0; (...) Definição o número de processos e início da região paralela omp_set_num_threads(2); #pragma omp parallel shared(temp_contador,temp_pixels,controle_contador, controle_pixels,pixelcount,pixelstep) { #pragma omp critical { if( temp_it == 0) { controle_pixels=0; controle_it=map->begin(); temp_it=1; /*modo de alterar temp_it*/ } else { ++controle_it; controle_pixels++; } it=controle_it; } #pragma omp barrier

38

while( it != fin ) { (*) #pragma omp critical { controle_pixels++; pixels=controle_pixels; ++controle_it; it=controle_it; } Fim da seção crítica e do laço. }

Um trecho do código no corpo do while (*) referente à escrita do resultado da projeção do modelo do objeto foi encerrado numa seção crítica.

Sample amb; #pragma omp critical { Sample const &amb1 = env->get( lg, lt ); amb=amb1; } (…) #pragma omp critical { if( amb.size() == 0 ) map->put(lg,lt); else map->put(lg,lt,val); }

5.1.3-Compilação e Depuração

Na compilação foi utilizado o compilador icc da Intel que oferece a linguagem C++ e o OpenMP. Na depuração foram detectados e solucionados alguns problemas, em particular aqueles referentes à necessidade de incluir parte do código em seção crítica.

5.1.4-Testes e análise de desempenho e nova versão com OpenMP

A implementação e os testes foram realizados em um computador com um processador dual core com a finalidade de ser verificada a funcionalidade da implementação e de realizar uma análise preliminar do desempenho.

Em um primeiro teste, utilizando uma massa de dados pequena, verificou-se a funcionalidade da implementação com o paralelismo.

Contudo, em um segundo teste, aplicado a uma massa de dados maior, realizado com a finalidade de se ter uma visão preliminar de desempenho obteve-se um tempo de execução maior do que o obtido sem a paralelização. Detectou-se que o ponto de gargalo é o trecho de código referente à gravação do resultado da projeção do modelo no objeto

Em uma primeira tentativa, aplicando-se a paralelização apresentada na seção 5.1.2, não se teve sucesso. Detectou-se o problema na fase de testes, causado pelo fato de que o procedimento de escrita no mapa é feito por ponteiros locais com variáveis de deslocamento compartilhadas. Assim, quando um processo escrevia um

39

ponto no mapa, era tomado como referência o ponto de escrita anterior, mas se fazia o deslocamento determinado pelos incrementos feitos por todos os processos e não somente pelo local. Tal impropriedade gerava arquivos muito grandes e totalmente diferentes dos gerados pela versão seqüencial.

Estudando o código do openModeller, notamos que a saída do modo projeção se resumia em três campos: a latitude, a longitude e o valor do pixel. O mapa era gerado imprimindo-se na posição determinada, em escala, pela latitude e longitude o valor do pixel. Adotou-se, então, estratégia de dividir toda a tarefa entre as threads com cada um armazenando estes três campos em um arquivo de acesso local ao processo; assim que todos os processos terminassem suas tarefas, o conteúdo armazenado seria enviado para a thread 0 e inserido no mapa.

A forma adotada para se armazenar os campos foi um array de struct com tamanho máximo pré-determinado. Cada thread tem um array local, cujo conteúdo é descarregado em um arquivo na iminência de seu tamanho ultrapassar o estabelecido. Quando é terminado o laço da Projeção, todo conteúdo ainda restante no array é gravado no seu arquivo correspondente e é imposta uma barreira , de modo que nenhum processo avance para a última etapa, até que todos os demais tenham realizado suas partes. Na última etapa, a thread 0 recebe os arquivos dos demais processos e imprime no mapa os pixels correspondentes a cada par de latitude e longitude.

Quanto a funcionalidade, obteve-se sucesso nesta nova versão. Contudo, conforme pode-se observar na tabela de resultados apresentada a seguir, que mostra os resultados de testes executados em um computador dual core, o tempo obtido com a paralelização é maior do que o seqüencial. Detectou-se que a redução no desempenho é devido ao acesso simultâneo ao disco do computador, buscando posições em diferentes blocos no disco, adicionado ao fato do sistema OpenMP com C++ compartilhar todos os objetos criados pelo processo criador das threads, neste caso, o processo principal do programa, gerando a necessidade de incluir várias seções críticas ao longo do código, seqüencializando a execução do programa nestes pontos. Concluiu-se, então, que independentemente do número de processadores, não se obteria um bom desempenho. Assim sendo, verificou-se que a paralelização da etapa de Projeção do OpenModeller não é adequada para computadores paralelos com memória compartilhada. Como se pretendia oferecer esta versão com OpenMP para disponibilizar o OpenModeller para computadores paralelos com memória compartilhada, não se investiu em executá-la em um cluster, o que exigiria pesquisar a disponibilidade de uma plataforma com OpenMP e C++ para cluster. Preferiu-se continuar a execução do cronograma, partindo para o desenvolvimento de uma versão utilizando MPI com a aplicação desta mesma estratégia.

Apresenta-se a seguir os resultados obtidos nos testes:

Plataforma :

1 Processador dual core Intel 3.4 GHz

2 GB de RAM

HD IDE-Sata 160GB

Linux Read Hat versão 2.6

40

Implementação Processadores/Maquina Número de Máquinas tempo (s) Ganho% speed up

Seqüencial # # 185 #

Paralela 2 1 228,5 -23,5135 0,809628

5.2- Segunda Estratégia com implementação utilizando MPI

Foi implementada uma segunda estratégia equivalente a apresentada na seção 5.1, utilizando-se aqui a interface MPI.

5.2.1- Testes e análise de desempenho

O programa contém duas partes. Na primeira, os processos efetuam a projeção do seu respectivo bloco de pixels e geram um arquivo local ao processo no qual são armazenados os dados de projeção de cada pixel. Na segunda, é efetuada a transferência dos dados do arquivo de cada processo, por passagem de mensagem, para o processo 0, que os insere no mapa, gerando o arquivo final nome_arquivo.tif.

Para analisar o desempenho foram feitos testes em duas plataformas. Uma plataforma com processadores mais recentes (dual-core), um tamanho maior de memória e discos mais velozes e com maior capacidade de armazenamento do que uma segunda plataforma, que contém porém, um número maior de nós de processamento. Com os testes na plataforma 1 foi possível analisar o comportamento da aplicação paralela em um ambiente mais atual. Os testes na plataforma 2, com um número maior de nós, possibilitou a verificação da saturação do desempenho com um determinado número de nós.

Plataforma 1: 1 Processador dual core Intel 3.4 GHz 2 GB de RAM HD IDE-Sata 160GB Linux Read Hat versão 2.6 Plataforma 2: 1 nó com: 2 Processadores Intel Xeon 2.66 GHz 256 MB de RAM HD Linux Debian 2.618 4 nós com: 2 Processadores Intel Xeon 2 GHz 512 MB de RAM HD Linux Debian 2.618

41

Plataforma 1

Implementação Processos/Nó Número de

Nós tempo (s)

Ganho%

speed up

Sequencial # # 185 # #

1 2 182 1,6 1,0

2 1 158 14,5 1,2

2 2 123 33,5 1,5 Paralela

2 em uma e 1 em outra 2 133 28,1 1,4

Verificou-se um ganho de desempenho de 1,5 , ou seja, uma redução de 33,5% no tempo de execução em relação ao tempo seqüencial, processando-se com 4 processos, sendo dois em cada nó. Não foi possível a obtenção de maiores reduções, devido à seqüencialidade implícita na geração do mapa ( nome_arquivo.tif).

Plataforma 2

Implementação Número de Processos tempo (s) Ganho% speed up

Sequencial # 275 # # 2 384 -39,6 0,45 3 354 -22,3 0,78 4 309 -12,3 0,89 5 271 1,45 1,01 6 283 -2,91 0,97 7 281 -2,18 0,98 8 255 7,27 1,08 9 256 6,91 1,07

Paralela

10 247 10,2 1,1

No cálculo do tempo de execução, selecionou-se o maior tempo gasto pelos processos. O tempo seqüencial foi obtido pela execução da aplicação no nó com 2.66 GHz.

Para um número maior de processos (8, 9 e 10) houve uma diminuição no tempo de processamento da etapa 1 acompanhada de um aumento no tempo da etapa 2, resultando em um pequeno ganho de desempenho. Não foi possível observar com o número de nós disponíveis, a saturação esperada com um número determinado de nós, decorrente do aumento de mensagens e do tempo mínimo gasto pela geração do mapa sem nenhuma troca de mensagens.

42

5.3- Terceira Estratégia com implementação utilizando MPI

A segunda estratégia baseia-se na divisão das projeções nos pontos do mapa no módulo de Projeção, onde cada processo efetua a projeção para um bloco de pontos e escreve em arquivos distintos e a reunião destes é feita pelo processo 0.

O módulo Projeção gera um arquivo *.tif composto de um cabeçalho e o resultado da projeção. Neste estratégia, manteve-se o cabeçalho e o restante do arquivo foi dividido em N blocos, onde N é o número de processos. Cada bloco é computado paralelamente e impresso em um arquivo auxiliar.

Em uma primeira etapa, o primeiro processo escreve o cabeçalho e o último bloco, gerando o arquivo nome_arquivo.tif (arquivo que conterá o resultado final). O segundo escreve o cabeçalho e o penúltimo bloco, gerando o arquivo nome_arquivo_1.tf. O n-ésimo processo escreve o cabeçalho e o primeiro bloco, gerando o arquivo nome_arquivo_n-1.tif. Como não foi possível escrever diretamente cada bloco na sua respectiva posição no arquivo correspondente a cada processo, sem tratar as posições anteriores, devido ao formato e geração do arquivo .tif, procedemos da seguinte forma: foi impresso o cabeçalho e depois o valor 255 , que corresponde a nenhum valor de cor, até chegarmos a posição do primeiro pixel do bloco. Como os processos não realizam cálculo complexo para imprimir 255, a computação do trecho inútil é rápida,.cada um deles chega rapidamente a posição determinada no arquivo.

Em uma segunda etapa é realizada a transferência dos dados de cada bloco respectivo a cada processo para o processo 0 que os grava no arquivo nome_arquivo.tif. Esta transferência, realizada através de envio e recebimento de mensagem, inicia-se pelo processo1, sendo gravado o penúltimo bloco no arquivo nome_arquivo.tif e termina com a transferência pelo n-ésimo processo (n é o número de processos), com o processo 0 gravando o primeiro bloco de pixels, e gerandos- o arquivo final nome_arquivo.tif. Esta operação de inclusão dos blocos, embora realizada seqüencialmente, demora pouco tempo para ser realizada se comparada com o tempo de computação total do programa.

Para acelerar o processo, a transferência de dados do arquivo de cada processo para o processo 0 e a sua inclusão no mapa, foi efetuado a partir da posição calculada como segue:

posição=total_pixels/N *k , onde k é o número identificador do processo MPI (k=0 a N-1).

5.3.1- Testes e análise de desempenho

Para analisar o desempenho foram feitos testes na plataforma 1 e plataforma 2, descritas na seção 5.2.1. Plataforma 1

Implementação

Processadores/Maquina

Número de Máquinas

tempo (s)

Ganho%

speed up

Sequencial # # 185 # # 1 2 152 17,8 1,2 2 1 152 17,7 1,2 2 2 116 37,3 1,6 Paralela 2 em uma e 1 em outra 2 127 31,3 1,5

43

Obteve-se nos testes na plataforma um ganho de desempenho de cerca de 1,6, ou seja, uma redução de 37,3 % com 4 processadores, sendo 2 em cada nó. Nestes testes, pode-se observar um ganho de desempenho decorrente do processamento paralelo na parte 1 do programa. Não se pode observar, entretanto, o impacto da comunicação entre nós, presente na transferência dos dados de cada processo para o processo 0, uma vez que todos os processos se encontram no único nó.

Plataforma 2 número de processos parte 1 parte 2 Tempo Ganho % speedup Seqüencial 275 2 323 2 325 -18,1818 0,846154 3 283 2 285 -3,63636 0,964912 4 242 2 244 11,27273 1,127049 5 228 2 230 16,36364 1,195652 6 219 11 230 16,36364 1,195652 7 217 12 229 16,72727 1,200873 8 210 11 221 19,63636 1,244344 9 213 12 225 18,18182 1,222222 10 213 12 225 18,18182 1,222222

A estratégia de acelerar o processamento da parte 2 do programa, efetuando-se a transferência de cada processo para o processo 0 a partir de uma posição no arquivo próxima e anterior ao início do respectivo bloco de dados, conforme o cálculo apresentado, mostrou-se eficiente, resultando em uma estabilidade no aumento do tempo na parte 2.

Observa-se a saturação no ganho de desempenho com 8 processos, sendo que a partir deste número de processos não há redução no processamento da parte 1. Esta saturação é causada pela seqüencialidade na geração do mapa, com a necessidade de cada processo gravar dados no arquivo nas posições anteriores ao início do seu respectivo bloco, como também pelo alto custo de acesso ao disco nesta plataforma.

6-Conclusão

Os estudos realizados sobre C++ e programação paralela proveram a fundamentação necessária para o desenvolvimento da implementação. Desenvolveu-se uma versão ainda preliminar utilizando o OpenMP, que demonstrou a sua funcionalidade mas não apresentou um ganho de desempenho em relação a versão seqüencial.

A terceira estratégia mostrou-se mais adequada. Contudo, com as plataformas disponíveis para os testes, não foi possível verificar o seu comportamento em um cluster com melhores recursos de processamento, disco e comunicação entre os nós. Esta avaliação será realizada pela equipe do projeto OpenModeller no momento em que o cluster que foi adquirido e que se encontra ainda em instalação estiver disponível para uso. Prevê-se um melhor ganho de desempenho, especialmente devido à conexão de alta velocidade, presente neste cluster, assim como as características dos recursos de processamento e de armazenamento.

44

Referências Bibliográficas

Anderson, R. P., D. Lew, & A. T. Peterson. 2003. Using intermodel variation in error components to select best subsets of ecological niche models. Ecological Modelling 162:211-232.

Anderson, R. P., M. Laverde, & A. T. Peterson. 2002a. Geographical distributions of spiny pocket mice in South America: Insights from predictive models. Global Ecology and Biogeography 11:131-141.

Anderson, R. P., M. Laverde, & A. T. Peterson. 2002b. Using niche-based GIS modeling to test geographic predictions of competitive exclusion and competitive release in South American pocket mice. Oikos 93:3-16

Ben-Ari, M. 1948. Principles of concurrent and distributed programing.

Peterson, A. T. 2001. Predicting species' geographic distributions based on ecological niche modeling. Condor 103:599-605.

Peterson, A. T., & K. C. Cohoon. 1999. Sensitivity of distributional prediction algorithms to geographic data completeness. Ecological Modelling 117:159-164.

Peterson, A. T. & Vieglais., D. A. 2001. "Predicting species invasions using ecological niche modeling." BioScience 51: 363-371.

Sato, Líria M., Midorikawa, Edson T., Senger, Hermes. 1996. Introdução a Programação Paralela e Distribuída.<www.unisantos.br/mestrado/informatica/hermes/File/apost.pdf.>

Acessado em 28/09/2006.

Sato, Líria M. 1995. Ambientes de programação para sistemas paralelos e distribuidos. Tese de Livre-docência da Escola Politécnica da Universidade de São Paulo (EPUSP).

SOURCE FORGE. Portal do Projeto OpenModeller. Disponível em: <http://openmodeller.sourceforge.net>. Acesso em 30/10/2006.

Snir,M., Otto,S., Huss-Lederman,S., Walker,D., Dongarra,J. MPI: the complete reference. <http://www.netlib.org/utk/papers/mpi-book/mpi-book.html>. Acesso em 02/06/2006.

Stroustrup's, Bjarme. 2000. The C++ Programing Lang uage (Special Edition).

<http://www.openmp.org/drupal/mp-documents/spec25.pdf> . Tutorial. Extraído em 20/09/2006

45

Annex 4. Parallel versions of the projection for computer clusters

VERSÕES PARALELAS DA PROJEÇÃO PARA CLUSTER DE COMPUTADORES

Autora: Liria Matsumoto Sato

1) Introdução Foram implementadas 4 versões paralelas do Openmodeller. Três delas utilizando a interface MPI: versão 2, versão 3 e versão 4.

A versão 3 e a versão 4 serão apresentadas neste anexo.As duas versões foram executadas no cluster do projeto utilizando números variados de processos.

O cluster contém um nó de entrada e 10 nós para processamento das aplicações. Cada nó contém seu próprio disco, 2 processadores quad Xeon e 8GB de memória RAM. Nos testes utilizando até 10 processos, cada processo foi alocado em um nó. Nos testes utilizando mais de 10 processos, mais de um processador foi utilizado em alguns nós. Analisando-se as tabelas de tempo de execução relativas a cada versão, apresentadas na seção 2 e seção 3 deste anexo, nota-se a superioridade da versão2.

Nos testes foram aplicados dois experimentos:

Experimento 1: algoritmo Environmental Distance

map e output map: bio, prec, tmax, tmin

Output mask: Brasil

Total de pixels: 24834568

Experimento 2: algoritmo Environmental Distance

map e output map: bio, prec, tmax, tmin

Output mask: Brasil

Total de pixels: 777600000

O código da implementação original contém uma etapa da aplicação, referente a geração do arquivo resultante no formato .tif, que exige uma execução seqüencial. No experimento 2, em um total de tempo de execução seqüencial de 3422 segundos, cerca de 1638 segundos são gastos nesta etapa. Na versão 3 e na versão 4, obteve-se com 10 nós um tempo de execução menor que 1638 segundos. Esta redução deve ter decorrido não apenas pelo processamento paralelo como também possivelmente do melhor uso da memória.

2) Versão 3 Cada nó gera um arquivo intermediário contendo a probabilidade associada para cada pixel foi proposta e implementada. Nesta versão, o total de pixel foi particionado em blocos de tamanhos iguais, com o nó 0 executando também o resto da divisão dos blocos entre os nós. O nó 0 recebe os arquivos intermediários dos demais nós e gera o arquivo final no formato .tif.

46

A aplicação foi executada no cluster do projeto utilizando-se números variados de processos.

A tabela 1 e a tabela 2 mostram os tempos de execução obtidos utilizando um conjunto variado de nós para o experimento 1 e 2 respectivamente.

Seqüencial 4 nós 8 nós 10 nós 10 nós

(11 proc.)

10 nós

(12 proc.)

10 nós (13 proc.) 10 nós

(14 proc.)

231 162 89 83 80 72 67 65

Tabela 1: tempos de execução (em segundos) do experimento 1 (versão 3)

Seqüencial 4 nós

8 nós

10 nós

10 nós

(11 proc.)

10 nós

(12 proc.)

10 nós (13 proc.)

10 nós (14 proc.)

3422 1653 1504 1458 1449 1410


2) Versão 2 Esta versão difere da versão 3, pois aplica uma estratégia em que não há a necessidade de gerar arquivos intermediários e distribue dinamicamente blocos de pixels sob demanda de cada nó ao terminar a execução de um bloco. Desta forma, o tempo de execução desta versão é menor do que a versão 1, pois elimina o tempo gasto na escrita e leitura dos arquivos intermediários e provê uma distribuição mais adequada da demanda de processamento. Nesta versão, cada nó requisita um bloco de pixels ao processo que gerencia a distribuição, processa a projeção para os pixels do bloco, armazena os resultados em um “buffer”, e após tratar todo o bloco, transfere para o nó 0 os dados do buffer. A seguir, solicita um novo bloco. O nó 0 abriga o processo que recebe as informações de probabilidade de cada pixels dos demais nós e gera o arquivo final .tif.

A tabela 3 e a tabela 4 mostram os tempos de execução obtidos utilizando um conjunto variado de nós para o experimento 1 e 2 respectivamente.

Seqüencial 4 nós

8 nós

10 nós

10 nós

(11 proc.)

10 nós

(12 proc.)

10 nós (13 proc.)

10 nós (14 proc.)

231 146 61 48 44 41 40 40


Seqüencial 4 nós

8 nós

10 nós

10 nós

(11 proc.)

10 nós

(12 proc.)

10 nós (13 proc.)

10 nós (13 proc.)

3422 1988 1299 1201 1163 1151 1120 1117


47

Annex 5. Manual for the installation of a services platform

openmodeller a framework for species modelingopenmodeller.cria.org.br/documentos/relatorios/... ·...

Documents