fiware fwd big_data_all_in_1_v1
TRANSCRIPT
FIWARE Developer’s WeekBig Data GE
2
10000 feet view:Big Data in the
FIWARE architecture
OrionCB
Big Data
App
real time last-valuecontext data
historicalcontext data
IoT agents
HDFS
other storages
Cygnus
MapReduceHive
Tidoop
4
Big Data:What is it and how much data is there
What is big data?
5
> small data
What is big data?
6
> big data
http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg
How much data is there?
7
Data growing forecast
8
2.3 3.612
19
11.3
39
0.5
1.4
Global users
(billions)
Global networked
devices(billions)
Global broadband speed
(Mbps)
Global traffic(zettabytes)
http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast
2012
20122012
2012
2017
2017
2017
2017
1 zettabyte = 1021 bytes
1,000,000,000,000,000,000,000 bytes
9
How to deal with it:Distributed storage
and computing
What happens if one shelving is not enough?
10
You buy more shelves…
… then you create an index
11
“The Avengers”, 1-100, shelf 1“The Avengers”, 101-125, shelf 2“Superman”, 1-50, shelf 2“X-Men”, 1-100, shelf 3“X-Men”, 101-200, shelf 4“X-Men”, 201, 225, shelf 5
The Avengers
The Avengers
The Avengers
The Avengers
The Avengers
Superman
Superman
X-Men
Distributed
storage!
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
… and you call your friends to read everything!
12
Distributed
computing!
13
Distributed storage:The Hadoop reference
Hadoop Distributed File System (HDFS)
14
• Based on Google File System• Large files are stored across multiple machines
(Datanodes) by spliting them into blocks that are distributed
• Metadata is managed by the Namenode• Scalable by simply adding more Datanodes• Fault-tolerant since HDFS replicates each block
(default to 3)• Security based on authentication (Kerberos) and
authorization (permissions, HACLs)• It is managed like a Unix-like file system
HDFS architecture
15
Namenode
Datanode0 Datanode1 DatanodeN
Rack 1 Rack 2
1 2 2 3 3 1 2
Path Replicas Block IDs
/user/user1/data/file0 2 1,3
/user/user1/data/file1 3 2,4,5
… … …
16
Managing HDFS:File System ShellHTTP REST API
File System Shell
17
• The File System Shell includes various shell-like comands that directly interact with HDFS
• The FS shell is invoked by any of these scripts:– bin/hadoop fs– bin/hdfs dfs
• All FS Shell commans take URI paths as arguments:– scheme://authority/path– Available schemas: file (local FS), hdfs (HDFS)– If nothing is especified, hdfs is considered
• It is necessary to connect to the cluster via SSH• Full commands reference
– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
File System Shell examples
18
$ hadoop fs -cat webinar/abriefhistoryoftime_page1CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast$ hadoop fs -mkdir webinar/afolder$ hadoop fs -ls webinarFound 4 items-rw-r--r-- 3 frb cosmos 3431 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page1-rw-r--r-- 3 frb cosmos 1604 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page2-rw-r--r-- 3 frb cosmos 5257 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page3drwxr-xr-x - frb cosmos 0 2015-03-10 11:09 /user/frb/webinar/afolder$ hadoop fs -rmr webinar/afolderDeleted hdfs://cosmosmaster-gi/user/frb/webinar/afolder
HTTP REST API
19
• The HTTP REST API supports the complete File System interface for HDFS
• It relies on the webhdfs schema for URIs– webhdfs://<HOST>:<HTTP_PORT>/<PATH>
• HTTP URLs are built as:– http://<HOST>:<HTTP_PORT>/webhdfs/v1/
<PATH>?op=…
• Full API specification– http://hadoop.apache.org/docs/current/hadoop
-project-dist/hadoop-hdfs/WebHDFS.html
HTTP REST API examples
20
$ curl –X GET “http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=frb”CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast$ curl -X PUT "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb"{"boolean":true}$ curl –X GET "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb"{"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE","length":3431,"owner":"frb","group":"cosmos","permission":"644","accessTime":1425995831489,"modificationTime":1418216412441,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE","length":5257,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathSuffix":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":"755","accessTime":0,"modificationTime":1425995941361,"blockSize":0,"replication":0}]}}$ curl -X DELETE "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb"{"boolean":true}
21
Feeding HDFS:Cygnus
Cygnus FAQ
22
• What is it for?– Cygnus is a connector in charge of persisting Orion
context data in certain configured third-party storages, creating a historical view of such data. In other words, Orion only stores the last value regarding an entity's attribute, and if an older value is required then you will have to persist it in other storage, value by value, using Cygnus.
• How does it receives context data from Orion Context Broker?– Cygnus uses the subscription/notification feature of
Orion. A subscription is made in Orion on behalf of Cygnus, detailing which entities we want to be notified when an update occurs on any of those entities attributes.
Cygnus FAQ
23
• Which storages is it able to integrate?– Current stable release is able to persist Orion context data in:
• HDFS, the Hadoop distributed file system.• MySQL, the well-know relational database manager.• CKAN, an Open Data platform.• STH, Short-Term Historic
• Which is its architecture?– Internally, Cygnus is based on Apache Flume. In fact,
Cygnus is a Flume agent, which is basically composed of a source in charge of receiving the data, a channel where the source puts the data once it has been transformed into a Flume event, and a sink, which takes Flume events from the channel in order to persist the data within its body into a third-party storage.
Basic Cygnus agent
24
Configure a basic Cygnus agent
25
• Edit /usr/cygnus/conf/agent_<id>.conf• List of sources, channels and sinks:
cygnusagent.sources = http-sourcecygnusagent.sinks = hdfs-sinkcygnusagent.channels = hdfs-channel
• Channels configurationcygnusagent.channels.hdfs-channel.type = memorycygnusagent.channels.hdfs-channel.capacity = 1000cygnusagent.channels.hdfs-channel.transactionCapacity = 100
Configure a basic Cygnus agent
26
• Sources configuration:cygnusagent.sources.http-source.channels = hdfs-channelcygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSourcecygnusagent.sources.http-source.port = 5050cygnusagent.sources.http-source.handler = es.tid.fiware.fiwareconnectors.cygnus.handlers.OrionRestHandlercygnusagent.sources.http-source.handler.notification_target = /notifycygnusagent.sources.http-source.handler.default_service = def_servcygnusagent.sources.http-source.handler.default_service_path = def_servpathcygnusagent.sources.http-source.handler.events_ttl = 10cygnusagent.sources.http-source.interceptors = ts decygnusagent.sources.http-source.interceptors.ts.type = timestampcygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtractor$Buildercygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf
Configure a basic Cygnus agent
27
• Sinks configuration:cygnusagent.sinks.hdfs-sink.channel = hdfs-channelcygnusagent.sinks.hdfs-sink.type = es.tid.fiware.fiwareconnectors.cygnus.sinks.OrionHDFSSinkcygnusagent.sinks.hdfs-sink.cosmos_host = cosmos.lab.fi-ware.orgcygnusagent.sinks.hdfs-sink.cosmos_port = 14000cygnusagent.sinks.hdfs-sink.cosmos_default_username = cosmos_usernamecygnusagent.sinks.hdfs-sink.cosmos_default_password = xxxxxxxxxxxxxcygnusagent.sinks.hdfs-sink.hdfs_api = httpfscygnusagent.sinks.hdfs-sink.attr_persistence = columncygnusagent.sinks.hdfs-sink.hive_host = cosmos.lab.fi-ware.orgcygnusagent.sinks.hdfs-sink.hive_port = 10000cygnusagent.sinks.hdfs-sink.krb5_auth = false
HDFS details regarding Cygnus persistence
28
• By default, for each entity Cygnus stores the data at:– /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-
type>/<entity-id>-<entity-type>.txt
• Within each HDFS file, the data format may be json-row or json-column:– json-row
{ "recvTimeTs":"13453464536”, "recvTime":"2014-02-27T14:46:21”, "entityId":"Room1”, "entityType":"Room”, "attrName":"temperature”, "attrType":"centigrade”, “attrValue":"26.5”, "attrMd":[ … ]}
– json-column{ "recvTime":"2014-02-27T14:46:21”, "temperature":"26.5”, "temperature_md":[ … ], “pressure”:”90”, “pressure_md”:[ … ]}
Advanced features (version 0.7.1)
29
• Round-Robin channel selection• Pattern-based context data grouping• Kerberos authentication• Management Interface (roadmap)• Multi-tenancy support (roadmap)• Entity model-based persistence (roadmap)
Round Robin channel selection
30
• It is possible to configure more than one channel-sink pair for each storage, in order to increase the performance
• A custom ChannelSelector is needed• https://github.com/telefonicaid/fiware-connectors/b
lob/master/flume/doc/operation/performance_tuning_tips.md
RoundRobinChannelSelector configuration
31
cygnusagent.sources = mysourcecygnusagent.sinks = mysink1 mysink2 mysink3cygnusagent.channels = mychannel1 mychannel2 mychannel3
cygnusagent.sources.mysource.type = ...cygnusagent.sources.mysource.channels = mychannel1 mychannel2 mychannel3cygnusagent.sources.mysource.selector.type = es.tid.fiware.fiwareconnectors.cygnus.channelselectors.RoundRobinChannelSelectorcygnusagent.sources.mysource.selector.storages = Ncygnusagent.sources.mysource.selector.storages.storage1 = <subset_of_cygnusagent.sources.mysource.channels>...cygnusagent.sources.mysource.selector.storages.storageN = <subset_of_cygnusagent.sources.mysource.channels>
Pattern-based Context Data Grouping
32
• Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a concatenation:– destination=<entity_id>-<entityType>
• It is possible to group different context data thanks to this regex-based feature implemented as a Flume interceptor:cygnusagent.sources.http-source.interceptors = ts decygnusagent.sources.http-source.interceptors.ts.type = timestampcygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtractor$Buildercygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf
Matching table for pattern-based grouping
33
• CSV file (‘|’ field separator) containing rules– <id>|<comma-separated_fields>|<regex>|<destination>|
<destination_dataset>
• For instance:1|entityId,entityType|Room\.(\d*)Room|numeric_rooms|rooms2|entityId,entityType|Room\.(\D*)Room|character_rooms|rooms3|entityType,entityId|RoomRoom\.(\D*)|character_rooms|rooms4|entityType|Room|other_roorms|rooms
• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor
Kerberos authentication
34
• HDFS may be secured with Kerberos for authentication purposes
• Cygnus is able to persist on kerberized HDFS if the configured HDFS user has a registered Kerberos principal and this configuration is added:cygnusagent.sinks.hdfs-sink.krb5_auth = truecygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_usernamecygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxxcygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file = /usr/cygnus/conf/krb5_login.confcygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file = /usr/cygnus/conf/krb5.conf
• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/operation/hdfs_kerberos_authentication.md
35
Distributed computing:
The Hadoop reference
Hadoop was created by Doug Cutting at Yahoo!...
36
… based on the MapReduce patent by Google
Well, MapReduce was really invented by Julius Caesar
37
Divide etimpera*
* Divide and conquer
An example
38
How much pages are written in latin among the booksin the Ancient Library of Alexandria?
LATINREF1P45
GREEKREF2P128
EGYPTREF3P12
LATINpages 45
EGYPTIAN
LATINREF4P73
LATINREF5P34
EGYPTREF6P10
GREEKREF7P20
GREEKREF8P230
45 (ref 1)
still reading…
Mappers
Reducer
An example
39
How much pages are written in latin among the booksin the Ancient Library of Alexandria?
GREEKREF2P128
stillreading…
EGYPTIAN
LATINREF4P73
LATINREF5P34
EGYPTREF6P10
GREEKREF7P20
GREEKREF8P230
GREEK
45 (ref 1)
Mappers
Reducer
An example
40
How much pages are written in latin among the booksin the Ancient Library of Alexandria?
LATINpages 73
EGYPTIAN
LATINREF4P73
LATINREF5P34
GREEKREF7P20
GREEKREF8P230
LATINpages 34
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
Mappers
Reducer
An example
41
How much pages are written in latin among the booksin the Ancient Library of Alexandria?
GREEK
GREEK
GREEKREF7P20
GREEKREF8P230
idle…
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
Mappers
Reducer
An example
42
How much pages are written in latin among the booksin the Ancient Library of Alexandria?
idle…
idle…
idle…
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
152 TOTAL
Mappers
Reducer
MapReduce applications
43
• MapReduce applications are commonly written in Java– Can be written in other languages through Hadoop Streaming
• They are executed in the command line$ hadoop jar <jar-file> <main-class> <input-dir> <output-dir>
• A MapReduce job consists of:– A driver, a piece of software where to define inputs, outputs, formats,
etc. and the entry point for launching the job– A set of Mappers, given by a piece of software defining its behaviour– A set of Reducers, given by a piece of software defining its behaviour
• There are 2 APIS– org.apache.mapred old one– org.apache.mapreduce new one
• Hadoop is distributed with MapReduce examples– [HADOOP_HOME]/hadoop-examples.jar
Implementing the example
44
• The input will be a single big file containing:symbolae botanicae,latin,230mathematica,greek,95physica,greek,109ptolomaics,egyptian,120terra,latin,541iustitia est vincit,latin,134
• The mappers will receive pieces of the above file, which will be read line by line– Each line will be represented by a (key,value) pair, i.e. the offset on the
file and the real data within the line, respectively– For each input pair a (key,value) pair will be output, i.e. a common
“num_pages” key and the third field in the line
• The reducers will receive arrays of pairs produced by the mappers, all having the same key (“num_pages”)– For each array of pairs, the sum of the values will be output as a
(key,value) pair, in this case a “total_pages” key and the sum as value
Implementing the example: JCMapper.class
45
public static class JCMapper extendsMapper<Object, Text, Text, IntWritable> {
private final Text globalKey = new Text(”num_pages"); private final IntWritable bookPages = new IntWritable();
@Override public void map(Object key, Text value, Context context) throws Exception {
String[] fields = value.toString().split(“,”); system.out.println(“Processing “ + fields[0]);
if (fields[1].equals(“latin”)) { bookPages.set(fields[2]);
context.write(globalKey, bookPages); } // if
} // map
} // JCMapper
Implementing the example: JCReducer.class
46
public static class JCReducer extendsReducer<Text, IntWritable, Text, IntWritable> {
private final IntWritable totalPages= new IntWritable();
@Override public void reduce(Text globalKey, Iterable<IntWritable> bookPages, Context context) throws Exception {
int sum = 0;
for (IntWritable val: bookPages) { sum += val.get(); } // for
totalPages.set(sum); context.write(globalKey, totalPages);
} // reduce
} // JCReducer
Implementing the example: JC.class
47
public static void main(String[] args) throws Exception { int res = ToolRunner.run( new Configuration(), new CKANMapReduceExample(), args); System.exit(res);} // main @Overridepublic int run(String[] args) throws Exception { Configuration conf = this.getConf(); Job job = Job.getInstance(conf, ”julius caesar"); job.setJarByClass(JC.class); job.setMapperClass(JCMapper.class); job.setCombinerClass(JCReducer.class); job.setReducerClass(JCReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;} // run
48
Querying toolsHive
Querying tools
49
• MapReduce paradigm may be hard to understand and, the worst, to use
• Indeed, many data analyzers just need to query for the data– If possible, by using already well-known languages
• Regarding that, some querying tools appeared in the Hadoop ecosystem– Hive and its HiveQL language quite similar to
SQL– Pig and its Pig Latin language a new language
Hive and HiveQL
50
• HiveQL reference– https://cwiki.apache.org/confluence/display/Hive/
LanguageManual
• All the data is loaded into Hive tables– Not real tables (they don’t contain the real data) but metadata
pointing to the real data at HDFS
• The best thing is Hive uses pre-defined MapReduce jobs behind the scenes!• Column selection• Fields grouping• Table joining• Values filtering• …
• Important remark: since MapReduce is used by Hive, the queries make take some time to produce a result
Hive CLI
51
$ hivehive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txthive>select column1,column2,otherColumns from mytable where column1='whatever' and columns2 like '%whatever%';Total MapReduce jobs = 1Launching Job 1 out of 1Starting Job = job_201308280930_0953, Tracking URL = http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201308280930_0953Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_09532013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%
Hive Java API
52
• Hive CLI is OK for human-driven testing purposes• But it is not usable by remote applications
• Hive has no REST API• Hive has several drivers and libraries
• JDBC for Java• Python• PHP• ODBC for C/C++• Thrift for Java and C++• https://cwiki.apache.org/confluence/display/Hive/Hiv
eClient
• A remote Hive client usually performs:• A connection to the Hive server (TCP/10000)• The query execution
Hive Java API: get a connection
53
private static Connection getConnection(String ip, String port, String user, String password) { try { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) { System.out.println(e.getMessage()); return null; } // try catch try { return DriverManager.getConnection("jdbc:hive://" + ip + ":” + port + "/default?user=" + user + "&password=“ + password); } catch (SQLException e) { System.out.println(e.getMessage()); return null; } // try catch} // getConnection
Hive Java API: do the query
54
private static void doQuery() { try { Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery( "select column1,column2,” + "otherColumns from mytable where “ + “column1='whatever' and “ + "columns2 like '%whatever%'");
while (res.next()) { String column1 = res.getString(1); Integer column2 = res.getInteger(2); } // while
res.close(); stmt.close(); con.close(); } catch (SQLException e) { System.exit(0); } // try catch} // doQuery
Hive tables creation
55
• Both locally using the CLI, or remotely using the Java API, use this command:hive> Create external table...
• CSV-like HDFS fileshive> create external table <table_name> (<field1_name> <field1_type>, ..., <fieldN_name> <fieldN_type>) row format delimited field terminated by ‘<separator>' location ‘/user/<username>/<path>/<to>/<the>/<data>';
• Json-like HDFS filescreate external table <table_name> (<field1_name> <field1_type>, ..., <fieldN_name> <fieldN_type>) row format serde 'org.openx.data.jsonserde.JsonSerDe' location ‘/user/<username>/<path>/<to>/<the>/<data>’;
• Cygnus automatically creates the Hive tables for the entities it is notified for!
56
TidoopExtensions for Hadoop
Non HDFS inputs for MapReduce
57
• TIDOOP is about all the developments related to Hadoop made Teléfonica Investigación y Desarrollo– https://github.com/telefonicaid/fiware-tidoop
• Part of this repo are the extensions allowing using generic non HDFS data– https://github.com/telefonicaid/fiware-tidoop/tree/develop
/tidoop-hadoop-ext
• Non-HDFS sources:– CKAN (beta available)– MongoDB-based Short-Term Historic (roadmap)
• Other specific public repos UNDER STUDY… suggestions accepted
Non HDFS inputs for MapReduce: driver
58
@Overridepublic int run(String[] args) throws Exception { Configuration conf = this.getConf(); Job job = Job.getInstance(conf, ”ckan mr"); job.setJarByClass(CKANMapReduceExample.class); job.setMapperClass(CKANMapper.class); job.setCombinerClass(CKANReducer.class); job.setReducerClass(CKANReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // start of the relevant part job.setInputFormatClass(CKANInputFormat.class); CKANInputFormat.addCKANInput(job, ckanInputs); CKANInputFormat.setCKANEnvironmnet(job, ckanHost, ckanPort, sslEnabled, ckanAPIKey); CKANInputFormat.setCKANSplitsLength(job, splitsLength); // end of the relevant part FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;} // run
59
TidoopMapReduce Library
MapReduce Library
60
• Under Tidoop, there is a library of predefined general purpose MapReduce jobs– https://github.com/telefonicaid/fiware-tidoop/tree/develop/ti
doop-mr-lib
• Jobs within the library may be used as any other MapReduce job, but it can be used remotely through a REST API as well– https://github.com/telefonicaid/fiware-tidoop/tree/develop/ti
doop-mr-lib-api
• Available jobs:– Filter by regex– MapOnly (mapping function as parameter)
• Currently working on adding more and more jobs
61
Big Data services at FIWARE LAB
Cosmos
Cosmos storage services
62
• Currently:– Hadoop cluster combining HDFS storage and
MapReduce computing in the same nodes– 10 virtual nodes– 1 TB capacity (333 GB real capacity, replicas=3)– OAuth2-based secure access to WebHDFS/HttpFS
• Upcoming:– Specific HDFS cluster for storage– 6 physical nodes (+ other 6 planified)– 20 TB capacity (6,7 TB real capacity, replicas=3)– OAuth2-based secure access to WebHDFS/HttpFS– Support to CKAN for Open Big Data
Cosmos computing services
63
• Currently:– Hadoop cluster combining HDFS storage and MapReduce
computing in the same nodes– 10 virtual nodes– 100 GB capacity– MapReduce and Hive available– No queues, no resources reservation mechanisms
• Upcoming:– Specific HDFS cluster for computing– 8 physical nodes (+ other 8 planified)– 256 GB capacity– MapReduce, Hive and Tidoop available– Reservation mechanisms through YARN queues
Cosmos portal
64
• Currently:– http://cosmos.lab.fi-ware.org/cosmos-gui/– FIWARE Lab credentials are directly requested– A HDFS userspace is created in the shared Hadoop
cluster
• Upcoming:– Real delegated authentication– A HDFS userspace is created in the storage cluster– Features to deploy and run MapReduce jobs
Thanks!Thanks!