fiware fwd big_data_all_in_1_v1

65
FIWARE Developer’s Week Big Data GE francisco.romerobueno@telefonic a.com

Upload: francisco-romero-bueno

Post on 14-Aug-2015

188 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: FIWARE FWD big_data_all_in_1_v1

FIWARE Developer’s WeekBig Data GE

[email protected]

Page 2: FIWARE FWD big_data_all_in_1_v1

2

10000 feet view:Big Data in the

FIWARE architecture

Page 3: FIWARE FWD big_data_all_in_1_v1

OrionCB

Big Data

App

real time last-valuecontext data

historicalcontext data

IoT agents

HDFS

other storages

Cygnus

MapReduceHive

Tidoop

Page 4: FIWARE FWD big_data_all_in_1_v1

4

Big Data:What is it and how much data is there

Page 5: FIWARE FWD big_data_all_in_1_v1

What is big data?

5

> small data

Page 6: FIWARE FWD big_data_all_in_1_v1

What is big data?

6

> big data

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg

Page 7: FIWARE FWD big_data_all_in_1_v1

How much data is there?

7

Page 8: FIWARE FWD big_data_all_in_1_v1

Data growing forecast

8

2.3 3.612

19

11.3

39

0.5

1.4

Global users

(billions)

Global networked

devices(billions)

Global broadband speed

(Mbps)

Global traffic(zettabytes)

http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast

2012

20122012

2012

2017

2017

2017

2017

1 zettabyte = 1021 bytes

1,000,000,000,000,000,000,000 bytes

Page 9: FIWARE FWD big_data_all_in_1_v1

9

How to deal with it:Distributed storage

and computing

Page 10: FIWARE FWD big_data_all_in_1_v1

What happens if one shelving is not enough?

10

You buy more shelves…

Page 11: FIWARE FWD big_data_all_in_1_v1

… then you create an index

11

“The Avengers”, 1-100, shelf 1“The Avengers”, 101-125, shelf 2“Superman”, 1-50, shelf 2“X-Men”, 1-100, shelf 3“X-Men”, 101-200, shelf 4“X-Men”, 201, 225, shelf 5

The Avengers

The Avengers

The Avengers

The Avengers

The Avengers

Superman

Superman

X-Men

Distributed

storage!

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

Page 12: FIWARE FWD big_data_all_in_1_v1

… and you call your friends to read everything!

12

Distributed

computing!

Page 13: FIWARE FWD big_data_all_in_1_v1

13

Distributed storage:The Hadoop reference

Page 14: FIWARE FWD big_data_all_in_1_v1

Hadoop Distributed File System (HDFS)

14

• Based on Google File System• Large files are stored across multiple machines

(Datanodes) by spliting them into blocks that are distributed

• Metadata is managed by the Namenode• Scalable by simply adding more Datanodes• Fault-tolerant since HDFS replicates each block

(default to 3)• Security based on authentication (Kerberos) and

authorization (permissions, HACLs)• It is managed like a Unix-like file system

Page 15: FIWARE FWD big_data_all_in_1_v1

HDFS architecture

15

Namenode

Datanode0 Datanode1 DatanodeN

Rack 1 Rack 2

1 2 2 3 3 1 2

Path Replicas Block IDs

/user/user1/data/file0 2 1,3

/user/user1/data/file1 3 2,4,5

… … …

Page 16: FIWARE FWD big_data_all_in_1_v1

16

Managing HDFS:File System ShellHTTP REST API

Page 17: FIWARE FWD big_data_all_in_1_v1

File System Shell

17

• The File System Shell includes various shell-like comands that directly interact with HDFS

• The FS shell is invoked by any of these scripts:– bin/hadoop fs– bin/hdfs dfs

• All FS Shell commans take URI paths as arguments:– scheme://authority/path– Available schemas: file (local FS), hdfs (HDFS)– If nothing is especified, hdfs is considered

• It is necessary to connect to the cluster via SSH• Full commands reference

– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

Page 18: FIWARE FWD big_data_all_in_1_v1

File System Shell examples

18

$ hadoop fs -cat webinar/abriefhistoryoftime_page1CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast$ hadoop fs -mkdir webinar/afolder$ hadoop fs -ls webinarFound 4 items-rw-r--r-- 3 frb cosmos 3431 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page1-rw-r--r-- 3 frb cosmos 1604 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page2-rw-r--r-- 3 frb cosmos 5257 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page3drwxr-xr-x - frb cosmos 0 2015-03-10 11:09 /user/frb/webinar/afolder$ hadoop fs -rmr webinar/afolderDeleted hdfs://cosmosmaster-gi/user/frb/webinar/afolder

Page 19: FIWARE FWD big_data_all_in_1_v1

HTTP REST API

19

• The HTTP REST API supports the complete File System interface for HDFS

• It relies on the webhdfs schema for URIs– webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URLs are built as:– http://<HOST>:<HTTP_PORT>/webhdfs/v1/

<PATH>?op=…

• Full API specification– http://hadoop.apache.org/docs/current/hadoop

-project-dist/hadoop-hdfs/WebHDFS.html

Page 20: FIWARE FWD big_data_all_in_1_v1

HTTP REST API examples

20

$ curl –X GET “http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=frb”CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast$ curl -X PUT "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb"{"boolean":true}$ curl –X GET "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb"{"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE","length":3431,"owner":"frb","group":"cosmos","permission":"644","accessTime":1425995831489,"modificationTime":1418216412441,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE","length":5257,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathSuffix":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":"755","accessTime":0,"modificationTime":1425995941361,"blockSize":0,"replication":0}]}}$ curl -X DELETE "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb"{"boolean":true}

Page 21: FIWARE FWD big_data_all_in_1_v1

21

Feeding HDFS:Cygnus

Page 22: FIWARE FWD big_data_all_in_1_v1

Cygnus FAQ

22

• What is it for?– Cygnus is a connector in charge of persisting Orion

context data in certain configured third-party storages, creating a historical view of such data. In other words, Orion only stores the last value regarding an entity's attribute, and if an older value is required then you will have to persist it in other storage, value by value, using Cygnus.

• How does it receives context data from Orion Context Broker?– Cygnus uses the subscription/notification feature of

Orion. A subscription is made in Orion on behalf of Cygnus, detailing which entities we want to be notified when an update occurs on any of those entities attributes.

Page 23: FIWARE FWD big_data_all_in_1_v1

Cygnus FAQ

23

• Which storages is it able to integrate?– Current stable release is able to persist Orion context data in:

• HDFS, the Hadoop distributed file system.• MySQL, the well-know relational database manager.• CKAN, an Open Data platform.• STH, Short-Term Historic

• Which is its architecture?– Internally, Cygnus is based on Apache Flume. In fact,

Cygnus is a Flume agent, which is basically composed of a source in charge of receiving the data, a channel where the source puts the data once it has been transformed into a Flume event, and a sink, which takes Flume events from the channel in order to persist the data within its body into a third-party storage.

Page 24: FIWARE FWD big_data_all_in_1_v1

Basic Cygnus agent

24

Page 25: FIWARE FWD big_data_all_in_1_v1

Configure a basic Cygnus agent

25

• Edit /usr/cygnus/conf/agent_<id>.conf• List of sources, channels and sinks:

cygnusagent.sources = http-sourcecygnusagent.sinks = hdfs-sinkcygnusagent.channels = hdfs-channel

• Channels configurationcygnusagent.channels.hdfs-channel.type = memorycygnusagent.channels.hdfs-channel.capacity = 1000cygnusagent.channels.hdfs-channel.transactionCapacity = 100

Page 26: FIWARE FWD big_data_all_in_1_v1

Configure a basic Cygnus agent

26

• Sources configuration:cygnusagent.sources.http-source.channels = hdfs-channelcygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSourcecygnusagent.sources.http-source.port = 5050cygnusagent.sources.http-source.handler = es.tid.fiware.fiwareconnectors.cygnus.handlers.OrionRestHandlercygnusagent.sources.http-source.handler.notification_target = /notifycygnusagent.sources.http-source.handler.default_service = def_servcygnusagent.sources.http-source.handler.default_service_path = def_servpathcygnusagent.sources.http-source.handler.events_ttl = 10cygnusagent.sources.http-source.interceptors = ts decygnusagent.sources.http-source.interceptors.ts.type = timestampcygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtractor$Buildercygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf

Page 27: FIWARE FWD big_data_all_in_1_v1

Configure a basic Cygnus agent

27

• Sinks configuration:cygnusagent.sinks.hdfs-sink.channel = hdfs-channelcygnusagent.sinks.hdfs-sink.type = es.tid.fiware.fiwareconnectors.cygnus.sinks.OrionHDFSSinkcygnusagent.sinks.hdfs-sink.cosmos_host = cosmos.lab.fi-ware.orgcygnusagent.sinks.hdfs-sink.cosmos_port = 14000cygnusagent.sinks.hdfs-sink.cosmos_default_username = cosmos_usernamecygnusagent.sinks.hdfs-sink.cosmos_default_password = xxxxxxxxxxxxxcygnusagent.sinks.hdfs-sink.hdfs_api = httpfscygnusagent.sinks.hdfs-sink.attr_persistence = columncygnusagent.sinks.hdfs-sink.hive_host = cosmos.lab.fi-ware.orgcygnusagent.sinks.hdfs-sink.hive_port = 10000cygnusagent.sinks.hdfs-sink.krb5_auth = false

Page 28: FIWARE FWD big_data_all_in_1_v1

HDFS details regarding Cygnus persistence

28

• By default, for each entity Cygnus stores the data at:– /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-

type>/<entity-id>-<entity-type>.txt

• Within each HDFS file, the data format may be json-row or json-column:– json-row

{ "recvTimeTs":"13453464536”, "recvTime":"2014-02-27T14:46:21”, "entityId":"Room1”, "entityType":"Room”, "attrName":"temperature”, "attrType":"centigrade”, “attrValue":"26.5”, "attrMd":[ … ]}

– json-column{ "recvTime":"2014-02-27T14:46:21”, "temperature":"26.5”, "temperature_md":[ … ], “pressure”:”90”, “pressure_md”:[ … ]}

Page 29: FIWARE FWD big_data_all_in_1_v1

Advanced features (version 0.7.1)

29

• Round-Robin channel selection• Pattern-based context data grouping• Kerberos authentication• Management Interface (roadmap)• Multi-tenancy support (roadmap)• Entity model-based persistence (roadmap)

Page 30: FIWARE FWD big_data_all_in_1_v1

Round Robin channel selection

30

• It is possible to configure more than one channel-sink pair for each storage, in order to increase the performance

• A custom ChannelSelector is needed• https://github.com/telefonicaid/fiware-connectors/b

lob/master/flume/doc/operation/performance_tuning_tips.md

Page 31: FIWARE FWD big_data_all_in_1_v1

RoundRobinChannelSelector configuration

31

cygnusagent.sources = mysourcecygnusagent.sinks = mysink1 mysink2 mysink3cygnusagent.channels = mychannel1 mychannel2 mychannel3

cygnusagent.sources.mysource.type = ...cygnusagent.sources.mysource.channels = mychannel1 mychannel2 mychannel3cygnusagent.sources.mysource.selector.type = es.tid.fiware.fiwareconnectors.cygnus.channelselectors.RoundRobinChannelSelectorcygnusagent.sources.mysource.selector.storages = Ncygnusagent.sources.mysource.selector.storages.storage1 = <subset_of_cygnusagent.sources.mysource.channels>...cygnusagent.sources.mysource.selector.storages.storageN = <subset_of_cygnusagent.sources.mysource.channels>

Page 32: FIWARE FWD big_data_all_in_1_v1

Pattern-based Context Data Grouping

32

• Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a concatenation:– destination=<entity_id>-<entityType>

• It is possible to group different context data thanks to this regex-based feature implemented as a Flume interceptor:cygnusagent.sources.http-source.interceptors = ts decygnusagent.sources.http-source.interceptors.ts.type = timestampcygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtractor$Buildercygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf

Page 33: FIWARE FWD big_data_all_in_1_v1

Matching table for pattern-based grouping

33

• CSV file (‘|’ field separator) containing rules– <id>|<comma-separated_fields>|<regex>|<destination>|

<destination_dataset>

• For instance:1|entityId,entityType|Room\.(\d*)Room|numeric_rooms|rooms2|entityId,entityType|Room\.(\D*)Room|character_rooms|rooms3|entityType,entityId|RoomRoom\.(\D*)|character_rooms|rooms4|entityType|Room|other_roorms|rooms

• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor

Page 34: FIWARE FWD big_data_all_in_1_v1

Kerberos authentication

34

• HDFS may be secured with Kerberos for authentication purposes

• Cygnus is able to persist on kerberized HDFS if the configured HDFS user has a registered Kerberos principal and this configuration is added:cygnusagent.sinks.hdfs-sink.krb5_auth = truecygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_usernamecygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxxcygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file = /usr/cygnus/conf/krb5_login.confcygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file = /usr/cygnus/conf/krb5.conf

• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/operation/hdfs_kerberos_authentication.md

Page 35: FIWARE FWD big_data_all_in_1_v1

35

Distributed computing:

The Hadoop reference

Page 36: FIWARE FWD big_data_all_in_1_v1

Hadoop was created by Doug Cutting at Yahoo!...

36

… based on the MapReduce patent by Google

Page 37: FIWARE FWD big_data_all_in_1_v1

Well, MapReduce was really invented by Julius Caesar

37

Divide etimpera*

* Divide and conquer

Page 38: FIWARE FWD big_data_all_in_1_v1

An example

38

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

LATINREF1P45

GREEKREF2P128

EGYPTREF3P12

LATINpages 45

EGYPTIAN

LATINREF4P73

LATINREF5P34

EGYPTREF6P10

GREEKREF7P20

GREEKREF8P230

45 (ref 1)

still reading…

Mappers

Reducer

Page 39: FIWARE FWD big_data_all_in_1_v1

An example

39

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

GREEKREF2P128

stillreading…

EGYPTIAN

LATINREF4P73

LATINREF5P34

EGYPTREF6P10

GREEKREF7P20

GREEKREF8P230

GREEK

45 (ref 1)

Mappers

Reducer

Page 40: FIWARE FWD big_data_all_in_1_v1

An example

40

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

LATINpages 73

EGYPTIAN

LATINREF4P73

LATINREF5P34

GREEKREF7P20

GREEKREF8P230

LATINpages 34

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

Mappers

Reducer

Page 41: FIWARE FWD big_data_all_in_1_v1

An example

41

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

GREEK

GREEK

GREEKREF7P20

GREEKREF8P230

idle…

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

Mappers

Reducer

Page 42: FIWARE FWD big_data_all_in_1_v1

An example

42

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

idle…

idle…

idle…

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

152 TOTAL

Mappers

Reducer

Page 43: FIWARE FWD big_data_all_in_1_v1

MapReduce applications

43

• MapReduce applications are commonly written in Java– Can be written in other languages through Hadoop Streaming

• They are executed in the command line$ hadoop jar <jar-file> <main-class> <input-dir> <output-dir>

• A MapReduce job consists of:– A driver, a piece of software where to define inputs, outputs, formats,

etc. and the entry point for launching the job– A set of Mappers, given by a piece of software defining its behaviour– A set of Reducers, given by a piece of software defining its behaviour

• There are 2 APIS– org.apache.mapred old one– org.apache.mapreduce new one

• Hadoop is distributed with MapReduce examples– [HADOOP_HOME]/hadoop-examples.jar

Page 44: FIWARE FWD big_data_all_in_1_v1

Implementing the example

44

• The input will be a single big file containing:symbolae botanicae,latin,230mathematica,greek,95physica,greek,109ptolomaics,egyptian,120terra,latin,541iustitia est vincit,latin,134

• The mappers will receive pieces of the above file, which will be read line by line– Each line will be represented by a (key,value) pair, i.e. the offset on the

file and the real data within the line, respectively– For each input pair a (key,value) pair will be output, i.e. a common

“num_pages” key and the third field in the line

• The reducers will receive arrays of pairs produced by the mappers, all having the same key (“num_pages”)– For each array of pairs, the sum of the values will be output as a

(key,value) pair, in this case a “total_pages” key and the sum as value

Page 45: FIWARE FWD big_data_all_in_1_v1

Implementing the example: JCMapper.class

45

public static class JCMapper extendsMapper<Object, Text, Text, IntWritable> {

private final Text globalKey = new Text(”num_pages"); private final IntWritable bookPages = new IntWritable();

@Override public void map(Object key, Text value, Context context) throws Exception {

String[] fields = value.toString().split(“,”); system.out.println(“Processing “ + fields[0]);

if (fields[1].equals(“latin”)) { bookPages.set(fields[2]);

context.write(globalKey, bookPages); } // if

} // map

} // JCMapper

Page 46: FIWARE FWD big_data_all_in_1_v1

Implementing the example: JCReducer.class

46

public static class JCReducer extendsReducer<Text, IntWritable, Text, IntWritable> {

private final IntWritable totalPages= new IntWritable();

@Override public void reduce(Text globalKey, Iterable<IntWritable> bookPages, Context context) throws Exception {

int sum = 0;

for (IntWritable val: bookPages) { sum += val.get(); } // for

totalPages.set(sum); context.write(globalKey, totalPages);

} // reduce

} // JCReducer

Page 47: FIWARE FWD big_data_all_in_1_v1

Implementing the example: JC.class

47

public static void main(String[] args) throws Exception { int res = ToolRunner.run( new Configuration(), new CKANMapReduceExample(), args); System.exit(res);} // main @Overridepublic int run(String[] args) throws Exception { Configuration conf = this.getConf(); Job job = Job.getInstance(conf, ”julius caesar"); job.setJarByClass(JC.class); job.setMapperClass(JCMapper.class); job.setCombinerClass(JCReducer.class); job.setReducerClass(JCReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;} // run

Page 48: FIWARE FWD big_data_all_in_1_v1

48

Querying toolsHive

Page 49: FIWARE FWD big_data_all_in_1_v1

Querying tools

49

• MapReduce paradigm may be hard to understand and, the worst, to use

• Indeed, many data analyzers just need to query for the data– If possible, by using already well-known languages

• Regarding that, some querying tools appeared in the Hadoop ecosystem– Hive and its HiveQL language quite similar to

SQL– Pig and its Pig Latin language a new language

Page 50: FIWARE FWD big_data_all_in_1_v1

Hive and HiveQL

50

• HiveQL reference– https://cwiki.apache.org/confluence/display/Hive/

LanguageManual

• All the data is loaded into Hive tables– Not real tables (they don’t contain the real data) but metadata

pointing to the real data at HDFS

• The best thing is Hive uses pre-defined MapReduce jobs behind the scenes!• Column selection• Fields grouping• Table joining• Values filtering• …

• Important remark: since MapReduce is used by Hive, the queries make take some time to produce a result

Page 51: FIWARE FWD big_data_all_in_1_v1

Hive CLI

51

$ hivehive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txthive>select column1,column2,otherColumns from mytable where column1='whatever' and columns2 like '%whatever%';Total MapReduce jobs = 1Launching Job 1 out of 1Starting Job = job_201308280930_0953, Tracking URL = http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201308280930_0953Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_09532013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%

Page 52: FIWARE FWD big_data_all_in_1_v1

Hive Java API

52

• Hive CLI is OK for human-driven testing purposes• But it is not usable by remote applications

• Hive has no REST API• Hive has several drivers and libraries

• JDBC for Java• Python• PHP• ODBC for C/C++• Thrift for Java and C++• https://cwiki.apache.org/confluence/display/Hive/Hiv

eClient

• A remote Hive client usually performs:• A connection to the Hive server (TCP/10000)• The query execution

Page 53: FIWARE FWD big_data_all_in_1_v1

Hive Java API: get a connection

53

private static Connection getConnection(String ip, String port, String user, String password) { try { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) { System.out.println(e.getMessage()); return null; } // try catch try { return DriverManager.getConnection("jdbc:hive://" + ip + ":” + port + "/default?user=" + user + "&password=“ + password); } catch (SQLException e) { System.out.println(e.getMessage()); return null; } // try catch} // getConnection

Page 54: FIWARE FWD big_data_all_in_1_v1

Hive Java API: do the query

54

private static void doQuery() { try { Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery( "select column1,column2,” + "otherColumns from mytable where “ + “column1='whatever' and “ + "columns2 like '%whatever%'");

while (res.next()) { String column1 = res.getString(1); Integer column2 = res.getInteger(2); } // while

res.close(); stmt.close(); con.close(); } catch (SQLException e) { System.exit(0); } // try catch} // doQuery

Page 55: FIWARE FWD big_data_all_in_1_v1

Hive tables creation

55

• Both locally using the CLI, or remotely using the Java API, use this command:hive> Create external table...

• CSV-like HDFS fileshive> create external table <table_name> (<field1_name> <field1_type>, ..., <fieldN_name> <fieldN_type>) row format delimited field terminated by ‘<separator>' location ‘/user/<username>/<path>/<to>/<the>/<data>';

• Json-like HDFS filescreate external table <table_name> (<field1_name> <field1_type>, ..., <fieldN_name> <fieldN_type>) row format serde 'org.openx.data.jsonserde.JsonSerDe' location ‘/user/<username>/<path>/<to>/<the>/<data>’;

• Cygnus automatically creates the Hive tables for the entities it is notified for!

Page 56: FIWARE FWD big_data_all_in_1_v1

56

TidoopExtensions for Hadoop

Page 57: FIWARE FWD big_data_all_in_1_v1

Non HDFS inputs for MapReduce

57

• TIDOOP is about all the developments related to Hadoop made Teléfonica Investigación y Desarrollo– https://github.com/telefonicaid/fiware-tidoop

• Part of this repo are the extensions allowing using generic non HDFS data– https://github.com/telefonicaid/fiware-tidoop/tree/develop

/tidoop-hadoop-ext

• Non-HDFS sources:– CKAN (beta available)– MongoDB-based Short-Term Historic (roadmap)

• Other specific public repos UNDER STUDY… suggestions accepted

Page 58: FIWARE FWD big_data_all_in_1_v1

Non HDFS inputs for MapReduce: driver

58

@Overridepublic int run(String[] args) throws Exception { Configuration conf = this.getConf(); Job job = Job.getInstance(conf, ”ckan mr"); job.setJarByClass(CKANMapReduceExample.class); job.setMapperClass(CKANMapper.class); job.setCombinerClass(CKANReducer.class); job.setReducerClass(CKANReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // start of the relevant part job.setInputFormatClass(CKANInputFormat.class); CKANInputFormat.addCKANInput(job, ckanInputs); CKANInputFormat.setCKANEnvironmnet(job, ckanHost, ckanPort, sslEnabled, ckanAPIKey); CKANInputFormat.setCKANSplitsLength(job, splitsLength); // end of the relevant part FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;} // run

Page 59: FIWARE FWD big_data_all_in_1_v1

59

TidoopMapReduce Library

Page 60: FIWARE FWD big_data_all_in_1_v1

MapReduce Library

60

• Under Tidoop, there is a library of predefined general purpose MapReduce jobs– https://github.com/telefonicaid/fiware-tidoop/tree/develop/ti

doop-mr-lib

• Jobs within the library may be used as any other MapReduce job, but it can be used remotely through a REST API as well– https://github.com/telefonicaid/fiware-tidoop/tree/develop/ti

doop-mr-lib-api

• Available jobs:– Filter by regex– MapOnly (mapping function as parameter)

• Currently working on adding more and more jobs

Page 61: FIWARE FWD big_data_all_in_1_v1

61

Big Data services at FIWARE LAB

Cosmos

Page 62: FIWARE FWD big_data_all_in_1_v1

Cosmos storage services

62

• Currently:– Hadoop cluster combining HDFS storage and

MapReduce computing in the same nodes– 10 virtual nodes– 1 TB capacity (333 GB real capacity, replicas=3)– OAuth2-based secure access to WebHDFS/HttpFS

• Upcoming:– Specific HDFS cluster for storage– 6 physical nodes (+ other 6 planified)– 20 TB capacity (6,7 TB real capacity, replicas=3)– OAuth2-based secure access to WebHDFS/HttpFS– Support to CKAN for Open Big Data

Page 63: FIWARE FWD big_data_all_in_1_v1

Cosmos computing services

63

• Currently:– Hadoop cluster combining HDFS storage and MapReduce

computing in the same nodes– 10 virtual nodes– 100 GB capacity– MapReduce and Hive available– No queues, no resources reservation mechanisms

• Upcoming:– Specific HDFS cluster for computing– 8 physical nodes (+ other 8 planified)– 256 GB capacity– MapReduce, Hive and Tidoop available– Reservation mechanisms through YARN queues

Page 64: FIWARE FWD big_data_all_in_1_v1

Cosmos portal

64

• Currently:– http://cosmos.lab.fi-ware.org/cosmos-gui/– FIWARE Lab credentials are directly requested– A HDFS userspace is created in the shared Hadoop

cluster

• Upcoming:– Real delegated authentication– A HDFS userspace is created in the storage cluster– Features to deploy and run MapReduce jobs

Page 65: FIWARE FWD big_data_all_in_1_v1

Thanks!Thanks!