hadoop connector guide

21
© 2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners. Hadoop User Guide

Upload: parashara

Post on 17-Aug-2015

20 views

Category:

Documents


6 download

DESCRIPTION

Hadoop Connector Guide

TRANSCRIPT

2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners. Hadoop User Guide 2 Abstract Hadoop user guide provides a brief introduction on cloud connectors and its features.The guide provides detailed information on setting up the connector and running data synchronization tasks (DSS). A brief overview of supported features and task operations that can be performed using Hadoop connector is mentioned. Table of Contents Overview .......................................................................................................................................... 3 Hadoop ............................................................................................................................................ 3 Hadoop Plugin ................................................................................................................................. 4 Supported Objects and Task Operations ........................................................................................ 5 Enabling Hadoop Connector ........................................................................................................... 5 Instructions while installing the Secure Agent .......................................................................... 5 Creating a Hadoop Connection as a Source ................................................................................... 5 JDBC URL .................................................................................................................................... 7 JDBC Driver class ........................................................................................................................ 8 Installation Paths .......................................................................................................................... 8 Setting Hadoop Classpath for various Hadoop Distributions ....................................................... 8 Creating Hadoop Data Synchronization Task (Source) ................................................................ 12 Enabling a Hadoop Connection as a Target ................................................................................. 14 Creating Hadoop Data Synchronization Task (Target) ................................................................. 15 Data Filters .................................................................................................................................... 18 Troubleshooting ............................................................................................................................. 19 Increasing Secure Agent Memory.............................................................................................. 19 Additional Troubleshooting Tips ................................................................................................. 21 Known Issues ................................................................................................................................ 21 3 Overview Informatica cloud connector SDKs are off-cycle, off release add-ins that provide data integration to SaaS and on-premise applications, which are not supported natively by Informatica cloud. The cloud connectors are specifically designed to address most common use cases such as moving data into cloud and retrieving data from cloud for each individual application. Figure 1: Informatica Cloud Architecture Once the Hadoop cloud connector is enabled for your ORG Id, you need to create a connection in Informatica cloud to access the connector. Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. 4 The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Other Hadoop-related projects include: Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Avro: A data serialization system. Cassandra: A scalable multi-master database with no single points of failure. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications. Cloudera Impala: It is the industrys leading massively parallel processing (MPP) SQL query engine that runs natively in Apache Hadoop. The Apache-licensed, open source Impala project combines modern, scalable parallel database technology with the power of Hadoop, enabling users to directly query data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is designed from the ground up as part of the Hadoop ecosystem and shares the same flexible file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other components of the Hadoop stack. Hadoop Plugin The Informatica Hadoop connector allows you to perform Query and Insert operations on Hadoop.The plug-in supports Cloudera 5.0, MapR 3.1, Pivotal HD 2.0, Amazon EMR and Horton Works 2.1 and has been certified to work on CDH 4.2 and HDP 1.1 The Informatica Cloud Secure Agent must be installed on one of the nodes of the Hadoop Cluster where Hiveserver or Hiveserver2 is running. The plug-in is used as a target to insert data into Hadoop.The plug-in connects to Hive and Cloudera Impala to perform relevant data operations. The plug-in can easily be integrated with the Informatica Cloud. The plugin supports all operators supported in HiveQL. 5 The plug-in supports the AND conjunction between filters. It supports both AND and OR conjunctions in advanced filters. The plug-in supports filtering on all filterable columns in Hive/Impala tables. Supported Objects and Task Operations The table below provides the list of objects and task operations supported by ReST connector. Objects Task Operation Data Preview Look Up DSS Source DSS Target QueryInsertUpdateUpsertDelete All tables in Hive NANANA NA All tables in Impala NA NANANANA NA NAEnabling Hadoop ConnectorTo enable Hadoop connector, get in touch with Informatica support or Informatica representative. It usually takes 15 minutes for the connector to download to secure agent, after it is enabled.Instructions while installing the Secure Agent Follow the given instructions while installing the secure agents:You must install the secure agent on Hadoop cluster. If you install it outside the Hadoop cluster you can only read from Hadoop, but you cannot write into the Hadoop. You must also install the secure agent on the node where hive server 2 is running. Creating a Hadoop Connection as a Source To use Hadoop connector in data synchronization task, you must create a connection in Informatica Cloud. The following steps help you to create Hadoop connection in Informatica Cloud. 1.In Informatica Cloud home page, click Configure. 2.The drop-down menu appears, select Connections. 3.The Connections page appears. 4.Click New to create a connection. 5.The New Connection page appears. : Supported : Not Applicable 6 Figure 2: Connection Parameter

6.Specify the values to the connection parameters. Connection Property Description Connection NameEnter a unique name for the connection. DescriptionProvide a relevant description for the connection. TypeSelect Hadoop from the list. Secure AgentSelect the appropriate secure agent from the list. UsernameMention the username of Schema of Hadoop component. PasswordMention the password of the schema of Hadoop component. JDBC Connection URL Mention the JDBC URL to connect to the Hadoop Component. Refer JDBC URL.Driver Mention the JDBC driver class to connect to the Hadoop Component. Refer JDBC Driver class. 7 Commit Interval Mention the commit interval. It is the Batch size (in rows) of data loaded into hive. Hadoop Installation Path Mention Hadoop Installation path.The Installation path of the Hadoop component* used to connect to Hadoop. Only one of these installation Hive Installation PathMention the Hive installation path. HDFS Installation Path Mention the HDFS installation path. HBase Installation Path Mention HBase installation path. Impala Installation Path Mention Impala installation path. Miscellaneous Library Path Mention the Miscellaneous Library Path. This is an additional library that could be used to communicate with Hadoop. Enable Logging Check the Enable Logging box.This Enables verbose log messages. Note: Installation paths are the paths where Hadoop jars are listed. The connector loads and sets one of these or more. Connector loads the libraries from these paths before sending any instructions to Hadoop. If you do not want to mention the installation path, you can generate the setHadoopclasspath.sh file for amazon, HortonWorks and MapR. Refer Setting Hadoop Classpath for various Hadoop Distributions 7.Click Test to evaluate the connection. 8.Click Ok to save the connection. JDBC URL The connector connects to different components of Hadoop using JDBC. The URL format and parameters vary among components. Hive uses the JDBC URL format mentioned below: . jdbc:://:/ The significance of URL parameters is discussed below: hive/hive2 protocol information depending on the version of the Thrift Server used, hive forHiveServer and hive2 for HiveServer2. Server, port server and port information where the Thrift Server is running. Schema hive schema to which the connector needs to access. For example, jdbc:hive2://invrlx63iso7:10000/default connects the default schema of Hive, using a Hive Thrift server HiveServer2 that stars on the server invrlx63iso7 on port 10000. 8 The Hive thrift serve runs for the connector to communicate with Hive.The command to start the Thrift server is hive service hiveserver2. Cloudera Impala uses the JDBC URL format given below: jdbc:hive2://:/;auth= In this case, the parameter auth must be set to the security mechanism used by the Impala Server, Kerberos. For example, jdbc:hive2://invrlx63iso7:21050/;auth=noSasl connects to the default schema of Impala. JDBC Driver class The JDBC Driver class tends to vary among Hadoop components.For example, org.apache.hive.jdbc.HiveDriver for Hive and Impala: Installation Paths The following table displays sample installation paths for different Hadoop distributions: Installation PathsDefault Hive installation pathDefault Hadoop installation path CloudEra 5 VM /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hive HortonWorks 2.1 Sandbox/usr/lib/hadoop/usr/lib/hive Amazon EMR/home/hadoop/home/hadoop/hive/hive-0.11.0 MapR 3.1 demo/opt/mapr/hadoop/hadoop-0.20.2/opt/mapr/hive/hive-0.12 Pivotal HD2.0/usr/lib/gphd/hadoop/usr/lib/gphd/hive Note: When you do not mention the installation paths, you can simply set the classpath and proceed with the connection configuration and creating DSS tasks. Setting Hadoop Classpath for various Hadoop Distributions In the connection parameters if you do not mention the installation paths, you can perform the connection operations by generating the setHadoopConnectorClasspath.sh file.This section helps you to set the classpath for the distributions of Hadoop and procedure to set Classpath. Follow the procedure for generating setHadoopConnectorClasspath.sh for Amazon, Horton works and Pivotal. 1.Make Changes in /main/tomcat/saas-infaagentapp.sh file as shown in the figure. 9 2.Start the Agent as shown in the below command prompt. 3.Create the Hadoop Connection using the connector 4.Test the connection. This will generate the setHadoopConnectorClasspath.sh file in Infa_Agent_DIR/main/tomcat path. 5.Stop the Agent using Ctrl+C keys. 10 6.From Infa_agent_DIR, execute the . ./main/tomcat/setHadoopConnectorClasspath.sh using the command. 7.Restart the Agent. And execute the DSS tasks Note: If you want to generate the setHadoopConnectorClasspath.sh file again, then delete the existing one and regenerate. After generating the above steps, if the Hadoop classpath does not point towards the correct class path, then you must execute following steps to undo the commands executed above: 1.Enter vi saas-infaagentapp.sh 2.Enter Insert command 3.Press Delete or backspace to delete the following entries: 4.Press Escape key 5.Type :(Collon) + w +q keys Once you follow the above procedure the commands will be deleted, and then you can move on to the section given below to go direct the Hadoop to the correct classpath. Directing the Hadoop classpath to the correct classpath In certain cases the Hadoop may point to the incorrect classpath. Follow the procedure given below to direct it to the correct classpath . 1.Enter the command hadoop classpath from the terminal. This will display the stream of jars. 11 2. Copy and paste the above stream in a notepad. 3.Delete the following entries from the notepad file: a.:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar b.:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar (retain the latest version and delete the previous) 4.Copy the remaining content and export it to a variable called HADOOP_CLASSPATH. 5.In the command prompt window, mention the path where this file resides, that is, InfaAgentDir/main/tomcat/saas-infaagentapp.sh. 12 6.Now follow Steps for generating setHadoopConnectorClasspath.sh mentioned above. Refer Setting Hadoop Classpath for various Hadoop Distributions. Creating Hadoop Data Synchronization Task (Source) Note: You need to create a connection before getting started with data synchronization task. The following steps help you to setup a data synchronization task in Informatica Cloud. Let us consider the task operation Insert (Fetch/Read) to perform the Data synchronization task. 1.In Informatica Cloud home page, click Applications. 2.The drop-down menu appears, select Data Synchronization. 3.The Data Synchronization page appears. 4.Click New to create a data synchronization task. 5.The Definition tab appears. Figure 3: Definition Tab 6.Specify the Task Name, provide a Description and select the Task Operation Insert. 13 7.Click Next. 8.The Source tab appears. Figure 4: Source Tab 9.Select the source Connection, Source Type and Source Object to be used for the task.10.Click Next. 11.The Target tab appears. Select the target Connection and Target Object required for the task. Figure 5: Target Tab 12.Click Next. 13.In Data Filters tab by default, Process all rows is chosen.14.Click Next. 15.In Field Mapping tab, map source fields to target fields accordingly. 14 Figure 6: Field Mapping 16.Click Next. 17.The Schedule tab appears. 18.In Schedule tab, you can schedule the task as per the requirement and save. 19.If you do not want schedule the task, click Save and Run the task. Figure 7: Save and Run the Task After you Save and Run the task, you will be redirected to monitor log page. In monitor log page, you can monitor the status of data synchronization tasks. Enabling a Hadoop Connection as a Target To use Hadoop connector in data synchronization task, you must create a connection in Informatica Cloud. The following steps help you to create Hadoop connection in Informatica Cloud. 15 1.In Informatica Cloud home page, click Configure. 2.The drop-down menu appears, select Connections. 3.The Connections page appears. 4.Click New to create a connection. 5.The New Connection page appears. Figure 8: Connection Parameter 6.Specify the values to the connection parameters. Refer Creating a Hadoop Connection as a Source. 7.Click Test to evaluate the connection. 8.Click Ok to save the connection. Creating Hadoop Data Synchronization Task (Target) Note: You need to create a connection before getting started with data synchronization task. The following steps help you to setup a data synchronization task in Informatica Cloud. Let us consider the task operation Insert (Fetch/Read) to perform the Data synchronization task. 1.In Informatica Cloud home page, click Applications. 1.The drop-down menu appears, select Data Synchronization. 2.The Data Synchronization page appears. 3.Click New to create a data synchronization task. 16 4.The Definition tab appears. Figure 9: Definition Tab 5.Specify the Task Name, provide a Description and select the Task Operation Insert. 6.Click Next. 7.The Source tab appears. Figure 10: Source Tab 8.Select the source Connection, Source Type and Source Object to be used for the task.9.Click Next. 10.The Target tab appears. Select the target Connection and Target Object required for the task. 17 Figure 11: Target Tab 11.Click Next. 12.In Data Filters tab by default, Process all rows is chosen. See Also Data Filters. 13.Click Next. 14.In Field Mapping tab, map source fields to target fields accordingly. Figure 12: Field Mapping 15.Click Next. 16.The Schedule tab appears. 17.In Schedule tab, you can schedule the task as per the requirement and save. 18.If you do not want schedule the task, click Save and Run the task. 18 Figure 13: Save and Run the Task After you Save and Run the task, you will be redirected to monitor log page. In monitor log page, you can monitor the status of data synchronization tasks.

Data Filters Data filters help you to fetch specific data based on the APIs configured in Config.csv file. The data synchronization task will process the data based on the filter field assigned. Note: Advanced data filters are not supported by Hadoop Connector The following steps help you to use data filters. 1.In Data synchronization task, select Data Filters tab. 2.The Data Filters tab appears. 3.Click New as shown in the figure below. Figure 14: Data Filters 4.The Data Filter dialog box appears. 19 Figure 15: Data Filters-2 5.Specify the following details. Field TypeDescription ObjectSelect Object for which you want to assign filter fields Filter BySelect the Filter Field OperatorSelect Equals operator. Only Equals operator is supported with this release. Filter ValueEnter the Filter value 6.Click Ok. Troubleshooting Increasing Secure Agent Memory To overcome memory issues faced by secure agent follow the steps given below. 1.In Informatica Cloud home page, click Configuration. 2.Select Secure Agents. 3.The secure agent page appears. 4.From the list of available secure agents, select the secure agent for which you want to increase memory.\ 5.Click pencil icon corresponding to the secure agent. The pencil icon is to edit the secure agent. 6.The Edit Agent page appears. 7.In System Configuration section, select the Type as DTM. 8.Edit JVMOption1 as -Xmx512m as shown in the figure below. 20 Figure 16: Increasing Secure Agent Memory-1 9.Again in System Configuration section, select the Type as TomCatJRE. 10.Edit INFA_memory to -Xms256m -Xmx512m as shown in the figure below. Figure 17: Increasing Secure Agent Memory-2 11.Restart the secure agent. The secure agent memory has been increased successfully. 21 12.. Additional Troubleshooting Tips When the connection is used as a target, the last batch of the insert load is not reflected in the record count. Refer the session logs for the record count of the last batch inserted.For example, if the commit interval is set to 1 million and the actual rows inserted are 1.1 million, the record count in the UI shows 1 million and the session logs reveal the row count of the reminder 100k records. Set the commit interval to the highest value possible before java.lang.OutofMemoryError is encountered. When the connection is used as a target to load data into Hadoop, ensure that all the fields are mapped. After a data load in Hive, Impala needs to be refreshed manually for the latest changes to the table to be reflected in Impala. In the current version, the connector does not automatically refresh Impala upon a Hive dataset insert. Known Issues The connector is currently certified to work with Cloudera CDH 4.2. andHortonWorks HDP 1.1. The connector may encounter java.lang.OutOfMemory exception while fetching large data sets for tables with a large number of columns (for example, 5 million for a 15 column table). In such scenarios, restrict the resultset by adding appropriate filters or by decreasing the number of field mappings. The Enable Logging connection parameter is place-holder for a future release, and its state has no impact on connector functionality. The connector has been certified and tested on Hadoops pseudo-distributed mode. Performance is a factor of Hadoops cluster setup. Ignore log4j initialization warnings in the session logs.