using oracle big data cloud · get started with oracle shell for hadoop loaders 9-2 use oracle...

Oracle® CloudUsing Oracle Big Data Cloud

E70336-24September 2019

Oracle Cloud Using Oracle Big Data Cloud,

E70336-24

Copyright © 2017, 2019, Oracle and/or its affiliates. All rights reserved.

This software and related documentation are provided under a license agreement containing restrictions onuse and disclosure and are protected by intellectual property laws. Except as expressly permitted in yourlicense agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify,license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means.Reverse engineering, disassembly, or decompilation of this software, unless required by law forinteroperability, is prohibited.

The information contained herein is subject to change without notice and is not warranted to be error-free. Ifyou find any errors, please report them to us in writing.

If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it onbehalf of the U.S. Government, then the following notice is applicable:

U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software,any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are"commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of theprograms, including any operating system, integrated software, any programs installed on the hardware,and/or documentation, shall be subject to license terms and license restrictions applicable to the programs.No other rights are granted to the U.S. Government.

This software or hardware is developed for general use in a variety of information management applications.It is not developed or intended for use in any inherently dangerous applications, including applications thatmay create a risk of personal injury. If you use this software or hardware in dangerous applications, then youshall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure itssafe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of thissoftware or hardware in dangerous applications.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks oftheir respective owners.

Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks areused under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron,the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced MicroDevices. UNIX is a registered trademark of The Open Group.

This software or hardware and documentation may provide access to or information about content, products,and services from third parties. Oracle Corporation and its affiliates are not responsible for and expresslydisclaim all warranties of any kind with respect to third-party content, products, and services unless otherwiseset forth in an applicable agreement between you and Oracle. Oracle Corporation and its affiliates will not beresponsible for any loss, costs, or damages incurred due to your access to or use of third-party content,products, or services, except as set forth in an applicable agreement between you and Oracle.

Contents

Preface

Audience viii

Related Resources viii

Conventions viii

1 Get Started with Big Data Cloud

About Big Data Cloud 1-1

Before You Begin with Big Data Cloud 1-1

How to Begin with Big Data Cloud Subscriptions 1-2

About Big Data Cloud Roles and Users 1-2

Typical Workflow for Big Data Cloud 1-3

About Big Data Cloud Clusters on Oracle Cloud Infrastructure 1-3

About Installing Additional Software 1-5

2 Access Big Data Cloud

Access the Service Console for Big Data Cloud 2-1

Access the Big Data Cloud Console 2-1

Access Big Data Cloud Using the REST API 2-3

Access Big Data Cloud Using the CLI 2-3

Access Big Data Cloud Using Ambari 2-3

About Accessing Thrift 2-4

Connect to a Cluster Node Through Secure Shell (SSH) 2-8

Connect to a Node by Using SSH on UNIX 2-9

Connect to a Node by Using PuTTY on Windows 2-10

3 Manage the Life Cycle of Big Data Cloud

About Cluster Topology 3-1

Cluster Components 3-2

Cluster Extensions 3-3

Create a Cluster 3-3

iii

Create a Cluster with Oracle Cloud Stack 3-10

View All Clusters 3-11

View Details for a Cluster 3-11

View Activities for Clusters 3-12

View Cluster Component Status 3-12

Monitor the Health of a Cluster 3-12

Scale a Cluster Out 3-13

Scale a Cluster In 3-13

Stop, Start, and Restart a Cluster 3-13

Delete a Cluster 3-14

Stop, Start, and Restart a Node 3-14

Manage Tags 3-14

Create, Assign, and Unassign Tags 3-15

Find Tags and Instances Using Search Expressions 3-15

4 Use Identity Cloud Service for Cluster Authentication

About Cluster Authentication 4-1

Connect to Identity Cloud Service from the Service Console 4-1

Add Identity Cloud Service Users for Clusters 4-2

Make REST API Calls to Clusters That Use Identity Cloud Service 4-3

Update the Identity Cloud Service Password for Big Data Cloud 4-6

5 Manage Network Access

About Network Access 5-1

Enable Access Rules 5-2

Create Access Rules 5-2

Generate a Secure Shell (SSH) Public/Private Key Pair 5-4

Generate an SSH Key Pair on UNIX and UNIX-Like Platforms Using the ssh-keygen Utility 5-4

Generate an SSH Key Pair on Windows Using the PuTTYgen Program 5-5

System Properties of Big Data Cloud 5-6

6 Patch Big Data Cloud

About Operating System Patching 6-1

View Available Patches 6-1

Check Patch Prerequisites 6-2

Apply a Patch 6-3

Roll Back a Patch or Failed Patch 6-3

iv

7 Manage Credentials

Change the Cluster Password 7-1

Replace the SSH Keys for a Cluster 7-2

Update Cloud Storage Credentials 7-2

Use the Cluster Credential Store 7-3

Manage Certificates Used for the Cluster Console 7-3

Update the Security Key for Big Data Cloud on Oracle Cloud Infrastructure 7-4

8 Manage Data

Load Data Into Cloud Storage 8-1

Upload Files Into HDFS 8-2

Browse Data 8-3

About the Big Data File System (BDFS) 8-3

9 Connect to Oracle Database

Use the Oracle Shell for Hadoop Loaders Interface (OHSH) 9-1

About Oracle Shell for Hadoop Loaders 9-1

Configure Big Data Cloud for Oracle Shell for Hadoop Loaders 9-2

Get Started with Oracle Shell for Hadoop Loaders 9-2

Use Oracle Loader for Hadoop 9-4

About Oracle Loader for Hadoop 9-4

Get Started With Oracle Loader for Hadoop 9-5

Use Copy to Hadoop 9-9

About Copy to Hadoop 9-9

First Look: Loading an Oracle Table Into Hive and Storing the Data in Hadoop 9-9

10

Work with Jobs

Create a Job 10-1

Run a Job 10-3

About MapReduce Jobs 10-3

Stop a Job 10-5

View Jobs and Job Details 10-5

View Job Logs 10-5

Monitor and Troubleshoot Jobs 10-6

Manage Work Queue Capacity 10-6

Create Work Queues 10-7

v

11

Work with Notebook

Create a Note in a Notebook 11-1

Run a Note 11-2

View and Edit a Note 11-2

Import a Note 11-2

Export a Note 11-3

Delete a Note 11-3

Organize Notes 11-3

Manage Notebook Settings 11-4

Interpreters Available for Big Data Cloud 11-4

12

Work with Oracle R Advanced Analytics for Hadoop (ORAAH)

About ORAAH in Big Data Cloud 12-1

Use ORAAH in Big Data Cloud 12-1

13

Troubleshoot Big Data Cloud

Problems with Administering Clusters 13-1

I get a warning that the object store credentials are out of sync 13-1

I need to view the status of running services 13-1

Services aren’t being restarted properly after life cycle operations 13-2

I need to modify the Ambari Web inactivity timeout 13-2

I need to control the Ambari-agent service 13-2

I need to control the Ambari-server service 13-3

Problems with Patching and Rollback 13-3

I can’t apply a patch 13-3

Patching fails due to disk space 13-3

A Oracle Cloud Pages for Big Data Cloud

Service Console: Instances Page A-2

Service Console Create Instance: Instance Page A-4

Service Console Create Instance: Service Details Page A-6

Service Console Create Instance: Confirmation Page A-12

Service Console: Activity Page A-13

Service Console: SSH Access Page A-15

Service Console: Instance Overview Page A-16

Service Console: Access Rules Page A-21

Big Data Cloud Console: Overview Page A-23

Big Data Cloud Console: Jobs Page A-25

vi

Big Data Cloud Console New Job: Details Page A-27

Big Data Cloud Console New Job: Configuration Page A-27

Big Data Cloud Console New Job: Driver File Page A-29

Big Data Cloud Console New Job: Confirmation Page A-30

Big Data Cloud Console: Notebook Page A-30

Big Data Cloud Console: Data Stores Page A-31

Big Data Cloud Console: Status Page A-34

Big Data Cloud Console: Settings Page A-35

B Customize Clusters

About the Cluster Bootstrap Script B-1

Bootstrap Script Execution and Logging B-2

Sample Bootstrap Script B-3

Big Data Cloud Convenience Functions B-4

vii

Preface

This document describes how to administer and use Oracle Big Data Cloud andprovides references to related documentation.

Topics:

• Audience

• Related Resources

• Conventions

AudienceThis document is intended for users who want to quickly spin up elastic Apache Sparkor Apache Hadoop clusters and use the clusters to analyze data.

Related ResourcesFor related information, see these Oracle resources:

• Getting Started with Oracle Cloud

• Getting Started with Oracle Platform Services in the Oracle Cloud Infrastructuredocumentation

• Getting Started with Object Storage Classic in Using Oracle Cloud InfrastructureObject Storage Classic

• REST API for Oracle Big Data Cloud

• REST API to Manage Oracle Big Data Cloud

• Using the Command Line Interface in PaaS Service Manager Command LineInterface Reference

• Big Data Cloud on the Oracle Cloud website

https://cloud.oracle.com/big-data-cloud

ConventionsThe following text conventions are used in this document:

Convention Meaning

boldface Boldface type indicates graphical user interface elements associatedwith an action, or terms defined in text or the glossary.

Preface

viii

https://docs.us-phoenix-1.oraclecloud.com/Content/GSG/Reference/gettingstartedwithPaaS.htm

http://www.oracle.com/pls/topic/lookup?ctx=en/cloud/iaas/storage-cloud/cssto&id=CSSTO3106

http://www.oracle.com/pls/topic/lookup?ctx=cloud&id=PSCLI-GUID-A63D73BD-4F22-472D-9E04-D998CEE68A00

https://cloud.oracle.com/big-data-cloud

Convention Meaning

italic Italic type indicates book titles, emphasis, or placeholder variables forwhich you supply particular values.

monospace Monospace type indicates commands within a paragraph, URLs, codein examples, text that appears on the screen, or text that you enter.

Preface

ix

1Get Started with Big Data Cloud

This section describes how to get started with Oracle Big Data Cloud.

Topics

• About Big Data Cloud

• Before You Begin with Big Data Cloud

• How to Begin with Big Data Cloud Subscriptions

• About Big Data Cloud Roles and Users

• Typical Workflow for Big Data Cloud

• About Big Data Cloud Clusters on Oracle Cloud Infrastructure

• About Installing Additional Software

About Big Data CloudBig Data Cloud leverages Oracle’s Infrastructure Cloud Services to deliver a secure,elastic, integrated platform for all Big Data workloads. You can:

• Spin up multiple Hadoop or Spark clusters in minutes

• Use built-in tools such as Apache Zeppelin to understand your data, or use thejobs API to run non-interactive jobs

• Use open interfaces to integrate third-party tools to analyze your data

• Launch multiple clusters against a centralized data lake to achieve data sharingwithout compromising on job isolation

• Create small clusters or extremely large ones based on workload and use-cases

• Elastically scale the compute and storage tiers independently of one another,either manually or in an automated fashion

• Pause a cluster when not in use

• Use REST APIs to monitor, manage, and utilize the service

For information about the open source components used in Big Data Cloud, see Cluster Components.

Before You Begin with Big Data CloudBefore you start using Oracle Big Data Cloud, you should be familiar with the followingtechnologies:

• The Apache Hadoop ecosystem

• Apache Spark

• OpenStack Swift Object Storage

1-1

Before you create a cluster:

• Subscribe to Oracle Cloud Infrastructure Object Storage Classic, the persistentdata lake for Big Data Cloud

• Subscribe to Oracle Big Data Cloud

• (Optional) Create an Oracle Cloud Infrastructure Object Storage Classic containerfor your data

• (Optional) Create a Secure Shell (SSH) public/private key pair to provide whenyou create a cluster

How to Begin with Big Data Cloud SubscriptionsTo get started with Oracle Big Data Cloud subscriptions:

1. Sign up for a free credit promotion or purchase a subscription.

See Request and Manage Free Oracle Cloud Promotions and Buy an OracleCloud Subscription in Getting Started with Oracle Cloud.

2. Access Oracle Big Data Cloud.

See Access Big Data Cloud.

Note:

Be sure to review Before You Begin with Big Data Cloud before you createyour first cluster.

If you want to grant others access to Big Data Cloud, start by reviewing About BigData Cloud Roles and Users. Then, create accounts for users and assign themappropriate privileges and roles. For instructions, see Add Users and Assign Roles inGetting Started with Oracle Cloud.

About Big Data Cloud Roles and UsersOracle Big Data Cloud uses roles to control access to tasks and resources. A roleassigned to a user gives certain privileges to that user.

In addition to the roles and privileges described in Learn About Cloud Account Roles inGetting Started with Oracle Cloud, the following role is created for Big Data Cloud:BDCSCE_Administrator.

When the Big Data Cloud account is first set up, the service administrator is given theBDCSCE_Administrator role. User accounts with this role must be added beforeanyone else can access and use the service.

A user with the BDCSCE_Administrator role has complete administrative control overthe service. This user can create and terminate clusters, add and delete nodes,monitor cluster health, stop and start clusters, and manage other life cycle events. In atypical workflow, the administrator spins up a cluster that users can use to do theirwork. When the cluster is no longer needed, the administrator terminates it.

The identity domain administrator can create more Big Data Cloud administrators bycreating user accounts and assigning the role to the user. Only the identity domain

Chapter 1How to Begin with Big Data Cloud Subscriptions

1-2

administrator is allowed to create user accounts and assign roles. See Add Users andAssign Roles in Getting Started with Oracle Cloud.

Typical Workflow for Big Data CloudTo start using Oracle Big Data Cloud, refer to the following tasks as a guide. Some ofthese tasks are performed only by administrators.

Task Description More Information

Sign up for a freecredit promotion orpurchase asubscription

Provide your information, andsign up for a free creditpromotion or purchase asubscription to Oracle BigData Cloud.

How to Begin with Big Data CloudSubscriptions

Add and manageusers and roles

Create accounts for yourusers and assign themappropriate privileges. Assignthe necessary Oracle BigData Cloud roles.

Add Users and Assign Roles in GettingStarted with Oracle Cloud, and AboutBig Data Cloud Roles and Users

Create an SSH keypair

Create SSH public/privatekey pairs to facilitate secureaccess to all virtual machinesin your service.

Generate a Secure Shell (SSH) Public/Private Key Pair

Create a cluster Use a wizard to create acluster.

Create a Cluster

Enable networkaccess

Permit access to networkservices associated with yourclusters.

About Network Access

Load data Load the data you’ll be usingfor your analysis.

Manage Data

Create and managejobs

Use jobs to analyze data. Work with Jobs

Create and managenotes

Use notes to analyze data. Work with Notebook

Monitor clusters Check on the health andperformance of individualclusters.

Monitor the Health of a Cluster

Monitor the service Check on the day-to-dayoperation of your service,monitor performance, andreview important notifications.

Performing Service-Specific Tasks inManaging and Monitoring Oracle Cloud

About Big Data Cloud Clusters on Oracle CloudInfrastructure

You can create Oracle Big Data Cloud clusters on Oracle Cloud Infrastructure and onOracle Cloud Infrastructure Classic.

The infrastructure a cluster gets created on depends on the region you select whenyou create the cluster. If you see the Availability Domain and Subnet fields whenyou select a region for the cluster you're creating, that means the cluster will be

Chapter 1Typical Workflow for Big Data Cloud

1-3

created on Oracle Cloud Infrastructure. Otherwise, the cluster is created on OracleCloud Infrastructure Classic.

To determine which infrastructure your cluster is running on after the cluster has been

created, click the Instance Details icon for the cluster, and then locate the Regioninformation. If the value is us-phoenix-1, us-ashburn-1, eu-frankfurt-1, or uk-london-1, then the instance is running on Oracle Cloud Infrastructure.

Prerequisites on Oracle Cloud Infrastructure

Oracle Big Data Cloud clusters on Oracle Cloud Infrastructure require certainnetworking and storage resources that you must create on Oracle Cloud Infrastructurebefore you create your first cluster.

To learn about these resources, see Prerequisites for Oracle Platform Services in theOracle Cloud Infrastructure documentation.

For step-by-step instructions to create these resources, see Creating theInfrastructure Resources Required for Oracle Platform Services.

Note:

Oracle Big Data Cloud uses the native Oracle Cloud Infrastructure objectstorage API rather than the Swift API. As such, an API signing key isrequired for authentication to Oracle Cloud Infrastructure Object Storage, nota Swift user name and password as described in the Prerequisitesdocumentation above.

Differences Between Clusters on Oracle Cloud Infrastructure and Oracle CloudInfrastructure Classic

The cluster environment on either type of infrastructure is substantially the same. Afew differences exist in the underlying infrastructure components and in the supportedcapabilities. Awareness of these differences will help you choose an appropriateinfrastructure when creating a cluster.

The following table lists differences between Big Data Cloud clusters on Oracle CloudInfrastructure and on Oracle Cloud Infrastructure Classic.

Feature Oracle Cloud InfrastructureClassic

Oracle Cloud Infrastructure

Availability domains Not applicable Each region has multiple isolatedavailability domains, with separatepower and cooling. The availabilitydomains within a region areinterconnected using a low-latencynetwork. When creating a cluster, youcan select the availability domain thatthe cluster should be placed in.

Subnets and IPnetworks

You can attach clusters to IPnetworks defined on OracleCloud Infrastructure ComputeClassic.

You must attach each cluster to asubnet, which is a part of a virtualcloud network that you create onOracle Cloud Infrastructure.

Chapter 1About Big Data Cloud Clusters on Oracle Cloud Infrastructure

1-4

https://apexapps.oracle.com/pls/apex/f?p=44785:112:0::::P112_CONTENT_ID:22118


Feature Oracle Cloud InfrastructureClassic

Oracle Cloud Infrastructure

Compute shapes Standard and high memoryshapes

The list of available shapes mayvary by region. For informationabout shapes, see AboutShapes in Using Oracle CloudInfrastructure Compute Classic.

VM.Standard and BM.Standardshapes

The list of available shapes may varyby region. For information aboutshapes, see Overview of the ComputeService in the Oracle CloudInfrastructure documentation.

IP reservations Not supported Not supported

Network access toclusters

Use the Oracle Big Data Cloudinterfaces to configure accessrules.

Note that these access rulesprohibit access by default (withthe exception of SSH access onport 22), and you must enablethem to provide access to otherports.

Use Oracle Cloud Infrastructureinterfaces to configure security rules.

Scaling clusters Supported Not supported

You cannot scale the shape of acluster’s compute nodes; you canscale only the storage. The minimumsize of a new storage volume onOracle Cloud Infrastructure is 50 GB.

Using OracleIdentity CloudService to controlaccess toapplicationsdeployed on thecluster

In accounts that use OracleIdentity Cloud Service, whilecreating a cluster, you canenable Oracle Identity CloudService as the identity providerfor applications deployed on thecluster.

Not supported

Load balanceroptions

While creating a cluster, if youenable Oracle Identity CloudService as the identity provider,an Oracle-managed loadbalancer is created andconfigured automatically for thecluster.If you don’t enable OracleIdentity Cloud Service, then youcan use Oracle Traffic Director.

Uses a custom load balancer.

Object storage You can create the objectstorage container either beforeor during cluster creation.

You must create the object storagebucket on Oracle Cloud Infrastructurebefore creating the cluster.

About Installing Additional SoftwareYou can install additional software on Oracle Big Data Cloud, but do so at your ownrisk. Certain software installations can affect the proper functioning of the service.

Note the following:

• Using Ambari to install and manage additional services will cause patching to fail.

Chapter 1About Installing Additional Software

1-5

https://docs.us-phoenix-1.oraclecloud.com/Content/Compute/Concepts/computeoverview.htm

https://docs.us-phoenix-1.oraclecloud.com/Content/Compute/Concepts/computeoverview.htm

• Changing the default Python version can have adverse effects on lifecycleoperations such as start, stop, restart, scale-in, scale-out, and patching.

• If you choose to install third-party products, they should be installed on edgenodes and not directly in a Big Data Cloud cluster.

• You are responsible for the maintenance, operation, and support of any additionalsoftware you install on Big Data Cloud.

Chapter 1About Installing Additional Software

1-6

2Access Big Data Cloud

This section describes how to access the consoles and interfaces available for OracleBig Data Cloud.

Topics

• Access the Service Console for Big Data Cloud

• Access the Big Data Cloud Console

• Access Big Data Cloud Using the REST API

• Access Big Data Cloud Using the CLI

• Access Big Data Cloud Using Ambari

• About Accessing Thrift

• Connect to a Cluster Node Through Secure Shell (SSH)

Access the Service Console for Big Data CloudOracle Big Data Cloud can be accessed through a web console. Access to thisconsole is limited to administrators.

To access the service console for Oracle Big Data Cloud:

1. Sign in to Oracle Cloud.

If you received a welcome email, use it to identify the URL, your user name, andyour temporary password. After signing in, you'll be prompted to change yourpassword.

2. From the Infrastructure Console, click the navigation menu in the top leftcorner, expand Classic Data Management Services, and then click Big Data -Compute Edition.

The service console opens on the Instances page. For information about thedetails on the page, see Service Console: Instances Page. If this is the first timeOracle Big Data Cloud has been accessed for the account, a Welcome page isdisplayed.

Access the Big Data Cloud ConsoleClusters in Oracle Big Data Cloud can be accessed through a web-based console.

The Big Data Cloud Console (also referred to as the cluster console in this document),is used to create, terminate, monitor, and manage Apache Spark jobs; create andmanage notes and notebooks; browse Hadoop Distributed File System (HDFS) andCloud storage; and manage work queue configurations.

After administrators create a cluster, they give users the information they need toconnect to the cluster console. Administrators also provide information about the

2-1

Oracle Cloud Infrastructure Object Storage Classic container associated with thecluster when the cluster was created. Oracle Cloud Infrastructure Object StorageClassic is the persistent data lake for Big Data Cloud and is typically where the dataused for analysis is stored. Job logs are also stored there.

The cluster console can be accessed in several different ways, depending on whetheryou have administrator privileges, and whether the cluster uses Basic authentication oruses Oracle Identity Cloud Service (IDCS) for authentication.

Access the Big Data Cloud Console — Administrators

To access the cluster as an administrator:

1. Open the service console. See Access the Service Console for Big Data Cloud .

2. From the menu for the cluster you want to access, select Big Data CloudConsole and log in with the appropriate credentials:

• For clusters that use HTTP Basic authentication, log in with the administrativeuser name and password specified for the cluster when the cluster wascreated.

• For clusters that use IDCS for authentication, log in with your existing IDCSuser name and password. When IDCS is enabled as the authenticationmechanism for a cluster, anyone who can authenticate to IDCS can log in andaccess all cluster services.

After the cluster console opens, make note of the URL. This is the URL you’ll provideto users who need to access the cluster. The URL and connection information differdepending on whether the cluster uses Basic authentication or IDCS forauthentication.

For clusters that use Basic authentication, provide users with the cluster URL and withthe credentials specified for the cluster when the cluster was created. The URL is inthe form of https://address:1080/, where address is the public IP address of theMASTER-1 node on the cluster. Note that the console can be accessed on port 1080on all master nodes in a cluster. If you can't access the console on the MASTER-1node, try accessing it on another master node.

For clusters that use IDCS for authentication, provide users with just the cluster URL.The URL is in the form of https://cluster_name-load_balancing_server_URI,where cluster_name is the name of the cluster, and load_balancing_server_URI isthe URI assigned to the cluster by the load balancing service. Because authenticationis managed by IDCS, you won’t send a user name and password. Users will log in tothe cluster using their own IDCS credentials, which should have already beenprovisioned before you send the cluster URL. For information about adding users, see Add Identity Cloud Service Users for Clusters.

For both cluster types, also provide users with the URL and credentials for the OracleCloud Infrastructure Object Storage Classic container that was associated with thecluster when the cluster was created.

Access the Big Data Cloud Console — Cluster Users

To access the cluster if you are not an administrator:

1. Obtain the information you need from your administrator:

• For clusters that use Basic authentication, the administrator will give you thecluster URL and the user name and password for the cluster.

Chapter 2Access the Big Data Cloud Console

2-2

• For clusters that use IDCS for authentication, the administrator will give youjust the cluster URL. An administrator should have already added you as auser in IDCS, and you should have received an email with your IDCS logininformation. You’ll log in with your IDCS user name and password.

2. Access the cluster URL in your browser and log in when prompted:

• For clusters that use Basic authentication, you’re presented with a basic logindialog. Log in with the user name and password provided by youradministrator.

• For clusters that use IDCS for authentication, you’re presented with theIdentity Cloud Service login screen. Log in with your IDCS user name andpassword. All that’s required to access an IDCS-enabled cluster is a validIDCS account.

The Big Data Cloud Console opens.

Administrators should also give you the URL and credentials for the Oracle CloudInfrastructure Object Storage Classic container associated with the cluster when thecluster was created.

Access Big Data Cloud Using the REST APIYou can use the REST API to create and manage Oracle Big Data Cloud clusters andperform many other tasks you can perform using the web-based consoles. See:

• REST API for Oracle Big Data Cloud


You can also access the API Catalog for Big Data Cloud from the user name menu inthe Big Data Cloud Console. See Access the Big Data Cloud Console.

Access Big Data Cloud Using the CLIYou can use a command line interface (CLI) to create and manage Oracle Big DataCloud clusters and perform many other tasks you can perform using the web-basedconsoles.

The Oracle PaaS Service Manager (PSM) CLI enables you to manage the lifecycle ofvarious services in Oracle Public Cloud, including Big Data Cloud. See Using theCommand Line Interface in PaaS Service Manager Command Line InterfaceReference.

Access Big Data Cloud Using Ambari

This topic does not apply to Oracle Cloud Infrastructure. On Oracle CloudInfrastructure, the Ambari port is already accessible and nothing else needs to bedone.

You can use Apache Ambari to access and manage Oracle Big Data Cloud clusters.While Ambari isn't needed for normal operations with the cluster, it's useful to openAmbari access to help with troubleshooting and certain administrative actions.

Chapter 2Access Big Data Cloud Using the REST API

2-3



To access a cluster using Ambari, you enable an access rule to open the port forAmbari, and then use the Ambari URL:


2. From the menu for the cluster you want to access using Ambari, select AccessRules. Access rules control which ports can be accessed on the VMs that are partof a cluster.

3. In the list of access rules, find the Ambari REST rule, which is associated withport 8080, the port that needs to be open.

4. From the menu for the Ambari REST rule, select Enable.

The Enable Access Rule window is displayed.

5. Select Enable.

The Enable Access Rule window closes and the rule is displayed as enabled inthe list of rules. The given port on the cluster is opened to the public internet.

6. After the rule is enabled, click the link for the cluster at the top of the page to returnto the cluster overview page.

7. On the cluster overview page, under Resources, find the MASTER-1 host, copythe Public IP address, and paste it into your browser address bar, adding port8080 if necessary. For example, https://Public_IP_address:8080/. You mustuse https or you won’t be able to connect.

8. If you’re prompted for credentials, enter the user name and password specified forthe cluster when the cluster was created.

You should now be connected to the Ambari management console on the cluster.

For information about using Ambari to upload files into HDFS, see Upload Files IntoHDFS. For general information about using Ambari, see the Ambari 2.4documentation.

About Accessing ThriftOracle Big Data Cloud deploys two Thrift servers to provide JDBC connectivity to Hiveand Spark: Spark Thrift Server and Hive Thrift Server.

JDBC clients can connect to Hive or Spark servers and execute SQL. Spark ThriftServer provides a way to submit Spark jobs via SQL, and Hive Thrift Server provides away to submit Hadoop jobs via SQL. A common use for this capability is to allowbusiness intelligence (BI) tools to leverage the power of Apache Spark and ApacheHive.

Thrift servers are automatically started when a cluster is provisioned in Big Data Cloudand are made available by default for the Full deployment profile. Thrift servers are notavailable with the Basic deployment profile.

Chapter 2About Accessing Thrift

2-4

https://cwiki.apache.org/confluence/display/AMBARI/Ambari+User+Guides


Create a Keystore and Certificate

Note:

This section about creating a keystore and certificate does not apply toclusters that use Oracle Identity Cloud Service (IDCS) for authentication.Certificates associated with the load balancing service are typically signed bya certificate authority (are not self-signed), which means the following stepsgenerally aren't necessary for IDCS-enabled clusters.

Before you can access a Thrift server, a keystore must be created with the appropriatecertificate:

1. Download the certificate locally (on *nix environments):

echo | \ openssl s_client -connect ip_address:1080 2>/dev/null | \ openssl x509 >nginx.crt

where ip_address is the IP address of the Big Data Cloud Console (clusterconsole) or any of the master nodes in the cluster.

2. Create a TrustStore:

/usr/java/default/bin/keytool -import -trustcacerts \ -keystore /tmp/bdcsce.jks \ -storepass truststore_password -noprompt \ -alias bdcsce-certs \ -file nginx.crt;

where truststore_password is a password of your choosing.

3. (Optional) Verify the certificate is properly added:

/usr/java/default/bin/keytool \ -keystore /tmp/bdcsce.jks \ -storepass truststore_password \ -list -v

Access Spark or Hive Thrift Servers

Most JDBC clients can access the Spark and Hive Thrift Servers. The examples in thissection use the Beeline client to show how to connect. The Spark Thrift Server can beaccessed using the Beeline client both inside and outside of the cluster, as well asprogrammatically.

About the JDBC URL

If inside the cluster:

Spark and MapReduce jobs can read the Hive URL as a system property. Applicationscan access the URL and the user name from the /etc/bdcsce/datasources.properties file inside the cluster.

For external clients (external to the cluster), the URL must be manually constructedand use one of the following formats. Note that the URLs are almost identical and varyonly by the value of the hive.server2.thrift.http.path attribute.


2-5

Note:

Thrift URLs are listed on the JDBC URLs tab on the Settings page in the BigData Cloud Console and can be copied from there. See Access the Big DataCloud Console.The URLs differ depending on whether a cluster uses Basic authentication oruses IDCS for authentication. For IDCS-enabled clusters, interactions arerouted through the load balancing server instead of going directly to thecluster, and that difference is reflected in the URL.

Basic authentication cluster

URL for Spark Thrift Server:

jdbc:hive2://ip_address:1080/default;ssl=true;sslTrustStore=path_to_truststore;trustStorePassword=truststore_password?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice

URL for Hive Thrift Server:

jdbc:hive2://ip_address:1080/default;ssl=true;sslTrustStore=path_to_truststore;trustStorePassword=truststore_password?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hs2service

where:

• ip_address is the IP address of the desired endpoint

• path_to_truststore is the absolute path to the Java Trust Store that holds thecertificate

• truststore_password is the password used with the trust store

IDCS-enabled cluster

URL for Spark Thrift Server:

jdbc:hive2://cluster_name-load_balancing_server_URI/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice

URL for Hive Thrift Server:


where:

• cluster_name is the name of the cluster

• load_balancing_server_URI is the URI assigned to the cluster by the loadbalancing service

Access Using the Beeline CLI

The following examples show how to access the Thrift servers using Beeline.


2-6

Note:

The URLs shown in the examples are for clusters that use Basicauthentication. For IDCS-enabled clusters, substitute the URLs listed above,and use IDCS credentials (user name and password).

Access Spark Thrift Server (example):

beeline –u \'jdbc:hive2://ip_address:1080/default;ssl=true;sslTrustStore=/tmp/bdcsce.jks;trustStorePassword=truststore_password?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice' \ -n user_name \ -p password

Access Hive Thrift Server (example):

beeline –u \'jdbc:hive2://ip_address:1080/default;ssl=true;sslTrustStore=/tmp/bdcsce.jks;trustStorePassword=truststore_password?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hs2service' \ -n user_name \ -p password

where:

• ip_address is the IP address of the desired endpoint

• truststore_password is the password used with the trust store

• user_name is the name of the user that was specified when the cluster was created

• password is the password specified for the cluster when the cluster was created

Access Thrift Programmatically

Thrift can easily be accessed programmatically. The following code snippet illustrateshow Thrift can be accessed within the cluster using the available system properties:

String url = System.getProperty("bdcsce.hivethrift.default_connect"); Properties prop = new Properties ();prop.put ("user", System.getProperty("bdcsce.hivethrift.default_user ")); prop.put ("password", password); System.out.println("Connecting to url: " +url); Connection conn = DriverManager.getConnection(url, prop); System.out.println("connected");

Note that the Hive Thrift Server system properties used in the snippet:

bdcsce.hivethrift.default_connectbdcsce.hivethrift.default_user

can be replaced with the Spark Thrift Server equivalents to connect to the Spark ThriftServer instead of the Hive Thrift Server:

bdcsce.sparkthrift.default_connectbdcsce.sparkthrift.default_user


2-7

password can be an empty string if the client is run within the cluster. If the client is runoutside of the cluster over https, password should be the Big Data Cloud Consolepassword.

CLASSPATH is used to specify any additional jars required by the job. When running anapplication outside the cluster, CLASSPATH should include all libraries under ${spark_home}/jars, where spark_home points to the Spark2 install directory.

System Properties Related to Thrift

The following table summarizes system properties that can be used by applications orin Zeppelin to facilitate simpler connection to Thrift.

Property Example Value Description

bdcsce.hivethrift.default_user

bdcsce_admin User name that should beused for connecting to theHive Thrift Server.

bdcsce.hivethrift.default_connect

jdbc:hive2://host:10002/default;transportMode=http;httpPath=hs2service

URL to connect to the HiveThrift Server within thecluster. This can be usedby jobs to execute queriesagainst the Hive ThriftServer.

bdcsce.sparkthrift.default_user

bdcsce_admin User name that should beused for connecting to theSpark Thrift Server.

bdcsce.sparkthrift.default_connect

jdbc:hive2://host:10001/default;transportMode=http;httpPath=cliservice

URL to connect to theSpark Thrift Server withinthe cluster. This can beused by jobs to executequeries against the SparkThrift Server.

oscs.default.container https://storage.oraclecorp.com/v1/Storage-tenant/container

REST URL to connect tothe object store.Applications running insidethe cluster can query thissystem property to get theURL.

Connect to a Cluster Node Through Secure Shell (SSH)To gain local access to the tools, utilities, and other resources on a cluster nodeassociated with Oracle Big Data Cloud, you use Secure Shell (SSH) client software toestablish a secure connection and log in as the user opc.

By default, network access to cluster nodes associated with Oracle Big Data Cloud isprovided by Secure Shell (SSH) connections on port 22. Port 22 is the standardTCP/IP port that is assigned to SSH servers.

Several SSH clients are freely available. The following sections describe how to useSSH clients on UNIX, UNIX-like, and Windows platforms to connect to a cluster nodeassociated with Oracle Big Data Cloud.

Chapter 2Connect to a Cluster Node Through Secure Shell (SSH)

2-8

Note:

The ora_p2bdcsce_ssh access rule controls SSH access to a cluster. Therule is created automatically when a cluster is created and is disabled bydefault. Before you can connect to a cluster node through SSH, you mustenable the ora_p2bdcsce_ssh access rule. See Enable Access Rules.

Connect to a Node by Using SSH on UNIXUNIX and UNIX-like platforms (including Solaris and Linux) include the ssh utility, anSSH client.

Before You Begin

Before you use the ssh utility to connect to a node, you need the following:

• The IP address of the node

The IP address of the node is listed on the details page of the cluster that containsthe node. To display this page, see View Details for a Cluster.

• The SSH private key file that pairs with the public key associated with the cluster

The public key was associated with your cluster when it was created. If you don’thave the private key that’s paired with the public key, contact your administrator.

Procedure

To connect to a node using the ssh utility on UNIX and UNIX-like platforms:

1. In a command shell, set the file permissions of the private key file so that only youhave access to it:

$ chmod 600 private-key-file

private-key-file is the path to the SSH private key file that matches the publickey that is associated with the cluster.

2. Run the ssh utility:

$ ssh -i private-key-file opc@node-ip-address

where:

• private-key-file is the path to the SSH private key file.

• opc is the opc operating system user. As opc, you can use the sudo commandto gain root access to the node, as described in the next step.

• node-ip-address is the IP address of the node in x.x.x.x format.

If this is the first time you are connecting to the node, the ssh utility prompts you toconfirm the public key. In response to the prompt, enter yes.


2-9

3. To perform operations that require root access to the node—such as issuingambari-server commands—open a root command shell. Enter sudo -s at thecommand prompt:

$ sudo -s# whoami # root

Connect to a Node by Using PuTTY on WindowsPuTTY is a freely available SSH client program for Windows.

Before You Begin

Before you use the PuTTY program to connect to a node, you need the following:

• The IP address of the node

The IP address of the node is listed on the details page of the cluster that containsthe node. To display this page, see View Details for a Cluster.

• The SSH private key file that pairs with the public key associated with the cluster

The public key was associated with your cluster when it was created. If you don’thave the private key that’s paired with the public key, contact your administrator.

The private key file must be of the PuTTY .ppk format. If the private key file wasoriginally created on the Linux platform, you can use the PuTTYgen program toconvert it to the .ppk format. See 8.2.12 Dealing with private keys in other formatsin the PuTTY User Manual.

Procedure

1. Download and install PuTTY.

To download PuTTY, go to http://www.putty.org/ and click the You candownload PuTTY here link.

2. Run the PuTTY program.

The PuTTY Configuration window is displayed, showing the Session panel.

3. In Host Name (or IP address) box, enter the IP address of the node.

4. Confirm that the Connection type option is set to SSH.

5. In the Category tree, expand Connection if necessary and then click Data.

The Data panel is displayed.

6. In the Auto-login username box, enter opc. As the opc user, you can use thesudo command to gain root access to the node, as described in the last step,below.

7. Confirm that the When username is not specified option is set to Prompt.

8. In the Category tree, expand SSH and then click Auth.

The Auth panel is displayed.

9. Click the Browse button next to the Private key file for authentication box.Then, in the Select private key file window, navigate to and open the private keyfile that matches the public key that is associated with the cluster.


2-10

https://the.earth.li/~sgtatham/putty/latest/htmldoc/Chapter8.html#puttygen-conversions

10. In the Category tree, click Session.

The Session panel is displayed.

11. In the Saved Sessions box, enter a name for this connection configuration. Then,click Save.

12. Click Open to open the connection.

The PuTTY Configuration window is closed and the PuTTY window is displayed.

If this is the first time you are connecting to the VM, the PuTTY Security Alertwindow is displayed, prompting you to confirm the public key. Click Yes tocontinue connecting.

13. To perform operations that require root access to the node—such as issuingambari-server commands—open a root command shell. Enter sudo -s at thecommand prompt:

$ sudo -s# whoami # root


2-11

3Manage the Life Cycle of Big Data Cloud

This section describes tasks to manage the life cycle of Oracle Big Data Cloud.

Topics

• About Cluster Topology

• Cluster Components

• Cluster Extensions

• Create a Cluster

• Create a Cluster with Oracle Cloud Stack

• View All Clusters

• View Details for a Cluster

• View Activities for Clusters

• View Cluster Component Status

• Monitor the Health of a Cluster

• Scale a Cluster Out

• Scale a Cluster In

• Stop, Start, and Restart a Cluster

• Delete a Cluster

• Stop, Start, and Restart a Node

• Manage Tags

About Cluster TopologyThe cluster topology in Oracle Big Data Cloud is based on the initial size of the clusterwhen it was first created. While a cluster can be scaled up or down later, theunderlying cluster topology that defines master services remains unchanged.Therefore, when you’re creating a cluster, it’s important to consider the maximumanticipated size of the cluster and start with a master topology that can scale to meetthe expected demands.

Big Data Cloud provides three different cluster topologies based on the initial size ofthe cluster when it was created. These topologies are described in the following table.

3-1

Initial Cluster Size Description

1 or 2 nodes A cluster initially created with 1 or 2 nodes has a single masternode that hosts all master services. This topology is well suited forsmaller clusters of less than 5 nodes. A cluster initially createdwith 1 or 2 nodes is not expected to scale well beyond severalnodes. This type of cluster is not highly available. All services runon the same node in non-HA mode.

This cluster has:• 1 master node that hosts all master services• N+ compute nodes• N+ compute and storage nodes

3 nodes A cluster initially created with 3 nodes has 3 master nodes thathost all master services. These nodes also act as storage andcompute nodes. This type of cluster is expected to scale to 10nodes and is highly available.

This cluster has:• 3 master nodes that host all master services• N+ compute nodes• N+ compute and storage nodes

4+ nodes A cluster initially created with 4 nodes has 4 master nodes, 2 ofwhich provide NameNode services and 2 others that host theother master services. Larger clusters should initially be createdwith 4 nodes. This type of cluster is highly available.

This cluster has:• 2 master nodes that host redundant NameNodes. The

NameNodes are of shape OC2m regardless of the shape ofthe other nodes. DataNode storage is not mounted on theNameNodes.

• 2 master nodes that host other master services• N+ compute nodes• N+ compute and storage nodes

Cluster ComponentsThe following table lists the open source components deployed in Oracle Big DataCloud clusters.

Note:

Administrators can view component status. See View Cluster ComponentStatus.

Component Version More Information

Alluxio 1.3.0 Alluxio documentation

Apache Ambari 2.4.2 Ambari documentation

Apache Hadoop 2.7.1 Hadoop documentation

Apache Hive 1.2.1 Hive documentation

Apache Pig 0.15.0 Pig documentation

Apache Spark 1.6 Spark 1.6 documentation

Chapter 3Cluster Components

3-2

http://www.alluxio.org/docs/1.3/en/Getting-Started.html


https://hadoop.apache.org/docs/r2.7.1/

https://cwiki.apache.org/confluence/display/Hive/Home

https://pig.apache.org/docs/r0.15.0/

https://spark.apache.org/docs/1.6.0/

Component Version More Information

Apache Spark 2.1 Spark 2.1 documentation

Sparkline SNAP Spark 2.2.1 SNAP documentation

Apache Tez 0.7.0 Tez documentation

Apache Zeppelin 0.7 Zeppelin documentation

Oracle R 3.2.0 Oracle R documentation

Cluster ExtensionsOracle Big Data Cloud bundles Oracle R with all newly provisioned clusters. Forinformation about Oracle R, see details about the Oracle R Distribution.

Create a ClusterTo create a cluster, use the Oracle Big Data Cloud wizard as described in the followingprocedure.

Before You Begin

When you create a cluster, you may need to provide information about otherresources, such as the following:

• An SSH public/private key pair

An SSH public key is used for authentication when you use an SSH client toconnect to a node associated with the cluster. When you connect, you mustprovide the private key that matches the public key.

You can have the wizard create a public/private key pair for you, or you can createone beforehand and upload or paste its private key value. If you want to create akey pair beforehand, you can use a standard SSH key generation tool. See Generate a Secure Shell (SSH) Public/Private Key Pair for instructions.

• A cloud storage location (Optional on Oracle Cloud Infrastructure Classic)

The type of location you specify depends on the infrastructure the cluster is builton:

– Oracle Cloud Infrastructure: Data consumed and generated by Big DataCloud is stored in an Oracle Cloud Infrastructure Object Storage bucket. Youmust create a storage bucket before you create a cluster. See Prerequisitesfor Oracle Platform Services in the Oracle Cloud Infrastructure documentation.

– Oracle Cloud Infrastructure Classic: Data consumed and generated by BigData Cloud is stored in the Oracle Cloud Infrastructure Object Storage Classiccontainer associated with a cluster when the cluster is created. Job logs arealso stored in Oracle Cloud Infrastructure Object Storage Classic. You cancreate the container beforehand and provide the wizard with information aboutit, or you can have the wizard create the container for you. If you want tocreate the container beforehand, see Creating Containers in Using OracleCloud Infrastructure Object Storage Classic for instructions.

Also, before you create a cluster, review the information in About Cluster Topology.The size of a cluster when it's first created determines the cluster's topology, and even

Chapter 3Cluster Extensions

3-3

https://spark.apache.org/docs/2.1.0/

https://github.com/SparklineData/SNAPDocs/wiki

https://tez.apache.org/releases/apache-tez-0-7-0.html

https://zeppelin.apache.org/docs/0.7.0/

http://www.oracle.com/technetwork/database/database-technologies/r/r-distribution/overview/index.html

http://www.oracle.com/technetwork/database/database-technologies/r/r-distribution/overview/index.html


though the cluster can be scaled up or down later, the underlying cluster topology thatdefines master services remains unchanged.

Tutorial (Oracle Cloud Infrastructure)

Tutorial (Oracle Cloud Infrastructure Classic)

Procedure

To create a cluster:


2. Click Create Instance.

The Create Instance wizard starts and the Instance page is displayed. Forinformation about the details on this page, see Service Console Create Instance:Instance Page.

3. On the Instance page, provide cluster information, then click Next to advance tothe Service Details page.

Element Description

Instance Name Name for the new cluster. The name:

• Must not exceed 30 characters.• Must start with a letter.• For IDCS-enabled clusters: Must contain only letters and

numbers.• For non-IDCS-enabled clusters: Can contain hyphens.

Hyphens are the only special characters you can use.• Must be unique within the identity domain.

Description (Optional) Description for the new cluster.

Notification Email (Optional) Email address that provisioning status updatesshould be sent to.

Region (Displayed only if your account has multiple regions)

The region for the cluster. If you choose a region that supportsOracle Cloud Infrastructure, the Availability Domain andSubnet fields are displayed and populated, and the cluster willbe created on Oracle Cloud Infrastructure. Otherwise, thosefields are not displayed and the cluster will be created onOracle Cloud Infrastructure Classic.

To create your cluster on Oracle Cloud Infrastructure, selectus-phoenix-1, us-ashburn-1, eu-frankfurt-1, or uk-london-1if those regions are available to you (which regions aredisplayed depends on which default data region was selectedduring the subscription process). If you select any other region,the cluster will be created on Oracle Cloud InfrastructureClassic.

Select No Preference to let Big Data Cloud choose an OracleCloud Infrastructure Classic region for you.

Availability Domain (Displayed only on Oracle Cloud Infrastructure)

The availability domain (within the region) where the clusterwill be placed.

Chapter 3Create a Cluster

3-4



Element Description

Subnet (Displayed only on Oracle Cloud Infrastructure)

The subnet (within the availability domain) that will determinenetwork access to the cluster.

Select a subnet from a virtual cloud network (VCN) that youcreated previously on Oracle Cloud Infrastructure. Select NoPreference to let Big Data Cloud choose a subnet for you.

IP Network (Not available on Oracle Cloud Infrastructure)

(Available only if you have selected a region and you havedefined one or more IP networks created in that region usingOracle Cloud Infrastructure Compute Classic.)

Select the IP network where you want the cluster placed.Choose No Preference to use the default shared networkprovided by Oracle Cloud Infrastructure Compute Classic.

For more information about IP networks, see About IPNetworks and Creating an IP Network in Using Oracle CloudInfrastructure Compute Classic.

Metering Frequency (Displayed only if you have a traditional metered subscription)

Metering frequency used to determine the billing for resourcesused by the cluster.

Tags (Not available on Oracle Cloud at Customer)

(Optional) Select existing tags or add tags to associate with thecluster.

To select existing tags, select one or more check boxes fromthe list of tags that are displayed on the drop-down menu. If notags are displayed, then no tags have been created.

To create tags, click Click to create a tag (plus sign) todisplay the Create Tags dialog box. In the New Tags field,enter one or more comma-separated tags that can be a key ora key:value pair.

If you do not assign tags during provisioning, you can createand manage tags after the cluster is created. See Create,Assign, and Unassign Tags.

4. On the Service Details page, complete the Cluster Configuration section. Forinformation about the details on this page, see Service Console Create Instance:Service Details Page.


3-5

Element Description

Deployment Profile Type of cluster you want to create based on its intended use.Deployment profiles are predefined sets of services optimizedfor specific uses. The deployment profile can’t be changedafter the cluster is created.

Choices are:

• Full: (default) Provisions the cluster with Spark, SparkThrift, Zeppelin, MapReduce, Hive, Alluxio, and AmbariMetrics. Use this profile if you want all of the features ofBig Data Cloud.

• Basic: Subset of the Full profile. Provisions the clusterwith Spark, Zeppelin, MapReduce, and Ambari Metrics.Use this profile if you don’t need all of the features of BigData Cloud and just want to run Spark or MapReduce jobsand use Notebooks. This profile does not include Alluxio(the in-memory cache), or Hive or JDBC connectivity forBI tools.

• Snap: Provisions the cluster with SNAP, Spark, andZeppelin. Once the SNAP cluster is provisioned, theSNAP service is started and can be viewed in the Ambariuser interface. The SNAP service is started only on themaster node. All lifecycle operations (start/stop/restart)can be performed on the SNAP service using Ambari. See Access Big Data Cloud Using Ambari. SNAP clusters canonly be used for the SNAP application and cannot be usedfor general-purpose Spark processing. Use the Full orBasic profile for general-purpose Spark processing. Forinformation about SNAP, see the SNAP documentation.

Number of Nodes Number of nodes to be allocated to the cluster. Specify threeor more nodes to provide high availability (HA), with multiplemaster nodes. If fewer than three nodes are specified, onenode will be the master node with all critical services runningon the same node in non-HA mode.

Compute Shape Number of Oracle Compute Units (OCPUs) and amount ofmemory (RAM) for each node of the new cluster. Big DataCloud offers many OCPU/RAM combinations.

Queue Profile YARN capacity scheduler queue profile. Defines how queuesand workloads are managed. Also determines which queuesare created and available by default when the cluster iscreated. See Manage Work Queue Capacity.

Note: The preemption setting can’t be changed after thecluster is created.

• Preemption Off: Jobs can't consume more resourcesthan a specific queue allows.

• Preemption On: Jobs can consume more resources thana queue allows, but could lose those resources whenanother job comes in that has priority for those resources.If preemption is on, higher-priority applications don’t haveto wait because lower priority applications have taken upthe available capacity.

Spark Version Spark version to be deployed on the cluster, Spark 1.6 or 2.1.

Note: Oracle R Advanced Analytics for Hadoop (ORAAH) isinstalled for Spark 1.6 clusters only.

5. On the Service Details page, complete the Credentials section. The user nameand password credentials are used to log in to the cluster and run jobs.


3-6


Element Description

Use Identity CloudService to login to theconsole

(Not available on Oracle Cloud Infrastructure)

(Not displayed for all user accounts)

Select this to use IDCS as the client authentication mechanismfor the cluster. Users will access the cluster with their ownIDCS identity and credentials.

When this option is selected, cluster users and cluster accessare managed through IDCS. If this option is not selected,HTTP Basic authentication is used and users access thecluster with the shared administrative user name andpassword specified below. For more information about clusterauthentication, see Use Identity Cloud Service for ClusterAuthentication.

SSH Public KeyEdit

The SSH public key to be used for authentication when usingan SSH client to connect to a node associated with yourcluster.

Click Edit to specify the public key. You can upload a filecontaining the public key value, paste in the value of a publickey, or have the wizard generate a key pair for you.

If you paste in the value, make sure the value does not containline breaks or end with a line break.

If you have the wizard generate a key pair for you, make sureyou download the zip file containing the keys that the wizardgenerated.

User Name Administrative user name. The user name cannot be admin.

For clusters that use Basic authentication, the administrativeuser name and password are used to access the clusterconsole, REST APIs, and Apache Ambari.

For clusters that use IDCS for authentication, theadministrative user name and password are used only toaccess Ambari. Cluster access is managed through IDCS.

PasswordConfirm Password

Password of the user specified in User Name.

6. On the Service Details page, complete the Associations section by selecting theservices you’d like to associate with this cluster.

You can associate a Big Data Cloud cluster with other Oracle Cloud servicesyou've already provisioned. When you associate a cluster with another service,networking between the service instances is reconfigured so the instances cancommunicate with one another. This is helpful if you have Apache Spark jobs thatrequire interaction between services or have some dependency. To associate acluster with a service, you must already have an active subscription for thatservice.

7. On the Service Details page, complete the Cloud Storage Credentials section.The fields in this section are different depending on whether the cluster is beingcreated on Oracle Cloud Infrastructure or on Oracle Cloud Infrastructure Classic.

On Oracle Cloud Infrastructure, provide the following information. Oracle CloudInfrastructure Object Storage is used for object storage.


3-7

Note:

Oracle Big Data Cloud uses the native Oracle Cloud Infrastructure objectstorage API rather than the Swift API. As such, an API signing key isrequired for authentication to Oracle Cloud Infrastructure Object Storage,not a Swift user name and password.

Element Description

OCI Cloud Storage URL The Oracle Cloud Infrastructure Object Storage URL. Forexample:

https://objectstorage.us-phoenix-1.oraclecloud.com

For information about the object storage URL, see REST APIsin the Oracle Cloud Infrastructure documentation.

OCI Cloud StorageBucket URL

The URL of an existing bucket in Oracle Cloud InfrastructureObject Storage.

Format:

oci://bucket@namespace/, where bucket is the defaultbucket where application binaries and application logs arestored, and namespace is your namespace.

Note: The bucket URL must have a trailing slash. If it doesn’t,provisioning will fail.

OCI Cloud Storage UserOCID

The Oracle Cloud Infrastructure Object Storage User OCID.See Where to Get the Tenancy's OCID and User's OCID in theOracle Cloud Infrastructure documentation.

OCI Cloud Storage PEMKey

The Oracle Cloud Infrastructure Object Storage PEM key. Thismust be generated. See How to Generate an API Signing Keyin the Oracle Cloud Infrastructure documentation.

Note: In Big Data Cloud, the PEM key must be created withouta password.

OCI Cloud Storage PEMKey Fingerprint

The Oracle Cloud Infrastructure Object Storage PEM keyfingerprint. This must be generated. See How to Generate anAPI Signing Key in the Oracle Cloud Infrastructuredocumentation.

On Oracle Cloud Infrastructure Classic, provide the following information. OracleCloud Infrastructure Object Storage Classic is used for object storage.


3-8

https://docs.us-phoenix-1.oraclecloud.com/Content/API/Concepts/usingapi.htm

https://docs.us-phoenix-1.oraclecloud.com/Content/API/Concepts/apisigningkey.htm#five

https://docs.us-phoenix-1.oraclecloud.com/Content/API/Concepts/apisigningkey.htm#two



Element Description

Cloud StorageContainer

Name of an existing Oracle Cloud Infrastructure ObjectStorage Classic container to be associated with the cluster, ora new one to be created. The container is used for writingapplication logs and reading application JARs and othersupporting files.

You must enter the complete (fully qualified) REST URL forOracle Cloud Infrastructure Object Storage Classic, appendedby the container name.

Format:

rest_endpoint_url/containerName

You can find the REST endpoint URL of the Oracle CloudInfrastructure Object Storage Classic service instance in theInfrastructure Classic Console. See Finding the RESTEndpoint URL for Your Cloud Account in Using Oracle CloudInfrastructure Object Storage Classic.

Example:

https://acme.storage.oraclecloud.com/v1/MyService-acme/MyContainer

The same formatting requirement applies to thecloudStorageContainer attribute in the REST API.

User Name User name of the user who has access to the specified OracleCloud Infrastructure Object Storage Classic container.

Password Password of the user specified in User Name.

Create Cloud StorageContainer

Select this to create a new Oracle Cloud Infrastructure ObjectStorage Classic container as part of cluster creation. Specifythe container name and the user name and password in thepreceding fields.

The user specified in User Name and Password must havethe privileges needed to create storage containers.

If you select this option, the new storage container is createdwhen you click Next on the Service Details page, and thestorage container remains even if you cancel out of the wizardwithout creating a new cluster. If this happens, you can use thecontainer in the future or manually delete it. See DeletingContainers in Using Oracle Cloud Infrastructure ObjectStorage Classic.

8. On the Service Details page, complete the Block Storage Settings section byselecting the services you’d like to associate with this cluster, then click Next toadvance to the Confirmation page.

Element Description

Use High PerformanceStorage

(Not available on Oracle Cloud at Customer or Oracle CloudInfrastructure)

Select this to use high performance storage for HDFS. Withthis option the storage attached to nodes uses SSDs (solidstate drives) instead of HDDs (hard disk drives). Use thisoption for performance-critical workloads. An additional cost isassociated with this type of storage.

Usable HDFS Storage(GB)

Amount of HDFS storage to be allocated to the cluster.


3-9



Element Description

Usable BDFS Cache(GB)

Amount of storage the Big Data File System (BDFS) will useas a cache to accelerate workloads. The total amount of cacheprovided by BDFS is the sum of RAM allocated to BDFS plusthe total block storage allocated for spillover.

The amount of memory allocated to BDFS is based on thecompute shape selected for the cluster. For details aboutBDFS and memory allocation, see the information about BDFSTiered Storage in About the Big Data File System (BDFS).

Total Allocated Storage(GB)

Total allocated storage for the cluster. You’re billed for thisamount.

9. On the Confirmation page, review the information listed. If you're satisfied withwhat you see, click Create to create the cluster.

If you need to change something, click Previous at the top of the wizard to stepback through the pages, or click Cancel to cancel out of the wizard withoutcreating a new cluster.

Create a Cluster with Oracle Cloud StackUse Oracle Cloud Stack to provision instances of both Oracle Big Data Cloud andOracle Event Hub Cloud Service as a single operation.

Oracle Cloud Stack is a component of Oracle Cloud that enables you to createmultiple cloud resources as a single unit called a stack. You create, delete, andmanage these resources together as a unit, but you can also access, configure, andmanage them through their service-specific interfaces. Stacks also define thedependencies between your stack resources, so that Oracle Cloud Stack creates anddestroys the resources in a logical sequence.

Stacks are created from templates. Oracle Cloud Stack includes a certified Oraclestack template named Oracle-OEHCS-BDCSCE-StackTemplate. This templatecreates a stack that’s comprised of these resources:

• A cluster in Oracle Big Data Cloud.

• A cluster in Oracle Event Hub Cloud Service that is connected to the Oracle BigData Cloud cluster.

• A storage container in Oracle Cloud Infrastructure Object Storage Classic tosupport cloud backups for both the Oracle Big Data Cloud and Oracle Event HubCloud Service clusters.

Get Started

Create a stack using the Oracle-OEHCS-BDCSCE-StackTemplate template. Refer tothese topics in Using Oracle Cloud Stack Manager:

• Accessing Oracle Cloud Stack

• Creating a Cloud Stack

A video and a tutorial are also available.

Video

Tutorial

Chapter 3Create a Cluster with Oracle Cloud Stack

3-10

http://apexapps.oracle.com/pls/apex/f?p=44785:265:0::::P265_CONTENT_ID:17476

http://apexapps.oracle.com/pls/apex/f?p=44785:112:0::::P112_CONTENT_ID:17477

Template Parameters

In the Oracle-OEHCS-BDCSCE-StackTemplate template, the values of these inputparameters can be customized for each stack creation:

• Big Data Cloud compute shape, number of nodes, usable HDFS storage, andcluster user name

• Event Hub compute shape, number of Kafka brokers, and usable topic storage

• SSH public key for VM administration

• Name of the Oracle Cloud Infrastructure Object Storage Classic container tocreate

• Oracle Cloud Infrastructure Object Storage Classic user name and password

Customize the Template

Export and update the Oracle-OEHCS-BDCSCE-StackTemplate template in order tocustomize your stack’s behavior. Modify the template’s name and contents, such asadding a template parameter or changing the parameters used to create the OracleBig Data Cloud instance. See:

• Exporting a Template in Using Oracle Cloud Stack Manager

• Creating a Template in Using Oracle Cloud Stack Manager


• REST API for Oracle Event Hub Cloud Service - Platform

View All ClustersAdministrators can view all clusters associated with Oracle Big Data Cloud.

To view all clusters:

• Open the service console. See Access the Service Console for Big Data Cloud .

The console opens on the Instances page, showing a list of all clusters. Forinformation about the details on the page, see Service Console: Instances Page.

View Details for a ClusterAdministrators can view detailed information for a cluster.

To view details for a cluster:


2. Click the name of the cluster for which you want to view more information.

An overview page with cluster details is displayed. For information about thedetails on this page, see Service Console: Instance Overview Page.

Chapter 3View All Clusters

3-11

View Activities for ClustersAdministrators can view activities for clusters.

To view activities for clusters:


2. Click Activity.

The Activity page is displayed, showing the list of all activities started within thepast 24 hours. You can use the Start Time Range field to specify a start timerange other than the default of the previous 24 hours. For information about thedetails on this page, see Service Console: Activity Page.

3. Use the options in the Search Activity Log section to filter the results to meet yourneeds. You can search on start time range, full or partial service name, activitystatus, and operation type. Click Search. View the results in the table that follows.

View Cluster Component StatusAdministrators can view information about the components and services running on acluster and their associated state.

To view component status:

1. Open the cluster console for the cluster. See Access the Big Data Cloud Console.

2. Click Status.

The Status page is displayed, listing all services running on a cluster and theirstatus. For information about the details on this page, see Big Data CloudConsole: Status Page.

3. Click the Services tab to see a list of all components on the cluster, or the Hoststab to list the components by each host (node) on the cluster. Use the Filter box tofilter as desired.

There are two possible states for components and services: INSTALLED andSTARTED.

INSTALLED means a service is stopped and not running, and STARTED means aservice is running. Note that some components on the cluster are never startedand are only installed, for example client libraries such as HDFS_CLIENT.

Monitor the Health of a ClusterAdministrators can view metrics and monitor the health of a cluster.

To view metrics for a cluster:


2. From the menu for the cluster, select Big Data Cloud Console.

The Big Data Cloud Console opens on the Overview page and displays metricsand other information for the cluster. For information about the details on thispage, see Big Data Cloud Console: Overview Page.

Chapter 3View Activities for Clusters

3-12

Scale a Cluster OutAdministrators can scale a cluster out by adding compute-only nodes or computenodes with HDFS storage.

To scale a cluster out:


2. Click the name of the cluster you want to scale out.

An overview page with cluster details is displayed.

3. From the menu for the cluster at the top of the page, select Scale Out.

The Scale Out window is displayed.

4. Scale out the cluster as desired:

• Compute only nodes: Number of compute-only nodes you want to add,between 0 and 5.

• Compute nodes with HDFS storage: Number of compute and storage nodesyou want to add, between 0 and 5.

• Rebalance HDFS: Select this to rebalance data blocks across storage nodes.

5. Click Scale Out.

Nodes are added and the HDFS rebalance operation is performed. During thisprocess the cluster goes into Maintenance mode and, once the scale-outoperation is complete, becomes operational again.

Scale a Cluster InAdministrators can scale a cluster in by removing compute-only nodes or computenodes with HDFS storage. Master nodes cannot be removed.

To scale a cluster in:


2. Click the name of the cluster you want to scale in.

An overview page with cluster details is displayed. Nodes are listed underResources.

3. From the menu for the compute node, select Remove Node, and then confirmthe action.

The cluster goes into Maintenance mode and, once the scale-in operation iscomplete, becomes operational again.

Stop, Start, and Restart a ClusterAdministrators can stop, start, and restart a cluster.

To stop, start, or restart a cluster:


Chapter 3Scale a Cluster Out

3-13

2. From the menu for the cluster, select and confirm the desired action.

• Stop: When you stop a cluster, you can’t access the cluster and you can’tperform management operations on it except to start the cluster or delete it.Stopping a cluster is like pausing it. You won’t be billed for compute resources,but you will be billed for storage.

• Start: When you start a cluster, you can access it again and performmanagement operations. Starting a cluster is like taking it off pause. Billing forcompute resources resumes.

• Restart: When you restart a cluster, the cluster is stopped and thenimmediately started again. The information about stopping and starting acluster applies to restarting a cluster as well, just in immediate succession.

Delete a ClusterAdministrators can delete (terminate) a cluster when it’s no longer needed.

Note:

When a cluster is deleted, all data stored on HDFS is deleted as well. Copythe data in HDFS over to Cloud Storage before deleting the cluster.

To delete a cluster:


2. From the menu for the cluster, select Delete, and then confirm the action. Theentry is removed from the list of clusters displayed in the console.

Stop, Start, and Restart a NodeAdministrators can restart master nodes, and stop, start, and restart other nodes.

To stop, start, or restart a node:


2. Click the name of the cluster with the node you want to stop, start, or restart.

An overview page with cluster details is displayed. Nodes are listed underResources.

3. From the menu for the node, select and confirm the desired action.

Manage Tags

This topic does not apply to Oracle Cloud at Customer.

Chapter 3Delete a Cluster

3-14

A tag is a key or a key-value pair that you can assign to your Oracle Big Data Cloudclusters. You can use tags to organize and categorize your clusters, and to search forthem.

Topics:

• Create, Assign, and Unassign Tags

• Find Tags and Instances Using Search Expressions

Create, Assign, and Unassign TagsYou can create and assign tags to Oracle Big Data Cloud clusters while creating thecluster or later. When you no longer need certain tags for a cluster, you can unassignthem.

To assign tags to a cluster or to unassign tags:

1. Navigate to the Overview page for the cluster for which you want to assign orunassign tags.

2. Click Manage this service in the cluster name bar at the top.

3. Select Manage Tags or Add Tags.

If any tags are already assigned, then the menu shows Manage Tags; otherwise,it shows Add Tags.

4. In the Manage Tags dialog box, create and assign the required tags, or unassigntags:

• In the Assign section, in the Tags field, select the tags that you want to assignto the cluster.

• If the tags that you want to assign don't exist, then select Create and Assignin the Tags field, and click just above the field. Enter the required new tags inthe Enter New Tags field.

• To unassign a tag, in the Unassign section, look for the tag that you want tounassign, and click the X button next to the tag.

Note:

You might see one or more tags with the key starting with ora_.Such tags are auto-assigned and used internally. You can’t assign orunassign them.

• To exit without changing any tag assignments for the cluster, click Cancel.

5. After assigning and unassigning tags, click OK for the tag assignments to takeeffect.

Find Tags and Instances Using Search ExpressionsA tag is an arbitrary key or a key-value pair that you can create and assign to yourOracle Big Data Cloud clusters. You can use tags to organize and categorize yourclusters, and to search for them. Over time, you might create dozens of tags, and you

Chapter 3Manage Tags

3-15

might assign one or more tags to several of your clusters. To search for specific tagsand to find clusters that are assigned specific tags, you can use filtering expressions.

For example, on the home page of the web console, you can search for the clustersthat are assigned a tag with the key env and any value starting with dev (example:env:dev1, env:dev2), by entering the search expression 'env':'dev%' in the Searchfield.

Similarly, when you use the REST API to find tags or to find instances that areassigned specific tags, you can filter the results by appending the optionaltagFilter=expression query parameter to the REST endpoint URL.

• To find specific tags: GET paas/api/v1.1/tags/{identity_domain}/tags?tagFilter={expression}

• To get a list of instances that are assigned specific tags: GET paas/api/v1.1/instancemgmt/{identity_domain}/instances?tagFilter={expression}

Syntax and Rules for Building Tag-Search Expressions

• When using cURL to send tag-search API requests, enclose the URL in doublequotation marks.

Example:

curl -s -u username:password -H "X-ID-TENANT-NAME:acme" "restEndpointURL/paas/api/v1.1/instancemgmt/acme/instances?tagFilter='env'"

This request returns all the tags that have the key env.

• Enclose each key and each value in single quotation marks. And use a colon (:) toindicate a key:value pair.

Examples:

'env''env':'dev'

• You can include keys or key:value pairs in a tag-filtering expression.

SampleExpression

Description Sample Search Result

'env' Finds the tags with the key env, or theinstances that are assigned the tagswith that key.

The following tags, or the instancesthat are assigned any of these tags:

env:devenv:qa


3-16

SampleExpression


'env':'dev'

Finds the tag with the key env and thevalue dev, or the instances that areassigned that tag.

The following tag, or the instancesthat are assigned this tag

env:dev

• You can build a tag-search expression by using actual keys and key values, or byusing the following wildcard characters.

% (percent sign): Matches any number of characters.

_ (underscore): Matches one character.

SampleExpression


'env':'dev%'

Finds the tags with the key env and avalue starting with dev, or theinstances that are assigned such tags.

Note: When you use curl or anycommand-line tool to send tag-searchREST API requests, encode thepercent sign as %25.


env:devenv:dev1

'env':'dev_'

Finds the tags with the key env andthe value devX where X can be anyone character, or finds the instancesthat are assigned such tags.


env:dev1env:dev2

• To use a single quotation mark ('), the percent sign (%), or the underscore (_) as aliteral character in a search expression, escape the character by prefixing abackslash (\).

SampleExpression


'env':'dev\_%'

Finds the tags with the key env anda value starting with dev_, or theinstances that are assigned suchtags.


env:dev_1env:dev_admin

• You can use the Boolean operators AND, OR, and NOT in your searchexpressions:


3-17

Sample Expression Description Sample Search Result

'env' OR 'owner' Finds the tags with thekey env or the keyowner, or the instancesthat are assigned eitherof those keys.

The following tags, or theinstances that areassigned any of thesetags:

env:devowner:admin

'env' AND 'owner' Finds the instances thatare assigned the tagsenv and owner.

Note: This expressionwon’t return any resultswhen used to search fortags, because a tag canhave only one key.

The instances that areassigned all of thefollowing tags:

env:devowner:admin

NOT 'env’ Finds the tags that havea key other than env, orthe instances that areassigned such tags.

Note: Untaggedinstances as well willsatisfy this searchexpression.

The following tags, or theinstances that areassigned any of thesetags or no tags:

owner:admindepartment

('env' OR 'owner') AND NOT'department'

Finds the tags that havethe key env or the keyowner but not the keydepartment, or theinstances that areassigned such tags.

The following tags, or theinstances that areassigned any of thesetags:

env:devowner:admin


3-18

4Use Identity Cloud Service for ClusterAuthentication

The topics in this chapter do not apply to Oracle Cloud Infrastructure.

This section describes how to enable Oracle Identity Cloud Service (IDCS) for clusterauthentication. The tasks in this section are performed by users with administratorprivileges.

Topics

• About Cluster Authentication

• Connect to Identity Cloud Service from the Service Console

• Add Identity Cloud Service Users for Clusters

• Make REST API Calls to Clusters That Use Identity Cloud Service

• Update the Identity Cloud Service Password for Big Data Cloud

About Cluster AuthenticationOracle Big Data Cloud provides two authentication mechanisms for clusters. Userscan authenticate using HTTP Basic authentication and shared credentials, or they canauthenticate using their own identity through Oracle Identity Cloud Service (IDCS).The authentication method a cluster uses is selected when the cluster is created andcannot be changed after cluster creation.

With HTTP Basic authentication, the administrative user name and password for thecluster are specified when the cluster is created. These credentials are then sharedwith any user who wants to access the cluster. This method is simple but requires thesharing of cluster credentials.

When Oracle Identity Cloud Service is used for cluster authentication, users canaccess the cluster with their own IDCS identity and credentials, so credentials don’tneed to be shared among cluster users. In this case, IDCS is used to manage useraccounts and access for the cluster, and all authorization and authentication for thecluster is handled through IDCS.

Connect to Identity Cloud Service from the Service ConsoleWhen you create a cluster that uses Oracle Identity Cloud Service (IDCS) forauthentication, an IDCS management application is created for the cluster. You canconnect to the UI for this IDCS application from the service console for Big DataCloud.

To connect to the IDCS application for the cluster:


4-1

2. Click the name of the IDCS-enabled cluster.


3. Expand Show more.

4. Click the link next to IDCS Application and log in with your IDCS credentials.

An instance of IDCS opens on the Application tab and lists cluster details.

The IDCS console has the following tabs for the cluster application:

• Details - Displays information about the cluster application, including theapplication ID.

• Configuration - Displays configuration information about the cluster application,including the client ID and client secret, primary audience, and scope. Thisinformation will be needed to make REST API calls to the cluster. See Make RESTAPI Calls to Clusters That Use Identity Cloud Service.

• Application Roles - Displays roles. There is currently just one role: BDCSCE-Administrators.

• Groups - Displays groups.

• Users - Displays users.

Note:

Oracle Identity Cloud Service is used just for cluster authentication. Definingroles for a cluster and assigning users and groups is not yet supported.

Add Identity Cloud Service Users for ClustersTo access a cluster that uses Oracle Identity Cloud Service (IDCS) for authentication,cluster users must first have valid IDCS credentials. Administrators manage theprovisioning of users in IDCS and perform the task of adding users.

Note:

Oracle Identity Cloud Service is used just for cluster authentication in OracleBig Data Cloud. Defining roles for a cluster and assigning users and groupsis not yet supported.

To add users:






Chapter 4Add Identity Cloud Service Users for Clusters

4-2


5. Click the Identity Cloud Service Users tab at the top of the page (not the Users tabfor the cluster).

6. Click Add and provide user details, then click Finish.

The Details page is displayed for the user. An email will be sent to the user withlogin information.

Make REST API Calls to Clusters That Use Identity CloudService

When Oracle Identity Cloud Service (IDCS) is selected for cluster authentication, theOAuth 2.0 authentication mechanism is enabled. The OAuth 2.0 token serviceprovided by IDCS enables secure access to REST endpoints. This topic describeshow to interact with OAuth-enabled clusters using REST.

To make REST API calls to an IDCS-enabled cluster, you’ll need to gather someinformation about the cluster, get an access token, and then use a REST clientapplication such as cURL to perform REST API calls. Those steps are described in thefollowing procedure.

To make REST API calls to an IDCS-enabled cluster:





4. Make note of the ID next to IDCS Application. This is the application ID for thecluster.



6. Get the client ID and client secret for the cluster application:

a. In the IDCS console, click the Configuration tab for the cluster applicationand expand the General Information section.

b. Make note of the client ID, then click Show Secret and make note of the clientsecret. The client secret is essentially the client password and should not beshared.

7. Get the primary audience and scope:

a. On the Configuration tab, expand the Resources section.

b. Make note of the primary audience and scope.

The primary audience identifies the cluster host and consists of the applicationID and compute domain.

There is currently just one scope (/). With this scope, all cluster resources areaccessible to everyone who logs in with valid IDCS credentials.

Chapter 4Make REST API Calls to Clusters That Use Identity Cloud Service

4-3

The Resources section also shows the expiration period for the access token.The access token provides a session (with scope and expiration) that yourclient application can use to perform tasks in IDCS using REST APIs. Theexpiration period for the token is one hour (3600 seconds). After one hour,you’ll need to get another access token to continue to make REST API calls tothe cluster.

8. Use the information you’ve gathered to create the REST request for the accesstoken. The following steps use cURL to get the token:

a. In a text editor, prepare the cURL command as follows:

curl -k -X POST -u "CLIENT_ID:CLIENT_SECRET" -d "grant_type=client_credentials&scope=PRIMARY_AUDIENCE/""IDCS_URL/oauth2/v1/token" -o access_token.json

Where:

• CLIENT_ID is the client ID.

• CLIENT_SECRET is the client secret.

• PRIMARY_AUDIENCE is the primary audience.

• / after PRIMARY_AUDIENCE is the scope (this is currently the only scopeavailable).

• IDCS_URL is the Oracle Identity Cloud Service URL for the IDCS instancethat’s associated with the cluster.

For example:

curl -k -X POST -u "123456789ABCDEFGHIJK_APPID:b9008819-0f0b-44c3-b266-b07746f9d9f9"-d "grant_type=client_credentials&scope=https://primary-audience-url.com:443/""https://IDCS-server.com/oauth2/v1/token" -o access_token.json

b. At the command prompt, enter the cURL command you created in the previousstep.

c. Open the access token file (access_token.json) in a text editor and copy theaccess_token value.

9. Use the access token to access the cluster. For IDCS authentication, the tokentype is Bearer.

The following example demonstrates a REST API call to perform a lookup ofavailable user directories in HDFS. The example is intended to illustrate howREST API calls to IDCS-enabled clusters are executed.

a. In a text editor, prepare the cURL command as follows:

curl -X GET -k https:/CSM_IDCS_URL/fs/v1/user?op=LISTSTATUS -H 'cache-control: no-cache'-H 'x-user-identity-domain-name:IDENTITY_DOMAIN' -H 'authorization: Bearer ACCESS_TOKEN’

Where:

• CSM_IDCS_URL is the URL advertised for the Big Data Cloud Console(cluster console). For information about this URL, see Access the Big DataCloud Console.


4-4

• IDENTITY_DOMAIN is the identity domain configured for your serviceaccount.

• ACCESS_TOKEN is the text of the access token you obtained in Step 8.

b. At the command prompt, enter the cURL command you created in the previousstep.

For the lookup example above, the response should have a form similar to thefollowing:

{"FileStatuses":{"FileStatus":[ {"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":16388,"group":"hdfs","length":0,"modificationTime":1508957357539,"owner":"ambari-qa","pathSuffix":"ambari-qa","permission":"770","replication":0,"storagePolicy":0,"type":"DIRECTORY"}, {"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":16537,"group":"bdcsce_admin","length":0,"modificationTime":1508957412913,"owner":"bdcsce_admin","pathSuffix":"bdcsce_admin","permission":"750","replication":0,"storagePolicy":0,"type":"DIRECTORY"}, {"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":16582,"group":"hdfs","length":0,"modificationTime":1508957486812,"owner":"hcat","pathSuffix":"hcat","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"}, {"accessTime":0,"blockSize":0,"childrenNum":2,"fileId":16536,"group":"hive","length":0,"modificationTime":1508957507850,"owner":"hive","pathSuffix":"hive","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"}, {"accessTime":0,"blockSize":0,"childrenNum":1,"fileId":16405,"group":"oracle","length":0,"modificationTime":1508957384444,"owner":"oracle","pathSuffix":"oracle","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"}, {"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":16389,"group":"hdfs","length":0,"modificationTime":1508957360272,"owner":"spark","pathSuffix":"spark","permission":"775","replication":0,"storagePolicy":0,"type":"DIRECTORY"}, {"accessTime":0,"blockSize":0,"childrenNum":3,"fileId":16545,"group":"hdfs","length":0,"modificationTime":1508957446717,"owner":"zeppelin","pathSuffix":"zeppelin","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"} ]}}

Note that if the access token you’re using has expired, you’ll see a responsesuch as the following:

<html><head><title>401 Authorization Required</title></head><body bgcolor="white"><center><h1>401 Authorization Required</h1></center><hr><center>nginx</center></body></html>


4-5

Update the Identity Cloud Service Password for Big DataCloud

Use the following procedure to update the Oracle Identity Cloud Service (IDCS)password for Oracle Big Data Cloud. Failing to do so could result in the IDCS accountbeing locked.

Important:

Changing the IDCS password while running jobs could lock the IDCSaccount. Big Data applications are executed on multiple mappers/reducers(multiple threads, cores, and nodes) for parallelism. If the IDCS password ischanged while these mappers/reducers are running, the IDCS account couldbe locked.

To update the IDCS password for Big Data Cloud:

1. Ensure that no jobs are running on the Big Data Cloud cluster and stop anyrunning jobs. To do so:

a. Log in to the Ambari user interface at https://Ambari_server_IP_address:8080 using the user name and password specified for the cluster when thecluster was created.

Ambari_server_IP_address is the IP address for the Ambari server host. Thisaddress is listed on the Instance Overview page for a cluster in the serviceconsole for Oracle Big Data Cloud.

b. Click Spocs Fabric Service on the left.

c. From the Service Actions drop-down menu at the top, select Stop.

2. Change the IDCS password in IDCS.

3. Update the IDCS password in Big Data Cloud. To do so, SSH to the cluster byusing the private key and update the IDCS password:

ssh -i private_key_file -l opc Ambari_server_IP_addresssudo -u spoccs-fabric-server -shadoop credential delete fs.swift.service.default.password -provider jceks://hdfs/system/oracle/bdcsce/associations/jcekshadoop credential create fs.swift.service.default.password -provider jceks://hdfs/system/oracle/bdcsce/associations/jceks -value new_IDCS_password

hadoop credential delete fs.swift2d.service.default.password -provider jceks://hdfs/system/oracle/bdcsce/associations/jcekshadoop credential create fs.swift2d.service.default.password -provider jceks://hdfs/system/oracle/bdcsce/associations/jceks -value new_IDCS_password

where:

• private_key_file is the path to the SSH private key file that matches thepublic key associated with the cluster.

• Ambari_server_IP_address is the IP address for the Ambari server host.

• new_IDCS_password is the new IDCS password.

Chapter 4Update the Identity Cloud Service Password for Big Data Cloud

4-6

4. Log in to the Ambari user interface as described in step 1, only this time selectStart from the Service Actions drop-down menu.

5. Resubmit any stopped jobs.

Chapter 4Update the Identity Cloud Service Password for Big Data Cloud

4-7

5Manage Network Access

The information about access rules in this section does not apply to OracleCloud Infrastructure. For information about network configuration on Oracle CloudInfrastructure, see Prerequisites for Oracle Platform Services in the Oracle CloudInfrastructure documentation.

This section describes how to manage network access to Oracle Big Data Cloud.

Regardless of the infrastructure that you create your cluster on (Oracle CloudInfrastructure or Oracle Cloud Infrastructure Classic), the rules to provide networkaccess to the cluster are preconfigured for you. The interfaces that you use to managethese rules depend on the infrastructure that the cluster is created in.

For clusters on Oracle Cloud Infrastructure, you configure the rules, called securityrules, in the Oracle Cloud Infrastructure interfaces.

For clusters on Oracle Cloud Infrastructure Classic, you configure the rules, calledaccess rules, in the Big Data Cloud interfaces. Note that these access rules prohibitaccess by default (with the exception of SSH access on port 22), and you must enablethem to provide access to other ports.

Topics

• About Network Access

• Enable Access Rules

• Create Access Rules

• Generate a Secure Shell (SSH) Public/Private Key Pair

• System Properties of Big Data Cloud

About Network Access

This topic does not apply to Oracle Cloud Infrastructure.

Access rules are used to provide secure network access to service components.Access rules control which ports can be accessed on the VMs that are part of acluster.

By default, network access to Oracle Big Data Cloud is provided by using SSH. TheSSH connection uses the SSH key specified when the cluster was created. By default,port 22 is used for SSH connections.

When a cluster is created, the following access rules are created by default:

• ora_p2bdcsce_ssh: Controls SSH access to a cluster. Disabled by default.

• ora_p2bdcsce_nginx: Enables access to the web-based cluster console andREST APIs. Enabled by default.

5-1

• ora_p2bdcsce_ambari: Enables access to the Ambari console and REST APIs.Disabled by default.

To enable access to a port, you enable the appropriate rule. System rules cannot bemodified.

When you enable one of the predefined rules, the given port on the cluster is openedto the public internet. To enable access to a different port, or to restrict access to aport, you must create an access rule. See Enable Access Rules and Create AccessRules.

Enable Access Rules


To enable access to a port, administrators must enable the appropriate access rule.

To enable an access rule:


2. From the menu for the cluster for which you want to manage access, selectAccess Rules.

The Access Rules page is displayed. For information about the details on thispage, see Service Console: Access Rules Page.

3. Locate the rule you want to enable.

4. From the menu for the rule, select Enable. This menu is also used to disable ordelete a rule.

The Enable Access Rule window is displayed.

5. Select Enable.

The Enable Access Rule window closes and the rule is displayed as enabled inthe list of rules. The given port on the cluster is opened to the public internet.

Create Access Rules


Administrators can create access rules to enable ports not associated with predefinedaccess rules, or to restrict access to ports to only permit connections from specific IPaddresses.

To create an access rule:


2. From the menu for the cluster for which you want to manage access, selectAccess Rules.

The Access Rules page is displayed. For information about the details on thispage, see Service Console: Access Rules Page.

Chapter 5Enable Access Rules

5-2

3. Click Create Rule. In the Create Access Rule dialog, enter the followinginformation:

• Rule Name: Any name to identify this rule. Must start with a letter, followed byletters, numbers, hyphens, or underscores. Cannot start with ora_ or sys_.

• Description: (Optional) Any description of your choice.

• Source: The hosts from which traffic should be allowed. Choices are:

– PAAS_INFRA: Internal for platform services. Used for various life cycleoperations including provisoning, patching, and scaling.

– PUBLIC-INTERNET: The public-internet Security IP List.

– bdcsce_ADMIN_HOST: The security list consisting of master nodes,which are designated as ADMIN hosts.

– bdcsce_COMPUTE_SLAVE: Hosts where the YARN NodeManager isrunning (no DataNode).

– bdcsce_MASTER: Hosts where the Big Data Cloud console and RESTservers are running.

– bdcsce_NN_MASTER: Hosts where the NameNode is running (no BigData Cloud console or REST server).

– bdcsce_SLAVE: Hosts where both the YARN NodeManager andDataNode are running.

– custom: A custom list of addresses from which traffic should be allowed.In the field that displays below when you select this option, enter acomma-separated list of the subnets (in CIDR format) or IPv4 addressesfor which you want to permit access.

• Destination: The service component to which traffic should be allowed.Choices are as follows (see the previous descriptions):

– bdcsce_ADMIN_HOST

– bdcsce_COMPUTE_SLAVE

– bdcsce_MASTER

– bdcsce_NN_MASTER

– bdcsce_SLAVE

• Destination Port(s): The port or range of ports you want to open. Specify asingle port, such as 5001, or a range of ports separated by a hyphen, such as5001-5010.

• Protocol: The protocol for the access rule.

4. Click Create.

The Create Access Rule dialog closes and the rule is displayed in the list of rules.New rules are enabled by default. If necessary, adjust the number of resultsdisplayed on the Access Rules page so you can see the newly created rule.

Chapter 5Create Access Rules

5-3

Generate a Secure Shell (SSH) Public/Private Key PairAn SSH public key is used for authentication when you use an SSH client to connectto a node associated with a cluster. When you connect, you must provide the privatekey that matches the public key.

Several tools exist to generate SSH public/private key pairs. The following sectionsshow how to generate an SSH key pair on UNIX, UNIX-like, and Windows platforms.

Generate an SSH Key Pair on UNIX and UNIX-Like Platforms Usingthe ssh-keygen Utility

UNIX and UNIX-like platforms (including Solaris and Linux) include the ssh-keygenutility to generate SSH key pairs.

To generate an SSH key pair on UNIX and UNIX-like platforms using the ssh-keygenutility:

1. Navigate to your home directory:

$ cd $HOME

2. Run the ssh-keygen utility, providing as filename your choice of file name for theprivate key:

$ ssh-keygen -b 2048 -t rsa -f filename

The ssh-keygen utility prompts you for a passphrase for the private key.

3. Enter a passphrase for the private key, or press Enter to create a private keywithout a passphrase:

Enter passphrase (empty for no passphrase): passphrase

Note:

While a passphrase is not required, you should specify one as a securitymeasure to protect the private key from unauthorized use. When youspecify a passphrase, a user must enter the passphrase every time theprivate key is used.

The ssh-keygen utility prompts you to enter the passphrase again.

4. Enter the passphrase again, or press Enter again to continue creating a privatekey without a passphrase:

Enter the same passphrase again: passphrase

5. The ssh-keygen utility displays a message indicating that the private key has beensaved as filename and the public key has been saved as filename.pub. It alsodisplays information about the key fingerprint and randomart image.

Chapter 5Generate a Secure Shell (SSH) Public/Private Key Pair

5-4

Generate an SSH Key Pair on Windows Using the PuTTYgenProgram

The PuTTYgen program is part of PuTTY, an open source networking client for theWindows platform.

To generate an SSH key pair on Windows using the PuTTYgen program:

1. Download and install PuTTY or PuTTYgen.

To download PuTTY or PuTTYgen, go to http://www.putty.org/ and click the Youcan download PuTTY here link.

2. Run the PuTTYgen program.

The PuTTY Key Generator window is displayed.

3. Set the Type of key to generate option to SSH-2 RSA.

4. In the Number of bits in a generated key box, enter 2048.

5. Click Generate to generate a public/private key pair.

As the key is being generated, move the mouse around the blank area as directed.

6. (Optional) Enter a passphrase for the private key in the Key passphrase box andreenter it in the Confirm passphrase box.

Note:

While a passphrase is not required, you should specify one as a securitymeasure to protect the private key from unauthorized use. When youspecify a passphrase, a user must enter the passphrase every time theprivate key is used.

7. Click Save private key to save the private key to a file. To adhere to file-namingconventions, you should give the private key file an extension of .ppk (PuTTYprivate key).

Note:

The .ppk file extension indicates that the private key is in PuTTY'sproprietary format. You must use a key of this format when using PuTTYas your SSH client. It cannot be used with other SSH client tools. Referto the PuTTY documentation to convert a private key in this format to adifferent format.

8. Select all of the characters in the Public key for pasting into OpenSSHauthorized_keys file box.

Make sure you select all the characters, not just the ones you can see in thenarrow window. If a scroll bar is next to the characters, you aren't seeing all thecharacters.

9. Right click somewhere in the selected text and select Copy from the menu.

Chapter 5Generate a Secure Shell (SSH) Public/Private Key Pair

5-5

10. Open a text editor and paste the characters, just as you copied them. Start at thefirst character in the text editor, and do not insert any line breaks.

11. Save the text file in the same folder where you saved the private key, usingthe .pub extension to indicate that the file contains a public key.

12. If you or others are going to use an SSH client that requires the OpenSSH formatfor private keys (such as the ssh utility on Linux), export the private key:

a. On the Conversions menu, choose Export OpenSSH key .

b. Save the private key in OpenSSH format in the same folder where you savedthe private key in .ppk format, using an extension such as .openssh toindicate the file's content.

System Properties of Big Data CloudSystem properties of Oracle Big Data Cloud are made available to executing jobsthrough YARN.

There are many system properties, but the most important property to know about isbdcsce.sparkthrift.default.connect. This is the connect string for the Spark ThriftServer. The format for the connect string is as follows:

jdbc:hive2://host:port/default;transportMode=http;httpPath=cliservice

where host and port are the host and port for the Spark Thrift Server.

For information about additional system properties related to Thrift, see AboutAccessing Thrift.

Chapter 5System Properties of Big Data Cloud

5-6

6Patch Big Data Cloud

This section describes how to apply a patch to Oracle Big Data Cloud and roll back apatch if necessary.

Big Data Cloud provides ongoing monthly updates that contain patches andenhancements. The patches are backwards-compatible and can be applied withoutadverse effects to existing Big Data Cloud clusters. Depending on the nature of thepatch, binaries and configurations can be updated. Patches are not mandatory anddon’t need to be applied to existing clusters unless desired.

Topics

• About Operating System Patching

• View Available Patches

• Check Patch Prerequisites

• Apply a Patch

• Roll Back a Patch or Failed Patch

About Operating System PatchingOracle Big Data Cloud does not provide cloud tooling for operating system (OS)patching. You are responsible for installing OS patches to existing service instances.

You can obtain Oracle Linux OS patches from the Oracle’s Unbreakable LinuxNetwork if you have an Oracle Linux support subscription. You can also obtain LinuxOS patches from the Oracle Linux public yum server:

http://public-yum.oracle.com

Big Data Cloud VMs are preconfigured to enable you to install and update packagesfrom the repositories on the Oracle Linux public yum server. The repositoryconfiguration file is in the /etc/yum.repos.d directory on the VMs. You can install,update, and remove packages by using the yum utility.

Note:

You are responsible for applying the required security updates publishedthrough the Oracle Linux public yum server.

View Available PatchesYou can view a list of patches you can apply to an Oracle Big Data Cloud cluster byusing the service console. Applicable patches are automatically available to theclusters to which they can be applied.

6-1

http://public-yum.oracle.com

To view available patches:


2. Click the name of the cluster for which you want to check patching.

The Big Data Cloud Overview page is displayed.

3. Click the Administration tile and then click the Patching tab.

The Big Data Cloud Patching page is displayed. A list of patches you can applyappears in the Available Patches section.

Note:

If patches are available for a cluster you'll also see a notification on theInstances page in the service console for Big Data Cloud.

Check Patch PrerequisitesYou can use the Oracle Big Data Cloud Patching page to check the prerequisites of apatch before you apply it and make sure the patch can be successfully applied.

The checking operation:

• Confirms the patch is available for download.

• Verifies there’s enough space in the /u01 directory to apply the patch.

• Ensures the cluster is healthy and able to have the patch applied.

To check patch prerequisites:


2. Click the name of the cluster to which you want to apply a patch.




4. In the entry for the patch whose prerequisites you want to check, click the menuand then select Precheck.

If you’ve previously checked prerequisites on the selected patch, the PatchPrecheck Results window displays, showing the results of the previous check andasking you to perform another set of prerequisite checks. In this case, clickPrecheck to continue.

The Patching page redisplays, showing a status message indicating prerequisitechecks are in progress.

5. Click the refresh icon on the Patching page occasionally to update the statusmessage. The checking operation can take several minutes to complete.

6. When the checking operation completes, click the Precheck summary link todisplay the results of the prerequisite checks.

Chapter 6Check Patch Prerequisites

6-2

Apply a PatchYou can apply a patch to a cluster by using the Patching page in the service consolefor Oracle Big Data Cloud.

To apply a patch to a cluster:


2. Click the name of the cluster to which you want to apply a patch.




4. In the entry for the patch you want to apply, click the menu and then selectPatch.

The Patch Service window displays.

5. Click Patch.

The Patch Service window closes and the patching operation begins. If patchconflicts or errors were discovered during the precheck stage of the patchingoperation, the patch will not be applied.

The Administration tile shows the starting time of the patching operation and aPatching... message replaces the Patch button.

When the patching operation completes, the Patching page shows the completiontime of the patching operation, and a log of the operation’s activities appears in theDetails of Last Patching Activity section. If the operation was successful, the patchis removed from the list of patches in the Available Patches. If the operation failed,the patch remains in the list. In this case, check the Details of Last PatchingActivity section for information about the failure.

Note:

Once the patch has been applied, the Big Data Cloud instance isrestarted. The time it takes to patch a cluster depends on the size of thecluster and will take many minutes.

Roll Back a Patch or Failed PatchYou can roll back the last patch or failed patch attempt on a cluster by using thePatching page in the service console for Oracle Big Data Cloud.

To roll back the last patch or failed patch attempt:


2. Click the name of the cluster on which you want to roll back a patch.


Chapter 6Apply a Patch

6-3


The Big Data Cloud Patching page is displayed.

4. Click Rollback.

The Patching page redisplays, showing a status message that your request hasbeen submitted, the Administration tile shows the starting time of the rollbackoperation, and a Rolling back... message replaces the Rollback button.

Note:

Rollback operations are performed with a minimum of impact on thefunctioning of the cluster. However, during part of the operation thecluster is shut down for a period of time, thus making it inaccessible.

5. Refresh the Patching page occasionally to update the status message.

Note that the rollback operation can take several minutes to complete.

When the rollback operation completes, the Administration tile shows thecompletion time of the operation, and a log of the operation’s activities appears inthe Details of Last Patching Activity section.

Chapter 6Roll Back a Patch or Failed Patch

6-4

7Manage Credentials

Administrators specify credentials for a cluster and for the Oracle Cloud InfrastructureObject Storage Classic container associated with the cluster when the cluster iscreated. These credentials might need to be reset or updated, or the SSH keys for acluster might need to be replaced.

Topics

• Change the Cluster Password

• Replace the SSH Keys for a Cluster

• Update Cloud Storage Credentials

• Use the Cluster Credential Store

• Manage Certificates Used for the Cluster Console

• Update the Security Key for Big Data Cloud on Oracle Cloud Infrastructure

Change the Cluster Password

Note:

This topic does not apply to clusters that use Oracle Identity Cloud Service(IDCS) for authentication. The IDCS credentials are managed through IDCS.The steps in this topic are only for clusters that use Basic authentication. Forinformation about cluster authentication, see Use Identity Cloud Service forCluster Authentication.

Credentials used to access a cluster are set when the cluster is created.Administrators can change the cluster password later if necessary. The Ambari serverand NGINX must be restarted for the password change to take effect.

To change the cluster password:

1. SSH to the Ambari server host and change to the otools user (sudo su -otools).

The IP address for the Ambari server host is listed on the Instance Overview pagefor a cluster in the service console for Oracle Big Data Cloud. Use this IP addressto SSH to the Ambari server host.

2. Execute the following command, providing the user name and new passwordwhen prompted:

cd /u01/bdcsce/tools/ambari/java; java -cp "bdcsce-java-tools.jar:/usr/hdp/current/hadoop-client/client/* " com.oracle.bdcsce.UpdateBDCSCEAmbariCred

7-1

The password must be 8 or more characters and contain at least 1 uppercasecharacter, 1 lowercase character, and 1 numeric character (0-9).

3. As root user, restart the Ambari server:

sudo ambari-server restart

4. Restart NGINX:

a. Log in to the Ambari user interface at https://Ambari_server_IP_address:8080 with the new credentials.

b. Click Nginx Reverse Proxy on the left.

c. From the Service Actions drop-down menu at the top, select Restart All andConfirm Restart All when prompted.

Replace the SSH Keys for a ClusterYou can add a new SSH public key to an Oracle Big Data Cloud cluster. This is helpfulif the SSH private key used to access the cluster is lost or corrupted.

To add a new public key for a cluster:


2. Click SSH Access.

A page listing all clusters in your identity domain is displayed. For informationabout the details on this page, see Service Console: SSH Access Page.

3. Locate the cluster for which you want to add a new public key and click Add NewKey.

The Add New Key window is displayed, showing the value of the most recentpublic key.

4. Specify the new public key using one of the following methods:

• Select Upload a new SSH Public Key value. Browse for and select the filethat contains the public key.

• Select Key value. Delete the current key value and paste the new public keyinto the text area. The value cannot contain line breaks or end with a linebreak.

5. Click Add New Key.

Update Cloud Storage CredentialsAn Oracle Cloud Infrastructure Object Storage Classic container is associated with acluster when the cluster is created. A storage password is also specified when thecluster is created. This password is used by Oracle Big Data Cloud to authenticate toOracle Cloud Infrastructure Object Storage Classic for all access, including thepropagation of log files and general data access. If you change this password outside

Chapter 7Replace the SSH Keys for a Cluster

7-2

of Big Data Cloud, you must also update the password as described in the followingprocedure.

Note:

Do not change the storage password as described here unless the passwordhas been changed outside of Big Data Cloud.

To update the Oracle Cloud Infrastructure Object Storage Classic passwordassociated with a cluster:


2. Click Settings, then click the Credentials tab.

For information about the details on the page, see Big Data Cloud Console:Settings Page.

3. In the Cloud Storage section, enter and confirm the new password, then clickSave.

Use the Cluster Credential StoreYou can store credentials in the credential store for a cluster, so they're not passed inclear text in command line parameters or job code. After you create a credential, youcan reference it from your job or notebook. When the cluster is deleted, the credentialstore is deleted as well.

To store credentials in the credential store for a cluster:


2. Click Settings, then click the Credentials tab.


3. In the User Credentials section, click New Credential to create a new credential.

4. In the Key field, enter the desired name or identifier for the credential. Forexample: database_password.

5. In the Value field, enter the value for the credential, then click Save.

Manage Certificates Used for the Cluster ConsoleYou can change the certificate used for the Big Data Cloud Console (also known asthe cluster console).

To change the certificate associated with a cluster:

1. Enable SSH access. See About Network Access.

2. SSH to all master nodes and replace /etc/nginx/nginx.crt and /etc/nginx/nginx.key with your certificate.

3. Use Ambari to restart the NGINX reverse proxy on all master nodes.

Chapter 7Use the Cluster Credential Store

7-3

Update the Security Key for Big Data Cloud on Oracle CloudInfrastructure

Note:

In Big Data Cloud, the PEM key must be created without a password.

To update the security key:

1. Copy the pem file to /etc/bdcsce/conf/oci_api_key.pem on all nodes of thesystem.

2. Obtain the new fingerprint and update the fingerprint using the Ambari > Yarn >Config > fs.oci.client.auth.fingerprint property.

3. Restart the required components from the Ambari UI.

Chapter 7Update the Security Key for Big Data Cloud on Oracle Cloud Infrastructure

7-4

8Manage Data

Data consumed and generated by Oracle Big Data Cloud is stored in Oracle CloudInfrastructure Object Storage, the persistent data lake for the service.

Persistent data is stored in Oracle Cloud Infrastructure Object Storage containers thatwere associated with clusters when the clusters were created. You can also uploaddata into HDFS, but that data is lost once a cluster is terminated. Log files and outputdata are also stored in Oracle Cloud Infrastructure Object Storage containers.

Big Data Cloud also includes the Oracle Big Data File System (BDFS), an in-memorycaching layer that enables Spark jobs to run much faster.

Note:

For information about connecting to Oracle Database from Big Data Cloudusing database connectors, see Connect to Oracle Database.

Topics

• Load Data Into Cloud Storage

• Upload Files Into HDFS

• Browse Data

• About the Big Data File System (BDFS)

Load Data Into Cloud Storage


To load data into Oracle Cloud Infrastructure Object Storage Classic, you must havethe URL and credentials for the Oracle Cloud Infrastructure Object Storage Classiccontainer associated with the cluster when the cluster was created. Clusters arecreated by administrators, who then provide the storage container URL andcredentials to cluster users. A cluster cannot be associated with more than onecontainer, and cannot be associated with a container after the cluster has beencreated.

To load data into the container associated with the cluster, use the tools and methodsdocumented in the Oracle Cloud Infrastructure Object Storage Classic documentation.The following resources will help get you started:

• Managing Containers in Object Storage Classic in Using Oracle CloudInfrastructure Object Storage Classic

• Managing Objects in Object Storage Classic in Using Oracle Cloud InfrastructureObject Storage Classic

8-1

• Uploading Files in Command-Line Reference for Oracle Cloud InfrastructureObject Storage Classic

• Creating a Single Object in Using Oracle Cloud Infrastructure Object StorageClassic

You can also upload up to 5 GB of data to Cloud Storage by using the Big Data CloudConsole. To upload data using the console:

1. Open the console for a cluster. See Access the Big Data Cloud Console.

2. Click Data Stores.

The Data Stores page is displayed. For information about the details on this page,see Big Data Cloud Console: Data Stores Page.

3. Click Cloud Storage.

From here you can browse files and directories in the Oracle Cloud InfrastructureObject Storage Classic container associated with the cluster. To make browsingeasier, filter by prefix.

4. Click Upload.

5. In the Upload File window, specify the directory to which you want to upload thefile in the Path field, browse to the file or directory you want to upload in the Filefield, and click OK. The upload limit is 5 GB.

Upload Files Into HDFSYou can upload files into HDFS using the Big Data Cloud Console (cluster console) orApache Ambari. Data stored in HDFS is lost once a cluster is terminated.

Upload Files Into HDFS Using the Cluster Console

1. Open the console for a cluster. See Access the Big Data Cloud Console.


The Data Stores page is displayed. For information about the details on this page,see Big Data Cloud Console: Data Stores Page.

3. Click HDFS.

4. Navigate among directories and use the HDFS browser as desired:

• Click New Directory to add a new directory.

• Click Upload to browse for and upload a file. The upload limit is 100 MB.

• Use the menu for a directory or file to view details, delete, or download.

Upload Files Into HDFS Using Ambari

To upload files using Ambari:

1. Access the cluster by using Ambari. See Access Big Data Cloud Using Ambari.

2. In the Ambari management console, browse HDFS and upload files using AmbariFiles View.

For detailed information, see documentation that describes how to browse HDFSusing Ambari Files View. Those steps are not provided here.

Chapter 8Upload Files Into HDFS

8-2


Browse DataYou can browse data in HDFS and in the storage container associated with a clusterby using the built-in browser.

To browse data for a cluster:



The Data Stores page is displayed. From here you can browse files anddirectories in HDFS and in the storage container associated with the cluster. Forinformation about the details on this page, see Big Data Cloud Console: DataStores Page.

For Oracle Cloud Infrastructure Object Storage Classic, filter by prefix to makebrowsing easier.

About the Big Data File System (BDFS)

Not supported on Oracle Cloud Infrastructure.

Oracle Big Data Cloud includes the Oracle Big Data File System (BDFS), an in-memory file system that accelerates access to data stored in multiple locations.

BDFS is compatible with the Hadoop file system and thus can be used withcomputational technologies such as Hive, MapReduce, and Spark. BDFS (currentlybased on Alluxio), is designed to accelerate data access for data pipelines and hasseveral features that significantly improve the runtime performance of Sparkapplications. The focus of BDFS is to accelerate data access to and from Oracle CloudInfrastructure Object Storage Classic by providing an active caching layer.

Note:

BDFS is available in the Full deployment profile only, not the Basic profile.

Using BDFS

Making use of BDFS doesn’t require any special integration. The mechanism toaccess data involves modifying the URI used by the application to access theunderlying data. Typically, files are accessed by leveraging Oracle Cloud InfrastructureObject Storage Classic using a swift:// URL. Oracle Cloud Infrastructure ObjectStorage Classic can be used for both reading and writing temporal and persistent data.In the case of temporal data, leveraging BDFS results in significant performanceimprovement because data doesn’t need to be transferred outside of the Big DataCloud cluster. Also, BDFS can be used to read data stored on Oracle CloudInfrastructure Object Storage Classic, which results in data being cached in BDFS forany subsequent reads, which results in better performance.

Chapter 8Browse Data

8-3

BDFS Topology

The BDFS architecture is composed of master and slave processes. BDFS masterprocesses are responsible for general coordination and bookkeeping, while the slaveprocesses are responsible for the actual caching of data. The amount of memorymade available to Alluxio by default is 1 GB per Alluxio worker node in the cluster.

The out-of-box configuration can be modified through Apache Ambari by changing thealluxio.worker.memory setting in alluxio-env (Ambari Advanced tab). All workersmust be restarted for changes to take effect.

Highly Available

BDFS is made highly available for clusters that have an initial size of at least threenodes. Multiple master components are deployed, and one is elected leader, while theothers are put in standby mode. If the leader goes down, one of the standby masters ispromoted to leader. Leader election is managed by Apache Zookeeper.

Off-Heap Storage (Cache)

The canonical use-case for BDFS is to share in-memory data across differentapplications, including Spark applications. The benefit of leveraging BDFS for thispurpose is to reduce the memory usage of the Spark process and accelerate dataaccess for downstream Spark applications. It’s common for Spark applications to storedata using the RDD cache() or persist() API. This enables the RDD to be stored inthe Spark executors, which makes the fetching of data efficient because the data isretained in memory. The drawback to this technique is that, depending on the amountof memory consumed in the executor, it may not leave enough memory for successfulexecution. An alternative is to leverage BDFS to store the RDD off-heap. This frees upvaluable executor memory for processing while off-loading the caching of the data toBDFS.

Any Spark API used to save RDDs can be used with BDFS. Examples include:

• rdd.saveAsTextFile(BDFS_FILE_URI) – Suitable for storing the RDD as a text file;each element in the RDD will be written as a separate line

• rdd.saveAsObjectFile(BDFS_FILE_URI) – Suitable for storing the RDD as a fileusing Java serialization and deserialization

• rdd.saveAsSequenceFile(BDFS_FILE_URI) – Suitable for storing key-value pairsas a Hadoop SequenceFile

An example BDFS_FILE_URI per above is: bdfs://localhost:19998/bdcsce/myrdd.txt

Once the RDD is saved to BDFS, the RDD can be accessed by any other Sparkapplication in the job pipeline.

Oracle Cloud Infrastructure Object Storage Classic Read Access

BDFS mounts Oracle Cloud Infrastructure Object Storage Classic as a read-only filesystem, which allows for direct reads from Oracle Cloud Infrastructure Object StorageClassic. This enables files that reside on Oracle Cloud Infrastructure Object StorageClassic to be accessed through BDFS. The benefit of accessing files through BDFS isthat the file is cached in BDFS. As such, subsequent reads of the same file are moreperformant, since the underlying data doesn’t need to be fetched from Oracle Cloud

Chapter 8About the Big Data File System (BDFS)

8-4

Infrastructure Object Storage Classic for a second time. Instead, the file is read directlyfrom BDFS.

The following are example Swift and BDFS URLs for the same file. In this example, it’sassumed that BDFS is mounted to the bdcsce container in Oracle Cloud InfrastructureObject Storage Classic.

• Swift URL: swift://bdcsce.default/somepath/file.txt

• BDFS URL: bdfs://localhost:19998/somepath/file.txt

The host and port in the BDFS URL are required but aren’t used because the cluster isconfigured for an HA environment. The requirement to provide the host and port willlikely be eliminated in a future release.

BDFS Tiered Storage

BDFS Tiered Storage provides the ability to store more objects in the caching layerbeyond what can be kept in memory. This is accomplished by evicting objects held inmemory to block storage for retrieval later. As memory is exhausted, BDFSautomatically moves objects from memory into block storage. This feature allows thecaching of objects beyond the capacity of the total memory cache available across theBDFS cluster (collection of workers). BDFS lazily consumes memory, so if BDFS isn'tused, little overhead is incurred by BDFS.

You can specify the size of the BDFS cache block storage when you create a cluster.The total amount of cache provided by BDFS is the sum of RAM allocated to BDFSplus the total block storage allocated for spillover. The amount of memory allocated toBDFS is based on the compute shape selected when the cluster was created. Thefollowing table summarizes the amount of memory allocated to BDFS.

Compute Shape Total Memory Available Total Memory Allocated to BDFS

OC2m 30 GB 1 GB

All other shapes 16% of Total Memory Available

Chapter 8About the Big Data File System (BDFS)

8-5

9Connect to Oracle Database

This section describes how to connect to Oracle Database from Big Data Cloud usingthe Oracle Loader for Hadoop and Copy to Hadoop database connectors. Theseconnectors are preinstalled and preconfigured on all cluster nodes in Oracle Big DataCloud.

Oracle Loader for Hadoop and Copy to Hadoop are high speed connectors used toload data into and copy data from Oracle Database. The interface for these connectorsis the Oracle Shell for Hadoop Loaders (OHSH) command line interface.

Topics

• Use the Oracle Shell for Hadoop Loaders Interface (OHSH)

• Use Oracle Loader for Hadoop

• Use Copy to Hadoop

Use the Oracle Shell for Hadoop Loaders Interface (OHSH)The following sections describe how to use the Oracle Shell for Hadoop Loaders(OHSH) interface.

Oracle Shell for Hadoop Loaders is the preferred way to use the Oracle Loader forHadoop and Copy to Hadoop database connectors. It includes a command lineinterface (whose simple command syntax can also be scripted) for moving databetween Hadoop and Oracle Database using the database connectors.

About Oracle Shell for Hadoop LoadersOracle Shell for Hadoop Loaders is a helper shell that provides an easy-to-usecommand line interface to Oracle Loader for Hadoop and Copy to Hadoop. It has basicshell features such as command line recall, history, inheriting environment variablesfrom the parent process, setting new or existing environment variables, and performingenvironmental substitution in the command line.

The core functionality of OHSH includes the following:

• Defining named external resources with which OHSH interacts to perform loadingtasks.

• Setting default values for load operations.

• Running load commands.

• Delegating simple pre and post load tasks to the Operating System, HDFS, Hive,and Oracle. These tasks include viewing the data to be loaded, and viewing thedata in the target table after loading.

9-1

Configure Big Data Cloud for Oracle Shell for Hadoop LoadersTo get started with OHSH in Oracle Big Data Cloud:

1. SSH to a node on Big Data Cloud and log in, then execute the following:

sudo su oracle

2. Add /opt/oracle/dbconnector/ohsh/bin to your PATH variable. The OHSHexecutable is at this location.

3. Start OHSH with the following command:

ohsh

You’re now ready to run OHSH commands to move data between Big Data Cloud andOracle Database.

Get Started with Oracle Shell for Hadoop Loaders

Starting an OHSH Interactive Session

To start an interactive session, enter ohsh on the command line. This brings you to theOHSH shell (if you have ohsh in your path):

$ ohshohsh>

You can execute OHSH commands in this shell (using the OHSH syntax). You canalso execute commands for Beeline/Hive, Hadoop, Bash, and SQL*Plus. For non-OHSH commands, you add a delegation operator prefix (“%”) to the name of theresource used to execute the command. For example:

ohsh> %bash0 ls —l

Scripting OHSH

You can also script the same commands that work in the CLI. The ohsh commandprovides three parameters for working with scripts.

• ohsh —i <filename>.ohsh

The —i parameter tells OHSH to initialize an interactive session with thecommands in the script before the prompt appears. This is a useful way to set upthe required session resources and automate other preliminary tasks before youstart working within the shell.

$ ohsh –i initresources.ohsh

• ohsh —f <filename>.ohsh

The ohsh command with the —f parameter starts a non-interactive session andruns the commands in the script.

$ ohsh –f myunattendedjobs.ohsh

Chapter 9Use the Oracle Shell for Hadoop Loaders Interface (OHSH)

9-2

• ohsh —i —f <filename>.ohsh

You can use —i and —f together to initialize a non-interactive session and then runanother script in the session.

$ ohsh -i mysetup.ohsh –f myunattendedjobs.ohsh

• ohsh —c

This command dumps all Hadoop configuration properties that an OHSH sessioninherits at start up.

Working With OHSH Resources

A resource is some named entity that OHSH interacts with. For example: a Hadoopcluster is a resource, as is a JDBC connection to an Oracle database, a Hivedatabase, a SQL*Plus session with an Oracle database, and a Bash shell on the localOS.

OHSH provides two default resources at start up: hive0 (to connect to the defaultHive database) and hadoop0.

• Using hive0 resource to execute a Hive command:

ohsh> %hive0 show tables;

You can create additional Hive resources to connect to other Hive databases.

• Using the hadoop0 resource to execute a Hadoop command:

ohsh> %hadoop0 fs -ls

Within an interactive or scripted session, you can create instances of additionalresources, such as SQL*Plus and JDBC. You need to create these two resources inorder to connect to Oracle Database through OHSH.

• Creating an SQL*Plus resource:

ohsh> create sqlplus resource sql0 connectid=”bigdatalite.localdomain:1521/orcl”

• Creating a JDBC resource:

ohsh> create jdbc resource jdbc0 connectid=<database connection URL>

• Showing resources:

ohsh> show resources

This command lists default resources and any additional resources created withinthe session.

Getting Help

The OHSH shell provides online help for all commands.

Chapter 9Use the Oracle Shell for Hadoop Loaders Interface (OHSH)

9-3

To get a list of all OHSH commands:

ohsh> help

To get help on a specific command, enter help, followed by the command:

ohsh> help show

The table below describes the help categories available.

Help Command Description

help load Describes load commands for Oracle andHadoop tables.

help set Shows help for setting defaults for loadoperations. It also describes what loadmethods are impacted by a particular setting.

help show Shows help for inspecting default settings.

help shell Shows shell-like commands.

help resource Show commands for creating and droppingnamed resources.

Use Oracle Loader for HadoopThe following sections describe how to use Oracle Loader for Hadoop to load datafrom Hadoop into tables in Oracle Database.

About Oracle Loader for HadoopOracle Loader for Hadoop (OLH) is an efficient and high-performance loader for fastloading of data from a Hadoop cluster into a table in an Oracle database.

Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into adatabase-ready format. It can also sort records by primary key or user-specifiedcolumns before loading the data or creating output files. Oracle Loader for Hadoopuses the parallel processing framework of Hadoop to perform these preprocessingoperations, which other loaders typically perform on the database server as part of theload process. Off-loading these operations to Hadoop reduces the CPU requirementson the database server, thereby lessening the performance impact on other databasetasks.

Oracle Shell for Hadoop Loaders (OHSH) is the preferred way to use Oracle Loaderfor Hadoop. It includes a command line interface (whose simple command syntax canalso be scripted) for moving data between Hadoop and Oracle Database using variousresources, including Oracle Loader for Hadoop. See Use the Oracle Shell for HadoopLoaders Interface (OHSH).

Chapter 9Use Oracle Loader for Hadoop

9-4

Get Started With Oracle Loader for HadoopThese instructions show how to use Oracle Loader for Hadoop through OHSH.

Before You Start

This is what you need to know before using OLH to load an Oracle Database tablewith data stored in Hadoop:

• The password of the database schema you are connecting to (which is implied bythe database connection URL).

• The name of the Oracle Database table.

• The source of the data living in Hadoop (either a path to an HDFS directory or thename of a Hive table).

• The preferred method for loading. Choose either JDBC or direct path. Direct pathload is faster, but requires partitioning of the target table. JDBC does not.

About Resources

In OHSH, the term resources refers to the interfaces that OHSH presents for definingthe data source, destination, and command language. Four types of resources areavailable:

• Hadoop resources – for executing HDFS commands to navigate HDFS and useHDFS as a source or destination.

• Hive resources – for executing Hive commands and specifying Hive as a source ordestination.

• JDBC resources – for making JDBC connections to a database.

• SQL*Plus resources – for executing SQL commands in a database schema.

Two resources are created upon OHSH startup:

• hive0 – enables access to the Hive database default.

• hadoop0 – enables access to HDFS.

You can create SQL*Plus and JDBC resources with a session, as well as additionalHive resources (for example, to connect to other Hive databases). Assign a resourceany name that is meaningful to you. In the examples below, we use the namesora_mydatabase and sql0 .

Where resources are invoked in the commands below, the percent sign (%) prefixidentifies a resource name.

Loading an Oracle Database Table

1. Start an OHSH session.

$ ohshohsh>

2. Create the following resources:


9-5

• SQL*Plus resource

ohsh> create sqlplus resource sql0 connectid=”<database connection url>”

At prompt, enter the database password.

• JDBC resource.

You can provide any name. A name that indicates the target schema isrecommended.

ohsh> create jdbc resource ora_mydatabase connectid=”<database connection ur1>”

At the prompt, enter the database password.

• Additional Hive resources (if required). The default Hive resource hive0connects to the default database in Hive. If you want to connect to anotherHive database, create another resource:

ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive database name>

3. Use the load command to load files from HDFS into a target table in the Oracledatabase.

The following command loads data from a delimited text file in HDFS <HDFS path>into the target table in Oracle Database using the direct path option.

ohsh> load oracle table ora_mydatabase:<target table in the Oracle database> from path hadoop0:/user/<HDFS path> using directpath

Note:

The default direct path method is the fastest way to load a table.However, it requires partitioned target table. Direct path is alwaysrecommended for use with partition tables. Use the JDBC option to loadinto a non-partitioned target table.

If the command does not explicitly state the load method, then OHSHautomatically uses the appropriate method. If the target Oracle table ispartitioned, then by default, OHSH uses direct path (i.e. Oracle OCI). Ifthe Oracle table is not partitioned, it uses JDBC.

4. After loading, check the number of rows.

You can do this conveniently from the OHSH command line:

ohsh> %sql0 select count(*) from <target table in Oracle Database>


9-6

Loading a Hive Table Into an Oracle Database Table

You can use OHSH to load a Hive table into a target table in an Oracle database. Thecommand below shows how to do this using the direct path method.

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name>

Note that if the target table is partitioned, then OHSH uses direct path automatically.You do not need to enter using directpath explicitly in the command.

If the target table is non-partitioned, then specify the JDBC method instead:

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using jdbc

Note:

The load command assumes that the column names in the Hive table and inthe Oracle Database table are identically matched. If they do not match, thenuse OHSH loadermap.

Using OHSH Loadermaps

The simple load examples in this section assume the following:

• Where we load data from a text file in Hadoop into an Oracle Database table, thedeclared order of columns in the target table maps correctly to the physicalordering of the delimited text fields in the file.

• Where we load Hive tables in to Oracle Database tables, the Hive and OracleDatabase column names are identically matched.

However, in less straightforward cases where the column names (or the order ofcolumn names and delimited text fields) do not match, use the OHSH loadermapconstruct to correct these mismatches.

You can also use a loadermap to specify a subset of target columns to load into tableor in the case of a load from a text file, specify the format of a field in the load.

Loadermaps are not covered in this introduction.

Performance Tuning Oracle Loader for Hadoop in OHSH

Aside from network bandwidth, two factors can have significant impact on OracleLoader for Hadoop performance. You can tune both in OHSH.

• Degree of parallelism

The degree of parallelism affects performance when Oracle Loader for Hadoopruns in Hadoop. For the default method (direct path), parallelism is determined by


9-7

the number of reducer tasks. The higher the number of reducer tasks, the fasterthe performance. The default value is 4. To set the number of tasks:

ohsh> set reducetasks 18

For the JDBC option, parallelism is determined by the number of map tasks andthe optimal number is determined automatically. However, remember that if thetarget table is partitioned, direct path is faster than JDBC.

• Load balancing

Performance is best when the load is balanced evenly across reduce tasks. Theload is detected by sampling. Sampling is always enabled by default for loadsusing the JDBC and the default copy method.

Debugging in OHSH

Several OHSH settings control the availability of debugging information:

• outputlevel

The outputlevel is set to minimal by default. Set it to verbose in order to return astack trace when a command fails:

ohsh> set outputlevel verbose

• logbadrecords

ohsh> set logbadrecords true

This is set to true by default.

These log files are informative for debugging:

• Oracle Loader for Hadoop log files.

/user/<username>/smartloader/jobhistory/oracle/<target table schema>/<target table name>/<OHSH job ID>/_olh

• Log files generated by the map and reduce tasks.

Other OHSH Properties That are Useful for Oracle Loader for Hadoop

You can set these properties on the OHSH command line or in a script.

• dateformat

ohsh> set dateformat “yyyy-MM-dd HH:mm:ss”

The syntax for this command is dictated by the Java date format.

• rejectlimit

The number of rows that can be rejected before the load of a delimited text filefails.

• fieldterminator


9-8

The field terminator in loads of delimited text files.

• hadooptnsadmin

Location of an Oracle TNS admin directory in the Hadoop cluster

• hadoopwalletlocation

Location of the Oracle Wallet directory in the Hadoop cluster.

Use Copy to HadoopThe following sections describe how to use Copy to Hadoop to copy Oracle Databasetables to Hadoop.

About Copy to HadoopCopy to Hadoop makes it simple to identify and copy Oracle data to the HadoopDistributed File System (HDFS).

Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle DataPump format. The Oracle Data Pump files can be queried by Hive. When the Oracletable changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily usefulfor Oracle tables that are relatively static, and thus do not require frequent refreshes.

Oracle Shell for Hadoop Loaders (OHSH) is the preferred way to use Copy to Hadoop.It includes a command line interface (whose simple command syntax can also bescripted) for moving data between Hadoop and Oracle Database using variousresources, including Copy to Hadoop. See Use the Oracle Shell for Hadoop LoadersInterface (OHSH).

First Look: Loading an Oracle Table Into Hive and Storing the Data inHadoop

This set of examples shows how to use Copy to Hadoop to load data from an Oracletable, store the data in Hadooop, and perform related operations within the OHSHshell. It assumes that OHSH and Copy to Hadoop are already installed andconfigured.

What’s Demonstrated in The Examples

These examples demonstrate the following tasks:

• Starting an OHSH session and creating the resources you’ll need for Copy toHadoop.

• Using Copy to Hadoop to copy the data from the selected Oracle Database tableto a new Hive table in Hadoop (using the resources that you created).

• Using the load operation to add more data to the Hive table created in the firstexample.

• Using the create or replace operation to drop the Hive table and replace it witha new one that has a different record set.

• Querying the data in the Hive table and in the Oracle Database table.

• Converting the data into other formats

Chapter 9Use Copy to Hadoop

9-9

Tip:

You may want to create select or create a small table in Oracle Databaseand work through these steps.

Starting OHSH, Creating Resources, and Running Copy to Hadoop

1. Start OHSH. (The startup command below assumes that you’ve added the OHSHpath to your PATH variable as recommended.)

$ ohshohsh>

2. Create the following resources.

• SQL*Plus resource.

ohsh> create sqlplus resource sql0 connectid=”<database_connection_url>”

• JDBC resource.

ohsh> create jdbc resource jdbc0 connectid=”<database_connection_url>”

Note:

For the Hive access shown in this example, only the default hive0resource is needed. This resource is already configured to connect to thedefault Hive database. If additional Hive resources were required, youwould create them as follows:

ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive_database_name>”

3. Include the Oracle Database table name in the create hive table commandbelow and run the command below. This command uses the Copy to Hadoopdirectcopy method. Note that directcopy is the default mode and you do notactually need to name it explicitly.

ohsh> create hive table hive0:<new_Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> from oracle table jdbc0:<Oracle_Database_table_name> using directcopy

The Oracle Table data is now stored in Hadoop as a Hive table.

Adding More Data to the Hive Table

Use the OHSH load method to add data to an existing Hive table.


9-10

Let’s assume that the original Oracle table includes a time field in the format DD-MM-YY and that a number of daily records were added after the Copy to Hadoop operationthat created the corresponding Hive table.

Use load to add these new records to the existing Hive table:

ohsh> load hive table hive0:<Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> where “(time >= ’01-FEB-18’)”

Using OHSH create or replace

The OHSH create or replace operation does the following:

1. Drops the named Hive table (and the associated Data Pump files) if a table by thisname already exists.

Note:

Unlike create or replace, a create operation fails and returns an errorif the Hive table and the related Data Pump files already exist.

2. Creates a new Hive table using the name provided.

Suppose some records were deleted from the original Oracle Database table and youwant to realign the Hive table with the new state of the Oracle Database table. Hivedoes not support update or delete operations on records, but the create or replaceoperation in OHSH can achieve the same end result:

ohsh> create or replace hive table hive0:<new_hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name>

Note:

Data copied to Hadoop by Copy to Hadoop can be queried through Hive, butthe data itself is actually stored as Oracle Data Pump files. Hive only pointsto the Data Pump files.

Querying the Hive Table

You can invoke a Hive resource in OHSH in order to run HiveQL commands. Likewise,you can invoke an SQL*Plus resource to run SQL commands. For example, these twoqueries compare the original Oracle Database table with the derivative Hive table:

ohsh> %sql0 select count(*) from <Oracle_Database_table_name>ohsh> %hive0 select count(*) from <Hive_table_name>


9-11

Storing Data in Other Formats, Such as Parquet or ORC

By default, Copy to Hadoop outputs Data Pump files. In a create operation, you canuse the “stored as” syntax to change the destination format to Parquet or ORC:

ohsh> %hive0 create table <Hive_table_name_parquet> stored as parquet as select * from <Hive_table_name>

This example creates the Data Pump files, but then immediately copies them toParquet format. (The original Data Pump files are not deleted.)


9-12

10Work with Jobs

This section describes how to create, run, and manage Apache Spark jobs in OracleBig Data Cloud.

Topics

• Create a Job

• Run a Job

• About MapReduce Jobs

• Stop a Job

• View Jobs and Job Details

• View Job Logs

• Monitor and Troubleshoot Jobs

• Manage Work Queue Capacity

• Create Work Queues

Create a JobUse the following procedure to create and run a job. When you’re done creating thejob, the job is automatically submitted for execution. Once the cluster has enoughcapacity, the job is executed.

To create a job:

1. Open the cluster console for the desired cluster. See Access the Big Data CloudConsole.

2. Click Jobs.

The Spark Jobs page is displayed, listing any jobs associated with the cluster. Forinformation about the details on this page, see Big Data Cloud Console: JobsPage.

The Zeppelin entry represents a running Apache Spark job used for notebooks.Apache Zeppelin is the notebook interface and coding environment for Big DataCloud.

3. Click New Job.

The New Job wizard starts and the Details page is displayed.

4. On the Details page, specify the following and then click Next to advance to theConfiguration page.

• Name: Name for the job.

• Description: (Optional) Description for the job.

10-1

• Type: Type of job: Spark, Python Spark, or MapReduce. For Spark jobsubmissions, the application can be written in any language as long as theapplication can be executed on the Java Virtual Machine. For moreinformation about submitting MapReduce jobs, see About MapReduce Jobs.

5. On the Configuration page, configure the driver, executor, and queue settings forthe job, then click Next to advance to the Driver File page.

Note: For MapReduce jobs, you’ll just specify the queue on the Configurationpage.

• Driver Cores: Number of CPU cores assigned to the Spark driver process.

• Driver Memory: Amount of memory assigned to the Spark driver process, inGB or MB. This value cannot exceed the memory available on the driver host,which is dependent on cluster shape. Some memory is reserved for supportingprocesses.

• Executor Cores: Number of CPU cores made available for each Sparkexecutor.

• Executor Memory: Amount of memory made available for each Sparkexecutor, in GB or MB.

• No. of Executors: Number of Spark executor processes used to execute thejob.

• Queue: Name of the resource queue for which the job will be targeted. Whena cluster is created, a set of queues is also created and configured by default.Which queues get created is determined by the queue profile specified whenthe cluster was created and whether preemption was set to Off or On. Thepreemption setting can’t be changed after a cluster is created.

If preemption was set to Off (disabled), the following queues are available bydefault:

– dedicated: Queue used for all REST API and Zeppelin job submissions.Default capacity is 80, with a maximum capacity of 80.

– default: Queue used for all Hive and Spark Thrift job submissions. Defaultcapacity is 20, with a maximum capacity of 20.

If preemption was set to On (enabled), the following queues are available bydefault:

– api: Queue used for all REST API job submissions. Default capacity is 50,with a maximum capacity of 100.

– interactive: Queue used for all Zeppelin job submissions. Default capacityis 40, with a maximum capacity of 100. To allocate more of the cluster'sresources to Notebook, increase this queue's capacity.

– default: Queue used for all Hive and Spark Thrift job submissions. Defaultcapacity is 10, with a maximum capacity of 100.

In addition to the queues created by default, you can also create and usecustom queues. See Create Work Queues.

6. On the Driver File page, specify the job driver file and its main class (for Sparkjobs), command line arguments, and any additional JARs or supporting filesneeded for executing the job. Then click Next to advance to the Confirmationpage.

Chapter 10Create a Job

10-2

• File Path: Path to the executable for the job. Click Browse to select a file inHDFS or Cloud Storage, or to upload a file from your local file system. The filemust have a .jar or .zip extension. In the Browse HDFS window, you canalso browse to and try some examples.

• Main Class: (Spark and MapReduce jobs only) Main class to run the job.

• Arguments: (Optional) Any arguments used to invoke the main class. Specifyone argument per line.

• Additional Py Modules: (Python Spark jobs only) Any Python dependenciesrequired for the application. You can specify more than one file. Click Browseto select a file in HDFS or Cloud Storage, or to upload a file from your local filesystem (.py file only).

• Additional Jars: (Optional) Any JAR dependencies required for theapplication, such as Spark libraries. Multiple files can be specified. UseBrowse to select a file (.jar or .zip file only).

• Additional Support Files: (Optional) Any additional support files required forthe application. Multiple files can be specified. Use Browse to select a file(.jar or .zip file only).

7. On the Confirmation page, review the information listed. If you're satisfied withwhat you see, click Create to create the job and submit the job for execution.

If you need to change something before creating and submitting the job, click Prevat the top of the wizard to step back through the pages, or click Cancel to cancelout of the wizard.

When you’re done creating the job, the job is automatically submitted forexecution. It typically sits in an Accepted state for a short period and thenexecution begins. If a job sits in the Accepted state for a long period of time, thisusually means there aren't enough resources available on the cluster to satisfy thejob requirements as defined by the job submission. You can address this by eitherreducing the resource requirements of the job or by terminating existing jobs thataren't required (such as Zeppelin).

Run a JobWhen you create a job, the job is automatically submitted for execution. Once thecluster has enough capacity, the job is executed.

To rerun a job after the job has finished executing, you must create the job again. See Create a Job.

About MapReduce JobsYou can submit MapReduce jobs using the cluster console, the REST API, or thecommand line interface.

Chapter 10Run a Job

10-3

Note:

• The MapReduce API is based on org.apache.hadoop.mapreduce.Joband creates its own YARN application, and thus requires its own slots inthe cluster. You must customize the job according to https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapred/jobcontrol/Job.html.

• The Stocator Swift driver (swift2d://) is not supported using theMapReduce API. All MapReduce applications that require access toOracle Cloud Infrastructure Object Storage Classic should make use ofthe Hadoop OpenStack Swift driver (swift://).

Use the Cluster Console

You can submit MapReduce jobs from the Jobs tab in the Big Data Cloud Console.See Create a Job.

Use the REST API

You can use the REST API to submit MapReduce jobs.Example job submission:

{ "job": { "applicationClass": "org.apache.hadoop.examples.ExampleDriver", "applicationFile": "hdfs:///mapred/examples/hadoop-mapreduce-examples.jar", "applicationArguments": [ "wordcount", "hdfs:///mapred/data/one_word.txt", "hdfs:///tmp/one_word-1487791745.out" ], "hadoopConf": { "a.hadoop.conf.key": "a.hadoop.conf.value" }, "queue": "api", "applicationName":"MapReduceWordCount" }}

Assuming the above content is contained within payload_mr_job.json, thecorresponding REST API request would look as follows:

curl -k -s -X POST ""https://big_data_cluster_host:1080/bdcsce/api/v1.1/clustermgmt/identity_domain/instances/cluster_name/jobs/mapred >/jobs/mapred" -H "X-ID-TENANT-NAME: identity_domain" -H "Content-Type: application/json; charset=utf-8" --user "bdcsce_admin:csm_password"-d @payload_mr_job.json

For information about using the REST API, see REST API for Oracle Big Data Cloud.

Use the Command Line Interface

MapReduce jobs can be executed from the shell command line. To do so, SSH to anynode in the cluster, and then submit the job. The following example shows how tosubmit a MapReduce job using the command line:

Chapter 10About MapReduce Jobs

10-4

https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapred/jobcontrol/Job.html



opc@host/>sudo su oracleoracle@host/>hadoop fs -mkdir /user/oracle/mapredsmokeinputoracle@host/>hadoop fs -put /tmp/ambari.properties.1 /user/oracle/mapredsmokeinputoracle@host/>yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-2.*.jar wordcount/user/oracle/mapredsmokeinput /user/oracle/mapredsmokeoutput

Stop a JobTo stop a job:


2. Click Jobs.

The Spark Jobs page is displayed.

3. From the menu for the job you want to stop, select Abort Job.

The Abort Job window is displayed.

4. Select Yes.

The Abort Job window closes, the job is terminated, and the job logs get moved toCloud Storage.

View Jobs and Job DetailsYou can view jobs and job details. Details include queue, progress, and status.

To view jobs and job details:


2. Click Jobs.

The Spark Jobs page is displayed, listing any jobs associated with the cluster. Youcan choose from list view or table view. You may need to refresh the page to seethe jobs.

3. From the menu for the job for which you want to view details, select Details.

The Details page is displayed with details for the job.

View Job LogsJob logs provide basic details about jobs that are running or completed.

You can also drill into the details of a job by using the Spark UI. See Monitor andTroubleshoot Jobs.

To view job logs:


2. Click Jobs.


3. From the menu for the job for which you want to view log files, select Logs.

The Logs page is displayed.

Chapter 10Stop a Job

10-5

4. View the desired log:

• Container Logs: All log files for a running job. These logs are available onlywhen the job is running, with one log file per container.

• Aggregated Logs: All log files aggregated by YARN and available in HDFS.These logs are available and updated periodically only while the job is running,and are useful for long-running jobs.

• Archived Logs: All log files for a completed job, archived in Oracle CloudInfrastructure Object Storage Classic. Once a job is done (either failed,terminated, or successful), the aggregated logs are removed from HDFS andstored in Oracle Cloud Infrastructure Object Storage Classic.

Note:

When a Spark job completes, aggregated job logs can take 5 minutes ormore to become available in the Big Data Cloud Console.

Monitor and Troubleshoot JobsApache Spark includes a web UI that provides detailed information about jobs that arerunning or completed.

Use the Spark UI when you need more information about a job than what's available injob logs (see View Job Logs).

To access the Spark UI:


2. Click Jobs.


3. From the menu for the job you want to explore, select Spark UI.

The Spark UI window is displayed, from which you can drill down and accessdetails about the job.

For specific information about the Spark UI, see the Apache Spark documentation.

Manage Work Queue CapacityWork queues are used to allocate cluster resources among users. When you create ajob you specify the work queue to be used when the job is executed.

Oracle Big Data Cloud uses the YARN capacity scheduler for work queues. A set ofwork queues is created by default when a cluster is created. Which queues getcreated is determined by the queue profile specified when the cluster was created andwhether preemption was set to Off or On. The preemption setting can't be changedafter a cluster is created.

If preemption was set to Off (disabled), the following queues are available by default:

• dedicated: Queue used for all REST API and Zeppelin job submissions. Defaultcapacity is 80, with a maximum capacity of 80.

Chapter 10Monitor and Troubleshoot Jobs

10-6

• default: Queue used for all Hive and Spark Thrift job submissions. Defaultcapacity is 20, with a maximum capacity of 20.

If preemption was set to On (enabled), the following queues are available by default:

• api: Queue used for all REST API job submissions. Default capacity is 50, with amaximum capacity of 100.

• interactive: Queue used for all Zeppelin job submissions. Default capacity is 40,with a maximum capacity of 100. To allocate more of the cluster's resources toNotebook, increase this queue's capacity.

• default: Queue used for all Hive and Spark Thrift job submissions. Defaultcapacity is 10, with a maximum capacity of 100.

You can modify existing queues, or add new ones. To create a new queue, see CreateWork Queues. Total capacity of all queues cannot exceed 100%.

To manage work queue capacity:


2. Click Settings.

The Queues tab on the Settings page is displayed, listing the current queues andtheir configurations. For information about the details on the page, see Big DataCloud Console: Settings Page.

3. Modify queue capacity as desired, making note of the explanatory information onthe page. Preemption was set when the cluster was created and cannot bechanged.

Create Work QueuesWork queues are configured and available by default when a cluster is created. Youcan also create new work queues.

Note:

You can’t delete a queue after it’s been created. This has implications for thecapacity distribution between queues, because queue capacity also can’t beset to zero (0).

To create a work queue:


2. Click Settings.

The Queues page is displayed, listing the current queues and their configurations.Preemption is either enabled or disabled (on or off). This value was set when thecluster was created and cannot be changed. If preemption is enabled, jobs can'tconsume more resources than a specific queue allows. If preemption is disabled,jobs can consume more resources than a queue allows, but could lose thoseresources when another job comes in that has priority for those resources.

3. Click New Queue.

A row for the new queue is added to the list of queues.

Chapter 10Create Work Queues

10-7

4. In the Queue Name field, enter the name for the queue.

5. In the Capacity field, specify queue capacity, making note of the explanatoryinformation on the page and changing capacity as necessary. Total capacity of allqueues cannot exceed 100%. You won’t be able to create the new queue if totalcapacity exceeds 100.

6. Click Save. Once the queue is configured, new jobs can be assigned to it.

Chapter 10Create Work Queues

10-8

11Work with Notebook

Notebooks are used to explore and visualize data in an iterative fashion. Oracle BigData Cloud uses Apache Zeppelin as its notebook interface and coding environment.

To access notes created and shared by other Zeppelin users, see details at https://www.zepl.com.

In addition to using the console, you can also use the Zeppelin Notebook REST API toperform many of the tasks in this section. See https://zeppelin.apache.org/docs/0.7.0/rest-api/rest-notebook.html.

Topics

• Create a Note in a Notebook

• Run a Note

• View and Edit a Note

• Import a Note

• Export a Note

• Delete a Note

• Organize Notes

• Manage Notebook Settings

• Interpreters Available for Big Data Cloud

Create a Note in a NotebookTo create a note:


2. Click Notebook.

The Notebook page is displayed, listing any notes created for this notebook.Sample notes are also provided. For information about the details on this page,see Big Data Cloud Console: Notebook Page.

3. Click New Note.

The New Note window is displayed.

4. Enter the name for the note, then click OK.

The note is created.

5. In the note interface, code the note.

The note is saved automatically.

11-1

https://www.zepl.com

https://www.zepl.com

https://zeppelin.apache.org/docs/0.7.0/rest-api/rest-notebook.html

https://zeppelin.apache.org/docs/0.7.0/rest-api/rest-notebook.html

Run a NoteYou can run an entire note, or just specific paragraphs.

To run all or part of a note:


2. Click Notebook.

The Notebook page is displayed, listing any notes created for this notebook.

3. Click the link for the note or, from the menu for the note, click View Note.

The note is displayed.

4. Click the icon to run all paragraphs in the note, or to run specific paragraphs.

The note or paragraph runs and the output is displayed.

If your code produces visualizations, there are many ways you can explore andanalyze your results.

Also explore the actions you can perform using the note and paragraph icons. Youcan:

• Show and hide the code editor

• Show and hide results

• Clear results

• Export the note

• Use keyboard shorcuts

• Bind interpreters and change the default interpreter

View and Edit a NoteTo view and edit a note:


2. Click Notebook.

The Notebook page is displayed, listing any notes created for this notebook. Youcan choose from list view or table view.

3. Click the link for the note or, from the menu for the note, click View Note.

The note is displayed.

4. Edit the note if desired, modifying existing paragraphs or adding new ones.

The note is saved automatically.

Import a NoteTo import a note:

Chapter 11Run a Note

11-2


2. Click Notebook.


3. Click Import Note.

The Import Note window is displayed.

4. Browse for or specify the file you want to import, then click OK. The file must havea .json extension.

The note is imported and listed in the list of notes.

Export a NoteTo export a note:


2. Click Notebook.


3. From the menu for the note, click Export Note, then specify the export locationon your local file system.

Delete a NoteTo delete a note:


2. Click Notebook.


3. From the menu for the note, click Delete Note and confirm the action.

Organize NotesYou can organize notes into folders by specifying paths.

To organize notes:


2. Click Notebook.


3. Click the icon to get into table view. You have the same features andcapabilities for notes in table view as you do in the flat list view. Table view justgives you a way to organize your notes.

4. Perform the desired action:

Chapter 11Export a Note

11-3

• To create a new folder and note at the same time: Click New Note andenter /NoteDirectory/NoteName in the Note Name field. For example,entering /Demo/Note1, creates a Demo folder that contains the new Note1.

• To move a note: Click the note name in the list on the Notebook page to getinto edit mode, then click the note title and change it to the desired path. Forexample, if you want to move Note1 from the Demo folder to the Test folder,change the title from Demo/Note1 to Test/Note1. If you want to create a newfolder and move Note1 into it, change the title to /NewDirectory/Note1.

Manage Notebook SettingsYou can restart the notebook server (Zeppelin) if you notice data is cached or youencounter other issues with notes. You can also configure interpreters for notebooks.Interpreters are bindings for how code should be interpreted and where it should besubmitted for execution.

To restart notebooks and configure interpreters:


2. Click Settings, then click the Notebook tab.


3. To restart the notebook server, click Restart.

The current Zeppelin job is terminated and a new Zeppelin job is launched in thecluster. Any parameter or configuration changes also take effect.

4. To configure interpreters:

• Click Edit to change interpreter settings. Modify the values, then click Save.

• Click Restart to restart an interpreter, then click OK to confirm the action.

Interpreters Available for Big Data CloudThe following interpreters are available for Oracle Big Data Cloud.

Hadoop Interpreters

Interpreter Interpreter For

org.apache.zeppelin.hive.HiveInterpreter Hive

org.apache.zeppelin.file.HDFSFileInterpreter

HDFS

org.apache.zeppelin.alluxio.AlluxioInterpreter

Alluxio

Spark Interpreter Group


org.apache.zeppelin.spark.SparkInterpreter SparkContext and Scala

Chapter 11Manage Notebook Settings

11-4


org.apache.zeppelin.spark.PySparkInterpreter

PySpark

org.apache.zeppelin.spark.SparkSqlInterpreter

SparkSQL

org.apache.zeppelin.spark.DepInterpreter Dependency loader

Other Interpreters


org.apache.zeppelin.markdown.Markdown Markdown language

org.apache.zeppelin.angular.AngularInterpreter

Angular

org.apache.zeppelin.shell.ShellInterpreter Unix shell

org.apache.zeppelin.jdbc.JDBCInterpreter JDBC

Chapter 11Interpreters Available for Big Data Cloud

11-5

12Work with Oracle R Advanced Analytics forHadoop (ORAAH)

Oracle R Advanced Analytics for Hadoop (ORAAH) is a collection of R packages thatenable Big Data analytics from an R environment.

This section describes how to get started with ORAAH in Oracle Big Data Cloud. Fordetailed information about ORAAH, see Using Oracle R Advanced Analytics forHadoop in Oracle Big Data Connectors User's Guide. The information in that chapteralso pertains to Big Data Cloud.

Topics

• About ORAAH in Big Data Cloud

• Use ORAAH in Big Data Cloud

About ORAAH in Big Data CloudOracle R Advanced Analytics for Hadoop (ORAAH) packages are preinstalled on allcluster nodes in Oracle Big Data Cloud if Spark 1.6 is selected when you create acluster.

Big Data Cloud uses Apache Zeppelin as its notebook interface and codingenvironment. Once you have access to the notebook environment, you can start usingthe R language and the scalability that ORAAH brings to machine learning in the BigData environment.

Use ORAAH in Big Data CloudThese instructions describe how to use Oracle R Advanced Analytics for Hadoop(ORAAH) in Big Data Cloud.

Get Started with ORAAH

In a new notebook in Big Data Cloud, type %r and then type your R code.

To use the ORAAH libraries, you first need to load the libraries in a paragraph that isexecuted before other ORAAH functions. For example, you can add the following linesof code to load the ORAAH libraries inside your R session and check the files in yourHDFS home:

%r# Load the ORAAH library:library(ORCH)# List the datasets available in the cluster under the user's home HDFS folder:hdfs.ls()

12-1

Connect to Hive

When connecting to Hive from a notebook paragraph, use zeppelin as the user. Thisis the specific user that has the read/write permissions on HDFS that ORAAH uses forstoring temporary files. You can use localhost as the host.

%r# Load the ORAAH library:library(ORCH)# After loading the libraries, connect to Hive with the following command:ore.connect(user="zeppelin", host="localhost", port="10002",schema="default", type="HIVE", transportMode="http", httpPath="hs2service")# List all tables available in Hive:ore.ls()

Use the Spark-based Machine Learning Interfaces

To use the Spark-based machine learning interfaces to ORAAH's ML algorithms andthe Spark MLlib algorithms, use ORAAH's spark.connect() command to start anexclusive Spark session. The algorithms can then be executed against data stored inHDFS or Hive.

For example, the following lines of code establish a connection to an exclusive Sparksession and run a few of the built-in ML examples from ORAAH. Note that IP_addressin the following example is specific to your Big Data Cloud environment.

%r# Load the ORAAH library:library(ORCH)# Ensure no other connection exists from ORAAH into Spark:spark.disconnect()# Connect to Spark via YARN, asking for 2 GB of RAM, using as dfs.namenode theIP address indicated, usually the public IP address of the service:spark.connect(master='yarn-client', dfs.namenode='IP_address',memory='2G')

# Run the example of Logistic Regression by ORAAH:example(orch.glm2)# Run the example of the Multi-layer Neural Networks by ORAAH:example(orch.neural2)# Run the example of the Linear Regression by ORAAH:example(orch.lm2)# Run the example of the Spark MLlib Random Forest via ORAAH:example(orch.ml.random.forest)# Run the example of the Spark MLlib Decision Trees via ORAAH:example(orch.ml.dt)# Run the example of the Spark MLlib Support Vector Machines via ORAAH:example(orch.ml.svm)# Run the example of the Spark MLlib Logistic Regression via ORAAH:example(orch.ml.logistic)# Run the example of the Spark MLlib k-Means Clustering via ORAAH:example(orch.ml.kmeans)# Run the example of the Spark MLlib Gaussian Mixture Model Clustering via ORAAH:example(orch.ml.gmm)

Chapter 12Use ORAAH in Big Data Cloud

12-2

13Troubleshoot Big Data Cloud

This section describes common problems you might encounter when using Oracle BigData Cloud and explains how to solve them.

Topics

• Problems with Administering Clusters

• Problems with Patching and Rollback

Problems with Administering ClustersThe following information applies to problems with administering clusters on OracleBig Data Cloud.

• I get a warning that the object store credentials are out of sync

• I need to view the status of running services

• Services aren’t being restarted properly after life cycle operations

• I need to modify the Ambari Web inactivity timeout

• I need to control the Ambari-agent service

• I need to control the Ambari-server service

I get a warning that the object store credentials are out of syncYou might get a warning notification in the Big Data Cloud Console, indicating thatobject store credentials are out of sync. This occurs if the Cloud Storage passwordspecified when a cluster was created is changed later outside of Oracle Big DataCloud. The password needs to be updated. See Update Cloud Storage Credentials.

I need to view the status of running servicesThere are two ways to view the status of a cluster and the associated services:

• Big Data Cloud Console: You can view the status of services on the Status page inthe cluster console. Two views are available on the Status page: an overallsummary on the Services tab, and the same information broken down by host onthe Hosts tab. See View Cluster Component Status.

• Ambari user interface: You can view detailed information about each servicethrough the Ambari user interface. Open port 8080 and navigate to the Ambariserver host. See related information in Access Big Data Cloud Using Ambari and Enable Access Rules.

13-1

Services aren’t being restarted properly after life cycle operationsIf services aren’t being restarted properly after life cycle operations such as clusterstart/stop, scale-out, patch-apply, and patch-rollback, the most likely cause is thatambari-agent has either terminated or has not been able to start. See I need to controlthe Ambari-agent service.

I need to modify the Ambari Web inactivity timeoutAmbari Web automatically logs users out after a period of inactivity and redirects to thelogin page. This inactivity timeout is configurable for Operators and Read-Only users.

The value of the inactivity timeout is specified in seconds. By default, the inactivitytimeout is set to 3600 seconds (one hour) for all users. To disable the inactivity timeoutfeature, set the value to 0.

To change the inactivity timeout:

1. Ensure the Ambari server is completely stopped before making any changes to theinactivity timeout.

2. Open the /u01/bdcsce/etc/ambari-server/conf/ambari.properties file on theAmbari server host with a text editor.

3. Find the following two properties for the inactivity timeout setting. Both are initiallyset to 3600 seconds.

Property Description

user.inactivity.timeout.default Sets the inactivity timeout (in seconds) forall non-Read-Only users.

user.inactivity.timeout.role.readonly.default

Sets the inactivity timeout (in seconds) forall Read-Only users.

4. Set these properties to a desired timeout value in seconds.

5. Save the changes and restart the Ambari server.

I need to control the Ambari-agent serviceTo control the Ambari-agent service, SSH into each node and execute the followingcommands as root:

• To get the status of the Ambari agent:

ambari-agent status

• To start the Ambari agent:

ambari-agent start

• To stop the Ambari agent:

ambari-agent stop

• To restart the Ambari agent:

ambari-agent restart

Chapter 13Problems with Administering Clusters

13-2

Also see Access Big Data Cloud Using Ambari.

I need to control the Ambari-server serviceTo control the Ambari-server service, SSH into node 1 and execute the followingcommands as root:

• To get the status of the Ambari server:

ambari-server status

• To start the Ambari server:

ambari-server start

• To stop the Ambari server:

ambari-server stop

• To restart the Ambari server:

ambari-server restart

Also see Access Big Data Cloud Using Ambari.

Problems with Patching and RollbackThe following information applies to problems with patching and rollback operations onOracle Big Data Cloud.

• I can’t apply a patch

• Patching fails due to disk space

I can’t apply a patchThe most common reason a patch can’t be applied is a failure in the precheck step. Toapply a patch the entire cluster must be in a healthy state, with all services up andrunning. Log in to Ambari to verify that all services are running and start any servicesthat aren’t. If you can’t start all services, verify that ambari-agent is running on allcluster nodes. See I need to control the Ambari-agent service.

Patching fails due to disk spacePatching will fail if there isn't enough disk space on specific volumes of each node ofthe instance. Disk space requirements:

• Root volume (/): 5 GB

• Tools root (/u01/app/oracle/tools): 4 GB

• Install root (/u01/bdcsce): 25 GB

• Data volume (/data): 10%

Chapter 13Problems with Patching and Rollback

13-3

AOracle Cloud Pages for Big Data Cloud

This section provides information about what you can do and what you see on theOracle Cloud pages used to administer and use Oracle Big Data Cloud.

There are two web-based consoles:

• Service console for Big Data Cloud, used by administrators to create andmanage clusters, monitor cluster health, view service activities, manage SSHkeys, and perform other administrative tasks. See Access the Service Console forBig Data Cloud .

• Big Data Cloud Console (also referred to previously in this document as thecluster console), used to run Apache Spark jobs, manage notebooks and notes,browse HDFS and Cloud Storage, and manage work queue configurations. See Access the Big Data Cloud Console.

Topics

• Service Console: Instances Page

• Service Console Create Instance: Instance Page

• Service Console Create Instance: Service Details Page

• Service Console Create Instance: Confirmation Page

• Service Console: Activity Page

• Service Console: SSH Access Page

• Service Console: Instance Overview Page

• Service Console: Access Rules Page

• Big Data Cloud Console: Overview Page

• Big Data Cloud Console: Jobs Page

• Big Data Cloud Console New Job: Details Page

• Big Data Cloud Console New Job: Configuration Page

• Big Data Cloud Console New Job: Driver File Page

• Big Data Cloud Console New Job: Confirmation Page

• Big Data Cloud Console: Notebook Page

• Big Data Cloud Console: Data Stores Page

• Big Data Cloud Console: Status Page

• Big Data Cloud Console: Settings Page

A-1

Service Console: Instances PageThe Instances page displays all clusters on Oracle Big Data Cloud, and enables you tocreate a cluster if a cluster does not yet exist.

Topics

• What You Can Do from the Instances Page

• What You See on the Instances Page

What You Can Do from the Instances Page

Use the Instances page to perform tasks described in the following topics:

• View All Clusters

• Create a Cluster



• Delete a Cluster

• Access the Big Data Cloud Console

• Replace the SSH Keys for a Cluster


What You See on the Instances Page

Element Description

(Identity domain field) Identity domain for the account.

(Help icon) Click for Help resources.

(User icon) Click to access User menu with help,accessibility options, about, and sign-out.

Click to go to the Welcome page.

Navigation menu providing access to variousservices the user has purchased.

Activity Click to go to the Activity page.

SSH Access (Displayed after a cluster has been created.)

Click to go to the SSH Access page.

Click to refresh the page.

Appendix AService Console: Instances Page

A-2

Element Description

Instances, OCPUs, Memory, Storage, andPublic IPs

Summary of resources being used:• Instances — Total number of configured

clusters.• OCPUs — Total number of Oracle CPUs

allocated across all clusters.• Memory — Total amount of compute

node memory allocated across allclusters.

• Storage — Total amount of raw blockstorage allocated across all clusters.

• Public IPs — Number of public IPaddresses allocated across all clusters.

Enter a full or partial cluster name to filter thelist of clusters to include only those thatcontain the string in their name.

Click to create a new Oracle Big Data Cloudcluster. See Create a Cluster.

Status Status of the cluster if it is not running.

Version Version of Oracle Big Data Cloud configuredon the cluster.

Nodes Number of nodes assigned to the cluster.

Created On or Submitted On Date when the cluster was created. During thecreation process, the date when the creationrequest was submitted.

One or more patch(es) available Notification indicating patches are available fora cluster. Click to see patch information. (Onlydisplayed if patches are available.)

OCPUs Number of Oracle CPUs associated with thecluster.

Memory Amount of compute node memory in GBsassociated with the cluster.

Storage Amount of raw block storage in GBsassociated with the cluster.

Appendix AService Console: Instances Page

A-3

Element Description

menu icon (for each cluster)

Menu that provides the following options:

• Big Data Cloud Console — Click to openthe Big Data Cloud Console (clusterconsole).

• Start — Click to start all the virtualmachines (VMs) hosting the nodes of thecluster.

• Stop — Click to stop all the virtualmachines (VMs) hosting the nodes of thecluster.

• Restart— Click to restart all the virtualmachines (VMs) hosting the nodes of thecluster.

• SSH Access — Add another SSH publickey or replace the existing SSH public keyassociated with the cluster. See Replacethe SSH Keys for a Cluster.

• Access Rules — Click to navigate to theAccess Rules page. (Does not apply toOracle Cloud Infrastructure)

• Delete — Click to delete the cluster fromthe list of clusters displayed in theconsole.

Instance Create and Delete History Listing of attempts to create or delete acluster. Click the triangle icon next to the titleto view the history listing.

Service Console Create Instance: Instance PageYou use the Create Instance: Instance page to create a new Oracle Big Data Cloudcluster.

Topics

• What You See in the Navigation Area

• What You See in the Instance Section

What You See in the Navigation Area

Element Description

Cancel Click to cancel creating a Oracle Big DataCloud cluster.

Next> Click to navigate to the Create Instance:Confirmation page.

Appendix AService Console Create Instance: Instance Page

A-4

What You See in the Instance Section

Element Description

Instance Name The name for the new Oracle Big Data Cloudcluster. The name:

• Must not exceed 30 characters.• Must start with a letter.• For IDCS-enabled clusters: Must contain

only letters and numbers.• For non-IDCS-enabled clusters: Can

contain hyphens. Hyphens are the onlyspecial characters you can use.

• Must be unique within the identity domain.

Description (Optional) A description that can be used tohelp identify the cluster.

The description is only used during cluster listdisplay and is not used internally by ServiceManager.

Notification Email (Optional) Email address that provisioningstatus updates should be sent to.

Region (Displayed only if your account has multipleregions)

The region for the cluster. If you choose aregion that supports Oracle CloudInfrastructure, the Availability Domain andSubnet fields are displayed and populated,and the cluster will be created on Oracle CloudInfrastructure. Otherwise, those fields are notdisplayed and the cluster will be created onOracle Cloud Infrastructure Classic.

To create your cluster on Oracle CloudInfrastructure, select us-phoenix-1, us-ashburn-1, eu-frankfurt-1, or uk-london-1 ifthose regions are available to you (whichregions are displayed depends on whichdefault data region was selected during thesubscription process). If you select any otherregion, the cluster will be created on OracleCloud Infrastructure Classic.

Select No Preference to let Big Data Cloudchoose an Oracle Cloud Infrastructure Classicregion for you.

Availability Domain (Displayed only on Oracle CloudInfrastructure)

The availability domain (within the region)where the cluster will be placed.

Appendix AService Console Create Instance: Instance Page

A-5

Element Description

Subnet (Displayed only on Oracle CloudInfrastructure)

The subnet (within the availability domain) thatwill determine network access to the cluster.

Select a subnet from a virtual cloud network(VCN) that you created previously on OracleCloud Infrastructure. Select No Preference tolet Big Data Cloud choose a subnet for you.

IP Network (Not available on Oracle Cloud Infrastructure)

(Available only if you have selected a regionand you have defined one or more IP networkscreated in that region using Oracle CloudInfrastructure Compute Classic.)

Select the IP network where you want thecluster placed. Choose No Preference to usethe default shared network provided by OracleCloud Infrastructure Compute Classic.

For more information about IP networks, see About IP Networks and Creating an IPNetwork in Using Oracle Cloud InfrastructureCompute Classic.

Metering Frequency (Displayed only if you have a traditionalmetered subscription)

The metering frequency used to determine thebilling for resources used by the cluster.

Tags (Not available on Oracle Cloud at Customer)

Tags to be associated with the cluster. See Manage Tags.

Service Console Create Instance: Service Details PageYou can use the Create Instance: Service Details page to provide more details aboutthe new Oracle Big Data Cloud cluster that you are about to create.

Topics


• What You See in the Cluster Configuration Section

• What You See in the Credentials Section

• What You See in the Associations Section

• What You See in the Cloud Storage Credentials Section

• What You See in the Block Storage Settings Section

Appendix AService Console Create Instance: Service Details Page

A-6


Element Description

< Previous Click to navigate to the Create Instance:Instance page.


Next > Click to navigate to the Create Instance:Confirmation page.

Selection Summary Click to see service details.

What You See in the Cluster Configuration Section

Element Description

Deployment Profile Deployment profile for the cluster, based on itsintended use. Deployment profiles arepredefined sets of services optimized forspecific uses. The deployment profile can’t bechanged after the cluster is created. Choicesare:

• Full: (default) Provisions the cluster withSpark, Spark Thrift, Zeppelin,MapReduce, Hive, Alluxio, and AmbariMetrics. Use this profile if you want all ofthe features of Big Data Cloud.

• Basic: Subset of the Full profile.Provisions the cluster with Spark,Zeppelin, MapReduce, and AmbariMetrics. Use this profile if you don’t needall of the features of Big Data Cloud andjust want to run Spark or MapReduce jobsand use Notebooks. This profile does notinclude Alluxio (the in-memory cache), orHive or JDBC connectivity for BI tools.

• Snap: Provisions the cluster with SNAP,Spark, and Zeppelin. Once the SNAPcluster is provisioned, the SNAP service isstarted and can be viewed in the Ambariuser interface. The SNAP service isstarted only on the master node. Alllifecycle operations (start/stop/restart) canbe performed on the SNAP service usingAmbari. See Access Big Data CloudUsing Ambari. SNAP clusters can only beused for the SNAP application and cannotbe used for general-purpose Sparkprocessing. Use the Full or Basic profilefor general-purpose Spark processing.For information about SNAP, see the SNAP documentation.


A-7


Element Description

Number of Nodes Total number of nodes to be allocated to thecluster.

Choosing 3 or more nodes provide highavailability(HA) with multiple master nodes. Ifyou choose less than 3 nodes, only one nodewill be master node with all critical servicesrunning on same node in non-HA mode.

Any node in excess of the first 4 nodes whichare not designated as compute only slaves willrun as a compute + storage node.

Compute Shape Number of Oracle Compute Units (OCPUs)and amount of memory (RAM) for each nodeof the new cluster. Big Data Cloud offers manyOCPU/RAM combinations.

Queue Profile YARN capacity scheduler queue profile.Defines how queues and workloads aremanaged. Also determines which queues arecreated and available by default when thecluster is created. See Manage Work QueueCapacity.

Queue profile defines job queues appropriatefor different types of workloads. Each queuehas minimum guaranteed capacity andmaximum allowed capacity. The preemptionsetting is explained below and it can’t bechanged after the cluster is created.

• Preemption Off: Indicates that Jobs can'tconsume more resources than a specificqueue allows. This will lead to potentiallylower cluster utilization.

• Preemption On: Indicates that Jobs canconsume more resources than a queueallows, but could lose those resourceswhen another job comes in that haspriority for those resources.

If preemption is on, jobs submitted to aparticular queue do not have to waitbecause jobs of some other queue havetaken up the available cluster capacity. Ifpreemption is on, and if the cluster isunused, then jobs from any queue canutilize 100% of the cluster capacity. Thiswill lead to better cluster utilization.

Spark Version Spark version to be deployed on the cluster,Spark 1.6 or 2.1.

Note: Oracle R Advanced Analytics forHadoop (ORAAH) is installed for Spark 1.6clusters only.


A-8

What You See in the Credentials Section

Element Description

Use Identity Cloud Service to login to theconsole

(Not available on Oracle Cloud Infrastructure)

(Not displayed for all user accounts)

Select this to use IDCS as the clientauthentication mechanism for the cluster.Users will access the cluster with their ownIDCS identity and credentials.

When this option is selected, cluster users andcluster access are managed through IDCS. Ifthis option is not selected, HTTP Basicauthentication is used and users access thecluster with the shared administrative username and password specified below. For moreinformation about cluster authentication, see Use Identity Cloud Service for ClusterAuthentication.

SSH Public Key The SSH public key to be used forauthentication when using an SSH client toconnect to a compute node of the new cluster.

Click Edit to specify the public key. You canupload a file containing the public key value,paste in the value of a public key, or create asystem-generated key pair.

If you paste in the value, make sure the valuedoes not contain line breaks or end with a linebreak.

User Name Administrative user name. The user namecannot be admin.

For clusters that use Basic authentication, theadministrative user name and password areused to access the cluster console, RESTAPIs, and Apache Ambari.

For clusters that use IDCS for authentication,the administrative user name and passwordare used only to access Ambari. Clusteraccess is managed through IDCS.

Password

Confirm Password

Password of the user specified in User Name.The password:

• Must be between 8 and 30 characters.• Must contain at least one lowercase letter.• Must contain at least one uppercase

letter.• Must contain at least one number.• Must contain at least one special

character.

What You See in the Associations Section

This section allows you to associate your new Oracle Big Data Cloud cluster with othercloud services, such as, Oracle Event Hub Cloud Service, MySQL Cloud Service, andOracle Database Cloud Service.


A-9

Select the Cloud Service that you want to associate with your Oracle Big Data Cloudcluster.

What You See in the Cloud Storage Credentials Section

The fields in this section are different depending on whether the cluster is beingcreated on Oracle Cloud Infrastructure or on Oracle Cloud Infrastructure Classic.

Element Description

(Oracle Cloud Infrastructure)

OCI Cloud Storage URLThe Oracle Cloud Infrastructure ObjectStorage URL. For example:

https://objectstorage.us-phoenix-1.oraclecloud.com

For information about the object storage URL,see REST APIs in the Oracle CloudInfrastructure documentation.


OCI Cloud Storage Bucket URLThe URL of an existing bucket in Oracle CloudInfrastructure Object Storage.

Format:

oci://bucket@namespace/, where bucketis the default bucket where application binariesand application logs are stored, andnamespace is your namespace.

Note: The bucket URL must have a trailingslash. If it doesn’t, provisioning will fail.


OCI Cloud Storage User OCIDThe Oracle Cloud Infrastructure ObjectStorage User OCID. See Where to Get theTenancy's OCID and User's OCID in theOracle Cloud Infrastructure documentation.


OCI Cloud Storage PEM KeyThe Oracle Cloud Infrastructure ObjectStorage PEM key. This must be generated.See How to Generate an API Signing Key inthe Oracle Cloud Infrastructuredocumentation.

Note: In Big Data Cloud, the PEM key must becreated without a password.


OCI Cloud Storage PEM Key FingerprintThe Oracle Cloud Infrastructure ObjectStorage PEM key fingerprint. This must begenerated. See How to Generate an APISigning Key in the Oracle Cloud Infrastructuredocumentation.


A-10

https://docs.us-phoenix-1.oraclecloud.com/Content/API/Concepts/usingapi.htm






Element Description

(Oracle Cloud Infrastructure Classic)

Cloud Storage ContainerThe name of the Oracle Cloud InfrastructureObject Storage Classic container that isassociated with this cluster. The Oracle CloudInfrastructure Object Storage Classic containeris where the job logs are pushed uponcompletion.

You must enter the complete (fully qualified)REST URL for Oracle Cloud InfrastructureObject Storage Classic, appended by thecontainer name.

Format:

rest_endpoint_url/containerName

You can find the REST endpoint URL of theOracle Cloud Infrastructure Object StorageClassic service instance in the InfrastructureClassic Console. See Finding the RESTEndpoint URL for Your Cloud Account in UsingOracle Cloud Infrastructure Object StorageClassic.

Example:

https://acme.storage.oraclecloud.com/v1/MyService-acme/MyContainer

The same formatting requirement applies tothe cloudStorageContainer attribute in theREST API.


User NameThe user name of an Oracle Cloud user whohas access to the container specified in CloudStorage Container.


PasswordThe password of the user specified in CloudStorage user name.


Create Cloud Storage ContainerAvail this option if you do not have an OracleCloud Infrastructure Object StorageClassic container or if you do not want toreuse your existing Oracle Cloud InfrastructureObject Storage Classic containers.

Specify the above credentials and then selectCreate Cloud Storage Container check boxto create a new Oracle Cloud InfrastructureObject Storage Classic container with thosecredentials.


A-11

What You See in the Block Storage Settings Section

Element Description

Use High Performance Storage (Not available on Oracle Cloud at Customer orOracle Cloud Infrastructure)

Select this to use high performance storage forHDFS. With this option the storage attached tonodes uses SSDs (solid state drives) insteadof HDDs (hard disk drives). Use this option forperformance-critical workloads. An additionalcost is associated with this type of storage.

Usable HDFS Storage (GB) The amount of storage in GB for HDFS.

Oracle Big Data Cloud uses a replicationfactor of 2 for HDFS. Hence the Usable HDFSStorage will roughly be half of the totalallocated storage.

Usable BDFS Cache (GB) The amount of storage in GB the Big Data FileSystem (BDFS) will use as a cache toaccelerate workloads. The total amount ofcache provided by BDFS is the sum of RAMallocated to BDFS plus the total block storageallocated for spillover.

The amount of memory allocated to BDFS isbased on the compute shape selected whenthe cluster was created. For details aboutBDFS and memory allocation, see theinformation about BDFS Tiered Storage in About the Big Data File System (BDFS).

Total Allocated Storage (GB) The amount of raw block storage in GB thatwill be allocated to the new cluster.

Service Console Create Instance: Confirmation PageThe Create Instance: Confirmation page is the final page you use to create a newOracle Big Data Cloud cluster.

What You See in the Create Instance: Confirmation page

The Create Instance: Confirmation page presents a summary list of all the choices youmade on the preceding pages of the Create Instance wizard. In addition, it providesthe controls described in the following table.

Element Description

<Previous Click to navigate to the Create Instance:Instance page.


Appendix AService Console Create Instance: Confirmation Page

A-12

Element Description

Create > Click to begin the process of creating a OracleBig Data Cloud cluster.

The Create Instance wizard closes and theInstances page is displayed, showing the newcluster with a status of Creating ....

Service Console: Activity PageThe Activity page displays activities for all cloud services in your identity domain. Youcan restrict the list of activities displayed using search filters.

Topics

• What You Can Do from the Activity Page

• What You See on the Oracle Big Data Cloud: Activity Page

What You Can Do from the Activity Page

Use the Activity page to view operations for all Oracle Big Data Cloud clusters in youridentity domain.

You can use the page’s Search Activity Log section to filter the list of displayedoperations based on:

• The time the operation was started

• The status of the operation

• The name of the cluster on which the operation was performed

• The service type of the cluster on which the operation was performed

• The type of the operation

In the table of results, you can:

• Click any column heading to sort the table by that column.

• Click the triangle at the start of an operation’s row to see more details about thatoperation.

What You See on the Oracle Big Data Cloud: Activity Page

Element Description

Displays Search Activity Log to search andreview activities of Cloud Services in youridentity domain.

Start Time Range Filters activity results to include onlyoperations started within a specified timerange.

Appendix AService Console: Activity Page

A-13

Element Description

Operation Status Filters operations by status of the operation:

• All• Scheduled• Running• Succeeded• FailedYou can select any subset of status types. Thedefault value is All.

Service Name Filters the activity results to include operationsonly for the specified cluster. You can enter afull or partial cluster name.

Service Type Filters the activity results to include operationsonly for instances of the specified service type.The default value is the current cloud service.

Operation Filters the activity results to include selectedtypes of operations. You can select any subsetof the given operations. The default valueis All.

Search Searches for activities by applying the filtersspecified by the Start Time Range, Status,Service Name, Service Type, and Operationfields, and displays activity results in the table.

Reset Clears the Start Time Range and ServiceName fields, and returns the Status andOperation fields to their default values.

Results per page Specifies the number of results you want toview per page. The default value is 10.

Refreshes the page.

Displays status messages for the givenoperation. Clicking on the resulting downwardarrow hides the status messages.

Operation Shows the type of operation performed on thecluster.

Service Name Shows the name of the cluster.

You can sort the column in ascending ordescending order.

Service Type Shows the type of cloud service for thiscluster.


Operation Status Shows the status of the cluster.


Start Time Shows the time the operation started.


Appendix AService Console: Activity Page

A-14

Element Description

End Time Shows the time the operation ended, if theoperation is complete.


Initiated By Shows the user that initiated the operation.The user can be any user in the identitydomain who initiated the operation or, forcertain operations such as automated backup,System.


Service Console: SSH Access PageThe SSH Access page enables you to view and add SSH public keys to Oracle BigData Cloud clusters in your identity domain. You can restrict the list of clustersdisplayed using search filters.

Topics

• What You Can Do from the SSH Access Page

• What You See on the SSH Access Page

What You Can Do from the SSH Access Page

Use the SSH Access page to view and add SSH public keys to Oracle Big Data Cloudclusters in your identity domain.

You can use the page’s search section to filter the list of displayed clusters based onthe cluster name. In the table of results, you can:

• Click any column heading to sort the table by that column.

• Click the triangle at the start of a cluster’s row to see more details.

What You See on the SSH Access Page

Element Description

Service Name Filters the results to include SSH keys only forthe specified clusters. You can enter a full orpartial cluster name.

Service Type Filters the results to include SSH keys only forclusters of the specified service type.

Search Searches for SSH keys by applying the filtersspecified by the Service Name and ServiceType fields, and displays the results in thetable.


Refreshes the page.

Appendix AService Console: SSH Access Page

A-15

Element Description

Displays a description of an item in the resultstable. Clicking on the resulting downwardarrow hides the description.

Service Name Shows the name of the cluster.

Service Type Shows the type of service for this cluster.

Last Update Shows the most recent time the SSH keys forthis cluster were updated.

Actions Click the Add New Key button to associateanother SSH public key to this cluster, or toreplace the existing SSH public key for thecluster.

The Add New Key overlay is displayed with itsKey value field displaying the cluster’s mostrecent SSH public key.

Specify the public key using one of thefollowing methods:

• Select Upload a new SSH Public Keyvalue from file and click Choose File toselect a file that contains the public key.

• Select Key value. Delete the current keyvalue and paste the new public key intothe text area. Make sure the value doesnot contain line breaks or end with a linebreak.

The Add New Key button is enabled only ifthe service is running.

Service Console: Instance Overview PageThe Instance Overview page displays overview information of a Oracle Big Data Cloudcluster.

Topics

• What You Can Do from the Instance Overview Page

• What You See in the Overview Section

• What You See in the Administration Section

What You Can Do from the Instance Overview Page

Use the Instance Overview page to perform tasks described in the following topics:



• Scale a Cluster Out

• Scale a Cluster In

• Stop, Start, and Restart a Node

• Patch Big Data Cloud

Appendix AService Console: Instance Overview Page

A-16

What You See in the Overview Section

Element Description

Menu that provides the following options tomanage the cluster:

• Big Data Cloud Console — Open theBig Data Cloud Console.

• Start — Start all the virtual machines(VMs) hosting the nodes of the cluster.

• Stop — Stop all the virtual machines(VMs) hosting the nodes of the cluster.

• Restart — Restart all the virtual machines(VMs) hosting the nodes of the cluster.

• Scale Out — Add one or more nodes tothis Cluster.

• Access Rules — Manage access rulesthat control network access to servicecomponents. (Does not apply to OracleCloud Infrastructure)

• SSH Access — Associate an SSH publickey to the cluster.

• Add Tags or Manage Tags — Associatetags with the cluster. If any tags arealready assigned, then the menu showsManage Tags; otherwise, it shows AddTags.

• View Activity — View activities for allcloud services in your identity domain.

Displays instance details.

Click to start all the virtual machines (VMs)hosting the nodes of this cluster.

Click to stop all the virtual machines (VMs)hosting the nodes of this cluster.

Click to restart all the virtual machines (VMs)hosting the nodes of this cluster.

Click to add one or more nodes to this cluster.

Click to display the health of this cluster.

Instance Overview Displays overview information for the cluster.


Nodes, OCPUs, Memory, and Storage Summary of resources being used:

• Nodes — Total number of nodes for thiscluster.

• OCPUs — Total number of Oracle CPUsallocated across all clusters.

• Memory — Total amount of computenode memory allocated across allclusters.

• Storage — Total amount of raw blockstorage allocated across all clusters.


A-17

Element Description

Status Status of the Oracle Big Data Cloud cluster.

User Name Administrative user name.

For clusters that use Basic authentication, theadministrative user name and password areused to access the cluster console, RESTAPIs, and Apache Ambari.

For clusters that use IDCS for authentication,the administrative user name and passwordare used only to access Ambari. Clusteraccess is managed through IDCS.

Compute Shape The shape of all the nodes in the Oracle BigData Cloud cluster.


A-18

Element Description

Spark Thrift Server The URL for JDBC connection to Thrift. Thereare two relevant service URLs, one for SparkThrift Server and one for Hive Thrift Server.The URLs differ depending on whether youhave a Basic authentication cluster or anIDCS-enabled cluster.

For IDCS-enabled clusters, interactions arerouted through the load balancing serverinstead of going directly to the cluster, and thatdifference is reflected in the URL.

Basic authentication cluster• Spark Thrift Server:


• Hive Thrift Server:


Where:

• ip_address is the IP address of thedesired endpoint

• path_to_truststore is the absolutepath to the Java Trust Store that holds thecertificate

• truststore_password is the passwordused with the trust store

IDCS-enabled cluster• Spark Thrift Server:


• Hive Thrift Server:

jdbc:hive2://cluster_name-load_balancing_server_URI/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hs2service

Where:

• cluster_name is the name of the cluster• load_balancing_server_URI is the

URI assigned to the cluster by the loadbalancing service


A-19

Element Description

For more information about Thrift access, see About Accessing Thrift.

Version Version of Oracle Big Data Cloud configuredon the cluster.

Ambari Server Host The IP address of the server hosting Ambari.

Deployment Profile The deployment profile for the cluster.

Spark Version The Spark version, 1.6 or 2.1.

Show more / Show less (For IDCS-enabled clusters only) Click todisplay or hide the IDCS Application field.The IDCS Application field is listed only forIDCS-enabled clusters.

IDCS Application (For IDCS-enabled clusters only) The OracleIdentity Cloud Service (IDCS) application IDfor the cluster. Click the link to access theIDCS application. See Use Identity CloudService for Cluster Authentication.

(for Resources)

Displays resource information for this cluster.

Host Name Name of an individual node in the Hadoopcluster.

Public IP IP address of an individual node in the Hadoopcluster.

Instance The kind of node within the Oracle Big DataCloud cluster.

A node within a cluster can be a:

• Master node• Compute only slave node• Compute and storage node

OCPUs Total number of Oracle CPUs allocated for thismachine.

Memory Total amount of compute node memoryallocated for this machine.

Storage Total amount of storage allocated for thismachine.


A-20

Element Description

(for the resource)


For Compute Slave nodes:

• Remove Node— Remove the virtualmachine (VM) hosting the node from thecluster.

• Start— Start the virtual machine (VM)hosting the node of the cluster.

• Stop— Stop the virtual machine (VM)hosting the node of the cluster.

• Restart— Restart the virtual machine(VM) hosting the node of the cluster.

For Master nodes:

• Restart— Restart the virtual machine(VM) hosting the MASTER node of thecluster.

(for Associated Services)

Click to display information about servicesassociated with the Oracle Big Data Cloudcluster.

(for Load Balancer)

(For IDCS-enabled clusters only) Click todisplay the cluster URL and other details.

What You See in the Administration Section

Element Description

Available Patches A list of patches you can apply to the cluster.


(for each listed patch) Menu that provides the following options forthe patch:

• Precheck — Check whether the patchcan be successfully applied to the cluster.See Check Patch Prerequisites.

• Patch — Apply the patch to the cluster.See Apply a Patch.

(for Patch and Rollback History)

Displays the patch and rollback history for thecluster.

Details of Last Patching Activity Expand to see a description of the actionstaken during the last patching operation.

Rollback Click to roll back the last patching operation.See Roll Back a Patch or Failed Patch.

Service Console: Access Rules Page


The Access Rules page displays rules used to control network access to Oracle BigData Cloud clusters. You use the page to view, manage, and create security rules.

Appendix AService Console: Access Rules Page

A-21

Topics

• What You Can Do from the Access Rules Page

• What You See on the Access Rules Page

What You Can Do from the Access Rules Page

Use the Access Rules page to perform the following tasks:


• Enable Access Rules

What You See on the Access Rules Page

Element Description

Menu that provides the following options tomanage the cluster:

• Big Data Cloud Console — Open theBig Data Cloud Console.

• Start — Start all the virtual machines(VMs) hosting the nodes of the cluster.

• Stop — Stop all the virtual machines(VMs) hosting the nodes of the cluster.

• Restart— Restart all the virtual machines(VMs) hosting the nodes of the cluster.

• Scale Out — Add a single or multiplenodes to this cluster.

• Access Rules — Manage access rulesthat control network access to servicecomponents.

• SSH Access — Replace an existing SSHpublic key or associate another SSHpublic key to this cluster.

• View Activity — View activities for allcloud services in your identity domain.

Create Rule Click to create a new access rule. See CreateAccess Rules.


Refreshes the page.

Status Displays an icon that indicates whether theaccess rule is enabled or disabled.

Indicates the access rule is enabled.

Indicates the access rule is disabled.

Rule Name Name of the access rule. When creating anaccess rule, this must start with a letter,followed by letters, numbers, hyphens, orunderscores.

Source Hosts from which traffic is allowed. Possiblevalue is PUBLIC-INTERNET, or a customvalue in the form of an IP address.

Appendix AService Console: Access Rules Page

A-22

Element Description

Destination Security list to which traffic is allowed.

Ports Port or range of ports for the access rule.

Protocol Protocol for the access rule.

Description Description of the access rule (optional).

Rule Type Type of access rule. Access rule types are:

• DEFAULT—Access rules createdautomatically when the Oracle Big DataCloud cluster was created. Can beenabled or disabled, but not deleted.

• SYSTEM—Access rules created by thesystem. Cannot be enabled, disabled, ordeleted.

• USER—Access rules created by you oranother user. Can be enabled, disabled,or deleted.

(for each access rule)


• Enable—Enables the access rule.• Disable—Disables the access rule.• Delete—Deletes the access rule

(USER rules only).

Big Data Cloud Console: Overview PageThe Big Data Cloud Console: Overview page displays overview information for a BigData Cloud cluster.

Topics

• What You Can Do from the Big Data Cloud Console: Overview Page

• What You See on the Big Data Cloud Console: Overview Page

What You Can Do from the Big Data Cloud Console: Overview Page

Use the Big Data Cloud Console: Overview page to perform tasks described in thefollowing topics:


• Create a Job


• Stop a Job

• View Job Logs

Appendix ABig Data Cloud Console: Overview Page

A-23

What You See on the Big Data Cloud Console: Overview Page

Element Description

User menu providing access to the APICatalog for Big Data Cloud, Help for this page,and About information with details about theconsole.

Jobs Click to go to the Jobs page.

Notebook Click to go to the Notebook page.

Data Stores Click to go to the Data Stores page.

Status Click to see information about componentsand services running on Oracle Big DataCloud clusters and nodes and their associatedstate.

Settings Click to manage resource queueconfigurations.


Summary Display showing

• Status — The status of the cluster.• Uptime — The time from when the cluster

is operational.• Healthy Nodes — Total number of

operational nodes, out of the total numberof nodes allocated for the cluster.

• Total OCPUs — Total number of OracleCPUs allocated for the cluster.

• Total Memory — Total amount ofcompute node memory allocated for thecluster.

HDFS Capacity A pie chart indicating the total HDFS capacity,the percentage of space used for storingHDFS files and directories, and thepercentage of space available for storingHDFS files and directories.

CPU Usage A graph indicating the percentage of CPUusage by the user, and the system for aspecified time interval. By default, the CPUusage is displayed for the last 24 hours.

Memory Usage A graph indicating the memory usage in GBsfor a specified time interval. By default, thememory usage is displayed for the last 24hours.

Job Summary A pie chart indicating cumulative statistics ofall the jobs submitted to a Big Data Cloudcluster from the time of cluster creation.

Job History A graph displaying the break down of all thejobs by status, on a specified time interval. Bydefault, the job history is displayed for the last24 hours.

Recent Jobs A list of recent jobs performed in this cluster.

Appendix ABig Data Cloud Console: Overview Page

A-24

Element Description

Click to create a new job.

Status The status of the job.

Started At The date and time when the job started.

Elapsed Time The elapsed date and time from when the jobstarted.

Type The type of job. It can be a Spark, PythonSpark, or MapReduce job.

Finished At The date and time when the job finished.

Queue The queue to which the job is assigned.

menu icon (for each job)


• Abort Job — Click to abort the jobexecution.

• Details... — Click to view details aboutthe job.

• Logs... — Click to view a list of log filesassociated to the job.

• Spark UI...— Click to view a list of all theSpark Application UI for each attempt ofthe job execution.

View All Jobs... Click to navigate to the Big Data CloudConsole: Jobs page to view all jobs in thecluster.

Big Data Cloud Console: Jobs PageThe Big Data Cloud Console: Jobs page displays all the jobs on Big Data Cloud.

Topics

• What You Can Do from the Big Data Cloud Console: Jobs Page

• What You See on the Big Data Cloud Console: Jobs Page

What You Can Do from the Big Data Cloud Console: Jobs Page

Use the Jobs page to perform tasks described in the following topics:


• Create a Job

• Stop a Job

• View Job Logs

Appendix ABig Data Cloud Console: Jobs Page

A-25

What You See on the Big Data Cloud Console: Jobs Page

Element Description


Overview Click to go to the Overview page.






Click to view jobs in a list mode.

Click to view jobs in a table mode.

Click to create a new Spark job.

Status The status of the job.

Started At The date and time when the job started.

Elapsed Time The elapsed date and time from when the jobstarted.

Type The type of job: Spark, Python Spark, orMapReduce.

Finished At The date and time when the job finished.

Queue The queue to which the job is assigned.

menu icon (for each job)


• Abort Job — Click to abort the jobexecution.

• Details... — Click to view details aboutthe job.

• Logs... — Click to view a list of all log filesassociated to the job.

• Spark UI...— Click to view a list of all theSpark Application UI for each attempt ofthe Job execution.

Appendix ABig Data Cloud Console: Jobs Page

A-26

Big Data Cloud Console New Job: Details PageYou use the Big Data Cloud Console New Job: Details page to create a new Spark ora Python job in the Oracle Big Data Cloud cluster.


Element Description

Cancel Click to cancel creating a a new job.

Next> Click to navigate to the Big Data CloudConsole New Job: Configuration page.

What You See in the Details Section

Element Description

Name Name of the new job that you want to create.

• Must not exceed 50 characters.• Must start with a letter.• Must contain only letters, numbers, or

hyphens.• Must not contain any other special

characters.

Description (Optional) A description that can be used tohelp identify the job.

Type Select the type of job you want to create:Spark, Python Spark, or MapReduce.

For Spark job submissions, the application canbe written in any language as long as theapplication can be executed on the JavaVirtual Machine.

Big Data Cloud Console New Job: Configuration PageYou use the Big Data Cloud Console New Job: Configuration page to provide moredetails about the new job that you are about to create.


Element Description

< Previous Click to navigate to the Big Data CloudConsole New Job: Details page.

Cancel Click to cancel creating a new job.

Next > Click to navigate to the Big Data CloudConsole New Job: Driver File page.

Appendix ABig Data Cloud Console New Job: Details Page

A-27

What You See in the Configuration Section

Element Description

Driver Cores The total number of CPU cores that areassigned to a Spark driver process.

Driver Memory The amount of memory that assigned to aSpark driver process, in MB or GB.

This value cannot exceed the memoryavailable on the driver host, which isdependent on the compute shape used for thecluster. Also, some amount of memory isreserved for supporting processes.

Executor Core The number of CPU cores made available foreach Spark executor.

Executor Memory The amount of memory made available foreach Spark executor.

No. of Executors The number of Spark executor processes thatwill be used to execute the job.

Queue The name of the resource queue for which thejob will be targeted. When a cluster is created,a set of queues is also created and configuredby default. Which queues get created isdetermined by the queue profile specifiedwhen the cluster was created and whetherpreemption was set to Off or On (thepreemption setting can't be changed after acluster is created).

If preemption was set to Off (disabled), thefollowing queues are available by default:• dedicated: Queue used for all REST API

and Zeppelin job submissions. Defaultcapacity is 80, with a maximum capacityof 80.

• default: Queue used for all Hive andSpark Thrift job submissions. Defaultcapacity is 20, with a maximum capacityof 20.

If preemption was set to On (enabled), thefollowing queues are available by default:• api: Queue used for all REST API job

submissions. Default capacity is 50, with amaximum capacity of 100.

• interactive: Queue used for all Zeppelinjob submissions. Default capacity is 40,with a maximum capacity of 100. Toallocate more of the cluster's resources toNotebook, increase this queue's capacity.

• default: Queue used for all Hive andSpark Thrift job submissions. Defaultcapacity is 10, with a maximum capacityof 100.

Appendix ABig Data Cloud Console New Job: Configuration Page

A-28

Element Description

Available Cores

Available Memory

This information is available to the user at theright hand corner of the screen. It showsinformation about the available cores andmemory for that cluster which can be allocatedfor the new job.

Big Data Cloud Console New Job: Driver File PageYou use the Big Data Cloud Console New Job: Driver File page to provide detailsabout the job driver file and its main class, command line arguments, and anyadditional jars or supporting files for executing the job.


Element Description

< Previous Click to navigate to the Big Data CloudConsole New Job: Configuration page.

Cancel Click to cancel creating the new job.

Next > Click to navigate to the Big Data CloudConsole New Job: Confirmation page.

What You See in the Driver File Section

Element Description

File Path The path to the executable for the job.

Click Browse to select a file in HDFS, orCloud Storage, or to upload a file from yourlocal file system. The file must have a .jaror .zip extension.

In the Browse HDFS window, you can alsobrowse to and try some examples.

Main Class (for Spark and MapReduce jobsonly)

The main class to run the job.

Arguments Any argument(s) that invokes the main class.You specify one argument per line.

Additional Py Modules (for Python Spark jobsonly)

Any Python dependencies required for theapplication. You can specify more than onefile.

Click Browse to select a file in HDFS or CloudStorage, or to upload a file from your local filesystem (.py file only).

Additional Jars (optional) Any JAR dependencies required for theapplication, such as Spark libraries. You canspecify more than one file.

Click Browse to select a file in HDFS, orCloud Storage, or to upload a file from yourlocal file system (.jar or .zip file only).

Appendix ABig Data Cloud Console New Job: Driver File Page

A-29

Element Description

Additional Support Files (optional) Any additional support files required for theapplication. You can specify more than onefile.

Click Browse to select a file (.jar or .zip fileonly).

Big Data Cloud Console New Job: Confirmation PageYou use the Big Data Cloud Console New Job: Confirmation page to review all thedetails provided in the previous sections and create the job.


Element Description

< Previous Click to navigate to the Big Data CloudConsole New Job: Driver File page.

Cancel Click to cancel creating the new job.

Create > Click to create the job.

What You See in the Confirmation Page

The Big Data Cloud Console New Job: Confirmation page presents a summary list ofall the choices you made on the preceding pages of the Big Data Cloud Console NewJob wizard.

Click Create> to create the job and submit it for execution. Click <Prev to step backthrough the pages and change the value of any parameter before creating the job.

Big Data Cloud Console: Notebook PageThe Big Data Cloud Console: Notebook page displays all the example Notebooks inBig Data Cloud.

Topics

• What You Can Do from the Big Data Cloud Console: Notebook Page

• What You See on the Big Data Cloud Console: Notebook page

What You Can Do from the Big Data Cloud Console: Notebook Page

Use the Notebook page to perform tasks described in the following topics:

• View and Edit a Note

• Create a Note in a Notebook

• Import a Note

• Export a Note

• Delete a Note

Appendix ABig Data Cloud Console New Job: Confirmation Page

A-30

• Organize Notes

What You See on the Big Data Cloud Console: Notebook page

Element Description








Click to view notes in a list mode.

Click to view notes in a table mode. You canuse this view to organize notes into folders.See Organize Notes.

Click to create a new note.

Click to import a note from your local machine.

Status The status of the note.

Created On The date and time when the note was created.

Last Modified On The date and time when the note was lastmodified.

menu icon (for each note)


• View Note — Click to view details aboutthe note.

• Export Note — Click to download thenote to your local machine.

• Delete Note — Click to delete the note.

Big Data Cloud Console: Data Stores PageThe Big Data Cloud Console: Data Stores page displays files and directories in theHadoop Distributed File System and Oracle Cloud Storage Container.

Topics

• What You Can Do from the Big Data Cloud Console: Data Stores Page

Appendix ABig Data Cloud Console: Data Stores Page

A-31


• What You See in the HDFS Tab

• What You See in the Cloud Storage Tab

What You Can Do from the Big Data Cloud Console: Data Stores Page

Use the Big Data Cloud Console: Data Stores page to perform tasks described in thefollowing topics:

• Browse Data

• Load Data Into Cloud Storage

• Upload Files Into HDFS


Element Description







HDFS Click to browse files and directories in theHadoop Distributed File System (HDFS).

Cloud Storage Click to browse files and directories in thecloud storage container.

Load More... Click to show more files and directories.

What You See in the HDFS Tab

Element Description

Cloud Storage Click to browse files and directories in OracleCloud Storage Container.

New Directory Click to add a new directory.

Upload Click to upload a file.

(HDFS utilization) How much space is being used and how muchis available.


Name Name of the file or directory.

Size The size of the file or directory.


A-32

Element Description

Owner The owner of the file or directory.

Group The group to which the owner belongs to.

Permissions The permissions to act on the file or directory.

The basic rights are read, write, and execute.The first character indicates the type of file (dfor directory, s for special file, and - for aregular file). The next three characters definethe owner’s permission to the file. Thefollowing three characters define thepermissions for the members of the samegroup as the file owner and the last threecharacters define the permissions for all otherusers.

Last Modified The date and time at which the file or directorywas last modified.

Block Size The default block size of the file or directory.

menu icon (for each file or directory)


• Details — Click to see details for a file ordirectory.

• Delete — Click to delete a file ordirectory.

• Download — Click to download a file.

What You See in the Cloud Storage Tab

Element Description

HDFS Click to browse files and directories in theHadoop Distributed File System.

Filter By Prefix Enter a prefix to filter the list of files anddirectories to include only those that containthe string in their name.

Container Name Oracle Cloud Storage Container.


Upload Click to browse for and upload files ordirectories to the Oracle Cloud StorageContainer. The upload limit is 5 GB.

The file or directory is added to the list on thepage after it’s uploaded.

Name Name of the file or directory.

Size The size of the file or directory.

Last Modified The date and time at which the file or directorywas last modified.

menu icon (for each item)


• Details — Click to see details for a file ordirectory, including the Swift URL.

• Delete — Click to delete a file.• Download — Click to download a file.


A-33

Big Data Cloud Console: Status PageThe Big Data Cloud Console: Status page displays information about components andservices running on Big Data Cloud clusters and nodes and their associated state.

Topics

• What You Can Do from the Big Data Cloud Console: Status Page


• What You See in the Services Tab

• What You See in the Hosts Tab

What You Can Do from the Big Data Cloud Console: Status Page

Use the Status page to perform tasks described in the following topics:

• View Cluster Component Status


Element Description







Services Click to see a list of all components on thecluster.

Hosts Click to list the components by each host(node) in the cluster.

What You See in the Services Tab

Element Description

Hosts Click to list the components by each host(node) in the cluster. This view is useful tounderstand the topology of the cluster.

Filter Use the Filter box to filter as desired.


Appendix ABig Data Cloud Console: Status Page

A-34

Element Description

(Components on the cluster) List of components on the cluster and theirstatus.

There are two possible states: INSTALLEDand STARTED.

INSTALLED means a service is stopped andnot running, and STARTED means a service isrunning. Some components on the cluster arenever started and are only installed, forexample client libraries such as HDFS_CLIENT.

Metrics list the number of components in eachstate and the total of the two.

What You See in the Hosts Tab

Element Description

Services Click to see a list of all components on thecluster.

Filter Use the Filter box to filter as desired.


(Components on each node in the cluster) List of components on each host (node) in thecluster and their status.

There are two possible states: INSTALLEDand STARTED.

INSTALLED means a service is stopped andnot running, and STARTED means a service isrunning. Some components on the cluster arenever started and are only installed, forexample client libraries such as HDFS_CLIENT.

Metrics show whether the service is inmaintenance mode (false indicates it is not).

Big Data Cloud Console: Settings PageThe Big Data Cloud Console: Settings page displays information for managingresource queues configurations, notebook server restart, and interpreter settings forthe Big Data Cloud cluster.

Topics

• What You Can Do from the Big Data Cloud Console: Settings Page


• What You See in the Queues Region

• What You See in the Credentials Region

• What You See in the JDBC URLs Region

• What You See in the Notebook Region

Appendix ABig Data Cloud Console: Settings Page

A-35

What You Can Do from the Big Data Cloud Console: Settings Page

Use the Settings page to perform tasks described in the following topics:

• Create Work Queues

• Manage Work Queue Capacity

• Manage Notebook Settings

• Update Cloud Storage Credentials

• Use the Cluster Credential Store


Element Description







What You See in the Queues Region

Element Description

Click to create a new Queue.

Preemption Displays if Preemption is enabled or disabledfor queues.

• Preemption Off: Indicates that Jobs can'tconsume more resources than a specificqueue allows. This will lead to potentiallylower cluster utilization.

• Preemption On: Indicates that Jobs canconsume more resources than a queueallows, but could lose those resourceswhen another job comes in that haspriority for those resources.

If preemption is on, jobs submitted to aparticular queue do not have to waitbecause jobs of some other queue havetaken up the available cluster capacity. Ifpreemption is on, and if the cluster isunused, then jobs from any queue canutilize 100% of the cluster capacity. Thiswill lead to better cluster utilization.


A-36

Element Description

Queue Name The name used to identify the resource queue.

Capacity (%) The guaranteed capacity of each queue whichspecifies the percentage of cluster resourcesthat are available for applications submitted tothe queue.

Max Capacity (%) The maximum capacity of the queue. MaxCapacity is automatically sets to 100% whenpreemption is enabled. Otherwise it is thesame as Capacity.

Total Queues Capacity The total capacity of all queues put together.

What You See in the Credentials Region

Element Description

Cloud Storage Allows you to reset the Oracle CloudInfrastructure Object Storage Classicpassword associated with the cluster.

This password is associated with the accountused during provisioning of the cluster. Itshould only be changed if the Oracle CloudInfrastructure Object Storage Classicpassword has been changed outside of theBig Data Cloud service.

• Password: Enter the new Oracle CloudInfrastructure Object Storage Classicpassword that you want to save for thiscluster.

• Confirm Password: Re-enter the newOracle Cloud Infrastructure ObjectStorage Classic password that you wantto save for this cluster.

• Save: Click to save the new password.

System Credentials System Credentials are secrets required bythe platform (including the Oracle CloudInfrastructure Object Storage Classicpassword) which are updated through thesystem credential User Interface and/or theREST API.

Any credentials held within the systemcredential store can be updated, but theycannot be deleted.

Credentials stored in the System and UserCredential store are accessible to applicationsduring runtime through the Apache Hadooporg.apache.hadoop.conf.Configuration.getPassword() API.


A-37

Element Description

User Credentials For a cluster, User Credentials allows you tostore a user or application specific credentialsin the credential store.

•

: Click to reload the user credentials.•

: Click to add a new credential.• Key: A unique identifier for the credential.• Value: The new user or application

password that you want to save for thecluster.

• Save: Click to save the new password.•

: Click to delete the credential.


A-38

What You See in the JDBC URLs Region

Element Description

Spark Thrift URL URL for the Spark Thrift Server. The URLsdiffer depending on whether a cluster usesBasic authentication or uses IDCS forauthentication. For information about clusterauthentication, see Use Identity Cloud Servicefor Cluster Authentication.



Where:






Where:





A-39

Element Description

Hive Thrift URL URL for the Hive Thrift Server. The URL isdifferent depending on whether you have aBasic authentication cluster or an IDCS-enabled cluster.



Where:





jdbc:hive2://cluster_name-load_balancing_server_URI/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hs2service

Where:




What You See in the Notebook Region

Element Description

Click to restart the notebook server when thedata is cached.

Apart from restarting Notebooks, you can also manage interpreter settings forNotebooks from this region.


A-40

BCustomize Clusters

Advanced users can use the cluster bootstrap script to customize Oracle Big DataCloud clusters. This capability enables you to install binaries, load data into HDFS,and perform many other actions that can be performed through script execution.

Note:

Bootstrap is a feature that enables you to install any third-party library on BigData Cloud. Some scripts are included by default (such as RStudio), butthese scripts serve only to illustrate the capabilities of the bootstrap script. Itis your responsibility as the user of the service to ensure you have the properlicensing for any third-party software you might install on Big Data Cloudinstances, either through the bootstrap script or through other means.

Topics

• About the Cluster Bootstrap Script

• Bootstrap Script Execution and Logging

• Sample Bootstrap Script

• Big Data Cloud Convenience Functions

About the Cluster Bootstrap ScriptTo use the cluster bootstrap script, you must first write a shell script that contains yourcustomizations and then upload the script to a location in Oracle Cloud InfrastructureObject Storage Classic. The script is executed as the last step in cluster provisioning.

The Big Data Cloud cluster looks for the script in two well-known locations in OracleCloud Infrastructure Object Storage Classic:

• Location 1: swift://default_container.default/bdcsce/bootstrap/cluster_name/bootstrap.sh

• Location 2: swift://default_container.default/bdcsce/bootstrap/bootstrap.sh

where default_container is the container associated as the default container for thecluster when the cluster was created, and cluster_name is the name of the cluster thatwas provided when the cluster was created.

The logic used to determine which script gets executed is as follows:

• If a script is found in Location 1, the script gets executed.

• If a script is not found in Location 1, Big Data Cloud attempts to load and executea script from Location 2.

• If a script is present in both locations, only the script in Location 1 gets executed.

B-1

The execution logic enables you to have a cluster-specific bootstrap script as well as ageneric bootstrap script that applies to all provisioned clusters. The only requirement isthat all of the clusters use the same storage container, which is specified during clusterprovisioning.

Bootstrap Script Execution and LoggingThis section describes cluster bootstrap script execution and logging.

Script Execution

The following list describes the semantics under which the cluster bootstrap script isexecuted:

• The bootstrap script is executed as the root user.

• The script is executed as the last step in the cluster provisioning process. Becausethe bootstrap script is executed asynchronously, script execution may continueafter the cluster has been made available to accept processing requests.

• If a bootstrap script is present upon a scale-out life cycle event, the bootstrapscript will be executed on the scaled out node. This is an important considerationto be aware of since script actions may need to be made idempotent or controlledthrough semaphores or other script logic. It is the bootstrap script author’sresponsibility to control which cluster node types the script (or portions of thescript) is executed on. See Sample Bootstrap Script for examples of how scriptlogic can be controlled.

• The bootstrap script does not execute if the overall cluster provisioning processfails.

• You can determine if the bootstrap script has completed by looking at logs in theobject store or by manually peeking into the cluster.

Script Logging

The cluster bootstrap script logs standard error and standard out to the followinglocation:

/u01/bdcsce/data/var/log/bdcsce_bootstrap.host_name.timestamp

where host_name is the name of the host the script was executed on and timestamp isthe timestamp of when the script was executed.

Bootstrap script execution logs are also copied to the following location in OracleCloud Infrastructure Object Storage Classic after execution and can be retrieved fordebugging and analysis using the Swift REST API:

swift://default_container/bdcsce/bootstrap/cluster_name/bdcsce_bootstrap.host_name.timestamp

where default_container is the container associated as the default container for thecluster when the cluster was created, and cluster_name is the name of the cluster thatwas provided when the cluster was created.

Appendix BBootstrap Script Execution and Logging

B-2

Sample Bootstrap ScriptThe following sample cluster bootstrap script shows how to control execution onvarious cluster node types. Common use is to execute specific actions on specificnode types.

## Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.## Example BDCS-CE bootstrap script.

#!/bin/sh executeOnAmbariNodes() { echo " executeOnAmbariNodes ...";} executeOnSparkThriftServerNodes() { echo "executeOnSparkThriftServerNodes ...";} executeOnHive2ServerNodes() { echo "executeOnHive2ServerNodes...";} executeOnComputeAndStorageSlaveNodes() { echo "executeOnComputeAndStorageSlaveNodes...";} executeOnComputeOnlySlaveNodes() { echo "executeOnComputeOnlySlaveNodes...";} executeOnAllNodes() { echo "executeOnAllNodes...";} executeOnMasters() { echo "executeOnMasters ...";} echo 'Hello Bootstrap'echo 'Object-store-url:=' $(getBaseObjectStoreUrl);echo 'Cluster-name:=' $(getClusterName);echo 'Masters:=' $(getMasterNodes);echo 'ComputeOnlySlaveNodes:=' $(getComputeOnlySlaveNodes);echo 'ComputeAndStorageSlaveNodes:=' $(getComputeAndStorageSlaveNodes);echo 'getAllNodes:=' $(getAllNodes);echo 'getAmbariServerNodes:=' $(getAmbariServerNodes);echo 'getSparkThriftServerNodes:=' $(getSparkThriftServerNodes);echo 'getHive2ServerNodes:=' $(getHive2ServerNodes); _HOSTNAME=$(hostname -f) for i in $(getAmbariServerNodes); do if [ ${_HOSTNAME} = $i ]; then executeOnAmbariNodes; fidone

Appendix BSample Bootstrap Script

B-3

for i in $(getSparkThriftServerNodes); do if [ ${_HOSTNAME} = $i ]; then executeOnSparkThriftServerNodes; fidone for i in $(getHive2ServerNodes); do if [ ${_HOSTNAME} = $i ]; then executeOnHive2ServerNodes; fidonefor i in $(getMasterNodes); do if [ ${_HOSTNAME} = $i ]; then executeOnMasters; fidone for i in $(getComputeAndStorageSlaveNodes); do if [ ${_HOSTNAME} = $i ]; then executeOnComputeAndStorageSlaveNodes; fidone for i in $(getComputeOnlySlaveNodes); do if [ ${_HOSTNAME} = $i ]; then executeOnComputeOnlySlaveNodes; fidone for i in $(getAllNodes); do if [ ${_HOSTNAME} = $i ]; then executeOnAllNodes; fidone ### No exits please!!

Big Data Cloud Convenience FunctionsBig Data Cloud scripts come with helper functions for convenience. The followingfunctions are available.

# # hdfsCopy # Copies data from and to Object store to HDFS or local file system# Usage:# hdfsCopy swift://container1.default/logs/one.log hdfs:///tmp/# OR# hdfsCopy hdfs:///tmp/one.log swift://container1.default/logs/# OR# hdfsCopy swift://container1.default/logs/one.log file:///tmp/## Checks if the file is present# eg:# hdfsStat hdfs:///user# OR# hdfsStat swift://container1.default/bdscsce/## Creates a directory on hdfs or object store

Appendix BBig Data Cloud Convenience Functions

B-4

# eg:# hdfsMkdir hdfs:///tmp/trial# OR# hdfsMkdir swift://container1.default/bdscsce/trial## getDefaultContainer# returns the default container that is registered with the cluster# Usage:# default_container=$(getDefaultContainer)## getBaseObjectStoreUrl# returns a URL that is pointing to default object store container# does not have trailing '/' to make it easy to append# Usage:# objectStoreURL=$(getBaseObjectStoreUrl)# OR# objectStoreURL=$(getBaseObjectStoreUrl)/bdcsce/logs/one.log## getClusterName# returns the cluster name# Usage:# cluster_name=$(getClusterName)## getMasterNodes# returns a space-separated list of master nodes# The following for loop can be used to print all master nodes# for i in $(getMasterNodes); do echo $i; done# Usage:# master_nodes=$(getMasterNodes)## getComputeOnlySlaveNodes# returns a space-separated list of compute-only slave nodes# Usage:# compute_only_slaves=$(getComputeOnlySlaveNodes)## getComputeAndStorageSlaveNodes# returns a space-separated list of compute and slave nodes## getAllNodes# returns a space-separated list of all nodes## getAmbariServerNodes# returns a space-separated list of all Ambari server nodes## getSparkThriftServerNodes# returns a space-separated list of all Spark Thrift Server nodes## getHive2ServerNodes# returns a space-separated list of all Hive2 server nodes

Appendix BBig Data Cloud Convenience Functions

B-5

using oracle big data cloud · get started with oracle shell for hadoop loaders 9-2 use oracle...

Documents