realizing the full potential of your cloud investment · key issues: • radically speed up...

REALIZING THE FULL POTENTIAL OF YOUR CLOUD INVESTMENT How a dataware approach makes your cloud strategy more effective

EXECUTIVE SUMMARY

In today’s competitive marketplace, businesses are increasingly relying on data to provide competi-tive advantage, reduce expenses, increase productivity, and deliver high-quality services. With recent advances in cloud computing, many businesses are embarking on a journey to the cloud to support their big data analytics and artificial intelligence (AI) strategies. The cloud journey is seen as part of this overall strategy.

MapR believes that customers can, and should, exploit the economics of cloud infrastructure where it makes sense but also should understand how a dataware platform can significantly help with four key issues:

• Radically speed up application development compared to using only cloud vendors tools and services, which must be integrated continually

• Provide seamless on-premises to in-cloud data and app portability

• Enable multi-cloud deployment globally that includes edge deployments

• Provide uniform security and governance

Cloud vendors provide a lot of flexibility and options for organizations. But with this flexibility comes challenges for data administration, protection, and control. A data platform that enables an organization to provide a uniform, consistent layer to control and drive innovation becomes a strategic layer that simultaneously supports innovation and flexibility without losing control of costs and security. Customers are increasingly seeing a data platform as a new layer in the enterprise IT stack and referring to it as dataware.

W H I T E PA P E R

Here are the biggest benefits provided by a dataware layer while simultaneously providing uniform security, protection, and control of the data.

• Improved Integration — A given cloud provider has a myriad of data-specific services with proprietary APIs and interfaces. This means the customer must manage, among other things, different security models, different ways to implement HA, or different methods to move data between services or hot and cold storage. A data platform eliminates the exercise of integrating separate services, leading to less expense and increased speed in application development and ongoing administration.

• Greater Portability and Flexibility — One of the first hurdles faced by enterprises as they move applications to the cloud is the lack of standard APIs to read and write the data as well as a general lack of standardization on how data is stored, secured, and moved between cloud providers or between a cloud provider and an on-premises data center. This means that legacy applications must be rewritten to work in the cloud, which can be a costly and time-consuming effort. It also means that the enterprise must provide the software that manages any data that moves between a private and public cloud. A uniform dataware layer supports a host of standard APIs that can all update the same underlying data, providing portability and making it easy and fast to move applications from on-premises to cloud, thereby avoiding lengthy data duplication and complex data dependencies that increase costs. The support of multiple, industry standard APIs also eliminates vendor lock-in.

• Support for Multi-Cloud — If an enterprise wants to take advantage of a multi-cloud strategy to protect against service outages, optimize costs, or better meet government compliance, the underlying data presents a huge obstacle. A dataware layer handles the underlying data replication, security, and consistency to support interoperability across cloud providers.This layer addresses what would otherwise be an ongoing effort and expense.

• Improved Cost Controls — There is widespread belief that moving to the public cloud lowers your IT costs. This might be true when you’re getting started and experimenting with cloud services, but persistent data introduces additional cost dimensions. Particularly since it is not just the size of the data, but the frequency of moving the data that drives cost. A dataware layer provides greater visibility and control. A dataware layer also enables the reduction of data movement. Instead of wholesale file copying and movement, only changed data needs to be synchronized across locations, providing a far more economical method to optimize bandwidth and storage costs.

MAPR DATA PLATFORM: CLOUD, CONTAINER, AND EDGE NATIVE

MapR provides greater control and flexibility with its advanced dataware solution. Utilizing the MapR Data Platform within a cloud deployment can reduce total costs, eliminate future rework, and improve agility and time-to-market. Additionally, organizations see substantial improvements in SLAs as well as data scientist and end-user productivity when leveraging MapR with the public cloud. Finally, hybrid and multi-cloud strategies are enabled with MapR, ensuring no cloud provider lock-in and optimizing agility across the evolving mix of on-premises and available cloud options.

The MapR Data Platform delivers dataware[1] for AI and analytics, effectively handling the diversity of data types, data access, and ecosystem tools needed to manage data as an enterprise resource, regardless of the underlying infrastructure or location.

2/12

3/12

With the MapR Data Platform, users can store, manage, process, and analyze all data – including files, tables, and streams from operational, historical, and real-time data sources – with mission-critical reliability to meet production SLAs. MapR offers a core set of data services to ensure exabyte scale and high performance while providing unmatched data protection, disaster recovery, security, and management services. Open APIs and support for containerization ensure broad, distributed application access and seamless application portability. The MapR Data Platform runs on commodity hardware across on-premises, cloud, and edge deployments.

MapR includes a wide variety of analytics and open source tools, such as Apache Hadoop, Apache Spark, Apache Drill, and Apache Hive. With support for POSIX and cutting-edge AI and ML tools like new Python, ML libraries can run natively on the same cluster as other analytics and leverage the power of the MapR Data Platform.

With the MapR Data Platform, you can run your AI and analytics workloads seamlessly across on-premises, cloud, and edge deployments.

MAPR DATA PLATFORM: INTEGRATED SERVICES

One of the attributes that make the big cloud platform providers so attractive is the extent of the services they provide and the flexibility in using those services. With that flexibility comes complexity; the consumer must take on the responsibility and cost of integrating the various services. This becomes even more burdensome if the consumer wants to split their workloads between on-premises and cloud, or inter-cloud, since the interfaces and APIs between these platforms are seldom consistent.

With public cloud providers, you have to stitch together different services. The end result is:

• Administrative overhead

• Security pitfalls

• Learning curve for developers to develop against new APIs

The MapR Data Platform includes a distributed file and object store, a distributed document database, and global pub-sub messaging and event store for Apache Kafka. These data services are integrated and are part of the core platform and run perfectly well on all popular public cloud vendor’s IaaS offerings.

MapR is available in marketplaces for Azure and AWS, including a pay-by-hour option for AWS.

Cloud-Native Operations — MapR also boasts cloud-native operations: a cloud-aware provision- ing and management tool that deploys underlying cloud infrastructure along with MapR software. It supports the ability to scale up the cluster with a single click and turn off a cluster that isn’t currently in use. And it uniquely offers full customization of cloud deployments, allowing

[1] For more on dataware, download the executive perspective at: https://mapr.com/datasheets/dataware-mapr-perspective/

https://mapr.com/products/edge/

https://mapr.com/products/apache-spark/

https://mapr.com/products/apache-drill/

https://mapr.com/datasheets/dataware-mapr-perspective/

4/12

organizations to use any existing or new cloud-specific features and the cloud deployment tooling of their choice. This means that MapR provides tools to quickly deploy and scale-up your cloud deployment, taking advantage of the cloud provider platform in a uniform and consistent way, no matter which cloud provider or providers you use.

Security — Since data must be shared between nodes on the cluster, data transmission between nodes and from the cluster to the client are vulnerable to interception. Networked computers are also vulnerable to attack when an intruder successfully pretends to be another authorized user and then acts improperly as that user. Additionally, networked machines could share the security vulnerabilities of a single node.

The MapR approach is to build security directly into the platform and to enable security by default. Designed with security out-of-the-box, the platform enables the ability to automatically apply security protection directly as data comes into and out of the platform without requiring an external security management server or a special security plugin for each compute engine.

Authentication — With MapR, user-to-service and service-to-service communications must be authenticated. MapR supports Kerberos as well as MapR native security equally; the latter allows organizations to tie into the username/password registry of choice.

Authorization — All access to data stored in MapR has access control checks. MapR supports both POSIX mode bits and more advanced MapR Access Control Expressions (ACEs) to protect data. ACEs are Boolean logic expressions that use the standard Boolean operators – AND, OR, NOT – to express an access constraint. In this way, you have an unparalleled level of expressiveness when assigning permissions to data.

For proper multi-tenant isolation, MapR controls access at the volume level using ACEs as well. This makes it easy for a system administrator to ensure that all data in a volume (files, tables, and streams) is accessed or modified only by a specific set of users, regardless of what the individual file and directory permissions may say. When building a multi-tenant environment, where tenant access isolation is crucial, this capability is essential.

Encryption — By default, all network traffic in MapR is encrypted via AES256/GCM and Secure Sockets Layer/Transport Layer Security (SSL/TLS) protocol that secures HTTPS traffic, supporting TLS 1.2. In addition, MapR offers an option to encrypt data at rest.

Auditing — MapR includes robust, high-performance auditing built directly into the product without any complex add-ons. When auditing is enabled, all data access (file, directory, table, stream) generates audit records to a Kafka API-based pub/sub system (MapR Event Store for Apache Kafka), supporting a real-time processing of audit data. Auditing introduces low overhead as records are coalesced in memory with duplicates automatically suppressed within a configurable interval before writing to disk. Auditing is also highly configurable and can be enabled on a per volume or per file basis. Finally, all administrative operations against the storage system generate audit records, ensuring that administrative operations can be monitored appropriately.

MAPR DATA PLATFORM: POWERING DATA AND APPLICATION PORTABILITY

One of the most significant advantages of the MapR Data Platform is its support for multiple open APIs. MapR supports NFS, POSIX, S3, HDFS, REST, HBase, JSON, Kafka, SQL, and more. The average enterprise has dozens or more existing applications running on-premises. These applications are typically written against open APIs, so making them work with cloud-specific services and APIs is not a simple matter of moving existing code to your cloud of choice. With MapR, you would not be forced to spend time and money rewriting your existing applications – assuming they were written against open APIs in the first place – to work against cloud-specific services. Applications written against these supported APIs can continue to work “as is” with MapR in the cloud. Along with S3 support, you have the most flexible option in terms of providing multiple access protocols to your data. For example, a cloud-centric application can collect and write data via S3, while a legacy batch program can access that same data via NFS.

MapR Direct Access NFS offers usability and interoperability advantages and makes data and all of its related tools radically easier and less expensive to use. MapR allows files to be modified and overwritten at high speeds in real time from remote servers via an NFS mount and enables multiple concurrent reads and writes on any file. From an application standpoint, data accessed in MapR over NFS works identically to accessing data on a local drive.

MapR POSIX Client is an add-on product that provides seamless data access to MapR from remote nodes, just like with NFS. It gives you the added benefits of authentication, encrypted transmission, compressed transmission, and parallelized communications. Application servers, web servers, and other applications and systems can read and write directly and securely to a MapR cluster with significantly faster throughput.

SUPPORT FOR MICROSERVICES AND STATEFUL CONTAINER APPLICATIONS ACROSS HYBRID ENVIRONMENTS

Microservices represent an important application architecture in big data analytics and AI applications today because they offer tremendous agility. They are relatively simple, single-purpose applications that work in unison via lightweight communications, such as pub-sub messaging. Therefore, they are much easier to build, integrate, and coordinate, relative to traditionally large monolithic applications, and can be reused as different use cases and solutions require.

5/12

MapR Trust Model

• Ticket-based authentication for all services in the cluster

• Integration with LDAP, Active Directory, and other third-party directory services

• Kerberos or username/password authentication

• Mapr Event Store for Apache Kafka

• Logs include data access and administrative actions

• Ad hoc queries and custom reports on audit logs via SQL and standard BI tools

• Access Control Expressions (ACEs)• Protect files, tables, and streams• Volume-level protection available

• Encryption for data in motion - Within a cluster - Between clusters - Between client and cluster

• Encryption for data at rest - Data at Rest Encryption - LUKs

• NSA-level cryptographic algorithms

23 41Flexible

AuthenticationGranular Authorization

Robust Auditing

Ubiquitous Data Protection

6/12

MapR solves the stateful container challenge by integrating with the Kubernetes storage plugin, providing persistent storage volumes for access to data located across hybrid cloud deployments – on-premises, across one or more clouds, and at the edge. With MapR and Kubernetes technologies, stateful applications can easily be deployed in containers for production use cases and machine learning pipelines.

Containers are an ideal deployment mechanism for microservices, facilitating the seamless transfer of development from one environment or platform to another. Containers have quickly become a developer’s preferred platform because of their new and vastly improved application development environment. Kubernetes has become the de facto standard to orchestrate large numbers of containers in production, ensuring that each has the resources it needs and providing for things like health monitoring, restarting, and load balancing.

While the majority of current enterprise container applications are still stateless in nature, stateful container-based applications dominate the needs of complex applications. As enterprises adopt containers and Kubernetes, they need reliable storage to support these applications. The problem is that storing data inside containers themselves makes them heavy and prone to data loss, should the container fail or go away.

Application

GLOBAL DATA MANAGEMENT

Edge On-PremisesPrivate Cloud

Public Cloud Public Cloud Public Cloud

Scheduling & Scaling

MapR Kubernetes Volume Driver

Pod Pod Pod Classic ETL Image Classification Using TensorFlow in a Docker Container

Microservice A Microservice CMicroservice B Microservice A Microservice CMicroservice B

STREAM STREAM

With MapR Without MapR

7/12

SUPPORT FOR ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING ACROSS HYBRID ENVIRONMENTS

Containers are also emerging as effective environments for running data science workloads. There are several reasons.

First, for data scientists, it’s really important that their work is reproducible. Reproducibility facilitates peer review, ensuring the model or analysis you build can run without friction and withstand the test of time. When you wrap everything – all dependencies such as operating systems, compilers, drivers, configuration files, or other data required for your code to run successfully – in a container, it reduces the burden on others of recreating your environment and makes your work more accessible.

Portability is another reason for doing data science work in container environments. In machine learning, being able to rapidly change your compute environment can significantly improve your productivity. Leveraging resources such as GPUs or cloud-compute resources quickly can be a huge competitive advantage when it comes to training machine learning models. And the best models, once found, might need to be deployed to a mix of locations such as on-premises or at the edge – not just cloud.

MapR Data Science Refinery

The MapR Data Science Refinery provides data scientists an easy way to access and analyze all data in-place and to collaborate, build, and deploy machine learning models on the MapR Data Platform. Using a developer-friendly notebook and a wide range of open source data science tools that integrate directly with the MapR Data Platform, the MapR Data Science Refinery is easy to deploy, using a secure, persistent, and extensible container that can be distributed to many data science teams across multi-tenant environments.

8/12

The integrated Apache Zeppelin data science notebook provides a broad range of open source tooling, compute and query engines, and libraries for exploration, collaboration, and visualization. Available as a Docker image on Docker Hub, it includes all the necessary bits required to leverage the MapR Data Platform as a persistent data store for your containerized applications.

The MapR FUSE-based POSIX Client is used to allow app servers, web servers, and other client nodes and apps to read and write data directly and securely to the MapR Data Platform. In addition, Apache Spark connectors can be used to interact with MapR Database and MapR Event Store for Apache Kafka.

MapR Data Science Refinery also provides the ability to work across many engines in one visual space:

• Distributed compute and ML programming with Apache Spark and Python

• Batch and interactive SQL with Apache Hive and Drill

• Scripting support for Apache Pig

• Shell access to MapR Distributed File and Object Store

• Programmatic access to MapR’s built-in NoSQL document database management system (MapR Database) and event streaming system (MapR Event Store for Apache Kafka)

MAPR DATA PLATFORM: INTER-CLOUD MADE SIMPLE

MapR makes it easy to manage and process data across hybrid and multi-cloud environments with capabilities including: mirroring, replication, data tiering, and global namespace.

Mirroring — Provides secure, network-optimized, point-in-time consistent data replication of files between MapR clusters, whether intra-cloud, inter-cloud, or between on-premises and cloud

Replication — Provides real-time synchronization of tables and streaming messages between MapR clusters. Replication enables active/active data assets with application failover.

Application API Connector

Open APIs

GLOBAL DATA MANAGEMENT

On-PremisesPrivate Cloud

Public Cloud Public Cloud Public CloudEdge

9/12

Data Tiering

An important aspect of a modern data platform is its ability to use a combination of storage solutions to help organizations lower costs, increase performance, and manage ever- increasing data growth. It is possible to design effective data tiering without locking into expensive appliances.

With MapR, you can organize data into the data tier of your choice by specifying simple rules and schedules. MapR Automated Storage Tiering (MAST) handles all of the data movement between the tiers. It ingests file data into the “hot” (performance) tier, typically dedicated for mission-critical applications that have very high I/O performance needs. Then, depending on customer-defined rules and schedules, the file data is offloaded to the “warm” (capacity) or “cold” (archive) tier. Data in the MapR performance tier is highly available and resilient for faster, reliable access. MapR Erasure Coding for the warm tier offers cost-optimized capacity with data protection and the ability to handle multiple failures.

Data tiering is an intelligent solution for cost-effectively managing ever-increasing data growth. MAST eliminates the need to move data manually, while intelligently scaling and translating it for the cloud. Users and applications accessing files do not have to take any special action to take advantage of object tiering. Erasure coding for the capacity tier makes a compelling choice, due to its resilience to failures and storage efficiency. The way that the MapR Data Platform separates the decisions about detailed storage for data from the use of the data by applications is an example of automating policy that can drive down costs without sacrificing performance.

Once set, data tiering is transparent to users and applications. Administrators have the flexibility to integrate with their choice of public cloud or private that expose the S3-compatible API (including Amazon S3). Use the MapR integration with S3-compatible APIs for storing data long-term. Typically used for archival purposes, a MapR archive tier allows you to move data to the cloud or any S3-compatible cheap store with the ability to bring back the data into an active, operational mode quickly.

REPLICATED

ERASURE CODED

TIERED TO OBJECT STORE INACTIVE

ACTIVE

FAST AND FREQUENTLY ACTIVE

S3

The integration with the cloud through object tiering also provides the option of cloud bursting. Cloud bursting is the ability to spin up a cluster in the cloud, when the occasion arises where you don’t have enough compute capacity on-premises to meet transient demand. This flexibility helps you to control cost and only grow your cloud footprint when demand requires it, shrinking it quickly when the need passes.

With MapR, capacity and performance can be achieved simultaneously and cost-effectively.

Global Namespace

The MapR global namespace allows customers to view geo-distributed clusters (edge, core, cloud) as one “logical cluster” to enable global data access for traditional applications and big data applications using open APIs and industry standard protocols.

The global namespace concept can also be used to enable cross-country analysis in scenarios where data may not be stored outside the home geography. This can be an exclusive physical storage of the data in the us.mapr.com or asia.mapr.com cluster. At the same time, analysis should be carried out on the data in a holistic and location-independent manner. By considering the different geo-locations as an entity in MapR, this is made possible.

A global namespace can span on-premises, hybrid, and public clouds. This way customers can become more cloud independent, because applications can store and process the data using industry-standard open APIs with the global data access.

/mapr

/us_cloud.mapr.com /eu_cloud.mapr.com

/us.mapr.com/asia.mapr.com

GLOBALLY PROTECTED GLOBALLY

ACCESSIBLEGLOBALLY REPLICATEDGLOBALLY

MANAGED

10/12

MAPR DATA PLATFORM: KEEP CLOUD COSTS IN CHECK

The MapR Data Platform helps control your cloud spend more intelligently, resulting in lower total annual billings and incurred costs than managing your data and data usage directly through a cloud provider’s interfaces. In analyzing customer spend, MapR has shown savings averaging 30% when using our platform for data management in a cloud environment over using the cloud provider capabilities directly. Furthermore, these cost savings grow over time as the number, size, and complexity of cloud use cases increase. These savings come from much of the functionality we have discussed earlier in this paper, including:

• Decreasing the size of your storage footprint in the cloud with data compression

• Using object tiering for optimal use of lower-cost object storage for infrequently used data

• Open APIs enabling easier migration of existing applications to the cloud and avoiding costly application rewrites.

After modeling many actual customer scenarios, MapR has found that using the MapR Data Platform with a public cloud platform can result in:

• Lowering total annual cloud billings by 15% to 31%

• Lowering the cumulative 3-year cloud costs for an average-sized analytics use case by about 23% or an estimated $550K

• Lowering the cumulative 3-year cloud costs for an average-sized complex use case by about 23% or an estimated $2.7M

Read our blog post, “Supercharge Your Cloud ROI with MapR,” to learn more.

Storage Cost and Performance Optimization

MapR XD Distributed File and Object Store is a very large-scale, high performance file system. As we have covered earlier in this paper, it supports object tiering, compression, and global namespace. These capabilities allow our customers to optimize the use of both low-cost cloud storage and the superior, high performance file and object store essential to meet business requirements and SLAs such as real-time analytics.

MapR compression decreases the size of data storage used in the cloud, which is a primary driver of the billings generated by public cloud providers. With object tiering and compression, MapR can reduce your total billings for storage services while, again, also providing a high-performance data tier for greater business value. Data tiering is an intelligent solution for cost-effectively managing ever-increasing data growth. MapR Automated Storage Tiering (MAST) eliminates the need to move data manually while intelligently scaling and translating it for the cloud.

11/12

MapR and the MapR logo are registered trademarks of MapR and its subsidiaries in the United States and other countries. Other marks and brands may be claimed as the property of others. The product plans, specifications, and descriptions herein are provided for information only and subject to change without notice, and are provided without warranty of any kind, express or implied. Copyright © 2019 MapR Technologies, Inc.

Contact [email protected]

Try MapRdownload

For More [email protected]

High Value, Lower-Cost Database and Streaming Services

The MapR Data Platform includes both MapR Database and MapR Event Store for Apache Kafka. In our use case analysis, both of these services can be used as alternatives to similar services offered by major public cloud providers to hold down costs but still deliver superior results. Based on our analysis of pricing data and use case scenarios, we found that using MapR Database can save up to two-thirds over using public cloud database offerings.

Migrating Applications to the Cloud

For customers migrating on-premises applications to the cloud, the MapR Data Platform can make recoding much cheaper and redeploying much faster. Moving applications to the cloud is generally a very costly proposition, both in labor months and elapsed time needed to rewrite and test the applications on the cloud APIs. Since MapR includes both NFS and POSIX APIs, many of the targeted applications can be rapidly migrated to the cloud at a much lower cost.

Increased Development and Data Science Productivity

By using the MapR Data Platform, customers realize substantial improvements in the productivity of their developers, data scientists, and other business users due to the faster, less labor-intensive cycle times associated with building, enhancing, and evolving their growing number of cloud-based use cases.

SUMMARY

In today’s competitive marketplace, businesses are increasingly relying on data in the cloud to drive data-driven digital transformations and improve customer engagement, competitive positions, efficiency, and quality.

Utilizing the MapR Data Platform within a cloud deployment, regardless of whether it is a single cloud provider, hybrid, or multi-cloud, can reduce total costs, improve agility, speed time-to-market, and ensure security – with no cloud lock-in – enabling optimization/agility across the evolving mix of on-premises and available cloud options.

Additionally, MapR benefits include substantial improvements in performance/SLA compliance as well as data scientist/user productivity, measured by use case volume, time-to-market, and business value.

The result is a powerful enabler for data-driven transformation initiatives.

mailto:Info%40mapr.com?subject=

https://mapr.com/try-mapr/

https://mapr.com/

mailto:Sales%40MapR.com?subject=

realizing the full potential of your cloud investment · key issues: • radically speed up...

Documents