Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
OpenStack Sahara Essentials
OpenStack Sahara Essentials

OpenStack Sahara Essentials: Integrate, deploy, rapidly configure, and successfully manage your own big data-intensive clusters in the cloud using OpenStack Sahara

By Omar Khedher
$35.99 $24.99
Book Apr 2016 178 pages 1st Edition
eBook
$35.99 $24.99
Print
$43.99
Subscription
$15.99 Monthly
eBook
$35.99 $24.99
Print
$43.99
Subscription
$15.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

OpenStack Sahara Essentials

Chapter 1. The Essence of Big Data in the Cloud

How to quantify data into business value? It's a serious question that we might be prompted to ask when we take a look around and notice the increasing appetite of users for rich media and the content of data across the web. That could generate several challenging points: How to manage the exponential amount of data? Particularly, how to extract from these immense waves of data the most valuable aspects? It is the era of big data! To meet the growing demand of big data and facilitate its analysis, few solutions such as Hadoop and Spark appeared and have become a necessary tool towards making a first successful step into the big data world. However, the first question was not sufficiently answered! It might be needed to introduce a new architecture and cost approach to respond to the scalability of intensive resources consumed when analyzing data. Although Hadoop, for example, is a great solution to run data analysis and processing, there are difficulties with configuration and maintenance. Besides, its complex architecture might require a lot of expertise. In this book, you will learn how to use OpenStack to manage and rapidly configure a Hadoop/Spark cluster. Sahara, the new OpenStack integrated project, offers an elegant self-service to deploy and manage big data clusters. It began as an Apache 2.0 project and now Sahara has joined the OpenStack ecosystem to provide a fast way of provisioning Hadoop clusters in the cloud. In this chapter, we will explore the following points:

  • Introduce briefly the big data groove

  • Understand the success of big data processing when it is combined with the cloud computing paradigm

  • Learn how OpenStack can offer a unique big data management solution

  • Discover Sahara in OpenStack and cover briefly the overall architecture

It is all about data


A world of information, sitting everywhere, in different formats and locations, generates a crucial question: where is my data?

During the last decade, most companies and organizations have started to realize the increasing rate of data generated every moment and have begun to switch to a more sophisticated way of handling the growing amount of information. Performing a given customer-business relationship in any organization depends strictly on answers found in their documents and files sitting on their hard drives. It is even wider, with data generating more data, where there comes the need to extract from it particular data elements. Therefore, the filtered elements will be stored separately for a better information management process, and will join the data space. We are talking about terabytes and petabytes of structured and unstructured data: that is the essence of big data.

The dimensions of big data

Big data refers to the data that overrides the scope of traditional data tools to manage and manipulate them.

Gartner analyst Doug Laney described big data in a research publication in 2001 in what is known as the 3Vs:

  • Volume: The overall amount of data

  • Velocity: The processing speed of data and the rate at which data arrives

  • Variety: The different types of structured and unstructured data

Note

To read more about the 3Vs concept introduced by Doug Laney, check the following link: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

The big challenge of big data

Another important question is how will the data be manipulated and managed in a big space? For sure, traditional tools might need to be revisited to meet the large volume of data. In fact, loading and analyzing them in a traditional database means the database might become overwhelmed by the unstoppable massive surge of data.

Additionally, it is not only the volume of data that presents a challenge but also time and cost. Merging big data by using traditional tools might be too expensive, and the time taken to access data can be infinite. From a latency perspective, users need to run a query and get a response in a reasonable time. A different approach exists to meet those challenges: Hadoop.

The revolution of big data

Hadoop tools come to the rescue and answer a few challenging questions raised by big data. How can you store and manage a mixture of structured and unstructured data sitting across a vast storage network? How can given information be accessed quickly? How can you control the big data system in an enhanced scalable and flexible fashion?

The Hadoop framework lets data volumes increase while controlling the processing time. Without diving into the Hadoop technology stack, which is out of the scope of this book, it might be important to examine a few tools available under the umbrella of the Hadoop project and within its ecosystem:

  • Ambari: Hadoop management and monitoring

  • Hadoop: Hadoop distributed storage platform

  • HBase: Hadoop NoSQL non-relational database

  • Hive: Hadoop data warehouse

  • Hue: Hadoop web interface for analyzing data

  • MapReduce: Algorithm used by Hadoop MR component

  • Pig: Data analysis high-level language

  • Storm: Distributed real-time computation system

  • Yarn: MapReduce in Hadoop version 2

  • ZooKeeper: Hadoop centralized configuration system

  • Flume: Service mechanism for data collection and streaming

  • Mahout: Scalable machine learning platform

  • Avro: Data serialization platform

Apache Spark is another amazing alternative to process large amounts of data that a typical MapReduce cannot provide. Typically, Spark can run on top of Hadoop or standalone. Hadoop uses HDFS as its default file system. It is designed as a distributed file system that provides a high throughput access to application data.

The big data tools (Hadoop/Spark) sound very promising. On the other hand, while launching a project on a terabyte-scale, it might go quickly into a petabyte-scale. A traditional solution is found by adding more clusters. However, operational teams may face more difficulties with manual deployment, change management and most importantly, performance scaling. Ideally, when actively working on a live production setup, users should not experience any sort of service disruption. Adding then an elasticity flavor to the Hadoop infrastructure in a scalable way is imperative. How can you achieve this? An innovative idea is using the cloud.

Note

Some of the most recent functional programming languages are Scala and R. Scala can be used to develop applications that interact with Hadoop and Spark. R language has become very popular for data analysis, data processing, and descriptive statistics. Integration of Hadoop with R is ongoing; RHadoop is one of the R open source projects that exposes a rich collection of packages to help the analysis of data with Hadoop. To read more about RHadoop, visit the official GitHub project page found at https://github.com/RevolutionAnalytics/RHadoop/wiki

A key of big data success

Cloud computing technology might be a satisfactory solution by eliminating large upfront IT investments. A scalable approach is essential to let businesses easily scale out infrastructure. This can be simple by putting the application in the cloud and letting the provider supports and resolves the big data management scalability problem.

Use case: Elastic MapReduce

One shining example is the popular Amazon service named Elastic MapReduce (EMR), which can be found at https://aws.amazon.com/elasticmapreduce/. Amazon EMR in a nutshell is Hadoop in the cloud. Before taking a step further and seeing briefly how such technology works, it might be essential to check where EMR sits in Amazon from an architectural level.

Basically, Amazon offers the famous EC2 service (which stands for Elastic Compute Cloud) that can be found at https://aws.amazon.com/ec2/. It's a way that you can demand a certain size of computations resources, servers, load balancers, and many more. Moreover, Amazon exposes a simple key/value storage model named Simple Storage Service (S3) that can be found at https://aws.amazon.com/s3/.

Using S3, storing any type of data is very simple and straightforward using web or command-line interfaces. It is the responsibility of Amazon to take care of the scaling, data availability, and the reliability of the storage service.

We have used a few acronyms: EC2, S3 and EMR. From high-level architecture, EMR sits on top of EC2 and S3. It uses EC2 for processing and S3 for storage. The main purpose of EMR is to process data in the cloud without managing your own infrastructure. As described briefly in the following diagram, data is being pulled from S3 and is going to automatically spin up an EC2 cluster within a certain size. The results will be piped back to S3. The hallmark of Hadoop in the cloud is zero touch infrastructure. What you need to do is just specify what kind of job you intend to run, the location of the data, and from where to pick up the results.

OpenStack crossing big data


OpenStack is a very promising open source cloud computing solution that does not stop adumbrating and joining different projects related to the cloud environment. OpenStack kept growing its ecosystem thanks to the conglomeration of many projects that make it a very rich cloud platform. OpenStack exposes several infrastructure management services that work in tandem to provide a complete suite of infrastructure management software. Most of its modules have been refined and become more mature within the Havana release. It might be essential first to itemize the most basic ones briefly:

  • Keystone: The identity management service. Connecting and using OpenStack services requires in the first place authentication.

  • Glance: The image management service. Instances will be launched from disk images that glance stores them in its image catalogue.

  • Nova: The instance management service. Once authenticated, a user can create an instance by defining basic resources such as image and network.

  • Cinder: The block storage management service. It allows creating and attaching volumes to instances. It also handles snapshots, which can be used as a boot source.

  • Neutron: The network management service. It allows creating and managing an isolated virtual network for each tenant in an OpenStack deployment.

  • Swift: The object storage management service. Any form of data in Swift is stored in a redundant, scalable, distributed object storage using a cluster of servers.

  • Heat: The orchestration service. It provides a fast-paced way to launch a complete stack from one single template file.

  • Ceilometer: The telemetry service. It monitors the cluster resources used in an OpenStack deployment.

  • Horizon: The OpenStack Dashboard. It provides a web-based interface to different OpenStack services such as Keystone, Glance, Nova, Cinder, Neutron, Swift, Heat, and so on.

  • Trove: The Database as a Service (DBaaS) component in OpenStack. It enables users to consume relational and non-relational database engines on top of OpenStack.

Note

At the time of writing, more incubated projects are being integrated in the OpenStack ecosystem with the Liberty release such as Ironic, Zaqar, Manilla, Designate, Barbican, Murano, Magnum, Kolla, and Congress. To read more about those projects, refer to the official OpenStack website at: https://www.openstack.org/software/project-navigator/

The awesomeness of OpenStack comes not only from its modular architecture but also the contribution of its large community by developing and integrating a new project in nearly every new OpenStack release. Within the Icehouse release, OpenStack contributors turned on the light to meet the big data world: the Elastic Data Processing service. That becomes even more amazing to see a cloud service similar to EMR in Amazon running by OpenStack.

Well, it is time to open the curtains and explore the marriage of one of the most popular big data programs, Hadoop, with one of the most successful cloud operating system OpenStack: Sahara. As shown in the next diagram of the OpenStack IaaS (short for Infrastructure as a Service) layering schema, Sahara can be expressed as an optional service that sits on top of the base components of OpenStack. It can be enabled or activated when running a private cloud based on OpenStack.

Note

More details on Sahara integration in a running OpenStack environment will be discussed in Chapter 2, Integrating OpenStack Sahara.

Sahara: bringing big data to the cloud

Sahara is an incubated project for big data processing since the OpenStack Icehouse release. It has been integrated since the OpenStack Juno release. The Sahara project was a joint effort and contribution between Mirantis, a major OpenStack integration company, Red Hat, and Hortonworks. The Sahara project enables users to run Hadoop/Spark big data applications on top of OpenStack.

Note

The Sahara project was named Savanna and has been renamed due to trademark issues.

Sahara in OpenStack

The main reason the Sahara project was born is the need for agile access to big data. By moving big data to the cloud, we can capture many benefits for the user experience in this case:

  • Unlimited scalability: Sahara sits on top of the OpenStack Cloud management platform. By its nature, OpenStack services scale very well. As we will see, Sahara lets Hadoop clusters scale on OpenStack.

  • Elasticity: Growing or shrinking, as required, a Hadoop cluster is obviously a major advantage of using Sahara.

  • Data availability: Sahara is tightly integrated with core OpenStack services as we will see later. Swift presents a real cloud storage solution and can be used by Hadoop clusters for data source storage. It is a highly durable and available option when considering the input/output of processing a data workflow.

Note

Swift can be used for input and output data source access in a Hadoop cluster for all job types except Hive.

For an intimate understanding of the benefits cited previously, it might be essential to go through a concise architectural overview of Sahara in OpenStack. As depicted in the next diagram, a user can access and manage big data resources from the Horizon web UI or the OpenStack command-line interface. To use any service in OpenStack, it is required to authenticate against the Keystone service. It also applies to Sahara, which it needs to be registered with the Keystone service catalogue.

To be able to create a Hadoop cluster, Sahara will need to retrieve and register virtual machine images in its own image registry by contacting Glance. Nova is also another essential OpenStack core component to provision and launch virtual machines for the Hadoop cluster. Additionally, Heat can be used by Sahara in order to automate the deployment of a Hadoop cluster, which will be covered in a later chapter.

Note

In OpenStack within the Juno release, it is possible to instruct Sahara to use block storage as nodes backend.

The Sahara OpenStack mission

In addition to sharing the aforementioned generic big data in OpenStack, OpenStack Sahara has some unique characteristics that can be itemized as the following:

  • Fast provisioning: Deploying a Hadoop/Spark cluster becomes an easy task by performing a few push-button clicks or via command line interface.

  • Centralized management: Controlling and monitoring a Hadoop/Spark cluster from one single management interface efficiently.

  • Cluster management: Sahara offers an amazing templating mechanism. Starting, stopping, scaling, shaping, and resizing actions may form the life cycle of a Hadoop/Spark cluster ecosystem. Performing such a life cycle in a repeatable way can be simplified by using a template in which will be defined the Hadoop configuration. All the proper cluster node setup details just get out of the way of the user.

  • Workload management: This is another key feature of Sahara. It basically defines the Elastic Data Processing, the running and queuing jobs, and how they should work in the cluster. Several types of jobs for data processing such as MapReduce job, Pig script, Oozie, JAR file, and many others should run across a defined cluster. Sahara enables the provisioning of a new ephemeral cluster and terminates it on demand, for example, running the job for some specific analysis and shutting down the cluster when the job is finished. Workload management encloses data sources that defines where the job is going to read data from and write them to.

    Note

    Data sources URLs into Swift and URLs into HDFS will be discovered in more details in Chapter 5, Discovering Advanced Features with Sahara.

  • No deep expertise: Administrators and operators will not wonder anymore about managing the infrastructure running underneath the Hadoop/Spark cluster. With Sahara, managing the infrastructure does not require real big data operational expertise.

  • Multi-framework support: Sahara exposes the possibility to integrate diverse data processing frameworks using provisioning plugins. A user can choose to deploy a specific Hadoop/Spark distribution such as the Hortonworks Data Platform (HDP) plugin via Ambari, Spark, Vanilla, MapR Distribution, and Cloudera plugins.

  • Analytics as a Service: Bursty analytics workloads can utilize free computing infrastructure capacity for a limited period of time.

The Sahara's architecture

We have seen in the previous diagram how Sahara has been integrated in the OpenStack ecosystem from a high-level perspective. As it is a new OpenStack service, Sahara exposes different components that interact as the client of other OpenStack services such as Keystone, Swift, Nova, Neutron, Glance, and Cinder. Every request initiated from the Sahara endpoint is performed on the OpenStack services public APIs. For this reason, it is essential to put under scope the Sahara architecture as shown in the following diagram:

The OpenStack Sahara architecture consists essentially of the following components:

  • REST API: Every client request initiated from the dashboard will be translated to a REST API call.

  • Auth: Like any other OpenStack service, Sahara must authenticate against the authentication service Keystone. This also includes client and user authorization to use the Sahara service.

  • Vendor Plugins: The vendor plugins sit in the middle of the Sahara architecture that exposes the type of cluster to be launched. Vendors such as Cloudera and Apache Ambari provide their distributions in Sahara so users can configure and launch a Hadoop based on their plugin mechanism.

  • Elastic Data Processing (EDP): Enables the running of jobs on an existing and launched Hadoop or Spark cluster in Sahara. EDP makes sure that jobs are scheduled to the clusters and maintain the status of jobs, their sources, from where the data sources should be extracted, and to where the output of the treated data sources should be written.

  • Orchestration Manager/Provisioning Engine: The core component of the Sahara cluster provisioning and management. It instructs the Heat engine (OpenStack orchestrator service) to provision a cluster by communicating with the rest of the OpenStack services including compute, network, block storage, and images services.

  • Data Access Layer (DAL): Persistent internal Sahara data store.

Note

It is important to note that Sahara was configured to use a direct engine to create instances of the cluster which initiate calls to the required OpenStack services to provision the instances. It is also important to note that Direct Engine in Sahara will be deprecated from OpenStack Liberty release where Heat becomes the default Sahara provisioning engine.

Summary


In this chapter, you explored the factors behind the success of the emerging technology of data processing and analysis using cloud computing technology. You learned how OpenStack can be a great opportunity to offer the needed scalable and elastic big data on-demand infrastructure. It can be also useful to execute on-demand Elastic Data Processing tasks.

The first chapter exposed the new OpenStack incubated project called Sahara: a rapid, auto-deploy, and scalable solution for Hadoop and Spark clusters. An overall view of the Sahara architecture has been discussed for a fast-paced understanding of the platform and how it works in an OpenStack private cloud environment.

Now it is time to get things running and discover how such an amazing big data management solution can be used by installing OpenStack and integrating Sahara, which will be the topic of the next chapter.

Left arrow icon Right arrow icon

Key benefits

  • • A fast paced guide to help you utilize the benefits of Sahara in OpenStack to meet the Big Data world of Hadoop.
  • • A step by step approach to simplify the complexity of Hadoop configuration, deployment and maintenance.

Description

The Sahara project is a module that aims to simplify the building of data processing capabilities on OpenStack. The goal of this book is to provide a focused, fast paced guide to installing, configuring, and getting started with integrating Hadoop with OpenStack, using Sahara. The book should explain to users how to deploy their data-intensive Hadoop and Spark clusters on top of OpenStack. It will also cover how to use the Sahara REST API, how to develop applications for Elastic Data Processing on Openstack, and setting up hadoop or spark clusters on Openstack.

What you will learn

• Integrate and Install Sahara with OpenStack environment • Learn Sahara architecture under the hood • Rapidly configure and scale Hadoop clusters on top of OpenStack • Explore the Sahara REST API to create, deploy and manage a Hadoop cluster • Learn the Elastic Processing Data (EDP) facility to execute jobs in clusters from Sahara • Cover other Hadoop stable plugins existing supported by Sahara • Discover different features provided by Sahara for Hadoop provisioning and deployment • Learn how to troubleshoot OpenStack Sahara issues

Product Details

Country selected

Publication date : Apr 25, 2016
Length 178 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781785885969
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 25, 2016
Length 178 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781785885969
Concepts :

Table of Contents

14 Chapters
OpenStack Sahara Essentials Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewer Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. The Essence of Big Data in the Cloud Chevron down icon Chevron up icon
2. Integrating OpenStack Sahara Chevron down icon Chevron up icon
3. Using OpenStack Sahara Chevron down icon Chevron up icon
4. Executing Jobs with Sahara Chevron down icon Chevron up icon
5. Discovering Advanced Features with Sahara Chevron down icon Chevron up icon
6. Hadoop High Availability Using Sahara Chevron down icon Chevron up icon
7. Troubleshooting Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Top Reviews
No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.