Google Cloud Data Catalog — Live Sync Your On-Prem Hive Server Metadata Changes

Code samples with a practical approach on how to incrementally ingest metadata changes from an on-premise Hive server into Google Cloud Data Catalog

Marcelo Costa
Google Cloud - Community
5 min readMay 19, 2020

--

Background by JJ Ying on Unsplash

Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: github.

The Challenge

Entering the big data world is no easy task, the amount of data can quickly get out of hand. Look at Uber story, on how they deal with 100 petabytes of data using the Hadoop ecosystem, imagine if every time they would sync their on-premise metadata into a Data Catalog, a full run was executed, that would be impractical.

We need a way to monitor changes executed at the Hive server, and whenever a Table or Database is modified we capture just that change and incrementally persist it in our Data Catalog.

If you missed the last post, we showcased ingesting on-premise Hive metadata into Data Catalog, in that case we didn’t use an incremental solution.

To grasp the situation, a full run with ~1000 tables took almost 20 minutes, even if only 1 table had changed. In the Uber story, that would be no fun, right?

Sidenote: This article assumes that you have some understanding of what Data Catalog and Hive are. If you want to know more about Data Catalog, please read the official docs.

Live Sync Architecture

Live Sync Architecture

There are multiple ways of listening to changes executed at a Hive server, this article compares two approaches: Hive hooks x Hive Metastore Listeners.

The Architecture presented uses a Hive Metastore Listener, for the simplicity of having the metadata already parsed.

On-prem Hadoop environment side

The main component here is an agent written on Java which listens to 5 Metastore events: onAlterTable, onCreateTable, onCreateDatabase, onDropTable, onDropDatabase.

onCreateTable — other events were suppressed for better readability

It’s a really simple code that gets the event and sends it to a PubSub topic. For details on how to set it up and other events, please take a look at the GitHub repo.

The agent runs inside the Hive Metastore process, which must be in a network that is able to reach the Google Cloud Project, also the Service Account set up within it needs the Pub/Sub Publisher role in the topic.

Google Cloud Platform side

The main components here are PubSub and the Hive to Data Catalog connector.

  • PubSub: works as a durable event ingestion and delivery system layer.
  • Connector (Scrape/Prepare/Ingest): this layer transforms the Hive Metastore message into a Data Catalog asset and persists it — for details on how it works, please take a look at this post.

We also have Cloud Run, which works as a side-car web server receiving the message from PubSub and triggering the connector.

It’s a really simple code that calls the Synchronizer class from hive2datacatalog Python module, which triggers the Scrape/Prepare/Ingest steps.

Cloud Run sidecar

For details on how to set up the Cloud Run side-car please take a look at the connectors github repo.

Triggering the connector

Let’s create a new database and table to see it working

Hive Server terminal

Checking the Hive Metastore logs, we can see two messages sent to PubSub

Hive Metastore Logs — some data was suppressed for better readability

Going to Cloud Run, we can look at the execution log

Cloud Run Log — some lines were suppressed for better readability

Results

Finally, let’s open the new Entries using Data Catalog UI

medium database entry
post table entry

In a matter of seconds, we are able to search for the newly created entries.

The sample connector

All topics discussed in this article are covered in a sample connector, available on GitHub: hive-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!

It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Closing thoughts

In this article, we have covered how to incrementally ingest metadata from Hive into Google Cloud Data Catalog, in a scalable and efficient way, enabling users to centralize their Metadata management. Stay tuned for new posts showing how to do the same with other source systems! Cheers!

References

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai