Google Cloud Data Catalog — Keep Up With Your On-Prem Hive Server
Code samples with a practical approach on how to ingest metadata from an on-premise Hive server into Google Cloud Data Catalog
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: github.
The Challenge
We hardly find large organizations storing all their data at the same place, sometimes it’s because of compliance or even a strategic reason. That leads to many customers having their data assets spread across multiple silos, dealing with data that resides on hybrid clouds and/or on-premise environments.
By taking a metadata management perspective, we need to enable data discovery in a centralized place, no matter where the data is.
Data Catalog
Last year Google Cloud announced their metadata management tool
But how do they deal with the given challenge?
Google Cloud take on this
Recently Google Cloud released the Data Catalog custom entries API, allowing users to ingest virtually anything into Data Catalog. It’s a flexible API with open string fields, that allow users to specify the type of their assets, names, and schemas with their data types.
There are code samples for NodeJS, Java and Python. Please take a look at the official docs, if you want to know more about it.
The Hive Server
There are many ways of having a Hive server running on production, and connecting to it. In this blog post, we will work with the usually recommended production environment, where we have separate workloads for the hive server, hive metastore, and the underlying RDBMS (stores the metadata on behalf of hive).
If you want to quickly set up a Hive environment for testing/dev purposes check this post.
Test data
It’s also important to have a large amount of test data to simulate a more realistic scenario, so we will be using a pre-populated Hive server with ~1000
assets. To generate the data, this script was used, which creates random tables, with diverse column types.
Let's take a look at it
Then connecting to the on_prem_warehouse90281
database.
Now let’s see the table school_infoa3c9b12a
columns:
Hive to Data Catalog connector
To ingest all that metadata from Hive to Data Catalog, we will use a Python script (referenced as connector
in this blog post), divided into 3 Steps: Scrape, Prepare and Ingest.
Scrape
At this stage, the script connects to the source system — a PostgreSQL database in this case and retrieves the metadata using a SQL query, following the Hive metastore schema definition.
It’s important to point out that this script does not access the actual data when it connects to the RDBMS metastore, it retrieves only the metadata, which are the databases, tables and columns definitions.
Prepare
The transform logic happens here, where the source system assets are converted into Data Catalog assets.
Ingest
Finally, the in-memory Data Catalog assets, are synced and persisted into the specified Google Cloud Data Catalog instance, updated
/created
/deleted
as needed according to the source system state. The Data Catalog custom types API is called for that.
Executing the connector
After setting up the connector
environment, by following the instructions at the Github repo, let’s execute it using its command line args:
# Environment variables
export DATACATALOG_PROJECT_ID=hive2dc-gcp-project
export DATACATALOG_LOCATION_ID=us-central1
export HIVE_METASTORE_DB_HOST=localhost
export HIVE_METASTORE_DB_USER=hive
export HIVE_METASTORE_DB_PASS=hive
export HIVE_METASTORE_DB_NAME=metastore
export HIVE_METASTORE_DB_TYPE=postgresql
export GOOGLE_APPLICATION_CREDENTIALS=<CREDENTIALS_FILES_FOLDER>/hive2dc-datacatalog-credentials.json# Command line execution
google-datacatalog-hive-connector \
--datacatalog-project-id=$HIVE2DC_DATACATALOG_PROJECT_ID \
--datacatalog-location-id=$HIVE2DC_DATACATALOG_LOCATION_ID \
--hive-metastore-db-host=$HIVE2DC_HIVE_METASTORE_DB_HOST \
--hive-metastore-db-user=$HIVE2DC_HIVE_METASTORE_DB_USER \
--hive-metastore-db-pass=$HIVE2DC_HIVE_METASTORE_DB_PASS \
--hive-metastore-db-name=$HIVE2DC_HIVE_METASTORE_DB_NAME \
--hive-metastore-db-type=$HIVE2DC_HIVE_METASTORE_DB_TYPE
Results
Once the connector finishes we can go into Data Catalog search UI and look for the ingested assets
Same 8 databases shown on the Hive Server. And if we search for tables
Now let’s see the table we saw before school_infoa3c9b12a
Filtering the column cpf61825
We can then use Data Catalog Tags to mark it as PII
Execution Metrics
Finally, let’s look at some metrics generated from the execution. Metrics were collected by running the connector
on a Hive Metastore version 2.3.0
, backed by a PostgreSQL database instance populated with 993
tables distributed into 8
databases.
The following metrics are not a guarantee, they are approximations that may change depending on the environment, network and execution.
The sample connector
All topics discussed in this article are covered in a sample connector, available on GitHub: hive-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!
It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Closing thoughts
In this article, we have covered how to ingest metadata from Hive into Google Cloud Data Catalog, enabling users to centralize their Metadata management, even when it resides within an on-premise environment. Stay tuned for new posts showing how to do the same with other source systems! Cheers!
References
- Connector Github Repo: https://github.com/GoogleCloudPlatform/datacatalog-connectors-hive
- Data Catalog GA blog post: https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available
- Data Catalog official docs: https://cloud.google.com/data-catalog/
- Code Samples: https://cloud.google.com/data-catalog/docs/how-to/custom-entries#data-catalog-custom-entry-python
- Create Hive test/dev env: https://dev.to/mesmacosta/quickly-set-up-a-hive-environment-on-gcp-38j8
- Create Hive test data: https://dev.to/mesmacosta/using-python-to-create-hive-tables-with-random-schema-2e5o