Replace Hive CLI with Beeline on a cluster with Sentry
Fine grained authorization
In this blog I will explain how to use beeline in a secured cluster. The CDH 5.1.0 cluster is secured with Kerberos (authentication) and Sentry (authorization). If you want to setup a secured cluster checkout the related blog kerberos-cloudera-setup. Cloudera is using Sentry for fine grained authorization of data and metadata stored on a Hadoop cluster.
This blog is related to the hive command-line tool, using Hive through HUE is fine!
Why change from Hive CLI to Beeline?
The primary difference between the two involves how the clients connect to Hive. The Hive CLI connects directly to the Hive Driver and requires that Hive be installed on the same machine as the client. However, Beeline connects to HiveServer2 and does not require the installation of Hive libraries on the same machine as the client. Beeline is a thin client that also uses the Hive JDBC driver but instead executes queries through HiveServer2, which allows multiple concurrent client connections and supports authentication.
Cloudera's Sentry security is working through HiveServer2 and not HiveServer1 which is used by Hive CLI. So hive though the command-line will not follow the policy from Setry. According to the cloudera docs you should not use Hive CLI and WebHCat. Use beeline or impala-sell instead.
hive command-line will bypass sentry security!
Connect with Beeline
For a non secured cluster it is easy to connect. You can use beeline as described in this blog cloudera-migrating-hive-to-beeline.
# beeline with params beeline -u url -n username -p password # url is a jdbc connection string, pointing to the hiveServer2 host. # or use the !connect action beeline beeline> !connect jdbc:hive2://HiveServer2Host:Port
When using a kerberized cluster you can connect using your principle:
# initialize your kerberos ticket kinit # Connect with your ticket, no username / password required. # master01 is the node where HiveServer2 is running # In the url add the parameter principle with the hive principle beeline -u "jdbc:hive2://master01:10000/default;principal=hive/master01@MYREALM.COM"
You can find the full principle name in Cloudera Manager
- Administration -> Kerberos
- Credentials -> search hive
- Use the principle where HiveServer2 is running
Export a query to file with beeline:
HIVESERVER2_URL = "jdbc:hive2://master01:10000/default;principal=hive/master01@MYREALM.COM" # Just simple export (as a table). beeline -u $HIVESERVER2_URL -f quey.sql > result.txt # Result firstline: query, then: pretty table, lastline: '0: jdbc:hive2://master02:10000/default>' # Remove first and last line of result for a proper csv-file with a header: beeline -u $HIVESERVER2_URL -f quey.sql --outputformat=csv --showHeader=true | tail -n +2 -f | head -n -1 > result.csv
All the errors look the same. Error: Invalid URL ... (state=08S01,code=0). The --verbose=true options does not help much unfortunately. When you run into problems, check the hiveserver2 logs for hints.
[alexanderbij@tools01 ~]$ beeline -u jdbc:hive2://master01:10000/default;principal=hive/master01@MYREALM.COM scan complete in 3ms Connecting to jdbc:hive2://master01:10000/default Error: Invalid URL: jdbc:hive2://master01:10000/default (state=08S01,code=0) ...
Note that the Invalid URL message does not contain the principle part! Use "quotes around the url", otherwise the hive principle argument is not used
# beeline shell 14/08/08 09:44:23 ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7))] ... Caused by: KrbException: Server not found in Kerberos database (7) ... Caused by: KrbException: Identifier doesn't match expected value (906)
There is a keytab-file on the HiveServer2-node initialized with the principle. The connection string is using the wrong Kerberos principle for the keytab-file. Make sure you provide the correct hive principle in the connection url.
Follow us for more of this
Testing and debugging Apache Airflow
February 22, 2019
The Zen of Python and Apache Airflow
February 18, 2019
AWS Machine Learning Competency Status for GoDataDriven
February 14, 2019
GoDataDriven Open Source Contribution for January 2019, the Apache Edition
February 13, 2019
Our social responsibility as a company
February 08, 2019
Keras: multi-label classification with ImageDataGenerator
January 31, 2019