RBHive is a simple Ruby gem to communicate with the Apache Hive Thrift server.
It supports:
- Hiveserver (the original Thrift service shipped with Hive since early releases)
- Hiveserver2 (the new, concurrent Thrift service shipped with Hive releases since 0.10)
- Any other 100% Hive-compatible Thrift service (e.g. Sharkserver)
It is capable of using the following Thrift transports:
- BufferedTransport (the default)
- SaslClientTransport (SASL-enabled transport)
- HTTPClientTransport (tunnels Thrift over HTTP)
Hiveserver (the original Thrift interface) only supports a single client at a time. RBHive
implements this with the RBHive::Connection
class. It only supports a single transport,
BufferedTransport.
Hiveserver2 (the new Thrift interface) can support many concurrent client connections. It is shipped with Hive 0.10 and later. In Hive 0.10, only BufferedTranport and SaslClientTransport are supported; starting with Hive 0.12, HTTPClientTransport is also supported.
Each of the versions after Hive 0.10 has a slightly different Thrift interface; when connecting, you must specify the Hive version or you may get an exception.
RBHive implements this client with the RBHive::TCLIConnection
class.
We had to set the following in hive-site.xml to get the BufferedTransport Thrift service to work with RBHive:
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
Otherwise you'll get this nasty-looking exception in the logs:
ERROR server.TThreadPoolServer: Error occurred during processing of message.
java.lang.ClassCastException: org.apache.thrift.transport.TSocket cannot be cast to org.apache.thrift.transport.TSaslServerTransport
at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:35)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Consult the documentation for the service, as this will vary depending on the service you're using.
Since Hiveserver has no options, connection code is very simple:
RBHive.connect('hive.server.address', 10_000) do |connection|
connection.fetch 'SELECT city, country FROM cities'
end
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
Hiveserver2 has several options with how it is run. The connection code takes a hash with these possible parameters:
:transport
- one of:buffered
(BufferedTransport),:http
(HTTPClientTransport), or:sasl
(SaslClientTransport):hive_version
- the number after the period in the Hive version; e.g.10
,11
,12
,13
or one of a set of symbols; see Hiveserver2 protocol versions below for details:timeout
- if using BufferedTransport or SaslClientTransport, this is how long the timeout on the socket will be:sasl_params
- if using SaslClientTransport, this is a hash of parameters to set up the SASL connection
If you pass either an empty hash or nil in place of the options (or do not supply them), the connection
is attempted with the Hive version set to 0.10, using :buffered
as the transport, and a timeout of 1800 seconds.
Connecting with the defaults:
RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
connection.fetch('SHOW TABLES')
end
Connecting with a specific Hive version (0.12 in this case):
RBHive.tcli_connect('hive.server.address', 10_000, {:hive_version => 12}) do |connection|
connection.fetch('SHOW TABLES')
end
Connecting with a specific Hive version (0.12) and using the :http
transport:
RBHive.tcli_connect('hive.server.address', 10_000, {:hive_version => 12, :transport => :http}) do |connection|
connection.fetch('SHOW TABLES')
end
We have not tested the SASL connection, as we don't run SASL; pull requests and testing are welcomed.
Since the introduction of Hiveserver2 in Hive 0.10, there have been a number of revisions to the Thrift protocol it uses.
The following table lists the available values you can supply to the :hive_version
parameter when making a connection
to Hiveserver2.
value | Thrift protocol version | notes |
---|---|---|
10 |
V1 | First version of the Thrift protocol used only by Hive 0.10 |
11 |
V2 | Used by the Hive 0.11 release (but not CDH5 which ships with Hive 0.11!) - adds asynchronous execution |
12 |
V3 | Used by the Hive 0.12 release, adds varchar type and primitive type qualifiers |
13 |
V7 | Used by the Hive 0.13 release, adds features from V4, V5 and V6, plus token-based delegation connections |
:cdh4 |
V1 | CDH4 uses the V1 protocol as it ships with the upstream Hive 0.10 |
:cdh5 |
V5 | CDH5 ships with upstream Hive 0.11, but adds patches to bring the Thrift protocol up to V5 |
In addition, you can explicitly set the Thrift protocol version according to this table:
value | Thrift protocol version | notes |
---|---|---|
:PROTOCOL_V1 |
V1 | Used by Hive 0.10 release |
:PROTOCOL_V2 |
V2 | Used by Hive 0.11 release |
:PROTOCOL_V3 |
V3 | Used by Hive 0.12 release |
:PROTOCOL_V4 |
V4 | Updated during Hive 0.13 development, adds decimal precision/scale, char type |
:PROTOCOL_V5 |
V5 | Updated during Hive 0.13 development, adds error details when GetOperationStatus returns in error state |
:PROTOCOL_V6 |
V6 | Updated during Hive 0.13 development, adds binary type for binary payload, uses columnar result set |
:PROTOCOL_V7 |
V7 | Used by Hive 0.13 release, support for token-based delegation connections |
RBHive.connect('hive.server.address', 10_000) do |connection|
connection.fetch 'SELECT city, country FROM cities'
end
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
connection.fetch 'SELECT city, country FROM cities'
end
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
RBHive.connect('hive.server.address') do |connection|
connection.execute 'DROP TABLE cities'
end
➔ nil
RBHive.tcli_connect('hive.server.address') do |connection|
connection.execute 'DROP TABLE cities'
end
➔ nil
table = TableSchema.new('person', 'List of people that owe me money') do
column 'name', :string, 'Full name of debtor'
column 'address', :string, 'Address of debtor'
column 'amount', :float, 'The amount of money borrowed'
partition 'dated', :string, 'The date money was given'
partition 'country', :string, 'The country the person resides in'
end
Then for Hiveserver:
RBHive.connect('hive.server.address', 10_000) do |connection|
connection.create_table(table)
end
Or Hiveserver2:
RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
connection.create_table(table)
end
table = TableSchema.new('person', 'List of people that owe me money') do
column 'name', :string, 'Full name of debtor'
column 'address', :string, 'Address of debtor'
column 'amount', :float, 'The amount of money borrowed'
column 'new_amount', :float, 'The new amount this person somehow convinced me to give them'
partition 'dated', :string, 'The date money was given'
partition 'country', :string, 'The country the person resides in'
end
Then for Hiveserver:
RBHive.connect('hive.server.address') do |connection|
connection.replace_columns(table)
end
Or Hiveserver2:
RBHive.tcli_connect('hive.server.address') do |connection|
connection.replace_columns(table)
end
You can set various properties for Hive tasks, some of which change how they run. Consult the Apache Hive documentation and Hadoop's documentation for the various properties that can be set. For example, you can set the map-reduce job's priority with the following:
connection.set("mapred.job.priority", "VERY_HIGH")
RBHive.connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
result = connection.fetch("describe some_table")
puts result.column_names.inspect
puts result.first.inspect
}
RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
result = connection.fetch("describe some_table")
puts result.column_names.inspect
puts result.first.inspect
}
We use RBHive against Hive 0.10, 0.11 and 0.12, and have tested the BufferedTransport and HTTPClientTransport. We use it against both Hiveserver and Hiveserver2 with success.
We have not tested the SaslClientTransport, and would welcome reports on whether it works correctly.
We welcome contributions, issues and pull requests. If there's a feature missing in RBHive that you need, or you think you've found a bug, please do not hesitate to create an issue.
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request