Tuesday 24 October 2017

Apache Hive Deep dive

Apache Hive


Problem : get complete information about table in Hive
describe extended <table-name>


Example of a hive Table


 CREATE TABLE employees (
  name         STRING,
  salary       FLOAT,
  subordinates ARRAY<STRING>,
  deductions   MAP<STRING, FLOAT>,
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>);


Problem : Hive default record and field delimiters

\n For text files, each line is a record, so the line feed character separates records.

^A ("control" A) Separates all fields (columns). 
Written using the octal code \001 when explicitly specified in CREATE TABLE statements.

^B Separate the elements in an ARRAY or STRUCT, or the key-value pairs in a MAP. 
Written using the octal code \002 when explicitly specified in CREATE TABLE statements.

^C Separate the key from the corresponding value in MAP key-value pairs. 

Written using the octal code \003 when explicitly specified in CREATE TABLE statements.

Data example : John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600

Example of a hive table on how to override a delimiter
CREATE TABLE employees (
  name         STRING,
  salary       FLOAT,
  subordinates ARRAY<STRING>,
  deductions   MAP<STRING, FLOAT>,
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Example Avro


CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES ('schema.url'='http://schema_provider/kst.avsc')
STORED AS
INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat';



Example to Add new partition : note this does not move the old data


ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'

Example of table generating functions


Return typeSignatureDescription
N rows
explode(array)
Return 0 to many rows, one row for each element from the input array.
N rows
explode(map)
(v0.8.0 and later) Return 0 to many rows, one row for each map key-value pair, with a field for each map key and a field for the map value.
tuple
json_tuple(jsonStr, p1, p2, …, pn)
Like get_json_object, but it takes multiple names and returns a tuple. All the input parameters and output column types are STRING.
tuple
parse_url_tuple(url, partname1, partname2, …, partnameN) where N >= 1
Extract N parts from a URL. It takes a URL and the partnames to extract, returning a tuple. All the input parameters and output column types are STRING. The valid partnames are case-sensitive and should only contain a minimum of white space: HOSTPATHQUERYREFPROTOCOLAUTHORITYFILEUSERINFOQUERY:<KEY_NAME>.
N rows
stack(n, col1, …, colM)
Convert M columns into N rows of size M/N each.