How to use a Hive Avro Serde in distributed mode ~ Big data

Wednesday, April 3, 2013

How to use a Hive Avro Serde in distributed mode

Lately with our team, we tried to optimize our data by using an Avro Serde (with a binary encoding).
Unfortunately on CDH4, we would run into errors after creating the table;
Here is the Hive schema that we used:

Create external table avro_test

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION

'/kafka/avro/topic_avro2/hourly/<date>/' TBLPROPERTIES (

'avro.schema.url'='file:///home/ubuntu/Message.avsc') ;

But this didn't work :

hive>
> select count(*) from avro_test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201303291126_0045, Tracking URL = ..Kill Command = /home/ubuntu/cdh4/hadoop-2.0.0-mr1-cdh4.2.0/bin/hadoop job -kill job_201303291126_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-03-29 20:50:47,871 Stage-1 map = 0%, reduce = 0%
2013-03-29 20:51:22,151 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303291126_0045 with errors
Error during job, obtaining debugging information...
Job Tracking URL: ..Examining task ID: task_201303291126_0045_m_000002 (and more) from job job_201303291126_0045
Task with the most failures(4):
-----
Task ID:
task_201303291126_0045_m_000000
URL:..
-----
Diagnostic Messages for this Task:
java.io.IOException: java.lang.reflect.InvocationTargetExceptionat org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:369)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:316)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:430)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:540)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:395)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native

Well apparently the fix is to provide an inline schema instead:

CREATE EXTERNAL TABLE avro_topic

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION

'/kafka/avro/topic_avro/hourly/2013/04/01/06'

TBLPROPERTIES (

'avro.schema.literal'='{

"namespace": "com.test.beans",

"type": "record",

"name": "Message",

"doc": "Logs for not so important stuff.",

"fields": [

{

"name": "id",

"type": "long",

"default":0

},

{

"name": "logTime",

"type": "long",

"default":0

},

{

"name": "muchoStuff",

"type": {"type": "map", "values": "string"},

"default":null

}

]

}');

2 comments:

UnknownNovember 25, 2013 at 6:14 AM
You can specify schema file using following syntax:

create external table avro_test
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='file:///home/ubuntu/Message.avsc')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'/kafka/avro/topic_avro2/hourly//';
ReplyDelete
Replies
UnknownNovember 25, 2013 at 8:28 AM
Thanks Dmitry!
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Big data - tidbits of knowledge

Contact

Wednesday, April 3, 2013

How to use a Hive Avro Serde in distributed mode

2 comments:

Popular Posts

Recent Posts

Categories

Definition List

Text Widget

Blog Archive