Sunday, May 5, 2013
Friday, April 19, 2013
My Note on Solutions.: Cassandra DSE, Testing Solr integration
My Note on Solutions.: Cassandra DSE, Testing Solr integration
everything worked for me also, except i got
an error on the Solr search UI when querying saying"HTTP Status 500 - Unavailible shards"
i had to do a
to make things work ..
Anybody knows how to index data directly in Solr that is *not* first entered/created in Cassandra ?
It seems like it is possible : " If you HTTP post the files to a non-existing column keyspace or column family, DSE Search creates the keyspace and column family, and then starts indexing the data. F"
from http://www.datastax.com/docs/datastax_enterprise2.0/search/dse_search_about
everything worked for me also, except i got
an error on the Solr search UI when querying saying"HTTP Status 500 - Unavailible shards"
i had to do a
'update keyspace WITH placement_strategy = 'NetworkTopologyStrategy' and strategy_options=[{Solr:1}]; '
to make things work ..
Anybody knows how to index data directly in Solr that is *not* first entered/created in Cassandra ?
It seems like it is possible : " If you HTTP post the files to a non-existing column keyspace or column family, DSE Search creates the keyspace and column family, and then starts indexing the data. F"
from http://www.datastax.com/docs/datastax_enterprise2.0/search/dse_search_about
Monday, April 15, 2013
Using secondary indexes in Cassandra CQL: need to use an indexed column in your query
Say, i create this table:
Bad Request: No indexed columns present in by-columns clause with Equal operator
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
cqlsh:Keyspace2> select * from users where birth_year > 1960 and user_name 'jsmith';
Bad Request: line 1:48 no viable alternative at input 'user_name'
cqlsh:Keyspace2> select * from users where birth_year > 1960 and user_name = 'jsmith';
Bad Request: No indexed columns present in by-columns clause with Equal operator
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
Unless i use an indexed column in the query (not necessarily the primary indexed one):
I can create an index and query:cqlsh:Keyspace2> CREATE TABLE users ( user_name varchar PRIMARY KEY, password varchar, gender varchar, session_token varchar, state varchar, birth_year bigint );cqlsh:Keyspace2> INSERT INTO users... (user_name, password)... VALUES ('jsmith', 'ch@ngem3a');cqlsh:Keyspace2> INSERT INTO users (user_name, password) VALUES ('jsmith2', 'ch@ngem3a2');cqlsh:Keyspace2> create index on users (password);
cqlsh:Keyspace2> select * from users where password = 'ch@ngem3a2' and user_name = 'jsmith2';
user_name | birth_year | gender | password | session_token | state
-----------+------------+--------+------------+---------------+-------
jsmith2 | 1963 | null | ch@ngem3a2 | null | null
but this doesnt work:cqlsh:Keyspace2> select * from users where birth_year > 1960;
Bad Request: No indexed columns present in by-columns clause with Equal operator
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
cqlsh:Keyspace2> select * from users where birth_year > 1960 and user_name 'jsmith';
Bad Request: line 1:48 no viable alternative at input 'user_name'
cqlsh:Keyspace2> select * from users where birth_year > 1960 and user_name = 'jsmith';
Bad Request: No indexed columns present in by-columns clause with Equal operator
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
Unless i use an indexed column in the query (not necessarily the primary indexed one):
cqlsh:Keyspace2> select * from users where birth_year > 1960 and password = 'ch@ngem3a2';
user_name | birth_year | gender | password | session_token | state
-----------+------------+--------+------------+---------------+-------
jsmith2 | 1963 | null | ch@ngem3a2 | null | null
Trying Cassandra CQL
Here is how to create a new column on the fly in Cassandra's CQL2:
cqlsh:Keyspace2> select * from users;
user_name | birth_year | gender | password | session_token | state-----------+------------+--------+------------+---------------+-------jsmith3 | null | null | 200 | null | nulljsmith | 1968 | null | ch@ngem3a | null | nulljsmith2 | 1963 | null | ch@ngem3a2 | null | null
cqlsh:Keyspace1> insert into users (KEY, x) values ('jsmith',100);
cqlsh:Keyspace1> select * from users;
KEY,TEST | birth_year,1968 | gender,m
KEY,TEST1 | birth_year,1968 | gender,f
KEY,jsmith | x,100
Thursday, April 4, 2013
How to test Cassandra
Datastax's Cassandra comes with a stress tester, that can generate data for you.
It is all built in with different kind of parameters.
Here i am generating 1 M rows with 10 columns, readable, random values (up to 1000 values) that are indexed:
It is all built in with different kind of parameters.
Here i am generating 1 M rows with 10 columns, readable, random values (up to 1000 values) that are indexed:
cassandra-stress -o INSERT -n 1000000 -c 10 -U UTF8Type -C 1000 --create-index=KEYS -r
Wednesday, April 3, 2013
How to use a Hive Avro Serde in distributed mode
Lately with our team, we tried to optimize our data by using an Avro Serde (with a binary encoding).
Unfortunately on CDH4, we would run into errors after creating the table;
Here is the Hive schema that we used:
But this didn't work :
Unfortunately on CDH4, we would run into errors after creating the table;
Here is the Hive schema that we used:
Create external table avro_testROW FORMAT SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOCATION'/kafka/avro/topic_avro2/hourly/<date>/' TBLPROPERTIES ('avro.schema.url'='file:///home/ubuntu/Message.avsc') ;
But this didn't work :
hive>Well apparently the fix is to provide an inline schema instead:
> select count(*) from avro_test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201303291126_0045, Tracking URL = ..Kill Command = /home/ubuntu/cdh4/hadoop-2.0.0-mr1-cdh4.2.0/bin/hadoop job -kill job_201303291126_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-03-29 20:50:47,871 Stage-1 map = 0%, reduce = 0%
2013-03-29 20:51:22,151 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303291126_0045 with errors
Error during job, obtaining debugging information...
Job Tracking URL: ..Examining task ID: task_201303291126_0045_m_000002 (and more) from job job_201303291126_0045
Task with the most failures(4):
-----
Task ID:
task_201303291126_0045_m_000000
URL:..
-----
Diagnostic Messages for this Task:
java.io.IOException: java.lang.reflect.InvocationTargetExceptionat org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:369)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:316)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:430)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:540)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:395)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native
CREATE EXTERNAL TABLE avro_topicROW FORMAT SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOCATION'/kafka/avro/topic_avro/hourly/2013/04/01/06'TBLPROPERTIES ('avro.schema.literal'='{"namespace": "com.test.beans","type": "record","name": "Message","doc": "Logs for not so important stuff.","fields": [{"name": "id","type": "long","default":0},{"name": "logTime","type": "long","default":0},{"name": "muchoStuff","type": {"type": "map", "values": "string"},"default":null}]}');
How to recover from errors in Hive interactive mode
Lately, working with a team offshore on the same instance of Hive, I would get errors like:
hive> show tables;
FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> select count(*) from json_topic1; FAILED: SemanticException Unable to fetch table json_topic1
hive>
I was told that generally we get this exception if we the hive console not terminated properly.
The fix:
Run the jps command, look for "RunJar" process and kill it using
kill -9 <RunJarprocessId> command