My Notes: April 2016

Saturday, April 23, 2016

Apache Drill - Hive

Drill can be used to query hive tables. Please note that Drill doesn't use MapReduce as its for interactive purpose and not batch.

Drill currently does not support writing Hive tables. The approach I would use is to have Drill write files and read the files via Hive External tables.

Drill support ANSI SQL support basic datatypes. Please keep in mind that Drill does not support the following Hive types:

LIST
MAP
STRUCT
TIMESTAMP (Unix Epoch format)
UNION

If Decimal datatype is diabled in Drill, you can enable it as below

set the planner.enable_decimal_data_type option to true.

How to access hive tables in Drill - example sample

1. create external table in hive - store it as file

hive> CREATE EXTERNAL TABLE types_demo ( 
      a bigint, 
      b boolean, 
      c DECIMAL(3, 2), 
      d double, 
      e float, 
      f INT, 
      g VARCHAR(64), 
      h date,
      i timestamp
      ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
      LINES TERMINATED BY '\n' 
      STORED AS TEXTFILE LOCATION '/mapr/demo.mapr.com/data/mytypes.csv';

2. Configure Storage plugin (Hive storage plugin connects Drill to the Hive metastore containing the data.)

{

"type": "hive",

"enabled": false,

"configProps": {

"hive.metastore.uris": "thrift://localhost:9083",

"javax.jdo.option.ConnectionURL": "jdbc:mysql:;databaseName=../sample-data/drill_hive_db;create=true",

"hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",

"fs.default.name": "file:///",

"hive.metastore.sasl.enabled": "false"

}

3. Access hive table in Drill as below

0: jdbc:drill:> use hive;

0: jdbc:drill:> SELECT * FROM hive.`types_demo`;

Saturday, April 9, 2016

Apache Drill

What is it : Drill is designed to be a distributed SQL query engine. You can compare it with SparkSQL. Spark contains many sub projects and the piece that directly compares with Drill is SparkSQL If you need to perform complex math, statistics, or machine learning, then Apache Spark is a good place for you to start.

SQL vs SQL like : Drill supports ANSI SQL:2003.

Data Formats : Apache Drill is that it can discover the schema on the fly as you query any data.It can integrate with several data sources like Hive, HBase, MongoDB, file system, RDBMS. Also, input formats like Avro, CSV, TSV, PSV, Parquet, Hadoop Sequence files, and many others can be used in Drill with ease. Best format for drill is Parquet.

Access : There are multiple choice for accessing Drill. It can be accessed via the Drill shell, web interface, ReST interface, or through JDBC/ODBC drivers.

Security : Views can aggregate data from several sources and hide the underlying complexities of the data sources. Security through impersonation leverages views. Impersonation and views together offer fine grained security at the file level. A data owner can create a view which selects some limited set of data from the raw data source. They can then grant privileges in the file system for other users to execute that view to query the underlying data without giving the user the ability to read the underlying file directly.

The table below describes at a high level some of the key considerations for picking the right "SQL-on-Hadoop" technology

	Drill	Hive	Impala	Spark SQL
Key Use Cases		Self-service Data Exploration Interactive BI / Ad-hoc queries	Batch/ ETL/ Long-running jobs	Interactive BI / Ad-hoc queries	SQL as part of Spark pipelines / Advanced analytic workflows
Data Sources	Files Support	Parquet, JSON, Text, all Hive file formats	Yes (all Hive file formats)	Yes (Parquet, Sequence, RC, Text, AVRO ...)	Parquet, JSON, Text, all Hive file formats
Data Sources	HBase/MapR-DB	Yes	Yes	Yes	Yes
	Beyond Hadoop	Yes	No	No	Yes
Data Types	Relational	Yes	Yes	Yes	Yes
Data Types	Complex/Nested	Yes	Limited	No	Limited
Metadata	Schema-less/Dynamic schema	Yes	No	No	Limited
Metadata	Hive Meta store	Yes	Yes	Yes	Yes
SQL / BI tools	SQL support	ANSI SQL	HiveQL	HiveQL	ANSI SQL (limited) & HiveQL
	Client support	ODBC/JDBC	ODBC/JDBC	ODBC/JDBC	ODBC/JDBC
	Beyond Memory	Yes	Yes	Yes	Yes
	Optimizer	Limited	Limited	Limited	Limited
Platform	Latency	Low	Medium	Low	Low (in-memory) / Medium
Platform	Concurrency	High	Medium	High	Medium
Decentralized Granular Security		Yes	No	No	No

Functioning :

Drillbit is Apache Drill’s daemon that runs on each node in the cluster. It uses ZooKeeper for all the communication in the cluster and maintains cluster membership. It is responsible for accepting requests from the client, processing the queries, and returning results to the client. The drillbit which receives the request from the client is called ‘foreman’. It generates the execution plan, the execution fragments are sent to other drillbits running in the cluster.

you may like to edit conf/drill-env.sh for memory setting and conf/drill-override.conf for cluster id and zookeeper host & port

My Notes

Saturday, April 23, 2016

Apache Drill - Hive

Saturday, April 9, 2016

Apache Drill

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

Search This Blog