Saturday, April 9, 2016

Apache Drill


What is it : Drill is designed to be a distributed SQL query engine. You can compare it with SparkSQL. Spark contains many sub projects and the piece that directly compares with Drill is SparkSQL If you need to perform complex math, statistics, or machine learning, then Apache Spark is a good place for you to start.

SQL vs SQL like : Drill supports ANSI SQL:2003.

Data Formats : Apache Drill is that it can discover the schema on the fly as you query any data.It can integrate with several data sources like Hive, HBase, MongoDB, file system, RDBMS. Also, input formats like Avro, CSV, TSV, PSV, Parquet, Hadoop Sequence files, and many others can be used in Drill with ease. Best format for drill is Parquet. 

Access : There are multiple choice for accessing Drill. It can be accessed via the Drill shell, web interface, ReST interface, or through JDBC/ODBC drivers.

Security : Views can aggregate data from several sources and hide the underlying complexities of the data sources. Security through impersonation leverages views. Impersonation and views together offer fine grained security at the file level. A data owner can create a view which selects some limited set of data from the raw data source. They can then grant privileges in the file system for other users to execute that view to query the underlying data without giving the user the ability to read the underlying file directly.

The table below describes at a high level some of the key considerations for picking the right "SQL-on-Hadoop" technology



Drill
Hive
Impala
Spark SQL

Key Use Cases

Self-service Data Exploration Interactive BI / Ad-hoc queries
Batch/ ETL/ Long-running jobs
Interactive BI / Ad-hoc queries
SQL as part of Spark pipelines / Advanced analytic workflows
Data Sources
Files Support
Parquet, JSON, Text, all Hive file formats
Yes (all Hive file formats)
Yes (Parquet, Sequence, RC, Text, AVRO ...)
Parquet, JSON, Text, all Hive file formats
HBase/MapR-DB
Yes
Yes
Yes
Yes

Beyond Hadoop
Yes
No
No
Yes
Data Types
Relational
Yes
Yes
Yes
Yes
Complex/Nested
Yes
Limited
No
Limited
Metadata
Schema-less/Dynamic schema
Yes
No
No
Limited
Hive Meta store
Yes
Yes
Yes
Yes
SQL / BI tools
SQL support
ANSI SQL
HiveQL
HiveQL
ANSI SQL (limited) & HiveQL
Client support
ODBC/JDBC
ODBC/JDBC
ODBC/JDBC
ODBC/JDBC
Beyond Memory
Yes
Yes
Yes
Yes
Optimizer
Limited
Limited
Limited
Limited
Platform
Latency
Low
Medium
Low
Low (in-memory) / Medium
Concurrency
High
Medium
High
Medium
Decentralized Granular Security

Yes
No
No
No



Functioning :

Drillbit is Apache Drill’s daemon that runs on each node in the cluster. It uses ZooKeeper for all the communication in the cluster and maintains cluster membership. It is responsible for accepting requests from the client, processing the queries, and returning results to the client. The drillbit which receives the request from the client is called ‘foreman’. It generates the execution plan, the execution fragments are sent to other drillbits running in the cluster.

Drillbits-Apache-Drill


you may like to edit conf/drill-env.sh for memory setting and conf/drill-override.conf for cluster id and zookeeper host & port




No comments:

Post a Comment

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...