Thursday, January 28, 2016

Data Lake

Data Lake


The data lake is gaining lots of momentum across now a days. The data lake is a powerful data architecture that leverages the economics of big data. 

Visualize data lake and its analytic as hub-spoke architecture. 


The hub of the “Hub and Spoke” architecture is the data lake.  The data lake has the following characteristics:
  • Centralized, singular, schema-less data store with raw (as-is) data as well as massaged data
  • Mechanism for rapid ingestion of data with appropriate latency
  • Ability to map data across sources and provide visibility and security to users
  • Catalog to find and retrieve data
  • Costing model of centralized service
  • Ability to manage security, permissions and data masking
  • Supports self-provisioning of compute nodes, data, and analytic tools without IT intervention
 The spokes of the “Hub and Spoke” architecture are the resulting analytic use cases that have the following characteristics:
  • Ability to perform analytics
  • Analytics sandbox (HDFS, Hadoop, Spark,, Hive, HBase)
  • Data engineering tools (MapReduce, YARN)
  • Analytical tools (SAS, R, Mahout, MADlib, H2O)
  • Visualization tools (Tableau)
  • Ability to exploit analytics (application development)
  • 3rd platform application (mobile app development, web site app development)
  • Analytics exposed as services to applications (API’s)
  • Integrate in-memory and/or in-database scoring and recommendations into business process and operational systems
Do and Donts while think of designing data lake
1. Dont feed data lake from data warehouse

Loading data into a data warehouse means that someone has already made assumptions about what data, level of granularity and amount of history is important. valuable data is filtered out during this process.  use data lake as a data repository to store any and all data (structured AND unstructured; internal AND external) 

benefits - fast data injection, availability of data to data scientific for analysis. 

2. Try to create single data lake for all your data

3. Key differences between data warehouse and data lake. 



No comments:

Post a Comment

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...