Tuesday, September 8, 2015

Spark Introduction


Apache Spark is a tool to analyze big data quickly and efficiently. Its a computing framework for distributed processing. big data computation is all about memory and computation(speed and time)

MapReduce started a decade ago. It is ideal for batch programming but it didn't fit in many uses cases so this spans many different systems(interactive tools, Streaming tools etc) to handle specialized use cases in the system. Its not only difficult to manage multiple systems but also its has many restrictions on tool support.

Spark combines batch processing, interactive and Steaming api and also support ML algos.

Spark's in-memory primitives provides performance up to 100 faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms 


The Spark project consists of multiple components.
Spark Core and Resilient Distributed Datasets
Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basicI/O functionalities. The fundamental programming abstraction is called Resilient Distributed Datasets (RDDs), a logical collection of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying transformation on existing RDDs.
The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala, and R similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
Spark SQL
Spark sql is a component on top of Spark Core that introduces a new data abstraction called DataFrames, which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. It also provides SQL language support, with command-line interfaces and ODBC/JDBC server. prior to version 1.3 of Spark, DataFrames were referred to as SchemaRDDs.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, on a single engine.

MLlib Machine Learning Library
Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:
  1. summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
  2. classification and regression: state vector machines, logistic regression, linear regression, decision trees,naive Bayes classification
  3. collaborative filtering techniques including alternating least squares (ALS)
  4. cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA)
GraphX
GraphX is a distributed graph processing framework on top of Spark. It provides an API for expressing graph computation that can model the Pregel abstraction. 

Like Spark, GraphX initially started as a research project at UC Berkeley's AMPLab and Databricks, and was later donated to the the Apache Software Foundation and the Spark project.


2 comments:

  1. As we know there are many companies which are converting into Big data engineering automation. with the right direction we can definitely predict the future.

    ReplyDelete
  2. This service is calledAWS Database Migration Serviceand is a fully managed service that works with virtually any commercial and open-source database platform to provide fast, cost-effective, and reliable migration of data to AWS.

    ReplyDelete

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...