Apache Spark is a tool to analyze big data quickly and efficiently. Its a computing framework for distributed processing. big data computation is all about memory and computation(speed and time)
MapReduce
started a decade ago. It is ideal for batch programming but it didn't fit in
many uses cases so this spans many different systems(interactive tools, Streaming tools etc) to handle specialized use
cases in the system. Its not only difficult to manage multiple systems but also
its has many restrictions on tool support.
Spark combines batch processing, interactive and Steaming api and also support ML algos.
Spark's
in-memory primitives provides performance up to 100 faster for certain
applications. By allowing user programs to load data i nto a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms
The Spark project
consists of multiple components.
Spark Core and
Resilient Distributed Datasets
Spark
Core is the foundation of the overall project. It provides distributed task
dispatching, scheduling, and basicI/O functionalities. The fundamental
programming abstraction is called Resilient Distributed Datasets (RDDs), a
logical collection of data partitioned across machines. RDDs can be created by
referencing datasets in external storage systems, or by applying
transformation on existing RDDs.
The
RDD abstraction is exposed through a language-integrated API in Java, Python, Scala,
and R similar to local, in-process collections. This simplifies
programming complexity because the way applications manipulate RDDs is similar
to manipulating local collections of data.
Spark SQL
Spark sql is
a component on top of Spark Core that introduces a new data abstraction called
DataFrames, which provides support for structured and semi-structured
data. Spark SQL provides a domain-specific language to
manipulate SchemaRDDs in Scala, Java, or Python. It also
provides SQL language support, with command-line interfaces and ODBC/JDBC server.
prior to version 1.3 of Spark, DataFrames were referred to as SchemaRDDs.
Spark Streaming
Spark
Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD transformations on
those mini-batches of data. This design enables the same set of application
code written for batch analytics to be used in streaming analytics, on a single
engine.
MLlib Machine Learning Library
Spark MLlib is a distributed machine
learning framework on top of Spark Core that, due in large part of the
distributed memory-based Spark architecture, is as much as nine times as fast
as the disk-based implementation used by Apache Mahout (according to
benchmarks done by the MLlib developers against the Alternating Least Squares (ALS)
implementations, and before Mahout itself gained a Spark interface. Many common machine learning and statistical
algorithms have been implemented and are shipped with MLlib which simplifies
large scale machine learning pipelines, including:
- summary
statistics, correlations, stratified sampling, hypothesis
testing, random data generation
- classification and regression: state
vector machines, logistic regression, linear regression,
decision trees,naive Bayes classification
- collaborative
filtering techniques including alternating least squares (ALS)
- cluster
analysis methods including k-means, and Latent Dirichlet
Allocation (LDA)
GraphX
GraphX is a
distributed graph processing framework on top of Spark. It provides
an API for expressing graph computation that can model the Pregel abstraction.
Like Spark, GraphX
initially started as a research project at UC Berkeley's AMPLab and Databricks,
and was later donated to the the Apache Software Foundation and the Spark
project.
As we know there are many companies which are converting into Big data engineering automation. with the right direction we can definitely predict the future.
ReplyDeleteThis service is calledAWS Database Migration Serviceand is a fully managed service that works with virtually any commercial and open-source database platform to provide fast, cost-effective, and reliable migration of data to AWS.
ReplyDelete