MLlib Machine Learning Library
Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:
- summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
- classification and regression: state vector machines, logistic regression, linear regression, decision trees,naive Bayes classification
- collaborative filtering techniques including alternating least squares (ALS)
- cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA)
Machine Learning(definition) - Constructing and studying methods that learn from and make predictions on data.
Terminologies
Observations - (data points) item or entities used for learning or evaluation.
Features - attribute used to represent an observation.
Labels - value assigned to observation
Training and Test data - observation used to train or evaluate a learning algorithm.
so if consider observation as email than feature would be date, importance, key words in subject or body of an email and Labels would be spam or not-spam. Test and learning data would be set of emails.
Supervised learning - learning from labeled observation examples classification and regression
Unsupervised learning - learning from unlabeled observation examples clustering and dimensionality reduction.
Flow -raw data ->feature extraction-> supervised learning ->evaluation-(satisfied)-> prediction
MLlib consists of two packages.
When using pyspark, you'll find them in the pyspark.mllib and pyspark.ml packages respectively
Spark.ml is a newer package and works with data frames. The algorithm coverage is similar between the two packages, although spark.ml contains more tools for feature extraction and transformation. ML Package contains two types of classes transformer and estimator.
Transformer is a class which takes dataframe as input and transform it to another dataframe.
A transformer implement transform() function which is called on input dataframe.
Examples :
Terminologies
Observations - (data points) item or entities used for learning or evaluation.
Features - attribute used to represent an observation.
Labels - value assigned to observation
Training and Test data - observation used to train or evaluate a learning algorithm.
so if consider observation as email than feature would be date, importance, key words in subject or body of an email and Labels would be spam or not-spam. Test and learning data would be set of emails.
Supervised learning - learning from labeled observation examples classification and regression
Unsupervised learning - learning from unlabeled observation examples clustering and dimensionality reduction.
Flow -raw data ->feature extraction-> supervised learning ->evaluation-(satisfied)-> prediction
MLlib consists of two packages.
- spark.mllib
- spark.ml
When using pyspark, you'll find them in the pyspark.mllib and pyspark.ml packages respectively
Spark.ml is a newer package and works with data frames. The algorithm coverage is similar between the two packages, although spark.ml contains more tools for feature extraction and transformation. ML Package contains two types of classes transformer and estimator.
Transformer is a class which takes dataframe as input and transform it to another dataframe.
A transformer implement transform() function which is called on input dataframe.
Examples :
- Hashing Term Frequency - which calculates how often words occur. It does this after hashing the words to reduce the number of features that need to be tracked.
- LogisticRegressionModel - The model that results from trying logistic regression on a data set, this model can be used to transform features into predictions.
- Binarizer - which changes a numeric feature into 1 or 0 given a threshold value.
An Estimator is a class that can take a DataFrame as input and returns a Tranformer.
It does this by calling it's fit method on the input DataFrame.
Note that Estimators need to use the data in the input DataFrame to build a model that can then be used to transform that DataFrame or another DataFrame.
Examples :
- LogisticRegression processes the DataFrame to determine the weights for the resulting logistic regression model.
- StandardScaler needs to calculate the standard deviations, and possibly, means of a column of vectors so that it can create a standard scalar model. That model can then be used to transform a DataFrame by subtracting the means and dividing by the standard deviations.
- PipeLine Calling fit on a pipeline produces a Pipeline model. The pipeline model only contains transformers. There are no estimators.
ML Pipeline - Its a estimator that consist of one or more stages representing a reusable workflow. Pipeline stages can be transformers, estimators or another pipeline.
transformer1->transformer2->estimator1
-----------------------pipeline-----------------------------
Loss functions define how to penalize incorrect predictions.
The logistic function asymptotically approaches 0 as the input approaches negative infinity and 1 as the input approaches positive infinity. Since the results are bounded by 0 and 1, it can be directly interpreted as a probability.
Feature engineering is the important part and we will discuss that next until enjoy learning.
The logistic function asymptotically approaches 0 as the input approaches negative infinity and 1 as the input approaches positive infinity. Since the results are bounded by 0 and 1, it can be directly interpreted as a probability.
Feature engineering is the important part and we will discuss that next until enjoy learning.