Tuesday, February 20, 2018

Spark Mlib Basics

MLlib Machine Learning Library


Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:


  1. summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
  2. classification and regression: state vector machines, logistic regression, linear regression,      decision trees,naive Bayes classification
  3. collaborative filtering techniques including alternating least squares (ALS)
  4. cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA)

Machine Learning(definition) - Constructing and studying methods that learn from and make predictions on data. 

Terminologies
  Observations - (data points) item or entities used for learning or evaluation. 
  Features - attribute used to represent an observation. 
  Labels -  value assigned to observation
  Training and Test data - observation used to train or evaluate a learning algorithm. 

so if consider observation as email than feature would be date, importance, key words in subject or body of an email and Labels would be spam or not-spam. Test and learning data would be set of emails.

Supervised learninglearning from labeled observation examples classification and regression
Unsupervised learninglearning from unlabeled observation examples clustering and dimensionality reduction. 

Flow -raw data ->feature extraction-> supervised learning ->evaluation-(satisfied)-> prediction

MLlib consists of two packages.

  • spark.mllib  
  • spark.ml

When using pyspark, you'll find them in the pyspark.mllib and pyspark.ml packages respectively
Spark.ml is a newer package and works with data frames. The algorithm coverage is similar between the two packages, although spark.ml contains more tools for feature extraction and transformation. ML Package contains two types of classes transformer and estimator.

Transformer is a class which takes dataframe as input and transform it to another dataframe. 

A transformer implement transform() function which is called on input dataframe. 

Examples :


  • Hashing Term Frequency - which calculates how often words occur. It does this after hashing the words to reduce the number of features that need to be tracked.
  • LogisticRegressionModel - The model that results from trying logistic regression on a data set, this model can be used to transform features into predictions.
  • Binarizer - which changes a numeric feature into 1 or 0 given a threshold value.


An Estimator is a class that can take a DataFrame as input and returns a Tranformer.

It does this by calling it's fit method on the input DataFrame.



Note that Estimators need to use the data in the input DataFrame to build a model that can then be used to transform that DataFrame or another DataFrame.



Examples :

  • LogisticRegression processes the DataFrame to determine the weights for the resulting logistic regression model.
  • StandardScaler needs to calculate the standard deviations, and possibly, means of a column of vectors so that it can create a standard scalar model. That model can then be used to transform a DataFrame by subtracting the means and dividing by the standard deviations.
  • PipeLine Calling fit on a pipeline produces a Pipeline model. The pipeline model only contains transformers. There are no estimators.

ML Pipeline  - Its a estimator that consist of one or more stages representing a reusable workflow.  Pipeline stages can be transformers, estimators or another pipeline.

             
transformer1->transformer2->estimator1
-----------------------pipeline-----------------------------





Loss functions define how to penalize incorrect predictions.

The logistic function asymptotically approaches 0 as the input approaches negative infinity and 1 as the input approaches positive infinity. Since the results are bounded by 0 and 1, it can be directly interpreted as a probability.

Feature engineering is the important part and we will discuss that next until enjoy learning. 




Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...