My Notes: Spark Machine Learning

Showing posts with label Spark Machine Learning. Show all posts

Friday, May 11, 2018

Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Features are the ways you represent about the world for the classifier. feature selection has a multiplicative effect on the overall modeling process

Feature are numeric or categorical. feature engineering techniques are used to define feature more accuracy for your model.

Bucketing
Crossing
Hashing
Embedding

Feature Bucketing - transform the numeric feature into categorical feature.

problem - age increase so as income has a linear relation?

Age is not in linear relationship with age as children under 17 year didnt earn much so as after retirement.

Solution - Bucket the age(numeric feature) into age groups (categorical features) and put different weight for each age group. this is how we create age bucket.

Feature Crossing - way to create a new feature that are combination of existing features.

problem - Can linear classifier model interaction between multiple features say age and education against income?

No. This is were feature crossing is useful. for each cross(age bucket,education)-> we create new true/false feature and age bucket is divided into true/false of income with education.

Feature Hashing or hash buckets

one way to represent category feature with large vocabulary.

This representation can say memory and faster to execute.

A categorical feature with a large number of values can be represented and vocabulary not specified in advance.

To avoid collision put the hashing bucket number more than the unique occupation.

It can also be used to limit the number of possibilities.

Embedding - it represent the meaning of the words as a vector.

Used for large vocabulary

Embeddings are dense.

Dimensionality Reduction

Dimensionality Reduction is the process of reducing the number of variables/features.

Dimensionality Reduction can be divided into two subcategories

Feature Selection which includes Wrappers, Filters, and Embedded.
Feature Extraction which includes Principle Components Analysis.

Now consider if c was equal to 0 or an arbitrarily small number, it wouldn't really be relevant,

therefore it could be taken out of the equation. here you are using Feature Selection because you'd be selecting only the relevant variables and leaving out the irrelevant ones.

If you can equate ab = a + b, making a representation of two variables into one, you're using Feature

Extraction to reduce the number of variables.

Feature Selection is the process of selecting a subset of relevant features or variables.

There are 3 main subset types:

Wrappers,
Filters, and
Embedded.

Wrappers use a predictive model that scores feature subsets based on the error-rate of

the model.While they're computationally intensive, they usually produce the best selection of features.

A popular technique is called stepwise regression. It's an algorithm that adds the best feature, or deletes the worst feature at each iteration.

Filters use a proxy measure which is less computationally intensive but slightly less accurate. Filters do capture the practicality of the dataset but, in comparison to error measurement, the feature set that's selected will be more general than if a Wrapper was used.

An interesting fact about filters is that they produce a feature set that don't contain assumptions based on the predictive model, making it a useful tool for exposing relationships between features, such as which variables are 'Bad' together and, as a result, drop the accuracy or 'Good' together and therefore raise the accuracy.

Embedded algorithms learn about which features best contribute to an accurate model during

the model building process. The most common type of is called a regularization model.

Feature Extraction is the process of transforming or projecting a space composing of many dimensions into a space of fewer dimensions.

The main linear technique is called Principle Components Analysis.

Principle Components Analysis is the reduction of higher vector spaces to lower orders through projection.

An easy representation of this would be the projection from a 3-dimensional plane to a
2-dimensional one.

A plane is first found which captures most (if not all) of the information. Then the data is projected onto new axes and a reduction in dimensions occur. When the projection of components happens, new axes are created to describe the relationship. This is called the principle axes, and the new data is called principle components.

Thanks for reading!!!

Sunday, March 25, 2018

MapR Spark Certification tips

I recently cleared MapR spark certification and would like to share some tips as I was asked to do so, (here you go my friends)

I divided this blog into 3 sections.

prerequisite for exam
exam topics and must cover material
tips (don't ignore the topics at the end of this blog please)

Prerequisite

First and foremost - Work on Spark and Scala for at least a year before attempting the exam. below points summarize the need.

You should have basic knowledge of distributed functional programming
Hands-on experience on spark
Have good exposure to Scala programming(not expecting to be expert but read the code and answer sensibly).

Exam topics and must cover Material

Lots of programming questions in the exam, code snippet is provided and ask solve it and answer. If I remember correctly only 10% questions were theoretical ( like true/false or which algorithm to use kind).

I referred lot of materials(online books/videos/ edx courses in last 2 years) for my preparation but if I want to zero-in for what should be the mandatory for MapR certification - here is the list you should not miss any bit and I suggest to go over 4-5 times before taking the exam.

Instructor and Virtual Instructor-led Training(Training ppt and Lab guide)

DEV 360 – Developing Spark Applications
DEV 361 - Build and Monitor Apache Spark Applications
DEV 362 - Spark Streaming, Spark MLLib - Machine Learning, Graphx

Book - Learning Spark
Spark official documentation

pay more attention to RDD, Closure, Accumulator, Broadcast variables.

http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/rdd-programming-guide.html

MlLib - http://spark.apache.org/docs/latest/ml-guide.html

Topics covered in the exam

Topic Name

Your Score

Load and Inspect Data in Apache Spark

xx%

Advanced Spark Programming and Spark Machine Learning MLLib

xxx%

Monitoring Spark Applications

xx%

Work with Pair RDD

xx.x%

Spark Streaming

xx%

Work with DataFrames

xx%

Build an Apache Spark Application

xxx%

Tips -

Normally when anyone start preparing for the exam - the good start will be to go through below link

https://mapr.com/blog/how-get-started-using-apache-spark-graphx-scala/assets/spark-certification-study-guide.pdf

The question on this guide is way to basic comparing to the real exam. the exam was much-much harder.

Lots of question on core concepts of RDD and pair RDD
Dataframes are the next important
About 25% questions on Spark Streaming and Spark MLLib so prepare well on

You don't want to ignore any of the below topics at any cost

Silent topics which you don't want to get as surprise in exam

Accumulator and Broadcast variables
Scala Closures
Narrow and Wide Dependencies
Partitioning
Formating questions – saveAsTextFile() – need to save without bracket/parenthesis
Prepare well for mkString(“,”) and formating
flatMap functions
MapPartitions
There was a question on byKey transformation and also on hadoop streaming which I am not sure about.

Hope this blog will help in your preparation. Please let me know or email me if you have any other questions. Happy Studying

At the end - Here is my certification

Tuesday, February 20, 2018

Spark Mlib Basics

MLlib Machine Learning Library

Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:

summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
classification and regression: state vector machines, logistic regression, linear regression, decision trees,naive Bayes classification
collaborative filtering techniques including alternating least squares (ALS)
cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA)

Machine Learning(definition) - Constructing and studying methods that learn from and make predictions on data.

Terminologies
Observations - (data points) item or entities used for learning or evaluation.
Features - attribute used to represent an observation.
Labels - value assigned to observation
Training and Test data - observation used to train or evaluate a learning algorithm.

so if consider observation as email than feature would be date, importance, key words in subject or body of an email and Labels would be spam or not-spam. Test and learning data would be set of emails.

Supervised learning - learning from labeled observation examples classification and regression
Unsupervised learning - learning from unlabeled observation examples clustering and dimensionality reduction.

Flow -raw data ->feature extraction-> supervised learning ->evaluation-(satisfied)-> prediction

MLlib consists of two packages.

spark.mllib
spark.ml

When using pyspark, you'll find them in the pyspark.mllib and pyspark.ml packages respectively
Spark.ml is a newer package and works with data frames. The algorithm coverage is similar between the two packages, although spark.ml contains more tools for feature extraction and transformation. ML Package contains two types of classes transformer and estimator.

Transformer is a class which takes dataframe as input and transform it to another dataframe.

A transformer implement transform() function which is called on input dataframe.

Examples :

Hashing Term Frequency - which calculates how often words occur. It does this after hashing the words to reduce the number of features that need to be tracked.
LogisticRegressionModel - The model that results from trying logistic regression on a data set, this model can be used to transform features into predictions.
Binarizer - which changes a numeric feature into 1 or 0 given a threshold value.

An Estimator is a class that can take a DataFrame as input and returns a Tranformer.

It does this by calling it's fit method on the input DataFrame.

Note that Estimators need to use the data in the input DataFrame to build a model that can then be used to transform that DataFrame or another DataFrame.

Examples :

LogisticRegression processes the DataFrame to determine the weights for the resulting logistic regression model.

StandardScaler needs to calculate the standard deviations, and possibly, means of a column of vectors so that it can create a standard scalar model. That model can then be used to transform a DataFrame by subtracting the means and dividing by the standard deviations.

PipeLine Calling fit on a pipeline produces a Pipeline model. The pipeline model only contains transformers. There are no estimators.

ML Pipeline - Its a estimator that consist of one or more stages representing a reusable workflow. Pipeline stages can be transformers, estimators or another pipeline.

transformer1->transformer2->estimator1

-----------------------pipeline-----------------------------

Loss functions define how to penalize incorrect predictions.

The logistic function asymptotically approaches 0 as the input approaches negative infinity and 1 as the input approaches positive infinity. Since the results are bounded by 0 and 1, it can be directly interpreted as a probability.

Feature engineering is the important part and we will discuss that next until enjoy learning.

My Notes

Friday, May 11, 2018

Feature Engineering

Feature Engineering

Sunday, March 25, 2018

MapR Spark Certification tips

Prerequisite

Exam topics and must cover Material

Tips -

Silent topics which you don't want to get as surprise in exam

Tuesday, February 20, 2018

Spark Mlib Basics

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

Search This Blog