Showing posts with label Spark Machine Learning. Show all posts
Showing posts with label Spark Machine Learning. Show all posts

Friday, May 11, 2018

Feature Engineering

Feature Engineering


Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Features are the ways you represent about the world for the classifier.  feature selection has a multiplicative effect on the overall modeling process

Feature are numeric or categorical.  feature engineering techniques are used to define feature more accuracy for your model.

  • Bucketing
  • Crossing
  • Hashing
  • Embedding
                               
Feature Bucketing  -  transform the numeric feature into categorical feature.

problem - age increase so as income has a linear relation?
Age is not in linear relationship with age as children under 17 year didnt earn much so as after retirement.

Solution - Bucket the age(numeric feature) into age groups (categorical features) and put different weight for each age group. this is how we create age bucket.


Feature Crossing - way to create a new feature that are combination of existing features.

problem - Can linear classifier model interaction between multiple features say age and education against income?

No. This is were feature crossing is useful.  for each cross(age bucket,education)-> we create new true/false feature and age bucket is divided into true/false of income with education.


Feature Hashing or hash buckets
one way to represent category feature with large vocabulary.

This representation can say memory and faster to execute.

A categorical feature with a large number of values can be represented and vocabulary not specified in advance.

To avoid collision put the hashing bucket number more than the unique occupation.

It can also be used to limit  the number of possibilities.

Embedding - it represent the meaning of the words as a vector.

   Used for large vocabulary

   Embeddings are dense.




Dimensionality Reduction 

Dimensionality Reduction is the process of reducing the number of variables/features.

Dimensionality Reduction can be divided into two subcategories 

  • Feature Selection which includes Wrappers, Filters, and Embedded.
  • Feature Extraction which includes Principle Components Analysis.



Now consider if c was equal to 0 or an arbitrarily small number, it wouldn't really be relevant,
therefore it could be taken out of the equation. here you are using Feature Selection because you'd be selecting only the relevant variables and leaving out the irrelevant ones.

If you can equate ab = a + b, making a representation of two variables into one, you're using Feature
Extraction to reduce the number of variables.


Feature Selection is the process of selecting a subset of relevant features or variables.
There are 3 main subset types: 
  •  Wrappers,
  •  Filters, and 
  •  Embedded.

Wrappers use a predictive model that scores feature subsets based on the error-rate of
the model.While they're computationally intensive, they usually produce the best selection of features.

A popular technique is called stepwise regression. It's an algorithm that adds the best feature, or deletes the worst feature at each iteration.

Filters use a proxy measure which is less computationally intensive but slightly less accurate. Filters do capture the practicality of the dataset but, in comparison to error measurement, the feature set that's selected will be more general than if a Wrapper was used.

An interesting fact about filters is that they produce a feature set that don't contain assumptions based on the predictive model, making it a useful tool for exposing relationships between features, such as which variables are 'Bad' together and, as a result, drop the accuracy or 'Good' together and therefore raise the accuracy.

Embedded algorithms learn about which features best contribute to an accurate model during
the model building process. The most common type of is called a regularization model.


Feature Extraction is the process of transforming or projecting a space composing of many dimensions into a space of fewer dimensions.

The main linear technique is called Principle Components Analysis.

Principle Components Analysis is the reduction of higher vector spaces to lower orders through projection.

An easy representation of this would be the projection from a 3-dimensional plane to a
2-dimensional one.

A plane is first found which captures most (if not all) of the information. Then the data is projected onto new axes and a reduction in dimensions occur. When the projection of components happens, new axes are created to describe the relationship. This is called the principle axes, and the new data is called principle components.


Thanks for reading!!!

Sunday, March 25, 2018

MapR Spark Certification tips


I recently cleared MapR spark certification and would like to share some tips as I was asked to do so, (here you go my friends)

I divided this blog into 3 sections. 


  • prerequisite for exam
  • exam topics and must cover material
  • tips (don't ignore the topics at the end of this blog please)



Prerequisite 

First and foremost  -  Work on Spark and Scala for at least a year before attempting the exam. below points summarize the need.

  • You should have basic knowledge of distributed functional programming
  • Hands-on experience on spark 
  • Have good exposure to Scala programming(not expecting to be expert but read the code and answer sensibly).

Exam topics and must cover Material

Lots of programming questions in the exam, code snippet is provided and ask solve it and answer. If I remember correctly only 10% questions were theoretical ( like true/false or which algorithm to use kind).


I referred lot of materials(online books/videos/ edx courses in last 2 years) for my preparation but if I want to zero-in for what should be the mandatory for MapR certification - here is the list you should not miss any bit and I suggest to go over 4-5 times before taking the exam. 
  • Instructor and Virtual Instructor-led Training(Training ppt and Lab guide)
    • DEV 360 – Developing Spark Applications
    • DEV 361 - Build and Monitor Apache Spark Applications
    • DEV 362 - Spark Streaming, Spark MLLib - Machine Learning, Graphx
  • Book - Learning Spark
  • Spark official documentation
    • pay more attention to RDD, Closure, Accumulator, Broadcast variables.  
      • http://spark.apache.org/docs/latest/quick-start.html
      • http://spark.apache.org/docs/latest/rdd-programming-guide.html
    • MlLib - http://spark.apache.org/docs/latest/ml-guide.html



Topics covered in the exam 

Topic NameYour Score
Load and Inspect Data in Apache Spark
                               
xx%
Advanced Spark Programming and Spark Machine Learning MLLib
xxx%
Monitoring Spark Applications
xx%
Work with Pair RDD
                                 
xx.x%
Spark Streaming
xx%
Work with DataFrames
xx%
Build an Apache Spark Application
                                   
xxx%

Tips  - 

Normally when anyone start preparing for the exam - the good start will be to go through below link 

https://mapr.com/blog/how-get-started-using-apache-spark-graphx-scala/assets/spark-certification-study-guide.pdf

The question on this guide is way to basic comparing to the real exam. the exam was much-much harder.
  • Lots of question on core concepts of RDD and pair RDD
  • Dataframes are the next important
  • About 25% questions on Spark Streaming and Spark MLLib so prepare well on 

You don't want to ignore any of the below topics at any cost

Silent topics which you don't want to get as surprise in exam


  • Accumulator and Broadcast variables
  • Scala Closures
  • Narrow and Wide Dependencies
  • Partitioning  
  • Formating questions – saveAsTextFile() – need to save without bracket/parenthesis
  • Prepare well for mkString(“,”) and formating 
  • flatMap functions
  • MapPartitions
  • There was a question on byKey transformation and also on hadoop streaming which I am not sure about. 
Hope this blog will help in your preparation. Please let me know or email me if you have any other questions. Happy Studying

At the end  - Here is my certification

Tuesday, February 20, 2018

Spark Mlib Basics

MLlib Machine Learning Library


Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:


  1. summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
  2. classification and regression: state vector machines, logistic regression, linear regression,      decision trees,naive Bayes classification
  3. collaborative filtering techniques including alternating least squares (ALS)
  4. cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA)

Machine Learning(definition) - Constructing and studying methods that learn from and make predictions on data. 

Terminologies
  Observations - (data points) item or entities used for learning or evaluation. 
  Features - attribute used to represent an observation. 
  Labels -  value assigned to observation
  Training and Test data - observation used to train or evaluate a learning algorithm. 

so if consider observation as email than feature would be date, importance, key words in subject or body of an email and Labels would be spam or not-spam. Test and learning data would be set of emails.

Supervised learninglearning from labeled observation examples classification and regression
Unsupervised learninglearning from unlabeled observation examples clustering and dimensionality reduction. 

Flow -raw data ->feature extraction-> supervised learning ->evaluation-(satisfied)-> prediction

MLlib consists of two packages.

  • spark.mllib  
  • spark.ml

When using pyspark, you'll find them in the pyspark.mllib and pyspark.ml packages respectively
Spark.ml is a newer package and works with data frames. The algorithm coverage is similar between the two packages, although spark.ml contains more tools for feature extraction and transformation. ML Package contains two types of classes transformer and estimator.

Transformer is a class which takes dataframe as input and transform it to another dataframe. 

A transformer implement transform() function which is called on input dataframe. 

Examples :


  • Hashing Term Frequency - which calculates how often words occur. It does this after hashing the words to reduce the number of features that need to be tracked.
  • LogisticRegressionModel - The model that results from trying logistic regression on a data set, this model can be used to transform features into predictions.
  • Binarizer - which changes a numeric feature into 1 or 0 given a threshold value.


An Estimator is a class that can take a DataFrame as input and returns a Tranformer.

It does this by calling it's fit method on the input DataFrame.



Note that Estimators need to use the data in the input DataFrame to build a model that can then be used to transform that DataFrame or another DataFrame.



Examples :

  • LogisticRegression processes the DataFrame to determine the weights for the resulting logistic regression model.
  • StandardScaler needs to calculate the standard deviations, and possibly, means of a column of vectors so that it can create a standard scalar model. That model can then be used to transform a DataFrame by subtracting the means and dividing by the standard deviations.
  • PipeLine Calling fit on a pipeline produces a Pipeline model. The pipeline model only contains transformers. There are no estimators.

ML Pipeline  - Its a estimator that consist of one or more stages representing a reusable workflow.  Pipeline stages can be transformers, estimators or another pipeline.

             
transformer1->transformer2->estimator1
-----------------------pipeline-----------------------------





Loss functions define how to penalize incorrect predictions.

The logistic function asymptotically approaches 0 as the input approaches negative infinity and 1 as the input approaches positive infinity. Since the results are bounded by 0 and 1, it can be directly interpreted as a probability.

Feature engineering is the important part and we will discuss that next until enjoy learning. 




Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...