Friday, March 29, 2019

Recommendation engine using PredictionIO- Basics, Challenges


Recommendation - Collaborative Filtering

Collaborative Filtering techniques explore the idea that relationships exists between products and people's interests. As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels.

Netflix vs amazon recommendation - 

One popular example of Collaborative Filtering is Netflix. Everything on their site is driven by their customer's selections, which if made frequently enough, get turned into recommendations. Netflix orders these recommendations in such a way that the highest ranking items are more visible to users, in hopes of getting them to select those recommendations as well

Another popular example is amazon.com Amazon's item recommendation system is based on what you've previously purchased, as well as the frequency with which you've looked at certain books or other items during previous visits to their website. The advantages of using Collaborative Filtering is that users get a broader exposure to many different products they might be interested in. This exposure encourages users towards continual usage or purchase of their product.


I build recommendation engine using PredictionIO.  if you are interested learning more on implementation - You can send me an email and I can responded on how to design events etc.


I will just give pointers here - you can find the code in my github repo - https://github.com/pawan-agnihotri

PredictionIO - Overview

What: Apache PredictionIO® framework for machine learning, machine learning server built on top of apache spark,  spark mllib, hbase
  Apache License, Version 2.0
  Written in Scala, based on Spark and implements Lambda 

Architecture.
    Support Spark MLLib and OpenNLP
  Support batch and real time injection and predictions
    Respond to dynamic queries in real-time via Rest API

 Who/When:
  The company was founded in 2013 and is based in Walnut, California.

  Acquired by Salesforce in Feb 2016 and currently used in Salesforce Einstein(Salesforce AI Initiative)



Product Recommender - Build using PredictionIO
     Build model that produces individualized recommendations and serve at real time.
 User inputs
Like/buy/view events
Prediction query
     Output
Prediction result
Transaction Classifier
     Build model that classify user transaction(att0, att1, att2) into multiple categories(0-low, 1-medium, 2-high, 3-very high).
 User inputs
Events
Prediction query
     Output

Prediction result





Goal: Building machine learning model that can serve in real time

Step 1:  Create model using Spark Mlib
Step 2:  Build the model
Step 3Create test/training data
Step 4:  Train and Deploy the model
Step 5:  Use REST API
  Post Event data to Event Server(in real time)
    Make predictions(in real time)

Step 6:  Incorporating the prediction into your application

Challenges - 

1. One of them is Data Sparsity. Having a Large Dataset will most likely result in a user-item matrix being large and sparse, which may provide a good level of accuracy but also pose a risk to speed In comparison, having a small dataset would result in faster speeds but lower accuracy.

2.  Cold Start
Another issue to keep in mind is something called 'cold start'. This is where new users do not have a sufficient amount of ratings to give an accurate recommendation.

3. Scalability - volume increase cause delay

4. Synonyms
The term, 'Synonyms' refers to the frequency of items that are similar, but are labeled differently.
And thus treated differently by the recommendation system. An Example of this would be 'Backpack' vs 'Knapsack'.

5. Gray Sheep
The term 'Gray Sheep' refers to the users that have opinions that don't necessarily 'fit' or are alike to any specific grouping. These users do not consistently agree or disagree on products or items, therefore making recommendations a non-beneficiary to them.

6. Shilling Attacks
However, Shilling Attacks are the abuse of this system by rating certain products high and other products low regardless of personal opinion. Therefore allowing that product to be recommended more often.

7. Long Tail effect - popular items are rated/viewed frequently. This creates a cycle where new items are just a shadow behind the popular items resulting.


It is common in many real-world use cases to only have access to implicit feedback (e.g. views, clicks, purchases, likes, shares etc.). The approach used in spark.mllib to deal with such data is taken from Collaborative Filtering for Implicit Feedback Datasets. Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data as numbers representing the strength in observations of user actions (such as the number of clicks, or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of confidence in observed user preferences, rather than explicit ratings given to items. The model then tries to find latent factors that can be used to predict the expected preference of a user for an item.
 RANk - Its purely a characteristic of the data. As you said the rank refers the presumed latent or hidden factors. For example, if you were measuring how much different people liked movies and tried to cross-predict them then you might have three fields: person, movie, number of stars. Now, lets say that you were omniscient and you knew the absolute truth and you knew that in fact all the movie ratings could be perfectly predicted by just 3 hidden factors, sex, age and income. In that case the "rank" of your run should be 3.
Of course, you don't know how many underlying factors, if any, drive your data so you have to guess. The more you use, the better the results up to a point, but the more memory and computation time you will need.
One way to work it is to start with a rank of 5-10, then increase it, say 5 at a time until your results stop improving. That way you determine the best rank for your dataset by experimentation.


spark.mllib uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.mllib has the following parameters:
  • numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
  • rank is the number of latent factors in the model.
  • iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less.
  • lambda specifies the regularization parameter in ALS.
  • implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
  • alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.


 MatrixFactorizationModel(rank, userFeatures, productFeatures)
 {
      "name": "als",
      "params": {
        "rank": 10,
        "numIterations": 20,
        "lambda": 0.01,
        "seed": 3
      }


val implicitPrefs = false
    val als = new ALS()
    als.setUserBlocks(-1)
    als.setProductBlocks(-1)
    als.setRank(ap.rank)
    als.setIterations(ap.numIterations)
    als.setLambda(ap.lambda)
    als.setImplicitPrefs(implicitPrefs)
    als.setAlpha(1.0)
    als.setSeed(seed)
    als.setCheckpointInterval(10)
    val m = als.run(mllibRatings)

Advantage: 

1. Hierarchical matrix co-clustering / factorization(yes)
2. Preference versus intention
Distinguish between liking and interested in seeing /purchasing
Worthless to recommend an item a user already bought 
3. Scalability 
4. Relevant objectives 
Predicting actual rating may be useless! Missing at random assumption

drawback of our model

1. Multiple individuals using the same account — individual preference
2. Cold start (new users) 

Friday, November 9, 2018

Tips to consider before writing spark programs



Tips to consider before writing spark program 


1. size up your executor right 



Ideal executor size 
  • keep 4-5 cores per executors
  • 2-4 GB RAM per core
  • yarn memory overhead(7%)
  • Application Master use Core and 1 executor to run

     
sometimes you may need to run large data load and the above configuration may not help. here are few points to while sizing the executor.  


2. Driver program runs on single machine. 



This sometimes cause limitation and you may choose not to use spark in heavy data load applications and opt for other solutions like hive. Hive is better for single, heavy data load aggregation. consider Spark for   there are other workarounds so you need to take some additional steps. 



3. Manage you DAG



  • shuffles to be avoided(map side reduction, only send what you have to)
  • use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)
  • TreeReduce(reduce at executor) over Reduce(reduce at Driver)


4. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks



  • 128MB per partitions
  • if your number of partition is close to 2000, bump it to over 2000 so spark use compression

5. Skew with join

most data goes to one partition use salting to distribute your keys






Tuesday, June 19, 2018

Deep Learning & Neural Network


Deep learning is a subset of machine learning and functions in a similar way but its capabilities are different. Deep learning algorithms are capable of determining on their own if the prediction are accurate or not. This is where deep learning gets tricky:-) 

A deep learning model is designed to continually analyze data with a logic structure similar to how a human would draw conclusions. 

To achieve this, deep learning uses a layered structure of algorithms called an artificial neural network (ANN). The design of an ANN is inspired by the biological neural network of the human brain. This makes for machine intelligence that’s far more capable than that of standard machine learning models.

few terms-

MLP - Multi Layer perceptron  
RBM  - restricted Boltzmann machine
CNN  - Convolutional Neural Net
RNN - Recurrent Net
DBN - Deep Delief Net
Encoders


RNTN - Recursive Neural Tensor Network


Why Neural nets - due to complex patterns.


Why Neural Nets now  -

Earlier NNs are very hard to train (using back propagation  refer vanishing gradient problem)
and requires lot of CPU power which is not the issue due to the major work done in deep learning field by Hinton, Lecun, and Bengio.


What to use When?

If you’re interested in unsupervised learning – that is, you want to extract patterns from a set of unlabeled data – then your best bet is to use either a Restricted Boltzmann Machine, or an auto encoder.

for Supervised - If you have labeled data for supervised learning and you want to build a classifier, 

For text processing tasks like sentiment analysis, parsing, and named entity recognition – use a Recurrent Net or a Recursive Neural Tensor Network, which we’ll refer to as an RNTN. 

For any language model that operates on the character level, use a Recurrent Net. 

For image recognition, use a Deep Belief Network or a Convolutional Net. 

For object recognition, use a Convolutional Net or an RNTN. 

For speech recognition, use a Recurrent Net.

In general, Deep Belief Networks and Multilayer Perceptrons with rectified linear units – also known as RELU – are both good choices for classification. For time series analysis, it’s best to use a Recurrent Net.



CNN - Goal of the CNN was to form the best possible representation of the visual world in order to support recognition tasks.

The process of filtering through the image for a specific pattern.  used in Supervised learning methods. 

why CNN
  • detect and classify the objects into categories
  • robust against pose, scale, brightness  etc

Working

input image -> extract feature -> create part of the objects -> combine them to form object

CNN are good at finding feature and combining them

A typical deep CNN has three sets of layers – a convolutional layer, RELU, and pooling layers – all of which are repeated several times. These layers are followed by a few fully connected layers in order to support classification

a CNN layer has the flashlight structure. Each neuron is only connected to the input neurons it "shines" upon. The neurons in a given filter share the same weight and bias parameters. This means that, anywhere on the filter, a given neuron is connected to the same number of input neurons and has the same weights and biases.  This is what allows the filter to look for the same pattern in different sections of the image.

The next two layers that follow are RELU and pooling, both of which help to build up the simple patterns discovered by the convolutional layer. Each node in the convolutional layer is connected to a node that fires like in other nets. The activation used is called RELU, or rectified linear unit. 

CNNs are trained using backpropagation, so the vanishing gradient is once again a potential issue. The gradient is held more or less constant at every layer of the net. So the RELU activation allows the net to be properly trained, without harmful slowdowns in the crucial early layers. 

The pooling layer is used for dimensionality reduction.

Together, these three layers can discover a host of complex patterns, but the net will have no understanding of what these patterns mean. 

So a fully connected layer is attached to the end of the net in order to equip the net with the ability to classify data samples.


Since CNNs are such deep nets, they most likely need to be trained using server resources with GPUs. Despite the power of CNNs, these nets have one drawback. Since they are a supervised learning method, they require a large set of labelled data for training, which can be challenging to obtain in a real-world application. 




RNN - pattern in data change over time - use RNN 


This deep learning model has a simple structure with a built-in feedback loop, allowing it to act as a forecasting engine

All the nets we’ve seen up to this point have been feedforward neural networks. In a feedforward neural network, signals flow in only one direction from input to output, one layer at a time. In a recurrent net, the output of a layer is added to the next input and fed back into the same layer, which is typically the only layer in the entire network.

Unlike feedforward nets, a recurrent net can receive a sequence of values as input, and it can also produce a sequence of values as output.

RNNs can be stacked to form cabaple network for complex output but they are bit difficult net to train. 


RNTN - Recursive Neural Tensor Network, designed for sentimental analysis and NLP.  


The purpose of these nets was to analyze data that had a hierarchical structure.

Structure for RNTN  - An RNTN has three basic components (root and two child - binary tree)
 – a parent group, which we’ll call the root,  the root group uses a classifier to fire out a class and a score.

and the child groups, which we’ll call the leaves receives the input and pass it to root group. 

Each group is simply a collection of neurons, where the number of neurons depends on the complexity of the input data.  the root is connected to both leaves, but the leaves are not connected to each other. 

Technically speaking, the three components form what’s called a binary tree. In general, the leaf groups receive input, and the root group uses a classifier to fire out a class and a score.

The score represents the quality of the current parse, and the class represents an encoding of a structure in the current parse. 

This goes into recursion until all inputs are used up and the net has a parse tree with all the input words. 

Uses case  - 
Image classification, Object recognition,  video recognition-driverless car, speech recognition.  

In digital advertising, deep nets are used to segment users by purchase history in order to offer relevant and personalized ads in real time. Based on historical ad price data and other factors,deep nets can learn to optimally bid for ad space on a given web page.

Unfortunately, the vanishing gradient is exponentially worse for an RNN. The reason for this is that each time step is the equivalent of an entire layer in a feedforward network. So training an RNN for 100 time steps is like training a 100-layer feedforward net – this leads to exponentially small gradients and a decay of information through time. 

There are several ways to address this problem - the most popular of which is gating. Gating is a technique that helps the net decide when to forget the current input, and when to remember it for future time steps. The most popular gating types today are GRU and LSTM. Besides gating, there are also a few other techniques like gradient clipping, steeper gates, and better optimizers.


Training and  vanishing gradient 
When you’re training a neural net, you’re constantly calculating a cost value. The cost is typically the difference between the net’s predicted output and the actual output from a set of labelled training data. The cost is then lowered by making slight adjustments to the weights and biases over and over throughout the training process, until the lowest possible value is obtained. Here is that forward prop again;

The training process utilizes something called a gradient, which measures the rate at which the cost will change with respect to a change in a weight or a bias.

When the gradient is large, the net will train quickly. When the gradient is small, the net will train slowly.

The process used for training a neural net is called back-propagation or back-prop. We saw before that forward prop starts with the inputs and works forward; back-prop does the reverse, calculating the gradient from right to left.

a gradient at any point is the product of the previous gradients up to that point. And the product of two numbers between 0 and 1 gives you a smaller number



RBM - and how they overcame the vanishing gradient problem.


An RBM is a shallow, two-layer net; the first layer is known as the visible layer and the second is called the hidden layer. Each node in the visible layer is connected to every node in the hidden layer. An RBM is considered “restricted” because no two nodes in the same layer share a connection. 

An RBM is the mathematical equivalent of a two-way translator – in the forward pass, an RBM takes the inputs and translates them into a set of numbers that encode the inputs. In the backward pass, it takes this set of numbers and translates them back to form the re-constructed inputs. A well-trained net will be able to perform the backwards translation with a high degree of accuracy. In both steps, the weights and biases have a very important role. They allow the RBM to decipher the interrelationships among the input features, and they also help the RBM decide which input features are the most important when detecting patterns. 

Through several forward and backward passes, an RBM is trained to reconstruct the input data. Three steps are repeated over and over through the training process: 

a) With a forward pass, every input is combined with an individual weight and one overall bias, and the result is passed to the hidden layer which may or may not activate. 

b) Next, in a backward pass, each activation is combined with an individual weight and an overall bias, and the result is passed to the visible layer for reconstruction. 

c) At the visible layer, the reconstruction is compared against the original input to determine the quality of the result. 

RBMs use a measure called KL Divergence for step c); 


steps a) thru c) are repeated with varying weights and biases until the input and the re-construction are as close as possible.



DBN -  
A deep belief network can be viewed as a stack of RBMs, where the hidden layer of one RBM is the visible layer of the one "above" it.

Training DBN - 
a) The first RBM is trained to re-construct its input as accurately as possible 

b) The hidden layer of the first RBM is treated as the visible layer for the second and the second RBM is trained using the outputs from the first RBM 

c) This process is repeated until every layer in the network is trained


An important note about a DBN is that each RBM layer learns the entire input. In other kinds of models – like convolutional nets – early layers detect simple patterns and later layers recombine them

AutoEncoder -  understand features in data act as feature extraction


an autoencoder is a neural net that takes a set of typically unlabelled inputs, and after encoding them, tries to reconstruct them as accurately as possible. As a result of this, the net must decide which of the data features are the most important, essentially acting as a feature extraction engine.

Autoencoders are typically very shallow, and are usually comprised of an input layer, an output layer and a hidden layer. An RBM is an example of an autoencoder with only two layers. Here is a forward pass that ends with a reconstruction of the input. There are two steps - the encoding and the decoding. Typically, the same weights that are used to encode a feature in the hidden layer are used to reconstruct an image in the output layer.

Autoencoders are trained with backpropagation, using a metric called “loss”.

loss measures the amount of information that was lost when the net tried to reconstruct the input. A net with a small loss value will produce reconstructions that look very similar to the originals.


Autoencoders can be deep. Deep autoencoders perform better at dimensionality reduction than 
their predecessor, principal component analysis, or PCA


Platform  - no coding required but you are bounded by the offering, help in quick deployment but there is more cost associated with it. 

example - H2o.ai, graphlab

library - no boundation of offering but requires coding. less cost

a library is a premade set of functions and modules that you can call through your own programs. you’ll need to code every aspect of a net, like the model, the layers, the activation, the training method, and any special methods for preventing overfitting

a commercial-grade library like deeplearning4j, Torch, or Caffe, Scientific projects - Theano, deepmat.

Theano - python library - I am not sure if Hadoop support is present at the time of writing this.
Caffe - c++ , interface with python and Matlab, good for machine vision for forecasting applications
TensorFlow - Python. based on computational graph (same as Theano). hadoop support, model parallelism, support openCL(GPU), TensorBoard


Friday, May 11, 2018

Feature Engineering

Feature Engineering


Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Features are the ways you represent about the world for the classifier.  feature selection has a multiplicative effect on the overall modeling process

Feature are numeric or categorical.  feature engineering techniques are used to define feature more accuracy for your model.

  • Bucketing
  • Crossing
  • Hashing
  • Embedding
                               
Feature Bucketing  -  transform the numeric feature into categorical feature.

problem - age increase so as income has a linear relation?
Age is not in linear relationship with age as children under 17 year didnt earn much so as after retirement.

Solution - Bucket the age(numeric feature) into age groups (categorical features) and put different weight for each age group. this is how we create age bucket.


Feature Crossing - way to create a new feature that are combination of existing features.

problem - Can linear classifier model interaction between multiple features say age and education against income?

No. This is were feature crossing is useful.  for each cross(age bucket,education)-> we create new true/false feature and age bucket is divided into true/false of income with education.


Feature Hashing or hash buckets
one way to represent category feature with large vocabulary.

This representation can say memory and faster to execute.

A categorical feature with a large number of values can be represented and vocabulary not specified in advance.

To avoid collision put the hashing bucket number more than the unique occupation.

It can also be used to limit  the number of possibilities.

Embedding - it represent the meaning of the words as a vector.

   Used for large vocabulary

   Embeddings are dense.




Dimensionality Reduction 

Dimensionality Reduction is the process of reducing the number of variables/features.

Dimensionality Reduction can be divided into two subcategories 

  • Feature Selection which includes Wrappers, Filters, and Embedded.
  • Feature Extraction which includes Principle Components Analysis.



Now consider if c was equal to 0 or an arbitrarily small number, it wouldn't really be relevant,
therefore it could be taken out of the equation. here you are using Feature Selection because you'd be selecting only the relevant variables and leaving out the irrelevant ones.

If you can equate ab = a + b, making a representation of two variables into one, you're using Feature
Extraction to reduce the number of variables.


Feature Selection is the process of selecting a subset of relevant features or variables.
There are 3 main subset types: 
  •  Wrappers,
  •  Filters, and 
  •  Embedded.

Wrappers use a predictive model that scores feature subsets based on the error-rate of
the model.While they're computationally intensive, they usually produce the best selection of features.

A popular technique is called stepwise regression. It's an algorithm that adds the best feature, or deletes the worst feature at each iteration.

Filters use a proxy measure which is less computationally intensive but slightly less accurate. Filters do capture the practicality of the dataset but, in comparison to error measurement, the feature set that's selected will be more general than if a Wrapper was used.

An interesting fact about filters is that they produce a feature set that don't contain assumptions based on the predictive model, making it a useful tool for exposing relationships between features, such as which variables are 'Bad' together and, as a result, drop the accuracy or 'Good' together and therefore raise the accuracy.

Embedded algorithms learn about which features best contribute to an accurate model during
the model building process. The most common type of is called a regularization model.


Feature Extraction is the process of transforming or projecting a space composing of many dimensions into a space of fewer dimensions.

The main linear technique is called Principle Components Analysis.

Principle Components Analysis is the reduction of higher vector spaces to lower orders through projection.

An easy representation of this would be the projection from a 3-dimensional plane to a
2-dimensional one.

A plane is first found which captures most (if not all) of the information. Then the data is projected onto new axes and a reduction in dimensions occur. When the projection of components happens, new axes are created to describe the relationship. This is called the principle axes, and the new data is called principle components.


Thanks for reading!!!

Sunday, March 25, 2018

MapR Spark Certification tips


I recently cleared MapR spark certification and would like to share some tips as I was asked to do so, (here you go my friends)

I divided this blog into 3 sections. 


  • prerequisite for exam
  • exam topics and must cover material
  • tips (don't ignore the topics at the end of this blog please)



Prerequisite 

First and foremost  -  Work on Spark and Scala for at least a year before attempting the exam. below points summarize the need.

  • You should have basic knowledge of distributed functional programming
  • Hands-on experience on spark 
  • Have good exposure to Scala programming(not expecting to be expert but read the code and answer sensibly).

Exam topics and must cover Material

Lots of programming questions in the exam, code snippet is provided and ask solve it and answer. If I remember correctly only 10% questions were theoretical ( like true/false or which algorithm to use kind).


I referred lot of materials(online books/videos/ edx courses in last 2 years) for my preparation but if I want to zero-in for what should be the mandatory for MapR certification - here is the list you should not miss any bit and I suggest to go over 4-5 times before taking the exam. 
  • Instructor and Virtual Instructor-led Training(Training ppt and Lab guide)
    • DEV 360 – Developing Spark Applications
    • DEV 361 - Build and Monitor Apache Spark Applications
    • DEV 362 - Spark Streaming, Spark MLLib - Machine Learning, Graphx
  • Book - Learning Spark
  • Spark official documentation
    • pay more attention to RDD, Closure, Accumulator, Broadcast variables.  
      • http://spark.apache.org/docs/latest/quick-start.html
      • http://spark.apache.org/docs/latest/rdd-programming-guide.html
    • MlLib - http://spark.apache.org/docs/latest/ml-guide.html



Topics covered in the exam 

Topic NameYour Score
Load and Inspect Data in Apache Spark
                               
xx%
Advanced Spark Programming and Spark Machine Learning MLLib
xxx%
Monitoring Spark Applications
xx%
Work with Pair RDD
                                 
xx.x%
Spark Streaming
xx%
Work with DataFrames
xx%
Build an Apache Spark Application
                                   
xxx%

Tips  - 

Normally when anyone start preparing for the exam - the good start will be to go through below link 

https://mapr.com/blog/how-get-started-using-apache-spark-graphx-scala/assets/spark-certification-study-guide.pdf

The question on this guide is way to basic comparing to the real exam. the exam was much-much harder.
  • Lots of question on core concepts of RDD and pair RDD
  • Dataframes are the next important
  • About 25% questions on Spark Streaming and Spark MLLib so prepare well on 

You don't want to ignore any of the below topics at any cost

Silent topics which you don't want to get as surprise in exam


  • Accumulator and Broadcast variables
  • Scala Closures
  • Narrow and Wide Dependencies
  • Partitioning  
  • Formating questions – saveAsTextFile() – need to save without bracket/parenthesis
  • Prepare well for mkString(“,”) and formating 
  • flatMap functions
  • MapPartitions
  • There was a question on byKey transformation and also on hadoop streaming which I am not sure about. 
Hope this blog will help in your preparation. Please let me know or email me if you have any other questions. Happy Studying

At the end  - Here is my certification

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...