Recommendation - Collaborative Filtering
Collaborative Filtering techniques explore the idea that relationships exists between products and people's interests. As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels.
Netflix vs amazon recommendation -
I build recommendation engine using PredictionIO. if you are interested learning more on implementation - You can send me an email and I can responded on how to design events etc.
I will just give pointers here - you can find the code in my github repo - https://github.com/pawan-agnihotri
Product Recommender - Build using PredictionIO
Netflix vs amazon recommendation -
One popular example of Collaborative Filtering is Netflix. Everything on their site is driven by their customer's selections, which if made frequently enough, get turned into recommendations. Netflix orders these recommendations in such a way that the highest ranking items are more visible to users, in hopes of getting them to select those recommendations as well
Another popular example is amazon.com Amazon's item recommendation system is based on what you've previously purchased, as well as the frequency with which you've looked at certain books or other items during previous visits to their website. The advantages of using Collaborative Filtering is that users get a broader exposure to many different products they might be interested in. This exposure encourages users towards continual usage or purchase of their product.
I build recommendation engine using PredictionIO. if you are interested learning more on implementation - You can send me an email and I can responded on how to design events etc.
I will just give pointers here - you can find the code in my github repo - https://github.com/pawan-agnihotri
PredictionIO - Overview
What: Apache
PredictionIO® framework
for machine learning, machine learning server built on
top of apache spark, spark mllib,
hbase
Apache
License, Version 2.0
Written in Scala,
based on Spark and implements Lambda
Architecture.
Support Spark MLLib
and OpenNLP
Support batch
and real time injection and predictions
Respond to
dynamic queries in real-time via
Rest API
Who/When:
The company
was founded in 2013 and
is based in Walnut, California.
Acquired
by Salesforce in Feb 2016 and currently used in Salesforce Einstein(Salesforce
AI Initiative)
Product Recommender - Build using PredictionIO
Build
model that
produces individualized recommendations
and serve at real time.
User inputs
−Like/buy/view
events
−Prediction
query
Output
−Prediction
result
Transaction
Classifier
Build
model
that classify
user transaction(att0, att1, att2) into multiple categories(0-low, 1-medium,
2-high, 3-very high).
User inputs
−Events
−Prediction
query
Output
−Prediction
result
Goal: Building
machine learning model that can serve in real time
Step 1: Create
model using Spark Mlib
Step
2: Build the model
Step 3: Create
test/training data
Step
4: Train
and Deploy the model
Step
5: Use
REST API
Post
Event data to Event Server(in real time)
Make
predictions(in real time)
Step 6: Incorporating
the prediction into your application
Challenges -
1. One of them is Data Sparsity. Having a Large Dataset will most likely result in a user-item matrix being large and sparse, which may provide a good level of accuracy but also pose a risk to speed In comparison, having a small dataset would result in faster speeds but lower accuracy.
2. Cold Start
Another issue to keep in mind is something called 'cold start'. This is where new users do not have a sufficient amount of ratings to give an accurate recommendation.
3. Scalability - volume increase cause delay
4. Synonyms
The term, 'Synonyms' refers to the frequency of items that are similar, but are labeled differently.
And thus treated differently by the recommendation system. An Example of this would be 'Backpack' vs 'Knapsack'.
5. Gray Sheep
The term 'Gray Sheep' refers to the users that have opinions that don't necessarily 'fit' or are alike to any specific grouping. These users do not consistently agree or disagree on products or items, therefore making recommendations a non-beneficiary to them.
6. Shilling Attacks
However, Shilling Attacks are the abuse of this system by rating certain products high and other products low regardless of personal opinion. Therefore allowing that product to be recommended more often.
7. Long Tail effect - popular items are rated/viewed frequently. This creates a cycle where new items are just a shadow behind the popular items resulting.
It is common in many real-world use cases to only have access to implicit feedback (e.g. views, clicks, purchases, likes, shares etc.). The approach used in
spark.mllib
to deal with such data is taken from Collaborative Filtering for Implicit Feedback Datasets. Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data as numbers representing the strength in observations of user actions (such as the number of clicks, or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of confidence in observed user preferences, rather than explicit ratings given to items. The model then tries to find latent factors that can be used to predict the expected preference of a user for an item.
Of course, you don't know how many underlying factors, if any, drive your data so you have to guess. The more you use, the better the results up to a point, but the more memory and computation time you will need.
One way to work it is to start with a rank of 5-10, then increase it, say 5 at a time until your results stop improving. That way you determine the best rank for your dataset by experimentation.
spark.mllib
uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.mllib
has the following parameters:- numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
- rank is the number of latent factors in the model.
- iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less.
- lambda specifies the regularization parameter in ALS.
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.
MatrixFactorizationModel(rank, userFeatures, productFeatures)
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
val implicitPrefs = false
val als = new ALS()
als.setUserBlocks(-1)
als.setProductBlocks(-1)
als.setRank(ap.rank)
als.setIterations(ap.numIterations)
als.setLambda(ap.lambda)
als.setImplicitPrefs(implicitPrefs)
als.setAlpha(1.0)
als.setSeed(seed)
als.setCheckpointInterval(10)
val m = als.run(mllibRatings)
val als = new ALS()
als.setUserBlocks(-1)
als.setProductBlocks(-1)
als.setRank(ap.rank)
als.setIterations(ap.numIterations)
als.setLambda(ap.lambda)
als.setImplicitPrefs(implicitPrefs)
als.setAlpha(1.0)
als.setSeed(seed)
als.setCheckpointInterval(10)
val m = als.run(mllibRatings)
1. Hierarchical matrix co-clustering / factorization(yes)
2. Preference versus intention
Distinguish between liking and interested in seeing /purchasing
Worthless to recommend an item a user already bought
3. Scalability
4. Relevant objectives
Predicting actual rating may be useless! Missing at random assumption
drawback of our model
1. Multiple individuals using the same account — individual preference
2. Cold start (new users)