Feature are numeric or categorical. feature engineering techniques are used to define feature more accuracy for your model.
Age is not in linear relationship with age as children under 17 year didnt earn much so as after retirement.
Solution - Bucket the age(numeric feature) into age groups (categorical features) and put different weight for each age group. this is how we create age bucket.
problem - Can linear classifier model interaction between multiple features say age and education against income?
No. This is were feature crossing is useful. for each cross(age bucket,education)-> we create new true/false feature and age bucket is divided into true/false of income with education.
one way to represent category feature with large vocabulary.
This representation can save memory and faster to execute.
A categorical feature with a large number of values can be represented and vocabulary not specified in advance.
To avoid collision put the hashing bucket number more than the unique occupation.
It can also be used to limit the number of possibilities.
Embeddings are dense.
1.
The K-Nearest
Neighbors algorithm (K-NN or KNN) is a supervised learning method used
2.
for classification and
regression.
3.
* For classification,
the output of the K-NN algorithm is the classification of an unknown
4.
data point based on
the k 'nearest' neighbors in the training data.
5.
* For regression, the
output is an average of the values of a target variable based on
6. the k 'nearest' neighbors in the training
data.
7. A
very high value of K (ex. K = 100) produces an overly generalised model, while
a very low value of k (ex. k = 1) produces a highly complex model.
8. A difficulty that arises from trying to
classify out-of-sample data is that the actual classification may not be known,
therefore making it hard to produce an accurate result.Top of Form
1.
2.
The sum of the weights must be equal to 1.
- Model Evaluation:
Overfitting & Underfitting
- Bias
- Bias is the error that results
from incorrect assumptions and relations that the model makes
- High Bias Caused overly
Generalized model cause underfitting
- Variance
- Variance is the inconsistency of a model
due to small changes in the dataset.
- Variance is the expected value
of the squared deviation of a random variable from its mean.
- high variance - model changes
drastically due to minor modificatio. this is over fiting. to much dependence
on data and model
- A good balance is keeping the
model general enough for out-of-sample data but specific enough to fit the pattern of the data.
Metrics –
Error is the difference between the data point and the trend
line generated by algorithm
There are three main model evaluation metrics we'll look at they
are:
1.
* Mean Absolute Error
(MAE),
2.
* Mean Squared Error
(MSE), and
3.
*
Root Mean Squared Error (RMSE). (mostly used)
Unsupervised Learning -
- K-Means Clustering plus Advantages & Disadvantages
it can group unknown data through the use of algorithms. Grouping datapoints together by using a centroid and distances from the centroid to other points.
Euclidean distance is used to measure the distance from the object to the centroid.
Advantage - easy to understand and fast.
Disadvantage - high variation of clustering model. possibility for centroid not having datapoint so not being updated.
- Hierarchical Clustering plus Advantages & Disadvantages
Demdrogram and proximity matrix( distance from each point to other points)
- Measuring the Distances Between Clusters - Single Linkage Clustering
- Measuring the Distances Between Clusters - Algorithms for Hierarchy Clustering
- Density-Based Clustering - DB Scan
There are two parameters that are taken into account, epsilon (and minimum points Epsilon is the maximum radius of the neighborhood and minimum points is the minimum number of points in the epsilon-neighborhood to define a cluster. There are three classifications of points They are Core, Border, and Outlier.
DBScan is used to remove the outliers and can predict the cluster accurately.
K-mean cant distinguish between noise and clusters.
Dimensionality Reduction can be divided into two subcategories
- Feature Selection which includes Wrappers, Filters, and Embedded.
- Feature Extraction which includes Principle Components Analysis.
Feature Selection is the process of selecting a subset of relevant features or variables.
Wrapper - Wrappers use a predictive model that scores feature subsets based on the error-rate of the model. Wrappers are computationally expensive but provide best selection.
A popular technique is called stepwise regression.
Filters- Feature set is more general than wrapper. Filters use a proxy measure which is less computationally intensive but slightly less accurate.
An interesting fact about filters is that they produce a feature set that don't contain assumptions based on the predictive model, making it a useful tool for exposing relationships between features, such as which variables are 'Bad' together and, as a result, drop the accuracy or 'Good' together and therefore raise the accuracy.
Embedded-
Principle Components Analysis is the reduction of higher vector spaces to lower orders through projection. It can be used to visualize the dataset through compact representation and compression of dimensions.
An easy representation of this would be the projection from a 3-dimensional plane to a 2-dimensional one. A plane is first found which captures most (if not all) of the information. Then the data is projected onto new axes and a reduction in dimensions occur. When the projection of components happens, new axes are created to describe the relationship. This is called the principle axes, and the new data is called principle components.
Recommondation - Collaborative Filtering
Collaborative Filtering techniques explore the idea that relationships exists between products and people's interests.
As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels.
One popular example of Collaborative Filtering is Netflix. Everything on their site is driven by their customer's selections, which if made frequently enough, get turned into recommendations. Netflix orders these recommendations in such a way that the highest ranking items are more visible to users, in hopes of getting them to select those recommendations as well
Another popular example is amazon.com Amazon's item recommendation system is based on what you've previously purchased, as well as the frequency with which you've looked at certain books or other items during previous visits to their website. The advantages of using Collaborative Filtering is that users get a broader exposure to many different products they might be interested in. This exposure encourages users towards continual usage or purchase of their product.
Challenges -
1. One of them is Data Sparsity. Having a Large Dataset will most likely result in a user-item matrix being large and sparse, which may provide a good level of accuracy but also pose a risk to speed In comparison, having a small dataset would result in faster speeds but lower accuracy.
2. Cold Start
Another issue to keep in mind is something called 'cold start'. This is where new users do not have a sufficient amount of ratings to give an accurate recommendation.
3. Scalability - volume increase cause delay
4. Synonyms
The term, 'Synonyms' refers to the frequency of items that are similar, but are labeled differently.
And thus treated differently by the recommendation system. An Example of this would be 'Backpack' vs 'Knapsack'.
5. Gray Sheep
The term 'Gray Sheep' refers to the users that have opinions that don't necessarily 'fit' or are alike to any specific grouping. These users do not consistently agree or disagree on products or items, therefore making recommendations a non-beneficiary to them.
6. Shilling Attacks
However, Shilling Attacks are the abuse of this system by rating certain products high and other products low regardless of personal opinion. Therefore allowing that product to be recommended more often.
7. Long Tail effect - popular items are rated/viewed frequently. This creates a cycle where new items are just a shadow behind the popular items resulting.
It is common in many real-world use cases to
only have access to implicit feedback (e.g. views,
clicks, purchases, likes, shares etc.). The approach used in
spark.mllib
to deal with
such data is taken from Collaborative
Filtering for Implicit Feedback Datasets. Essentially,
instead of trying to model the matrix of ratings directly, this approach treats
the data as numbers representing the strength in
observations of user actions (such as the number of clicks, or the cumulative
duration someone spent viewing a movie). Those numbers are then related to the
level of confidence in observed user preferences, rather than explicit ratings
given to items. The model then tries to find latent factors that can be used to
predict the expected preference of a user for an item.
RANk - Its purely a characteristic of the data. As you said the rank refers the presumed latent or hidden factors. For example, if you were measuring how much different people liked movies and tried to cross-predict them then you might have three fields: person, movie, number of stars. Now, lets say that you were omniscient and you knew the absolute truth and you knew that in fact all the movie ratings could be perfectly predicted by just 3 hidden factors, sex, age and income. In that case the "rank" of your run should be 3.
Of course, you don't know how many underlying factors, if any, drive your data so you have to guess. The more you use, the better the results up to a point, but the more memory and computation time you will need.
One way to work it is to start with a rank of 5-10, then increase it, say 5 at a time until your results stop improving. That way you determine the best rank for your dataset by experimentation.
spark.mllib
uses the
alternating least squares (ALS) algorithm to learn these latent factors. The implementation in
spark.mllib
has the following parameters:
- numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
- rank is the number of latent factors in the model.
- iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less.
- lambda specifies the regularization parameter in ALS.
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.
MatrixFactorizationModel(rank, userFeatures, productFeatures)
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
val implicitPrefs = false
val als = new ALS()
als.setUserBlocks(-1)
als.setProductBlocks(-1)
als.setRank(ap.rank)
als.setIterations(ap.numIterations)
als.setLambda(ap.lambda)
als.setImplicitPrefs(implicitPrefs)
als.setAlpha(1.0)
als.setSeed(seed)
als.setCheckpointInterval(10)
val m = als.run(mllibRatings)