Semi Supervised
There are so many algorithms available that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit.
I explained supervised and non-supervised learning in my previous blog so I dont go into details explaining them but will focus on Semi-Supervised learning and feature engineering.
semi-supervised learning is learning from model with labeled and unlabeled data. semi-supervised learning methods are used in areas such as image classification where there are large datasets with very few labeled examples
Learning Style algorithms -
supervised algorithm
Example problems are classification and regression.
Example algorithms include Logistic Regression and the Back Propagation Neural Network.
unsupervised algorithm
Example problems are clustering, dimensional reduction and association rule learning.
Example algorithms include: the Apriori algorithm and k-Means.
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.
Features are the ways you represent about the world for the classifier. feature selection has a multiplicative effect on the overall modeling process
Feature are numeric or categorical. feature engineering techniques are used to define feature more accuracy for your model.
- Bucketing
- Crossing
- Hashing
- Embedding
Feature Bucketing - transform the numeric feature into categorical feature.
K
problem - age increase so as income has a linear relation?
Age is not in linear relationship with age as children under 17 year didnt earn much so as after retirement.
Solution - Bucket the age(numeric feature) into age groups (categorical features) and put different weight for each age group. this is how we create age bucket.
Feature Crossing - way to create a new feature that are combination of existing features.
problem - Can linear classifier model interaction between multiple features say age and education against income?
No. This is were feature crossing is useful. for each cross(age bucket,education)-> we create new true/false feature and age bucket is divided into true/false of income with education.
Feature Hashing or hash buckets
one way to represent category feature with large vocabulary.
This representation can save memory and faster to execute.
A categorical feature with a large number of values can be represented and vocabulary not specified in advance.
To avoid collision put the hashing bucket number more than the unique occupation.
It can also be used to limit the number of possibilities.
Embedding - it represent the meaning of the words as a vector.
Used for large vocabulary
Embeddings are dense.
1.
The K-Nearest
Neighbors algorithm (K-NN or KNN) is a supervised learning method used
2.
for classification and
regression.
3.
* For classification,
the output of the K-NN algorithm is the classification of an unknown
4.
data point based on
the k 'nearest' neighbors in the training data.
5.
* For regression, the
output is an average of the values of a target variable based on
6. the k 'nearest' neighbors in the training
data.
7. A
very high value of K (ex. K = 100) produces an overly generalised model, while
a very low value of k (ex. k = 1) produces a highly complex model.
A difficulty that arises from trying to
classify out-of-sample data is that the actual classification may not be known,
therefore making it hard to produce an accurate result.
1.
2.
The sum of the weights must be equal to 1.
- Model Evaluation:
Overfitting & Underfitting
- Bias
- Bias is the error that results
from incorrect assumptions and relations that the model makes
- High Bias Caused overly
Generalized model cause underfitting
- Variance
- Variance is the inconsistency of a model
due to small changes in the dataset.
- Variance is the expected value
of the squared deviation of a random variable from its mean.
- high variance - model changes
drastically due to minor modificatio. this is over fiting. to much dependence
on data and model
- A good balance is keeping the
model general enough for out-of-sample data but specific enough to fit the pattern of the data.
Metrics –
Error is the difference between the data point and the trend
line generated by algorithm
There are three main model evaluation metrics we'll look at they
are:
1.
* Mean Absolute Error
(MAE),
2.
* Mean Squared Error
(MSE), and
3.
*
Root Mean Squared Error (RMSE). (mostly used)
Unsupervised Learning -
- K-Means Clustering
Euclidean distance is used to measure the distance from the object to the centroid.
Advantage - easy to understand and fast.
Disadvantage - high variation of clustering model. possibility for centroid not having datapoint so not being updated.
- Hierarchical Clustering plus Advantages & Disadvantages
- Measuring the Distances Between Clusters
- Measuring the Distances Between Clusters
- Density-Based Clustering - DB Scan
DBScan is used to remove the outliers and can predict the cluster accurately.
K-mean cant distinguish between noise and clusters.
Dimensionality Reduction can be divided into two subcategories
- Feature Selection which includes Wrappers, Filters, and Embedded.
- Feature Extraction which includes Principle Components Analysis.
Feature Selection is the process of selecting a subset of relevant features or variables.
Wrapper - Wrappers use a predictive model that scores feature subsets based on the error-rate of the model. Wrappers are computationally expensive but provide best selection.
A popular technique is called stepwise regression.
Filters- Feature set is more general than wrapper. Filters use a proxy measure which is less computationally intensive but slightly less accurate.
An interesting fact about filters is that they produce a feature set that don't contain assumptions based on the predictive model, making it a useful tool for exposing relationships between features, such as which variables are 'Bad' together and, as a result, drop the accuracy or 'Good' together and therefore raise the accuracy.
Embedded-
Principle Components Analysis is the reduction of higher vector spaces to lower orders through projection. It can be used to visualize the dataset through compact representation and compression of dimensions.
An easy representation of this would be the projection from a 3-dimensional plane to a 2-dimensional one. A plane is first found which captures most (if not all) of the information. Then the data is projected onto new axes and a reduction in dimensions occur. When the projection of components happens, new axes are created to describe the relationship. This is called the principle axes, and the new data is called principle components.
Recommondation - Collaborative Filtering
Collaborative Filtering techniques explore the idea that relationships exists between products and people's interests.
As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels.
One popular example of Collaborative Filtering is Netflix. Everything on their site is driven by their customer's selections, which if made frequently enough, get turned into recommendations. Netflix orders these recommendations in such a way that the highest ranking items are more visible to users, in hopes of getting them to select those recommendations as well
Another popular example is amazon.com Amazon's item recommendation system is based on what you've previously purchased, as well as the frequency with which you've looked at certain books or other items during previous visits to their website. The advantages of using Collaborative Filtering is that users get a broader exposure to many different products they might be interested in. This exposure encourages users towards continual usage or purchase of their product.
Challenges -
1. One of them is Data Sparsity. Having a Large Dataset will most likely result in a user-item matrix being large and sparse, which may provide a good level of accuracy but also pose a risk to speed In comparison, having a small dataset would result in faster speeds but lower accuracy.
2. Cold Start
Another issue to keep in mind is something called 'cold start'. This is where new users do not have a sufficient amount of ratings to give an accurate recommendation.
3. Scalability - volume increase cause delay
4. Synonyms
The term, 'Synonyms' refers to the frequency of items that are similar, but are labeled differently.
And thus treated differently by the recommendation system. An Example of this would be 'Backpack' vs 'Knapsack'.
5. Gray Sheep
The term 'Gray Sheep' refers to the users that have opinions that don't necessarily 'fit' or are alike to any specific grouping. These users do not consistently agree or disagree on products or items, therefore making recommendations a non-beneficiary to them.
6. Shilling Attacks
However, Shilling Attacks are the abuse of this system by rating certain products high and other products low regardless of personal opinion. Therefore allowing that product to be recommended more often.
7. Long Tail effect - popular items are rated/viewed frequently. This creates a cycle where new items are just a shadow behind the popular items resulting.
It is common in many real-world use cases to
only have access to implicit feedback (e.g. views,
clicks, purchases, likes, shares etc.). The approach used in
spark.mllib
to deal with
such data is taken from Collaborative
Filtering for Implicit Feedback Datasets. Essentially,
instead of trying to model the matrix of ratings directly, this approach treats
the data as numbers representing the strength in
observations of user actions (such as the number of clicks, or the cumulative
duration someone spent viewing a movie). Those numbers are then related to the
level of confidence in observed user preferences, rather than explicit ratings
given to items. The model then tries to find latent factors that can be used to
predict the expected preference of a user for an item.
Of course, you don't know how many underlying factors, if any, drive your data so you have to guess. The more you use, the better the results up to a point, but the more memory and computation time you will need.
One way to work it is to start with a rank of 5-10, then increase it, say 5 at a time until your results stop improving. That way you determine the best rank for your dataset by experimentation.
spark.mllib
uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.mllib
has the following parameters:- numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
- rank is the number of latent factors in the model.
- iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less.
- lambda specifies the regularization parameter in ALS.
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.
MatrixFactorizationModel(rank, userFeatures, productFeatures)
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
val implicitPrefs = false
val als = new ALS()
als.setUserBlocks(-1)
als.setProductBlocks(-1)
als.setRank(ap.rank)
als.setIterations(ap.numIterations)
als.setLambda(ap.lambda)
als.setImplicitPrefs(implicitPrefs)
als.setAlpha(1.0)
als.setSeed(seed)
als.setCheckpointInterval(10)
val m = als.run(mllibRatings)
val als = new ALS()
als.setUserBlocks(-1)
als.setProductBlocks(-1)
als.setRank(ap.rank)
als.setIterations(ap.numIterations)
als.setLambda(ap.lambda)
als.setImplicitPrefs(implicitPrefs)
als.setAlpha(1.0)
als.setSeed(seed)
als.setCheckpointInterval(10)
val m = als.run(mllibRatings)
1. Hierarchical matrix co-clustering / factorization(yes)
2. Preference versus intention
Distinguish between liking and interested in seeing /purchasing
Worthless to recommend an item a user already bought
3. Scalability
4. Relevant objectives
Predicting actual rating may be useless! Missing at random assumption
drawback of our model
1. Multiple individuals using the same account — individual preference
2. Cold start (new users)
--------------deep Learning-------------
Deep learning is a subset of machine learning and functions in a similar way but its capabilities are different. Deep learning algorithms are capable of determining on their own if the prediction are accurate or not. This is where deep learning gets tricky:-)
A deep learning model is designed to continually analyze data with a logic structure similar to how a human would draw conclusions.
To achieve this, deep learning uses a layered structure of algorithms called an artificial neural network (ANN). The design of an ANN is inspired by the biological neural network of the human brain. This makes for machine intelligence that’s far more capable than that of standard machine learning models.
Why Neural nets - due to complex patterns.
Why Neural Nets now - its very hard to train (using backpropogation refer vanishing gradient problem)and requires lot of CPU power. Up until 2006, deep nets were still underperforming shallow nets and other machine learning algorithms. But everything changed after three breakthrough papers published by Hinton, Lecun, and Bengio in 2006 and 2007.
Training and vanishing gradient
When you’re training a neural net, you’re constantly calculating a cost value. The cost is typically the difference between the net’s predicted output and the actual output from a set of labelled training data. The cost is then lowered by making slight adjustments to the weights and biases over and over throughout the training process, until the lowest possible value is obtained. Here is that forward prop again;The training process utilizes something called a gradient, which measures the rate at which the cost will change with respect to a change in a weight or a bias.
When the gradient is large, the net will train quickly. When the gradient is small, the net will train slowly.
The process used for training a neural net is called back-propagation or back-prop. We saw before that forward prop starts with the inputs and works forward; back-prop does the reverse, calculating the gradient from right to left.
a gradient at any point is the product of the previous gradients up to that point. And the product of two numbers between 0 and 1 gives you a smaller number
What to use When?
If you have labeled data for supervised learning and you want to build a classifier,
For text processing tasks like sentiment analysis, parsing, and named entity recognition – use a Recurrent Net or a Recursive Neural Tensor Network, which we’ll refer to as an RNTN.
For any language model that operates on the character level, use a Recurrent Net.
For image recognition, use a Deep Belief Network or a Convolutional Net.
For object recognition, use a Convolutional Net or an RNTN.
For speech recognition, use a Recurrent Net.
In general, Deep Belief Networks and Multilayer Perceptrons with rectified linear units – also known as RELU – are both good choices for classification. For time series analysis, it’s best to use a Recurrent Net.
RBM - and how they overcame the vanishing gradient problem.
An RBM is a shallow, two-layer net; the first layer is known as the visible layer and the second is called the hidden layer. Each node in the visible layer is connected to every node in the hidden layer. An RBM is considered “restricted” because no two nodes in the same layer share a connection.
An RBM is the mathematical equivalent of a two-way translator – in the forward pass, an RBM takes the inputs and translates them into a set of numbers that encode the inputs. In the backward pass, it takes this set of numbers and translates them back to form the re-constructed inputs. A well-trained net will be able to perform the backwards translation with a high degree of accuracy. In both steps, the weights and biases have a very important role. They allow the RBM to decipher the interrelationships among the input features, and they also help the RBM decide which input features are the most important when detecting patterns.
Through several forward and backward passes, an RBM is trained to reconstruct the input data. Three steps are repeated over and over through the training process:
a) With a forward pass, every input is combined with an individual weight and one overall bias, and the result is passed to the hidden layer which may or may not activate.
b) Next, in a backward pass, each activation is combined with an individual weight and an overall bias, and the result is passed to the visible layer for reconstruction.
c) At the visible layer, the reconstruction is compared against the original input to determine the quality of the result.
RBMs use a measure called KL Divergence for step c);
steps a) thru c) are repeated with varying weights and biases until the input and the re-construction are as close as possible.
DBN -
A deep belief network can be viewed as a stack of RBMs, where the hidden layer of one RBM is the visible layer of the one "above" it.
a) The first RBM is trained to re-construct its input as accurately as possible
b) The hidden layer of the first RBM is treated as the visible layer for the second and the second RBM is trained using the outputs from the first RBM
c) This process is repeated until every layer in the network is trained
CNN - The process of filtering through the image for a specific pattern.
used Supervised learning methods.
a CNN layer has the flashlight structure. Each neuron is only connected to the input neurons it "shines" upon.
The neurons in a given filter share the same weight and bias parameters. This means that, anywhere on the filter, a given neuron is connected to the same number of input neurons and has the same weights and biases.
This is what allows the filter to look for the same pattern in different sections of the image.
The next two layers that follow are RELU and pooling, both of which help to build up the simple patterns discovered by the convolutional layer. Each node in the convolutional layer is connected to a node that fires like in other nets. The activation used is called RELU, or rectified linear unit. CNNs are trained using backpropagation, so the vanishing gradient is once again a potential issue.
The gradient is held more or less constant at every layer of the net. So the RELU activation allows the net to be properly trained, without harmful slowdowns in the crucial early layers.
The pooling layer is used for dimensionality reduction.
Together, these three layers can discover a host of complex patterns, but the net will have no understanding of what these patterns mean.
So a fully connected layer is attached to the end of the net in order to equip the net with the ability to classify data samples.
A typical deep CNN has three sets of layers – a convolutional layer, RELU, and pooling layers – all of which are repeated several times. These layers are followed by a few fully connected layers in order to support classification.
Since CNNs are such deep nets, they most likely need to be trained using server resources with GPUs. Despite the power of CNNs, these nets have one drawback. Since they are a supervised learning method, they require a large set of labelled data for training, which can be challenging to obtain in a real-world application.
RNN - pattern in data change over time - use RNN
All the nets we’ve seen up to this point have been feedforward neural networks. In a feedforward neural network, signals flow in only one direction from input to output, one layer at a time. In a recurrent net, the output of a layer is added to the next input and fed back into the same layer, which is typically the only layer in the entire network.
Unlike feedforward nets, a recurrent net can receive a sequence of values as input, and it can also produce a sequence of values as output.
RNNs can be stacked to form cabaple network for complex output.
RNN is an extremely difficult net to train. Since these nets use backpropagation, we once again run into the problem of the vanishing gradient.
Unfortunately, the vanishing gradient is exponentially worse for an RNN. The reason for this is that each time step is the equivalent of an entire layer in a feedforward network. So training an RNN for 100 time steps is like training a 100-layer feedforward net – this leads to exponentially small gradients and a decay of information through time.
There are several ways to address this problem - the most popular of which is gating. Gating is a technique that helps the net decide when to forget the current input, and when to remember it for future time steps. The most popular gating types today are GRU and LSTM. Besides gating, there are also a few other techniques like gradient clipping, steeper gates, and better optimizers.
AutoEncoder - understand features in data act as feature extraction
an autoencoder is a neural net that takes a set of typically unlabelled inputs, and after encoding them, tries to reconstruct them as accurately as possible. As a result of this, the net must decide which of the data features are the most important, essentially acting as a feature extraction engine.
Autoencoders are typically very shallow, and are usually comprised of an input layer, an output layer and a hidden layer. An RBM is an example of an autoencoder with only two layers. Here is a forward pass that ends with a reconstruction of the input. There are two steps - the encoding and the decoding. Typically, the same weights that are used to encode a feature in the hidden layer are used to reconstruct an image in the output layer.
Autoencoders are trained with backpropagation, using a metric called “loss”.
loss measures the amount of information that was lost when the net tried to reconstruct the input. A net with a small loss value will produce reconstructions that look very similar to the originals.
Autoencoders can be deep. Deep autoencoders perform better at dimensionality reduction than
their predecessor, principal component analysis, or PCA
RNTN - Recursive Neural Tensor Network, designed for sentimental analysis and NLP.
The purpose of these nets was to analyze data that had a hierarchical structure.
Structure for RNTN - An RNTN has three basic components (root and two child - binary tree)
– a parent group, which we’ll call the root, the root group uses a classifier to fire out a class and a score.
and the child groups, which we’ll call the leaves receives the input and pass it to root group.
Each group is simply a collection of neurons, where the number of neurons depends on the complexity of the input data. the root is connected to both leaves, but the leaves are not connected to each other.
Technically speaking, the three components form what’s called a binary tree. In general, the leaf groups receive input, and the root group uses a classifier to fire out a class and a score.
The score represents the quality of the current parse, and the class represents an encoding of a structure in the current parse.
This goes into recursion until all inputs are used up and the net has a parse tree with all the input words.
Uses case -
Image classification, Object recognition, video recognition-driverless car, speech recognition.
In digital advertising, deep nets are used to segment users by purchase history in order to offer relevant and personalized ads in real time. Based on historical ad price data and other factors,deep nets can learn to optimally bid for ad space on a given web page.
Platform - no coding required but you are bounded by the offering, help in quick deployment but there is more cost associated with it.
example - H2o.ai, graphlab
library - no boundation of offering but requires coding. less cost
a library is a premade set of functions and modules that you can call through your own programs. you’ll need to code every aspect of a net, like the model, the layers, the activation, the training method, and any special methods for preventing overfitting
a commercial-grade library like deeplearning4j, Torch, or Caffe, Scientific projects - Theano, deepmat.
Theano - python library - I am not sure if Hadoop support is present at the time of writing this.
Caffe - c++ , interface with python and Matlab, good for machine vision for forecasting applications
TensorFlow - Python. based on computational graph (same as Theano). hadoop support, model parallelism, support openCL(GPU), TensorBoard
Glossary
MLP - Multi Layer perceptron
RBM - restricted Boltzmann machine
CNN - Convolutional Neural Net
RNN - Recurrent Net
DBN - Deep Delief Net
Encoders
RNTN - Recursive Neural Tensor Network
No comments:
Post a Comment