Tuesday, June 19, 2018

Deep Learning & Neural Network


Deep learning is a subset of machine learning and functions in a similar way but its capabilities are different. Deep learning algorithms are capable of determining on their own if the prediction are accurate or not. This is where deep learning gets tricky:-) 

A deep learning model is designed to continually analyze data with a logic structure similar to how a human would draw conclusions. 

To achieve this, deep learning uses a layered structure of algorithms called an artificial neural network (ANN). The design of an ANN is inspired by the biological neural network of the human brain. This makes for machine intelligence that’s far more capable than that of standard machine learning models.

few terms-

MLP - Multi Layer perceptron  
RBM  - restricted Boltzmann machine
CNN  - Convolutional Neural Net
RNN - Recurrent Net
DBN - Deep Delief Net
Encoders


RNTN - Recursive Neural Tensor Network


Why Neural nets - due to complex patterns.


Why Neural Nets now  -

Earlier NNs are very hard to train (using back propagation  refer vanishing gradient problem)
and requires lot of CPU power which is not the issue due to the major work done in deep learning field by Hinton, Lecun, and Bengio.


What to use When?

If you’re interested in unsupervised learning – that is, you want to extract patterns from a set of unlabeled data – then your best bet is to use either a Restricted Boltzmann Machine, or an auto encoder.

for Supervised - If you have labeled data for supervised learning and you want to build a classifier, 

For text processing tasks like sentiment analysis, parsing, and named entity recognition – use a Recurrent Net or a Recursive Neural Tensor Network, which we’ll refer to as an RNTN. 

For any language model that operates on the character level, use a Recurrent Net. 

For image recognition, use a Deep Belief Network or a Convolutional Net. 

For object recognition, use a Convolutional Net or an RNTN. 

For speech recognition, use a Recurrent Net.

In general, Deep Belief Networks and Multilayer Perceptrons with rectified linear units – also known as RELU – are both good choices for classification. For time series analysis, it’s best to use a Recurrent Net.



CNN - Goal of the CNN was to form the best possible representation of the visual world in order to support recognition tasks.

The process of filtering through the image for a specific pattern.  used in Supervised learning methods. 

why CNN
  • detect and classify the objects into categories
  • robust against pose, scale, brightness  etc

Working

input image -> extract feature -> create part of the objects -> combine them to form object

CNN are good at finding feature and combining them

A typical deep CNN has three sets of layers – a convolutional layer, RELU, and pooling layers – all of which are repeated several times. These layers are followed by a few fully connected layers in order to support classification

a CNN layer has the flashlight structure. Each neuron is only connected to the input neurons it "shines" upon. The neurons in a given filter share the same weight and bias parameters. This means that, anywhere on the filter, a given neuron is connected to the same number of input neurons and has the same weights and biases.  This is what allows the filter to look for the same pattern in different sections of the image.

The next two layers that follow are RELU and pooling, both of which help to build up the simple patterns discovered by the convolutional layer. Each node in the convolutional layer is connected to a node that fires like in other nets. The activation used is called RELU, or rectified linear unit. 

CNNs are trained using backpropagation, so the vanishing gradient is once again a potential issue. The gradient is held more or less constant at every layer of the net. So the RELU activation allows the net to be properly trained, without harmful slowdowns in the crucial early layers. 

The pooling layer is used for dimensionality reduction.

Together, these three layers can discover a host of complex patterns, but the net will have no understanding of what these patterns mean. 

So a fully connected layer is attached to the end of the net in order to equip the net with the ability to classify data samples.


Since CNNs are such deep nets, they most likely need to be trained using server resources with GPUs. Despite the power of CNNs, these nets have one drawback. Since they are a supervised learning method, they require a large set of labelled data for training, which can be challenging to obtain in a real-world application. 




RNN - pattern in data change over time - use RNN 


This deep learning model has a simple structure with a built-in feedback loop, allowing it to act as a forecasting engine

All the nets we’ve seen up to this point have been feedforward neural networks. In a feedforward neural network, signals flow in only one direction from input to output, one layer at a time. In a recurrent net, the output of a layer is added to the next input and fed back into the same layer, which is typically the only layer in the entire network.

Unlike feedforward nets, a recurrent net can receive a sequence of values as input, and it can also produce a sequence of values as output.

RNNs can be stacked to form cabaple network for complex output but they are bit difficult net to train. 


RNTN - Recursive Neural Tensor Network, designed for sentimental analysis and NLP.  


The purpose of these nets was to analyze data that had a hierarchical structure.

Structure for RNTN  - An RNTN has three basic components (root and two child - binary tree)
 – a parent group, which we’ll call the root,  the root group uses a classifier to fire out a class and a score.

and the child groups, which we’ll call the leaves receives the input and pass it to root group. 

Each group is simply a collection of neurons, where the number of neurons depends on the complexity of the input data.  the root is connected to both leaves, but the leaves are not connected to each other. 

Technically speaking, the three components form what’s called a binary tree. In general, the leaf groups receive input, and the root group uses a classifier to fire out a class and a score.

The score represents the quality of the current parse, and the class represents an encoding of a structure in the current parse. 

This goes into recursion until all inputs are used up and the net has a parse tree with all the input words. 

Uses case  - 
Image classification, Object recognition,  video recognition-driverless car, speech recognition.  

In digital advertising, deep nets are used to segment users by purchase history in order to offer relevant and personalized ads in real time. Based on historical ad price data and other factors,deep nets can learn to optimally bid for ad space on a given web page.

Unfortunately, the vanishing gradient is exponentially worse for an RNN. The reason for this is that each time step is the equivalent of an entire layer in a feedforward network. So training an RNN for 100 time steps is like training a 100-layer feedforward net – this leads to exponentially small gradients and a decay of information through time. 

There are several ways to address this problem - the most popular of which is gating. Gating is a technique that helps the net decide when to forget the current input, and when to remember it for future time steps. The most popular gating types today are GRU and LSTM. Besides gating, there are also a few other techniques like gradient clipping, steeper gates, and better optimizers.


Training and  vanishing gradient 
When you’re training a neural net, you’re constantly calculating a cost value. The cost is typically the difference between the net’s predicted output and the actual output from a set of labelled training data. The cost is then lowered by making slight adjustments to the weights and biases over and over throughout the training process, until the lowest possible value is obtained. Here is that forward prop again;

The training process utilizes something called a gradient, which measures the rate at which the cost will change with respect to a change in a weight or a bias.

When the gradient is large, the net will train quickly. When the gradient is small, the net will train slowly.

The process used for training a neural net is called back-propagation or back-prop. We saw before that forward prop starts with the inputs and works forward; back-prop does the reverse, calculating the gradient from right to left.

a gradient at any point is the product of the previous gradients up to that point. And the product of two numbers between 0 and 1 gives you a smaller number



RBM - and how they overcame the vanishing gradient problem.


An RBM is a shallow, two-layer net; the first layer is known as the visible layer and the second is called the hidden layer. Each node in the visible layer is connected to every node in the hidden layer. An RBM is considered “restricted” because no two nodes in the same layer share a connection. 

An RBM is the mathematical equivalent of a two-way translator – in the forward pass, an RBM takes the inputs and translates them into a set of numbers that encode the inputs. In the backward pass, it takes this set of numbers and translates them back to form the re-constructed inputs. A well-trained net will be able to perform the backwards translation with a high degree of accuracy. In both steps, the weights and biases have a very important role. They allow the RBM to decipher the interrelationships among the input features, and they also help the RBM decide which input features are the most important when detecting patterns. 

Through several forward and backward passes, an RBM is trained to reconstruct the input data. Three steps are repeated over and over through the training process: 

a) With a forward pass, every input is combined with an individual weight and one overall bias, and the result is passed to the hidden layer which may or may not activate. 

b) Next, in a backward pass, each activation is combined with an individual weight and an overall bias, and the result is passed to the visible layer for reconstruction. 

c) At the visible layer, the reconstruction is compared against the original input to determine the quality of the result. 

RBMs use a measure called KL Divergence for step c); 


steps a) thru c) are repeated with varying weights and biases until the input and the re-construction are as close as possible.



DBN -  
A deep belief network can be viewed as a stack of RBMs, where the hidden layer of one RBM is the visible layer of the one "above" it.

Training DBN - 
a) The first RBM is trained to re-construct its input as accurately as possible 

b) The hidden layer of the first RBM is treated as the visible layer for the second and the second RBM is trained using the outputs from the first RBM 

c) This process is repeated until every layer in the network is trained


An important note about a DBN is that each RBM layer learns the entire input. In other kinds of models – like convolutional nets – early layers detect simple patterns and later layers recombine them

AutoEncoder -  understand features in data act as feature extraction


an autoencoder is a neural net that takes a set of typically unlabelled inputs, and after encoding them, tries to reconstruct them as accurately as possible. As a result of this, the net must decide which of the data features are the most important, essentially acting as a feature extraction engine.

Autoencoders are typically very shallow, and are usually comprised of an input layer, an output layer and a hidden layer. An RBM is an example of an autoencoder with only two layers. Here is a forward pass that ends with a reconstruction of the input. There are two steps - the encoding and the decoding. Typically, the same weights that are used to encode a feature in the hidden layer are used to reconstruct an image in the output layer.

Autoencoders are trained with backpropagation, using a metric called “loss”.

loss measures the amount of information that was lost when the net tried to reconstruct the input. A net with a small loss value will produce reconstructions that look very similar to the originals.


Autoencoders can be deep. Deep autoencoders perform better at dimensionality reduction than 
their predecessor, principal component analysis, or PCA


Platform  - no coding required but you are bounded by the offering, help in quick deployment but there is more cost associated with it. 

example - H2o.ai, graphlab

library - no boundation of offering but requires coding. less cost

a library is a premade set of functions and modules that you can call through your own programs. you’ll need to code every aspect of a net, like the model, the layers, the activation, the training method, and any special methods for preventing overfitting

a commercial-grade library like deeplearning4j, Torch, or Caffe, Scientific projects - Theano, deepmat.

Theano - python library - I am not sure if Hadoop support is present at the time of writing this.
Caffe - c++ , interface with python and Matlab, good for machine vision for forecasting applications
TensorFlow - Python. based on computational graph (same as Theano). hadoop support, model parallelism, support openCL(GPU), TensorBoard


Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...