Friday, November 9, 2018

Tips to consider before writing spark programs



Tips to consider before writing spark program 


1. size up your executor right 



Ideal executor size 
  • keep 4-5 cores per executors
  • 2-4 GB RAM per core
  • yarn memory overhead(7%)
  • Application Master use Core and 1 executor to run

     
sometimes you may need to run large data load and the above configuration may not help. here are few points to while sizing the executor.  


2. Driver program runs on single machine. 



This sometimes cause limitation and you may choose not to use spark in heavy data load applications and opt for other solutions like hive. Hive is better for single, heavy data load aggregation. consider Spark for   there are other workarounds so you need to take some additional steps. 



3. Manage you DAG



  • shuffles to be avoided(map side reduction, only send what you have to)
  • use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)
  • TreeReduce(reduce at executor) over Reduce(reduce at Driver)


4. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks



  • 128MB per partitions
  • if your number of partition is close to 2000, bump it to over 2000 so spark use compression

5. Skew with join

most data goes to one partition use salting to distribute your keys






Tuesday, June 19, 2018

Deep Learning & Neural Network


Deep learning is a subset of machine learning and functions in a similar way but its capabilities are different. Deep learning algorithms are capable of determining on their own if the prediction are accurate or not. This is where deep learning gets tricky:-) 

A deep learning model is designed to continually analyze data with a logic structure similar to how a human would draw conclusions. 

To achieve this, deep learning uses a layered structure of algorithms called an artificial neural network (ANN). The design of an ANN is inspired by the biological neural network of the human brain. This makes for machine intelligence that’s far more capable than that of standard machine learning models.

few terms-

MLP - Multi Layer perceptron  
RBM  - restricted Boltzmann machine
CNN  - Convolutional Neural Net
RNN - Recurrent Net
DBN - Deep Delief Net
Encoders


RNTN - Recursive Neural Tensor Network


Why Neural nets - due to complex patterns.


Why Neural Nets now  -

Earlier NNs are very hard to train (using back propagation  refer vanishing gradient problem)
and requires lot of CPU power which is not the issue due to the major work done in deep learning field by Hinton, Lecun, and Bengio.


What to use When?

If you’re interested in unsupervised learning – that is, you want to extract patterns from a set of unlabeled data – then your best bet is to use either a Restricted Boltzmann Machine, or an auto encoder.

for Supervised - If you have labeled data for supervised learning and you want to build a classifier, 

For text processing tasks like sentiment analysis, parsing, and named entity recognition – use a Recurrent Net or a Recursive Neural Tensor Network, which we’ll refer to as an RNTN. 

For any language model that operates on the character level, use a Recurrent Net. 

For image recognition, use a Deep Belief Network or a Convolutional Net. 

For object recognition, use a Convolutional Net or an RNTN. 

For speech recognition, use a Recurrent Net.

In general, Deep Belief Networks and Multilayer Perceptrons with rectified linear units – also known as RELU – are both good choices for classification. For time series analysis, it’s best to use a Recurrent Net.



CNN - Goal of the CNN was to form the best possible representation of the visual world in order to support recognition tasks.

The process of filtering through the image for a specific pattern.  used in Supervised learning methods. 

why CNN
  • detect and classify the objects into categories
  • robust against pose, scale, brightness  etc

Working

input image -> extract feature -> create part of the objects -> combine them to form object

CNN are good at finding feature and combining them

A typical deep CNN has three sets of layers – a convolutional layer, RELU, and pooling layers – all of which are repeated several times. These layers are followed by a few fully connected layers in order to support classification

a CNN layer has the flashlight structure. Each neuron is only connected to the input neurons it "shines" upon. The neurons in a given filter share the same weight and bias parameters. This means that, anywhere on the filter, a given neuron is connected to the same number of input neurons and has the same weights and biases.  This is what allows the filter to look for the same pattern in different sections of the image.

The next two layers that follow are RELU and pooling, both of which help to build up the simple patterns discovered by the convolutional layer. Each node in the convolutional layer is connected to a node that fires like in other nets. The activation used is called RELU, or rectified linear unit. 

CNNs are trained using backpropagation, so the vanishing gradient is once again a potential issue. The gradient is held more or less constant at every layer of the net. So the RELU activation allows the net to be properly trained, without harmful slowdowns in the crucial early layers. 

The pooling layer is used for dimensionality reduction.

Together, these three layers can discover a host of complex patterns, but the net will have no understanding of what these patterns mean. 

So a fully connected layer is attached to the end of the net in order to equip the net with the ability to classify data samples.


Since CNNs are such deep nets, they most likely need to be trained using server resources with GPUs. Despite the power of CNNs, these nets have one drawback. Since they are a supervised learning method, they require a large set of labelled data for training, which can be challenging to obtain in a real-world application. 




RNN - pattern in data change over time - use RNN 


This deep learning model has a simple structure with a built-in feedback loop, allowing it to act as a forecasting engine

All the nets we’ve seen up to this point have been feedforward neural networks. In a feedforward neural network, signals flow in only one direction from input to output, one layer at a time. In a recurrent net, the output of a layer is added to the next input and fed back into the same layer, which is typically the only layer in the entire network.

Unlike feedforward nets, a recurrent net can receive a sequence of values as input, and it can also produce a sequence of values as output.

RNNs can be stacked to form cabaple network for complex output but they are bit difficult net to train. 


RNTN - Recursive Neural Tensor Network, designed for sentimental analysis and NLP.  


The purpose of these nets was to analyze data that had a hierarchical structure.

Structure for RNTN  - An RNTN has three basic components (root and two child - binary tree)
 – a parent group, which we’ll call the root,  the root group uses a classifier to fire out a class and a score.

and the child groups, which we’ll call the leaves receives the input and pass it to root group. 

Each group is simply a collection of neurons, where the number of neurons depends on the complexity of the input data.  the root is connected to both leaves, but the leaves are not connected to each other. 

Technically speaking, the three components form what’s called a binary tree. In general, the leaf groups receive input, and the root group uses a classifier to fire out a class and a score.

The score represents the quality of the current parse, and the class represents an encoding of a structure in the current parse. 

This goes into recursion until all inputs are used up and the net has a parse tree with all the input words. 

Uses case  - 
Image classification, Object recognition,  video recognition-driverless car, speech recognition.  

In digital advertising, deep nets are used to segment users by purchase history in order to offer relevant and personalized ads in real time. Based on historical ad price data and other factors,deep nets can learn to optimally bid for ad space on a given web page.

Unfortunately, the vanishing gradient is exponentially worse for an RNN. The reason for this is that each time step is the equivalent of an entire layer in a feedforward network. So training an RNN for 100 time steps is like training a 100-layer feedforward net – this leads to exponentially small gradients and a decay of information through time. 

There are several ways to address this problem - the most popular of which is gating. Gating is a technique that helps the net decide when to forget the current input, and when to remember it for future time steps. The most popular gating types today are GRU and LSTM. Besides gating, there are also a few other techniques like gradient clipping, steeper gates, and better optimizers.


Training and  vanishing gradient 
When you’re training a neural net, you’re constantly calculating a cost value. The cost is typically the difference between the net’s predicted output and the actual output from a set of labelled training data. The cost is then lowered by making slight adjustments to the weights and biases over and over throughout the training process, until the lowest possible value is obtained. Here is that forward prop again;

The training process utilizes something called a gradient, which measures the rate at which the cost will change with respect to a change in a weight or a bias.

When the gradient is large, the net will train quickly. When the gradient is small, the net will train slowly.

The process used for training a neural net is called back-propagation or back-prop. We saw before that forward prop starts with the inputs and works forward; back-prop does the reverse, calculating the gradient from right to left.

a gradient at any point is the product of the previous gradients up to that point. And the product of two numbers between 0 and 1 gives you a smaller number



RBM - and how they overcame the vanishing gradient problem.


An RBM is a shallow, two-layer net; the first layer is known as the visible layer and the second is called the hidden layer. Each node in the visible layer is connected to every node in the hidden layer. An RBM is considered “restricted” because no two nodes in the same layer share a connection. 

An RBM is the mathematical equivalent of a two-way translator – in the forward pass, an RBM takes the inputs and translates them into a set of numbers that encode the inputs. In the backward pass, it takes this set of numbers and translates them back to form the re-constructed inputs. A well-trained net will be able to perform the backwards translation with a high degree of accuracy. In both steps, the weights and biases have a very important role. They allow the RBM to decipher the interrelationships among the input features, and they also help the RBM decide which input features are the most important when detecting patterns. 

Through several forward and backward passes, an RBM is trained to reconstruct the input data. Three steps are repeated over and over through the training process: 

a) With a forward pass, every input is combined with an individual weight and one overall bias, and the result is passed to the hidden layer which may or may not activate. 

b) Next, in a backward pass, each activation is combined with an individual weight and an overall bias, and the result is passed to the visible layer for reconstruction. 

c) At the visible layer, the reconstruction is compared against the original input to determine the quality of the result. 

RBMs use a measure called KL Divergence for step c); 


steps a) thru c) are repeated with varying weights and biases until the input and the re-construction are as close as possible.



DBN -  
A deep belief network can be viewed as a stack of RBMs, where the hidden layer of one RBM is the visible layer of the one "above" it.

Training DBN - 
a) The first RBM is trained to re-construct its input as accurately as possible 

b) The hidden layer of the first RBM is treated as the visible layer for the second and the second RBM is trained using the outputs from the first RBM 

c) This process is repeated until every layer in the network is trained


An important note about a DBN is that each RBM layer learns the entire input. In other kinds of models – like convolutional nets – early layers detect simple patterns and later layers recombine them

AutoEncoder -  understand features in data act as feature extraction


an autoencoder is a neural net that takes a set of typically unlabelled inputs, and after encoding them, tries to reconstruct them as accurately as possible. As a result of this, the net must decide which of the data features are the most important, essentially acting as a feature extraction engine.

Autoencoders are typically very shallow, and are usually comprised of an input layer, an output layer and a hidden layer. An RBM is an example of an autoencoder with only two layers. Here is a forward pass that ends with a reconstruction of the input. There are two steps - the encoding and the decoding. Typically, the same weights that are used to encode a feature in the hidden layer are used to reconstruct an image in the output layer.

Autoencoders are trained with backpropagation, using a metric called “loss”.

loss measures the amount of information that was lost when the net tried to reconstruct the input. A net with a small loss value will produce reconstructions that look very similar to the originals.


Autoencoders can be deep. Deep autoencoders perform better at dimensionality reduction than 
their predecessor, principal component analysis, or PCA


Platform  - no coding required but you are bounded by the offering, help in quick deployment but there is more cost associated with it. 

example - H2o.ai, graphlab

library - no boundation of offering but requires coding. less cost

a library is a premade set of functions and modules that you can call through your own programs. you’ll need to code every aspect of a net, like the model, the layers, the activation, the training method, and any special methods for preventing overfitting

a commercial-grade library like deeplearning4j, Torch, or Caffe, Scientific projects - Theano, deepmat.

Theano - python library - I am not sure if Hadoop support is present at the time of writing this.
Caffe - c++ , interface with python and Matlab, good for machine vision for forecasting applications
TensorFlow - Python. based on computational graph (same as Theano). hadoop support, model parallelism, support openCL(GPU), TensorBoard


Friday, May 11, 2018

Feature Engineering

Feature Engineering


Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Features are the ways you represent about the world for the classifier.  feature selection has a multiplicative effect on the overall modeling process

Feature are numeric or categorical.  feature engineering techniques are used to define feature more accuracy for your model.

  • Bucketing
  • Crossing
  • Hashing
  • Embedding
                               
Feature Bucketing  -  transform the numeric feature into categorical feature.

problem - age increase so as income has a linear relation?
Age is not in linear relationship with age as children under 17 year didnt earn much so as after retirement.

Solution - Bucket the age(numeric feature) into age groups (categorical features) and put different weight for each age group. this is how we create age bucket.


Feature Crossing - way to create a new feature that are combination of existing features.

problem - Can linear classifier model interaction between multiple features say age and education against income?

No. This is were feature crossing is useful.  for each cross(age bucket,education)-> we create new true/false feature and age bucket is divided into true/false of income with education.


Feature Hashing or hash buckets
one way to represent category feature with large vocabulary.

This representation can say memory and faster to execute.

A categorical feature with a large number of values can be represented and vocabulary not specified in advance.

To avoid collision put the hashing bucket number more than the unique occupation.

It can also be used to limit  the number of possibilities.

Embedding - it represent the meaning of the words as a vector.

   Used for large vocabulary

   Embeddings are dense.




Dimensionality Reduction 

Dimensionality Reduction is the process of reducing the number of variables/features.

Dimensionality Reduction can be divided into two subcategories 

  • Feature Selection which includes Wrappers, Filters, and Embedded.
  • Feature Extraction which includes Principle Components Analysis.



Now consider if c was equal to 0 or an arbitrarily small number, it wouldn't really be relevant,
therefore it could be taken out of the equation. here you are using Feature Selection because you'd be selecting only the relevant variables and leaving out the irrelevant ones.

If you can equate ab = a + b, making a representation of two variables into one, you're using Feature
Extraction to reduce the number of variables.


Feature Selection is the process of selecting a subset of relevant features or variables.
There are 3 main subset types: 
  •  Wrappers,
  •  Filters, and 
  •  Embedded.

Wrappers use a predictive model that scores feature subsets based on the error-rate of
the model.While they're computationally intensive, they usually produce the best selection of features.

A popular technique is called stepwise regression. It's an algorithm that adds the best feature, or deletes the worst feature at each iteration.

Filters use a proxy measure which is less computationally intensive but slightly less accurate. Filters do capture the practicality of the dataset but, in comparison to error measurement, the feature set that's selected will be more general than if a Wrapper was used.

An interesting fact about filters is that they produce a feature set that don't contain assumptions based on the predictive model, making it a useful tool for exposing relationships between features, such as which variables are 'Bad' together and, as a result, drop the accuracy or 'Good' together and therefore raise the accuracy.

Embedded algorithms learn about which features best contribute to an accurate model during
the model building process. The most common type of is called a regularization model.


Feature Extraction is the process of transforming or projecting a space composing of many dimensions into a space of fewer dimensions.

The main linear technique is called Principle Components Analysis.

Principle Components Analysis is the reduction of higher vector spaces to lower orders through projection.

An easy representation of this would be the projection from a 3-dimensional plane to a
2-dimensional one.

A plane is first found which captures most (if not all) of the information. Then the data is projected onto new axes and a reduction in dimensions occur. When the projection of components happens, new axes are created to describe the relationship. This is called the principle axes, and the new data is called principle components.


Thanks for reading!!!

Sunday, March 25, 2018

MapR Spark Certification tips


I recently cleared MapR spark certification and would like to share some tips as I was asked to do so, (here you go my friends)

I divided this blog into 3 sections. 


  • prerequisite for exam
  • exam topics and must cover material
  • tips (don't ignore the topics at the end of this blog please)



Prerequisite 

First and foremost  -  Work on Spark and Scala for at least a year before attempting the exam. below points summarize the need.

  • You should have basic knowledge of distributed functional programming
  • Hands-on experience on spark 
  • Have good exposure to Scala programming(not expecting to be expert but read the code and answer sensibly).

Exam topics and must cover Material

Lots of programming questions in the exam, code snippet is provided and ask solve it and answer. If I remember correctly only 10% questions were theoretical ( like true/false or which algorithm to use kind).


I referred lot of materials(online books/videos/ edx courses in last 2 years) for my preparation but if I want to zero-in for what should be the mandatory for MapR certification - here is the list you should not miss any bit and I suggest to go over 4-5 times before taking the exam. 
  • Instructor and Virtual Instructor-led Training(Training ppt and Lab guide)
    • DEV 360 – Developing Spark Applications
    • DEV 361 - Build and Monitor Apache Spark Applications
    • DEV 362 - Spark Streaming, Spark MLLib - Machine Learning, Graphx
  • Book - Learning Spark
  • Spark official documentation
    • pay more attention to RDD, Closure, Accumulator, Broadcast variables.  
      • http://spark.apache.org/docs/latest/quick-start.html
      • http://spark.apache.org/docs/latest/rdd-programming-guide.html
    • MlLib - http://spark.apache.org/docs/latest/ml-guide.html



Topics covered in the exam 

Topic NameYour Score
Load and Inspect Data in Apache Spark
                               
xx%
Advanced Spark Programming and Spark Machine Learning MLLib
xxx%
Monitoring Spark Applications
xx%
Work with Pair RDD
                                 
xx.x%
Spark Streaming
xx%
Work with DataFrames
xx%
Build an Apache Spark Application
                                   
xxx%

Tips  - 

Normally when anyone start preparing for the exam - the good start will be to go through below link 

https://mapr.com/blog/how-get-started-using-apache-spark-graphx-scala/assets/spark-certification-study-guide.pdf

The question on this guide is way to basic comparing to the real exam. the exam was much-much harder.
  • Lots of question on core concepts of RDD and pair RDD
  • Dataframes are the next important
  • About 25% questions on Spark Streaming and Spark MLLib so prepare well on 

You don't want to ignore any of the below topics at any cost

Silent topics which you don't want to get as surprise in exam


  • Accumulator and Broadcast variables
  • Scala Closures
  • Narrow and Wide Dependencies
  • Partitioning  
  • Formating questions – saveAsTextFile() – need to save without bracket/parenthesis
  • Prepare well for mkString(“,”) and formating 
  • flatMap functions
  • MapPartitions
  • There was a question on byKey transformation and also on hadoop streaming which I am not sure about. 
Hope this blog will help in your preparation. Please let me know or email me if you have any other questions. Happy Studying

At the end  - Here is my certification

Sunday, March 18, 2018

Blockchain basics




 Blockchain is a continuously growing list of records which are linked and secured using cryptography.For use as a distributed ledger, a blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without the alteration of all subsequent blocks, which requires collusion of the network majority.

Why Block chain

Companies/parties today keep track of records of all transactions between all the parties that the business interacts and update as and when needed. this process is inefficient because 

  • duplication of information with all the parties to update the ledger
  • less transparent 
  • less trusted transactions as data owned by one party and cant be guaranteed to be true.
  • error-prone
The solution is to use distributed, secure, transparent and shared ledger - Block chain
  • A shared ledger technology
  • transparent transaction as all parties are involved/informed
  • immutable chain - only appended

point to note  - Blockchain is still an emerging technology. Business owners need to start small and then look for more ways to grow and expand the use of blockchain networks.


How block chain applied to business network

Assets are classified into tangible(land, properties), intangible(Cash, loan). 

  • The transaction records of assets are kept in the form of distributed ledgers(block chain).
  • Flow of assets/transaction is governed by contract.

This assets can be tracked using this distributed ledger for transparency and maintain only single copy which is shared/endorsed by all party involved hence trusted.

You can see the life cycle of an asset




Blockchain in business 

Blockchain for business provide  - secure, shared ledger which one single record which is accessible for all party involved hence transparent. Business networks prioritize identity over anonymity. Assets are more diverse and important in a business network. A business network gets to choose who validates a transaction.


  • All the member of business network share the common ledger on block chain. 
  • Ledgers are replicated. 
  • all member involved can view the transaction but only authorized members can update the transaction. 



The requirements for a blockchain for business are a shared ledger, smart contract, privacy, and trust.

1. shared ledgers
2. privacy services(who can see what and update the information)
3. Trust - transaction are endorsed by relevant participants 
4.    4. Contract - common/shared business process.



        For example, for financial services network, a business network that runs on a blockchain can speed up transaction processes and audits. That in turn reduces costs and can lead to greater customer satisfaction. A business that runs a supply chain network can benefit from blockchain by reducing errors in shipments, have better tracking or materials, and reduce the risk of illicit tampering of records.


Blockchain for business has several advantages:
  • Saves time
  • Removes cost
  • Reduces risk
  • Increases trust

Use cases of block chain
1. Reference data
2. Supply Chain
3. Trades(Diamond life cycle)


blockchain and bitcoin

Bitcoin is an unregulated shadow-currency and was the first popular blockchain application. The Bitcoin application works in an anonymous network, so no one knows who the participants are.

Bitcoin blockchain is protected by the massive group mining effort. It's unlikely that any private blockchain will try to protect records using gigawatts of computing power — it's time consuming and expensive




Tuesday, February 20, 2018

Spark Mlib Basics

MLlib Machine Learning Library


Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:


  1. summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
  2. classification and regression: state vector machines, logistic regression, linear regression,      decision trees,naive Bayes classification
  3. collaborative filtering techniques including alternating least squares (ALS)
  4. cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA)

Machine Learning(definition) - Constructing and studying methods that learn from and make predictions on data. 

Terminologies
  Observations - (data points) item or entities used for learning or evaluation. 
  Features - attribute used to represent an observation. 
  Labels -  value assigned to observation
  Training and Test data - observation used to train or evaluate a learning algorithm. 

so if consider observation as email than feature would be date, importance, key words in subject or body of an email and Labels would be spam or not-spam. Test and learning data would be set of emails.

Supervised learninglearning from labeled observation examples classification and regression
Unsupervised learninglearning from unlabeled observation examples clustering and dimensionality reduction. 

Flow -raw data ->feature extraction-> supervised learning ->evaluation-(satisfied)-> prediction

MLlib consists of two packages.

  • spark.mllib  
  • spark.ml

When using pyspark, you'll find them in the pyspark.mllib and pyspark.ml packages respectively
Spark.ml is a newer package and works with data frames. The algorithm coverage is similar between the two packages, although spark.ml contains more tools for feature extraction and transformation. ML Package contains two types of classes transformer and estimator.

Transformer is a class which takes dataframe as input and transform it to another dataframe. 

A transformer implement transform() function which is called on input dataframe. 

Examples :


  • Hashing Term Frequency - which calculates how often words occur. It does this after hashing the words to reduce the number of features that need to be tracked.
  • LogisticRegressionModel - The model that results from trying logistic regression on a data set, this model can be used to transform features into predictions.
  • Binarizer - which changes a numeric feature into 1 or 0 given a threshold value.


An Estimator is a class that can take a DataFrame as input and returns a Tranformer.

It does this by calling it's fit method on the input DataFrame.



Note that Estimators need to use the data in the input DataFrame to build a model that can then be used to transform that DataFrame or another DataFrame.



Examples :

  • LogisticRegression processes the DataFrame to determine the weights for the resulting logistic regression model.
  • StandardScaler needs to calculate the standard deviations, and possibly, means of a column of vectors so that it can create a standard scalar model. That model can then be used to transform a DataFrame by subtracting the means and dividing by the standard deviations.
  • PipeLine Calling fit on a pipeline produces a Pipeline model. The pipeline model only contains transformers. There are no estimators.

ML Pipeline  - Its a estimator that consist of one or more stages representing a reusable workflow.  Pipeline stages can be transformers, estimators or another pipeline.

             
transformer1->transformer2->estimator1
-----------------------pipeline-----------------------------





Loss functions define how to penalize incorrect predictions.

The logistic function asymptotically approaches 0 as the input approaches negative infinity and 1 as the input approaches positive infinity. Since the results are bounded by 0 and 1, it can be directly interpreted as a probability.

Feature engineering is the important part and we will discuss that next until enjoy learning. 




Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...