Apache Spark Tutorial (Part 2 – RDD)

Leave a comment

Resilient Distributed Datasets (RDD): RDD is an abstraction, a fundamental unit of data and computation in Spark. As the name indicates, among others, they have two key features: They are resilient: If the data in memory is lost, an RDD can be recreated They are distributed: You can Java objects or Python objects that are distributed across clusters More details about RDD will be discussed later in this post.Sample scala program:You can monitor the jobs that are running on this cluster from

Learn More

Apache Spark Tutorial (Part 1 – Introduction & Architecture)

Leave a comment

INTRODUCTIONApache Spark is being an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platformsSpark Stack: Spark has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation.Spark-core: Spark-core provides services such as managing the memory pool, scheduling of

Learn More

Decision Tree Algorithms Simplified 2

Leave a comment

One of the advantage of using Decision tree is that it efficiently identifies the most significant variable and splits the population on it. In previous article, we developed a high level understanding of Decision trees. In this article, we will focus on the science behind splitting the nodes and choosing the most significant split.Decision trees can use various algorithms to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In

Learn More

Decision Tree Simplified!

Leave a comment

What is a Decision Tree? Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.Example:-Let’s say we have a sample of 30 students with three variables Gender

Learn More

Comparing a Random Forest to a CART model part 2

Leave a comment

Random forest is one of the most commonly used algorithm in Kaggle competitions. Along with a good predictive power, Random forest model are pretty simple to build. We have previously explained the algorithm of a random forest ( Introduction to Random Forest ). This article is the second part of the series on comparison of a random forest with a CART model. In the first article, we took an example of an inbuilt R-dataset to predict the classification of an specie. In this article we will build a

Learn More

Tags