You can monitor the jobs that are running on this cluster from Spark UI, which is running by default at port 4040. If you navigate your browser to http://localhost:4040, you should see the following Spark driver program UI:
The UI gives you an overview of the type of job, its submission date/time, the amount of time it took, and the number of stages that it had to pass through. If you want to look at the details of the job, simply click the description of the job, which will take you to another web page that details all the completed stages. You might want to look at individual stages of the job. If you click through the individual stage, you can get detailed metrics about your job.
RDD (Transformations & Actions)
RDD is an abstraction, a fundamental unit of data and computation in Spark. As the name indicates, among others, they have two key features:
For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, the major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month’s worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines.
For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows
The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD’s are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of e entire parent RDDs of the RDD. In addition to resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities:
A typical Spark program flow with an RDD includes:
There are two major ways of creating an RDD:
Parallelizing collections are created by calling the parallelize() method on SparkContext within your driver program. The parallelize() method asks Spark to create a new RDD based on the dataset provided. Once the local collection/dataset is converted into an RDD, it can be operated on in parallel. Parallelize() is often used for prototyping and not often used in production environments due to the need of the data set being available on a single machine
Val namesList = sc.parallelize(List(“rob”,”james”,”ardian”,”greg”,”paul”,”jochen”))
Creating an RDD by referencing an external data source, for example, Filesystem, HDFS, HBase, or any data source capable of providing a Hadoop Input format.
For production use, Spark can load data from any storage source supported by Hadoop ranging from a text file on your local file system, to data available on HDFS, HBase, or Amazon S3.
Operations on RDD:
Two major operations types can be performed on RDD. They are
Transformations: Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence are lazily evaluated, which is one of the main benefits of Spark
Actions: Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called.
val fruits = sc.textFile(“file:///home/ubuntu/spark-data/fruits.txt”)
val yellowThings = sc.textFile(“file:///home/ubuntu/spark-data/yellowthings.txt”)
What are RDD operations?
RDDs support two types of operations: transformations and actions. Transformations create a new dataset from an existing one. Transformations are lazy, meaning that no transformation is executed until you execute an action. Actions return a value to the driver program after running a computation on the dataset.
RDD transformations, following are examples of some of the common transformations available.
/* map */
val fruitsReversed = fruits.map((fruit) => fruit.reverse)
/* filter */
val shortFruits = fruits.filter((fruit) => fruit.length <= 5)
/* flatMap */
val characters = fruits.flatMap((fruit) => fruit.toList)
/* union */
val fruitsAndYellowThings = fruits.union(yellowThings)
/* intersection */
val yellowFruits = fruits.intersection(yellowThings)
/* distinct */
val distinctFruitsAndYellowThings = fruitsAndYellowThings.distinct()
/* groupByKey */
val yellowThingsByFirstLetter = yellowThings.map((thing) => (thing(0), thing)).groupByKey()
/* reduceByKey */
val numFruitsByLength = fruits.map((fruit) => (fruit.length, 1)).reduceByKey((x, y) => x + y)
RDD actions, following are examples of some of the common actions available.
/* collect */
val fruitsArray = fruits.collect()
val yellowThingsArray = yellowThings.collect()
/* count */
val numFruits = fruits.count()
/* take */
val first3Fruits = fruits.take(3)
/* reduce */
val letterSet = fruits.map((fruit) => fruit.toSet).reduce((x, y) => x ++ y)