Apache Spark Tutorial (Part 2 – RDD)

Leave a comment

Resilient Distributed Datasets (RDD): RDD is an abstraction, a fundamental unit of data and computation in Spark. As the name indicates, among others, they have two key features: They are resilient: If the data in memory is lost, an RDD can be recreated They are distributed: You can Java objects or Python objects that are distributed across clusters More details about RDD will be discussed later in this post.Sample scala program:You can monitor the jobs that are running on this cluster from

Learn More

Apache Spark Tutorial (Part 1 – Introduction & Architecture)

Leave a comment

INTRODUCTIONApache Spark is being an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platformsSpark Stack: Spark has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation.Spark-core: Spark-core provides services such as managing the memory pool, scheduling of

Learn More

PIG Functions

Leave a comment

Eval FunctionsLoad/Store FunctionsMath FunctionsString FunctionsTuple, Bag, Map FunctionsUser Defined Functions (UDFs)Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby. Registering UDFs Registering Java UDFs: —register_java_udf.pig register ‘your_path_to_piggybank/piggybank.jar’; divs = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray,

Learn More

PIG Operators

Leave a comment

Basic Operators Relational Operators

Learn More

PIG Data Types

Leave a comment

Basic Operators Simple TypesComplex Types

Learn More