Feeding the Pig with XML Most of the business data available today is XML and it is always tough to parse XML, especially when it comes to PIG. There are two approaches to parse an XML file in PIG.1. Using Regular Expression 2. Using XPathFor simplicity, let’s work on XML shown below(store this as sample.xml). The file is placed in HDFS for processing (path used here is /tmp/sample.xml).
Practice PIG:To learn Pig Latin, let’s question the data. Let’s practice pig using a sample movie dataset. The file has a total of 49590 records. Before we start asking questions, we need the data to be accessible in Pig. Run pig in local mode and use the following commands to start grunt and load the data:grunt> pig –x localgrunt> movies = LOAD ‘/usr/lib/pig/movies.csv’ USING PigStorage(‘,’) as (id,name,year,rating,duration);The above statement is made up of two parts. The part to the left ofLearn More
Introduction To PIGApache Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the PigLatin scripting language operations like ETL (Extract, Transform and Load) but here we do ELT (Extract, Load and then Transform), adhoc data analysis and iterative processing can be easily achieved.Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to makeLearn More
MapReduce is a programming paradigm where the computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.The Reduce function,Learn More