Data Lake Security and Governance Best Practices

Leave a comment

Data Lakes are the foundations of the new data platform, enabling companies to represent their data in an uniform and consumable way. The flexibility, agility, and security of having structured, unstructured, and historical data readily available in segregated logical zones brings now possibilities and extra transformational capabilities to businesses. It is key to understand what …

Learn More

GCP – Cloud Providers – Part 2

Leave a comment

There are good number of cloud providers available in the market. Some are public cloud providers revolves around the service providers that offer software-, platform- and infrastructure-as-a-service offerings. There are many more cloud providers that specialize in some part of the enterprise software stack. This article provides a quick view of cloud provider timeline and top cloud providers.Cloud providers timeline:Top Cloud Vendors (Ranking based on Annual revenue)Microsoft Azure:Microsoft

Learn More

Apache Spark Tutorial (Part 2 – RDD)

Leave a comment

Resilient Distributed Datasets (RDD): RDD is an abstraction, a fundamental unit of data and computation in Spark. As the name indicates, among others, they have two key features: They are resilient: If the data in memory is lost, an RDD can be recreated They are distributed: You can Java objects or Python objects that are distributed across clusters More details about RDD will be discussed later in this post.Sample scala program:You can monitor the jobs that are running on this cluster from

Learn More

Apache Spark Tutorial (Part 1 – Introduction & Architecture)

Leave a comment

INTRODUCTIONApache Spark is being an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platformsSpark Stack: Spark has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation.Spark-core: Spark-core provides services such as managing the memory pool, scheduling of

Learn More

PIG Functions

Leave a comment

Eval FunctionsLoad/Store FunctionsMath FunctionsString FunctionsTuple, Bag, Map FunctionsUser Defined Functions (UDFs)Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby. Registering UDFs Registering Java UDFs: —register_java_udf.pig register ‘your_path_to_piggybank/piggybank.jar’; divs = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray,

Learn More

Tags