An implementation of ROLLUP H2IRG on APACHE PIG

Introduction

Experimental Setup

Experiments' Results

View the Project on GitHub bigfootproject/pig/tree/pig-rollup

Introduction

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output.
Based on our previous work, we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase.
This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.


Features

Our implementation contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG+IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.


The new syntax to use our ROLLUP approach:
alias = CUBE rel BY { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
         [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} ...]

We have also made some experiments to demonstrate our improvement on this ROLLUP H2IRG.