RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

Document Type


Publication Date


Subject: LCSH

Big data, Input design, Computer, Software frameworks, Computer algorithms, Machine learning,


Computer Engineering | Computer Sciences | Electrical and Computer Engineering


In many domains such as Telecom various scenarios necessitate the processing of large amounts of data using statistical and machine learning algorithms. A noticeable effort has been made to move the data management systems into MapReduce parallel processing environments such as Hadoop and Pig. Nevertheless these systems lack the features of advanced machine learning and statistical analysis. Frame-works such as Mahout on top of Hadoop support machine learning but their implementations are at the preliminary stage. For example Mahout does not provide Support Vector Machine (SVM) algorithms and it is difficult to use. On the other hand traditional statistical software tools such as R containing comprehensive statistical algorithms for advanced analysis are widely used. But such software can only run on a single computer and therefore it is not scalable. In this paper we propose an integrated solution RPig which takes the advantages of R (for machine learning and statistical analysis capabilities) and parallel data processing capabilities of Pig. The RPig framework offers a scalable advanced data analysis solution for machine learning and statistical analysis. Analysis jobs can be easily developed with RPig script in high level languages. We describe the design implementation and an eclipse-based RPigEditor for the RPig framework. Using application scenarios from the Telecom domain we show the usage of RPig and how the framework can significantly reduce the development effort. The results demonstrate the scalability of our framework and the simplicity of deployment for analysis jobs.


Article originally published in the 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

University of New Haven community members can access the full-text here.



Publisher Citation

M. Wang, S. B. Handurukande and M. Nassar, "RPig: A scalable framework for machine learning and advanced statistical functionalities," 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, 2012, pp. 293-300, doi: 10.1109/CloudCom.2012.6427480.

Check your library