Using PL/R in HAWQ

PL/R is a procedural language. With the HAWQ PL/R extension, you can write database functions in the R programming language and use R packages that contain R functions and data sets.

Note: To use PL/R in HAWQ, R must be installed on each node in your HAWQ cluster. Additionally, you must install the PL/R package on an existing HAWQ deployment or have specified PL/R as a build option when compiling HAWQ.

PL/R Examples

This section contains simple PL/R examples.

Example 1: Using PL/R for Single Row Operators

This function generates an array of numbers with a normal distribution using the R function rnorm().

  1. CREATE OR REPLACE FUNCTION r_norm(n integer, mean float8,
  2. std_dev float8) RETURNS float8[ ] AS
  3. $$
  4. x<-rnorm(n,mean,std_dev)
  5. return(x)
  6. $$
  7. LANGUAGE 'plr';

The following CREATE TABLE command uses the r_norm function to populate the table. The r_norm function creates an array of 10 numbers.

  1. CREATE TABLE test_norm_var
  2. AS SELECT id, r_norm(10,0,1) AS x
  3. FROM (SELECT generate_series(1,30:: bigint) AS ID) foo
  4. DISTRIBUTED BY (id);

Example 2: Returning PL/R data.frames in Tabular Form

Assuming your PL/R function returns an R data.frame as its output (unless you want to use arrays of arrays), some work is required in order for HAWQ to see your PL/R data.frame as a simple SQL table:

Create a TYPE in HAWQ with the same dimensions as your R data.frame:

  1. CREATE TYPE t1 AS ...

Use this TYPE when defining your PL/R function:

  1. ... RETURNS SET OF t1 AS ...

Sample SQL for this situation is provided in the next example.

Example 3: Process Employee Information Using PL/R

The SQL below defines a TYPE and a function to process employee information with data.frame using PL/R:

  1. -- Create type to store employee information
  2. DROP TYPE IF EXISTS emp_type CASCADE;
  3. CREATE TYPE emp_type AS (name text, age int, salary numeric(10,2));
  4. -- Create function to process employee information and return data.frame
  5. DROP FUNCTION IF EXISTS get_emps();
  6. CREATE OR REPLACE FUNCTION get_emps() RETURNS SETOF emp_type AS '
  7. names <- c("Joe","Jim","Jon")
  8. ages <- c(41,25,35)
  9. salaries <- c(250000,120000,50000)
  10. df <- data.frame(name = names, age = ages, salary = salaries)
  11. return(df)
  12. ' LANGUAGE 'plr';
  13. -- Call the function
  14. SELECT * FROM get_emps();

Downloading and Installing R Packages

R packages are modules that contain R functions and data sets. You can install R packages to extend R and PL/R functionality in HAWQ.

Note: If you expand HAWQ and add segment hosts, you must install the R packages in the R installation of each of the new hosts.

  1. For an R package, identify all dependent R packages and each package web URL. The information can be found by selecting the given package from the following navigation page:

    http://cran.r-project.org/web/packages/available_packages_by_name.html

    As an example, the page for the R package arm indicates that the package requires the following R libraries: Matrix, lattice, lme4, R2WinBUGS, coda, abind, foreign, and MASS.

    You can also try installing the package with R CMD INSTALL command to determine the dependent packages.

    For the R installation included with the HAWQ PL/R extension, the required R packages are installed with the PL/R extension. However, the Matrix package requires a newer version.

  2. From the command line, use the wget utility to download the tar.gz files for the arm package to the HAWQ master host:

    1. $ wget http://cran.r-project.org/src/contrib/Archive/arm/arm_1.5-03.tar.gz
    2. $ wget http://cran.r-project.org/src/contrib/Archive/Matrix/Matrix_0.9996875-1.tar.gz
  3. Use the hawq scp utility and the hawq_hosts file to copy the tar.gz files to the same directory on all nodes of the HAWQ cluster. The hawq_hosts file contains a list of all of the HAWQ segment hosts. You might require root access to do this.

    1. $ hawq scp -f hosts_all Matrix_0.9996875-1.tar.gz =:/home/gpadmin
    2. $ hawq scp -f hawq_hosts arm_1.5-03.tar.gz =:/home/gpadmin
  4. Use the hawq ssh utility in interactive mode to log into each HAWQ segment host (hawq ssh -f hawq_hosts). Install the packages from the command prompt using the R CMD INSTALL command. Note that this may require root access. For example, this R install command installs the packages for the arm package.

    1. $ R CMD INSTALL Matrix_0.9996875-1.tar.gz arm_1.5-03.tar.gz

    Note: Some packages require compilation. Refer to the package documentation for possible build requirements.

  5. Ensure that the R package was installed in the /usr/lib64/R/library directory on all the segments (hawq ssh can be used to install the package). For example, this hawq ssh command lists the contents of the R library directory.

    1. $ hawq ssh -f hawq_hosts "ls /usr/lib64/R/library"
  6. Verify the R package can be loaded.

    This function performs a simple test to determine if an R package can be loaded:

    1. CREATE OR REPLACE FUNCTION R_test_require(fname text)
    2. RETURNS boolean AS
    3. $BODY$
    4. return(require(fname,character.only=T))
    5. $BODY$
    6. LANGUAGE 'plr';

    This SQL command calls the previous function to determine if the R package arm can be loaded:

    1. SELECT R_test_require('arm');

Displaying R Library Information

You can use the R command line to display information about the installed libraries and functions on the HAWQ host. You can also add and remove libraries from the R installation. To start the R command line on the host, log in to the host as the gpadmin user and run the script R.

  1. $ R

This R function lists the available R packages from the R command line:

  1. > library()

Display the documentation for a particular R package

  1. > library(help="package_name")
  2. > help(package="package_name")

Display the help file for an R function:

  1. > help("function_name")
  2. > ?function_name

To see what packages are installed, use the R command installed.packages(). This will return a matrix with a row for each package that has been installed. Below, we look at the first 5 rows of this matrix.

  1. > installed.packages()

Any package that does not appear in the installed packages matrix must be installed and loaded before its functions can be used.

An R package can be installed with install.packages():

  1. > install.packages("package_name")
  2. > install.packages("mypkg", dependencies = TRUE, type="source")

Load a package from the R command line.

  1. > library(" package_name ")

An R package can be removed with remove.packages

  1. > remove.packages("package_name")

You can use the R command -e option to run functions from the command line. For example, this command displays help on the R package named MASS.

  1. $ R -e 'help("MASS")'

References

http://www.r-project.org/ - The R Project home page

https://github.com/pivotalsoftware/gp-r - GitHub repository that contains information about using R.

https://github.com/pivotalsoftware/PivotalR - GitHub repository for PivotalR, a package that provides an R interface to operate on HAWQ tables and views that is similar to the R data.frame. PivotalR also supports using the machine learning package MADlib directly from R.

R documentation is installed with the R package:

  1. /usr/share/doc/R-N.N.N

where N.N.N corresponds to the version of R installed.

R Functions and Arguments

See http://www.joeconway.com/plr/doc/plr-funcs.html.

Passing Data Values in R

See http://www.joeconway.com/plr/doc/plr-data.html.

Aggregate Functions in R

See http://www.joeconway.com/plr/doc/plr-aggregate-funcs.html.