Using PL/Python in HAWQ

This section provides an overview of the HAWQ PL/Python procedural language extension.

About HAWQ PL/Python

PL/Python is embedded in your HAWQ product distribution or within your HAWQ build if you chose to enable it as a build option.

With the HAWQ PL/Python extension, you can write user-defined functions in Python that take advantage of Python features and modules, enabling you to quickly build robust HAWQ database applications.

HAWQ uses the system Python installation.

HAWQ PL/Python Limitations

  • HAWQ does not support PL/Python trigger functions.
  • PL/Python is available only as a HAWQ untrusted language.

Enabling and Removing PL/Python Support

To use PL/Python in HAWQ, you must either install a binary version of HAWQ that includes PL/Python or specify PL/Python as a build option when you compile HAWQ from source.

You must register the PL/Python language with a database before you can create and execute a PL/Python UDF on that database. You must be a database superuser to register and remove new languages in HAWQ databases.

On every database to which you want to install and enable PL/Python:

  1. Connect to the database using the psql client:

    1. gpadmin@hawq-node$ psql -d <dbname>

    Replace <dbname> with the name of the target database.

  2. Run the following SQL command to register the PL/Python procedural language:

    1. dbname=# CREATE LANGUAGE plpythonu;

    Note: plpythonu is installed as an untrusted language; it offers no way of restricting what you can program in UDFs created with the language. Creating and executing PL/Python UDFs is permitted only by database superusers and other database users explicitly GRANTed the permissions.

To remove support for plpythonu from a database, run the following SQL command; you must be a database superuser to remove a registered procedural language:

  1. dbname=# DROP LANGUAGE plpythonu;

Developing Functions with PL/Python

PL/Python functions are defined using the standard SQL CREATE FUNCTION syntax.

The body of a PL/Python user-defined function is a Python script. When the function is called, its arguments are passed as elements of the array args[]. You can also pass named arguments as ordinary variables to the Python script.

PL/Python function results are returned with a return statement, or a yield statement in the case of a result-set statement.

The following PL/Python function computes and returns the maximum of two integers:

  1. =# CREATE FUNCTION mypymax (a integer, b integer)
  2. RETURNS integer
  3. AS $$
  4. if (a is None) or (b is None):
  5. return None
  6. if a > b:
  7. return a
  8. return b
  9. $$ LANGUAGE plpythonu;

To execute the mypymax function:

  1. =# SELECT mypymax(5, 7);
  2. mypymax
  3. ---------
  4. 7
  5. (1 row)

Adding the STRICT keyword to the LANGUAGE subclause instructs HAWQ to return null when any of the input arguments are null. When created as STRICT, the function itself need not perform null checks.

The following example uses an unnamed argument, the built-in Python max() function, and the STRICT keyword to create a UDF named mypymax2:

  1. =# CREATE FUNCTION mypymax2 (a integer, integer)
  2. RETURNS integer AS $$
  3. return max(a, args[0])
  4. $$ LANGUAGE plpythonu STRICT;
  5. =# SELECT mypymax(5, 3);
  6. mypymax2
  7. ----------
  8. 5
  9. (1 row)
  10. =# SELECT mypymax(5, null);
  11. mypymax2
  12. ----------
  13. (1 row)

Creating the Sample Data

Perform the following steps to create, and insert data into, a simple table. This table will be used in later exercises.

  1. Create a database named testdb:

    1. gpadmin@hawq-node$ createdb testdb
  2. Create a table named sales:

    1. gpadmin@hawq-node$ psql -d testdb
    1. testdb=> CREATE TABLE sales (id int, year int, qtr int, day int, region text)
    2. DISTRIBUTED BY (id);
  3. Insert data into the table:

    1. testdb=> INSERT INTO sales VALUES
    2. (1, 2014, 1,1, 'usa'),
    3. (2, 2002, 2,2, 'europe'),
    4. (3, 2014, 3,3, 'asia'),
    5. (4, 2014, 4,4, 'usa'),
    6. (5, 2014, 1,5, 'europe'),
    7. (6, 2014, 2,6, 'asia'),
    8. (7, 2002, 3,7, 'usa') ;

Python Modules

A Python module is a text file containing Python statements and definitions. Python modules are named, with the file name for a module following the <python-module-name>.py naming convention.

Should you need to build a Python module, ensure that the appropriate software is installed on the build system. Also be sure that you are building for the correct deployment architecture, i.e. 64-bit.

HAWQ Considerations

When installing a Python module in HAWQ, you must add the module to all segment nodes in the cluster. You must also add all Python modules to any new segment hosts when you expand your HAWQ cluster.

PL/Python supports the built-in HAWQ Python module named plpy. You can also install 3rd party Python modules.

plpy Module

The HAWQ PL/Python procedural language extension automatically imports the Python module plpy. plpy implements functions to execute SQL queries and prepare execution plans for queries. The plpy module also includes functions to manage errors and messages.

Executing and Preparing SQL Queries

Use the PL/Python plpy module plpy.execute() function to execute a SQL query. Use the plpy.prepare() function to prepare an execution plan for a query. Preparing the execution plan for a query is useful if you want to run the query from multiple Python functions.

plpy.execute()

Invoking plpy.execute() with a query string and an optional limit argument runs the query, returning the result in a Python result object. This result object:

  • emulates a list or dictionary object
  • returns rows that can be accessed by row number and column name; row numbering starts with 0 (zero)
  • can be modified
  • includes an nrows() method that returns the number of rows returned by the query
  • includes a status() method that returns the SPI_execute() return value

For example, the following Python statement when present in a PL/Python user-defined function will execute a SELECT * FROM mytable query:

  1. rv = plpy.execute("SELECT * FROM my_table", 3)

As instructed by the limit argument 3, the plpy.execute function will return up to 3 rows from my_table. The result set is stored in the rv object.

Access specific columns in the table by name. For example, if my_table has a column named my_column:

  1. my_col_data = rv[i]["my_column"]

You specified that the function return a maximum of 3 rows in the plpy.execute() command above. As such, the index i used to access the result value rv must specify an integer between 0 and 2, inclusive.

Example: plpy.execute()

Example: Use plpy.execute() to run a similar query on the sales table you created in an earlier section:

  1. Define a PL/Python UDF that executes a query to return at most 5 rows from the sales table:

    1. =# CREATE OR REPLACE FUNCTION mypytest(a integer)
    2. RETURNS text
    3. AS $$
    4. rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5)
    5. region = rv[a-1]["region"]
    6. return region
    7. $$ LANGUAGE plpythonu;

    When executed, this UDF returns the region value from the id identified by the input value a. Since row numbering of the result set starts at 0, you must access the result set with index a - 1.

    Specifying the ORDER BY id clause in the SELECT statement ensures that subsequent invocations of mypytest with the same input argument will return identical result sets.

  2. Run mypytest with an argument identifying id 3:

    1. =# SELECT mypytest(3);
    2. mypytest
    3. ----------
    4. asia
    5. (1 row)

    Recall that the row numbering starts from 0 in a Python returned result set. The valid input argument for the mypytest2 function is an integer between 0 and 4, inclusive.

    The query returns the region from the row with id = 3, asia.

Note: This example demonstrates some of the concepts discussed previously. It may not be the ideal way to return a specific column value.

plpy.prepare()

The function plpy.prepare() prepares the execution plan for a query. Preparing the execution plan for a query is useful if you plan to run the query from multiple Python functions.

You invoke plpy.prepare() with a query string. Also include a list of parameter types if you are using parameter references in the query. For example, the following statement in a PL/Python user-defined function returns the execution plan for a query:

  1. plan = plpy.prepare("SELECT * FROM sales ORDER BY id WHERE
  2. region = $1", [ "text" ])

The string text identifies the data type of the variable $1.

After preparing an execution plan, you use the function plpy.execute() to run it. For example:

  1. rv = plpy.execute(plan, [ "usa" ])

When executed, rv will include all rows in the sales table where region = usa.

Read on for a description of how one passes data between PL/Python function calls.

Saving Execution Plans

When you prepare an execution plan using the PL/Python module, the plan is automatically saved. See the Postgres Server Programming Interface (SPI) documentation for information about execution plans.

To make effective use of saved plans across function calls, you use one of the Python persistent storage dictionaries, SD or GD.

The global dictionary SD is available to store data between function calls. This variable is private static data. The global dictionary GD is public data, and is available to all Python functions within a session. Use GD with care.

Each function gets its own execution environment in the Python interpreter, so that global data and function arguments from myfunc1 are not available to myfunc2. The exception is the data in the GD dictionary, as mentioned previously.

This example saves an execution plan to the SD dictionary and then executes the plan:

  1. =# CREATE FUNCTION usesavedplan() RETURNS text AS $$
  2. select1plan = plpy.prepare("SELECT region FROM sales WHERE id=1")
  3. SD["s1plan"] = select1plan
  4. # other function processing
  5. # execute the saved plan
  6. rv = plpy.execute(SD["s1plan"])
  7. return rv[0]["region"]
  8. $$ LANGUAGE plpythonu;
  9. =# SELECT usesavedplan();
Example: plpy.prepare()

Example: Use plpy.prepare() and plpy.execute() to prepare and run an execution plan using the GD dictionary:

  1. Define a PL/Python UDF to prepare and save an execution plan to the GD. Also return the name of the plan:

    1. =# CREATE OR REPLACE FUNCTION mypy_prepplan()
    2. RETURNS text
    3. AS $$
    4. plan = plpy.prepare("SELECT * FROM sales WHERE region = $1 ORDER BY id", [ "text" ])
    5. GD["getregionplan"] = plan
    6. return "getregionplan"
    7. $$ LANGUAGE plpythonu;

    This UDF, when run, will return the name (key) of the execution plan generated from the plpy.prepare() call.

  2. Define a PL/Python UDF to run the execution plan; this function will take the plan name and region name as an input:

    1. =# CREATE OR REPLACE FUNCTION mypy_execplan(planname text, regionname text)
    2. RETURNS integer
    3. AS $$
    4. rv = plpy.execute(GD[planname], [ regionname ], 5)
    5. year = rv[0]["year"]
    6. return year
    7. $$ LANGUAGE plpythonu STRICT;

    This UDF executes the planname plan that was previously saved to the GD. You will call mypy_execplan() with the planname returned from the plpy.prepare() call.

  3. Execute the mypy_prepplan() and mypy_execplan() UDFs, passing region usa:

    1. =# SELECT mypy_execplan( mypy_prepplan(), 'usa' );
    2. mypy_execplan
    3. ---------------
    4. 2014
    5. (1 row)

Handling Python Errors and Messages

The plpy module implements the following message- and error-related functions, each of which takes a message string as an argument:

  • plpy.debug(msg)
  • plpy.log(msg)
  • plpy.info(msg)
  • plpy.notice(msg)
  • plpy.warning(msg)
  • plpy.error(msg)
  • plpy.fatal(msg)

plpy.error() and plpy.fatal() raise a Python exception which, if uncaught, propagates out to the calling query, possibly aborting the current transaction or subtransaction. raise plpy.ERROR(msg) and raise plpy.FATAL(msg) are equivalent to calling plpy.error() and plpy.fatal(), respectively. Use the other message functions to generate messages of different priority levels.

Messages may be reported to the client and/or written to the HAWQ server log file. The HAWQ server configuration parameters log_min_messages and client_min_messages control where messages are reported.

Example: Generating Messages

In this example, you will create a PL/Python UDF that includes some debug log messages. You will also configure your psql session to enable debug-level client logging.

  1. Define a PL/Python UDF that executes a query that will return at most 5 rows from the sales table. Invoke the plpy.debug() method to display some additional information:

    1. =# CREATE OR REPLACE FUNCTION mypytest_debug(a integer)
    2. RETURNS text
    3. AS $$
    4. plpy.debug('mypytest_debug executing query: SELECT * FROM sales ORDER BY id')
    5. rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5)
    6. plpy.debug('mypytest_debug: query returned ' + str(rv.nrows()) + ' rows')
    7. region = rv[a]["region"]
    8. return region
    9. $$ LANGUAGE plpythonu;
  2. Execute the mypytest_debug() UDF, passing the integer 2 as an argument:

    1. =# SELECT mypytest_debug(2);
    2. mypytest_debug
    3. ----------------
    4. asia
    5. (1 row)
  3. Enable DEBUG2 level client logging:

    1. =# SET client_min_messages=DEBUG2;
  4. Execute the mypytest_debug() UDF again:

    1. =# SELECT mypytest_debug(2);
    2. ...
    3. DEBUG2: mypytest_debug executing query: SELECT * FROM sales ORDER BY id
    4. ...
    5. DEBUG2: mypytest_debug: query returned 5 rows
    6. ...

    Debug output is very verbose. You will parse a lot of output to find the mypytest_debug messages. Hint: look both near the start and end of the output.

  5. Turn off client-level debug logging:

    1. =# SET client_min_messages=NOTICE;

3rd-Party Python Modules

PL/Python supports installation and use of 3rd-party Python Modules. This section includes examples for installing the setuptools and NumPy Python modules.

Note: You must have superuser privileges to install Python modules to the system Python directories.

Example: Installing setuptools

In this example, you will manually install the Python setuptools module from the Python Package Index repository. setuptools enables you to easily download, build, install, upgrade, and uninstall Python packages.

You will first build the module from the downloaded package, installing it on a single host. You will then build and install the module on all segment nodes in your HAWQ cluster.

  1. Download the setuptools module package from the Python Package Index site. For example, run this wget command on a HAWQ node as the gpadmin user:

    1. $ ssh gpadmin@<hawq-node>
    2. gpadmin@hawq-node$ . /usr/local/hawq/greenplum_path.sh
    3. gpadmin@hawq-node$ mkdir plpython_pkgs
    4. gpadmin@hawq-node$ cd plpython_pkgs
    5. gpadmin@hawq-node$ export PLPYPKGDIR=`pwd`
    6. gpadmin@hawq-node$ wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-18.4.tar.gz
  2. Extract the files from the tar.gz package:

    1. gpadmin@hawq-node$ tar -xzvf setuptools-18.4.tar.gz
  3. Run the Python scripts to build and install the Python package; you must have superuser privileges to install Python modules to the system Python installation:

    1. gpadmin@hawq-node$ cd setuptools-18.4
    2. gpadmin@hawq-node$ python setup.py build
    3. gpadmin@hawq-node$ sudo python setup.py install
  4. Run the following command to verify the module is available to Python:

    1. gpadmin@hawq-node$ python -c "import setuptools"

    If no error is returned, the setuptools module was successfully imported.

  5. The setuptools package installs the easy_install utility. This utility enables you to install Python packages from the Python Package Index repository. For example, this command installs the Python pip utility from the Python Package Index site:

    1. gpadmin@hawq-node$ sudo easy_install pip
  6. Copy the setuptools package to all HAWQ nodes in your cluster. For example, this command copies the tar.gz file from the current host to the host systems listed in the file hawq-hosts:

    1. gpadmin@hawq-node$ cd $PLPYPKGDIR
    2. gpadmin@hawq-node$ hawq scp -f hawq-hosts setuptools-18.4.tar.gz =:/home/gpadmin
  7. Run the commands to build, install, and test the setuptools package you just copied to all hosts in your HAWQ cluster. For example:

    1. gpadmin@hawq-node$ hawq ssh -f hawq-hosts
    2. >>> mkdir plpython_pkgs
    3. >>> cd plpython_pkgs
    4. >>> tar -xzvf ../setuptools-18.4.tar.gz
    5. >>> cd setuptools-18.4
    6. >>> python setup.py build
    7. >>> sudo python setup.py install
    8. >>> python -c "import setuptools"
    9. >>> exit

Example: Installing NumPy

In this example, you will build and install the Python module NumPy. NumPy is a module for scientific computing with Python. For additional information about NumPy, refer to http://www.numpy.org/.

This example assumes yum is installed on all HAWQ segment nodes and that the gpadmin user is a member of sudoers with root privileges on the nodes.

Prerequisites

Building the NumPy package requires the following software:

  • OpenBLAS libraries - an open source implementation of BLAS (Basic Linear Algebra Subprograms)
  • Python development packages - python-devel
  • gcc compilers - gcc, gcc-gfortran, and gcc-c++

Perform the following steps to set up the OpenBLAS compilation environment on each HAWQ node:

  1. Use yum to install gcc compilers from system repositories. The compilers are required on all hosts where you compile OpenBLAS. For example:

    1. root@hawq-node$ yum -y install gcc gcc-gfortran gcc-c++ python-devel
  2. (Optionally required) If you cannot install the correct compiler versions with yum, you have the option to download the gcc compilers, including gfortran, from source and build and install them manually. Refer to Building gfortran from Source for gfortran build and install information.

  3. Create a symbolic link to g++, naming it gxx:

    1. root@hawq-node$ ln -s /usr/bin/g++ /usr/bin/gxx
  4. You may also need to create symbolic links to any libraries that have different versions available; for example, linking libppl_c.so.4 to libppl_c.so.2.

  5. You can use the hawq scp utility to copy files to HAWQ hosts and the hawq ssh utility to run commands on those hosts.

Obtaining Packages

Perform the following steps to download and distribute the OpenBLAS and NumPy source packages:

  1. Download the OpenBLAS and NumPy source files. For example, these wget commands download tar.gz files into a packages directory in the current working directory:

    1. $ ssh gpadmin@<hawq-node>
    2. gpadmin@hawq-node$ wget --directory-prefix=packages http://github.com/xianyi/OpenBLAS/tarball/v0.2.8
    3. gpadmin@hawq-node$ wget --directory-prefix=packages http://sourceforge.net/projects/numpy/files/NumPy/1.8.0/numpy-1.8.0.tar.gz/download
  2. Distribute the software to all nodes in your HAWQ cluster. For example, if you downloaded the software to /home/gpadmin/packages, these commands create the packages directory on all nodes and copies the software to the nodes listed in the hawq-hosts file:

    1. gpadmin@hawq-node$ hawq ssh -f hawq-hosts mkdir packages
    2. gpadmin@hawq-node$ hawq scp -f hawq-hosts packages/* =:/home/gpadmin/packages

Build and Install OpenBLAS Libraries

Before building and installing the NumPy module, you must first build and install the OpenBLAS libraries. This section describes how to build and install the libraries on a single HAWQ node.

  1. Extract the OpenBLAS files from the file:

    1. $ ssh gpadmin@<hawq-node>
    2. gpadmin@hawq-node$ cd packages
    3. gpadmin@hawq-node$ tar xzf v0.2.8 -C /home/gpadmin/packages
    4. gpadmin@hawq-node$ mv /home/gpadmin/packages/xianyi-OpenBLAS-9c51cdf /home/gpadmin/packages/OpenBLAS

    These commands extract the OpenBLAS tar file and simplify the unpacked directory name.

  2. Compile OpenBLAS. You must set the LIBRARY_PATH environment variable to the current $LD_LIBRARY_PATH. For example:

    1. gpadmin@hawq-node$ cd OpenBLAS
    2. gpadmin@hawq-node$ export LIBRARY_PATH=$LD_LIBRARY_PATH
    3. gpadmin@hawq-node$ make FC=gfortran USE_THREAD=0 TARGET=SANDYBRIDGE

    Replace the TARGET argument with the target appropriate for your hardware. The TargetList.txt file identifies the list of supported OpenBLAS targets.

    Compiling OpenBLAS make take some time.

  3. Install the OpenBLAS libraries in /usr/local and then change the owner of the files to gpadmin. You must have root privileges. For example:

    1. gpadmin@hawq-node$ sudo make PREFIX=/usr/local install
    2. gpadmin@hawq-node$ sudo ldconfig
    3. gpadmin@hawq-node$ sudo chown -R gpadmin /usr/local/lib

    The following libraries are installed to /usr/local/lib, along with symbolic links:

    1. gpadmin@hawq-node$ ls -l gpadmin@hawq-node$
    2. ...
    3. libopenblas.a -> libopenblas_sandybridge-r0.2.8.a
    4. libopenblas_sandybridge-r0.2.8.a
    5. libopenblas_sandybridge-r0.2.8.so
    6. libopenblas.so -> libopenblas_sandybridge-r0.2.8.so
    7. libopenblas.so.0 -> libopenblas_sandybridge-r0.2.8.so
    8. ...
  4. Install the OpenBLAS libraries on all nodes in your HAWQ cluster. You can use the hawq ssh utility to similarly build and install the OpenBLAS libraries on each of the nodes.

    Or, you may choose to copy the OpenBLAS libraries you just built to all of the HAWQ cluster nodes. For example, these hawq ssh and hawq scp commands install prerequisite packages, and copy and install the OpenBLAS libraries on the hosts listed in the hawq-hosts file.

    1. $ hawq ssh -f hawq-hosts -e 'sudo yum -y install gcc gcc-gfortran gcc-c++ python-devel'
    2. $ hawq ssh -f hawq-hosts -e 'ln -s /usr/bin/g++ /usr/bin/gxx'
    3. $ hawq ssh -f hawq-hosts -e sudo chown gpadmin /usr/local/lib
    4. $ hawq scp -f hawq-hosts /usr/local/lib/libopen*sandy* =:/usr/local/lib
    1. $ hawq ssh -f hawq-hosts
    2. >>> cd /usr/local/lib
    3. >>> ln -s libopenblas_sandybridge-r0.2.8.a libopenblas.a
    4. >>> ln -s libopenblas_sandybridge-r0.2.8.so libopenblas.so
    5. >>> ln -s libopenblas_sandybridge-r0.2.8.so libopenblas.so.0
    6. >>> sudo ldconfig

Build and Install NumPy

After you have installed the OpenBLAS libraries, you can build and install NumPy module. These steps install the NumPy module on a single host. You can use the hawq ssh utility to build and install the NumPy module on multiple hosts.

  1. Extract the NumPy module source files:

    1. gpadmin@hawq-node$ cd /home/gpadmin/packages
    2. gpadmin@hawq-node$ tar xzf numpy-1.8.0.tar.gz

    Unpacking the numpy-1.8.0.tar.gz file creates a directory named numpy-1.8.0 in the current directory.

  2. Set up the environment for building and installing NumPy:

    1. gpadmin@hawq-node$ export BLAS=/usr/local/lib/libopenblas.a
    2. gpadmin@hawq-node$ export LAPACK=/usr/local/lib/libopenblas.a
    3. gpadmin@hawq-node$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/
    4. gpadmin@hawq-node$ export LIBRARY_PATH=$LD_LIBRARY_PATH
  3. Build and install NumPy. (Building the NumPy package might take some time.)

    1. gpadmin@hawq-node$ cd numpy-1.8.0
    2. gpadmin@hawq-node$ python setup.py build
    3. gpadmin@hawq-node$ sudo python setup.py install

    Note: If the NumPy module did not successfully build, the NumPy build process might need a site.cfg file that specifies the location of the OpenBLAS libraries. Create the site.cfg file in the NumPy package directory:

    1. gpadmin@hawq-node$ touch site.cfg

    Add the following to the site.cfg file and run the NumPy build command again:

    1. [default]
    2. library_dirs = /usr/local/lib
    3. [atlas]
    4. atlas_libs = openblas
    5. library_dirs = /usr/local/lib
    6. [lapack]
    7. lapack_libs = openblas
    8. library_dirs = /usr/local/lib
    9. # added for scikit-learn
    10. [openblas]
    11. libraries = openblas
    12. library_dirs = /usr/local/lib
    13. include_dirs = /usr/local/include
  4. Verify that the NumPy module is available for import by Python:

    1. gpadmin@hawq-node$ cd $HOME
    2. gpadmin@hawq-node$ python -c "import numpy"

    If no error is returned, the NumPy module was successfully imported.

  5. As performed in the setuptools Python module installation, use the hawq ssh utility to build, install, and test the NumPy module on all HAWQ nodes.

  6. The environment variables that were required to build the NumPy module are also required in the gpadmin runtime environment to run Python NumPy functions. You can use the echo command to add the environment variables to gpadmin’s .bashrc file. For example, the following echo commands add the environment variables to the .bashrc file in gpadmin’s home directory:

    1. $ echo -e '\n#Needed for NumPy' >> ~/.bashrc
    2. $ echo -e 'export BLAS=/usr/local/lib/libopenblas.a' >> ~/.bashrc
    3. $ echo -e 'export LAPACK=/usr/local/lib/libopenblas.a' >> ~/.bashrc
    4. $ echo -e 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib' >> ~/.bashrc
    5. $ echo -e 'export LIBRARY_PATH=$LD_LIBRARY_PATH' >> ~/.bashrc

    You can use the hawq ssh utility with these echo commands to add the environment variables to the .bashrc file on all nodes in your HAWQ cluster.

Testing Installed Python Modules

You can create a simple PL/Python user-defined function (UDF) to validate that a Python module is available in HAWQ. This example tests the NumPy module.

  1. Create a PL/Python UDF that imports the NumPy module:

    1. gpadmin@hawq_node$ psql -d testdb
    1. =# CREATE OR REPLACE FUNCTION test_importnumpy(x int)
    2. RETURNS text
    3. AS $$
    4. try:
    5. from numpy import *
    6. return 'SUCCESS'
    7. except ImportError, e:
    8. return 'FAILURE'
    9. $$ LANGUAGE plpythonu;

    The function returns SUCCESS if the module is imported, and FAILURE if an import error occurs.

  2. Create a table that loads data on each HAWQ segment instance:

    1. => CREATE TABLE disttbl AS (SELECT x FROM generate_series(1,50) x ) DISTRIBUTED BY (x);

    Depending upon the size of your HAWQ installation, you may need to generate a larger series to ensure data is distributed to all segment instances.

  3. Run the UDF on the segment nodes where data is stored in the primary segment instances.

    1. =# SELECT gp_segment_id, test_importnumpy(1) AS status
    2. FROM disttbl
    3. GROUP BY gp_segment_id, status
    4. ORDER BY gp_segment_id, status;

    The SELECT command returns SUCCESS if the UDF imported the Python module on the HAWQ segment instance. FAILURE is returned if the Python module could not be imported.

Troubleshooting Python Module Import Failures

Possible causes of a Python module import failure include:

  • A problem accessing required libraries. For the NumPy example, HAWQ might have a problem accessing the OpenBLAS libraries or the Python libraries on a segment host.

    Try: Test importing the module on the segment host. This hawq ssh command tests importing the NumPy module on the segment host named mdw1.

    1. gpadmin@hawq-node$ hawq ssh -h mdw1 python -c "import numpy"
  • Environment variables may not be configured in the HAWQ environment. The Python import command may not return an error in this case.

    Try: Ensure that the environment variables are properly set. For the NumPy example, ensure that the environment variables listed at the end of the section Build and Install NumPy are defined in the .bashrc file for the gpadmin user on the master and all segment nodes.

    Note: The .bashrc file for the gpadmin user on the HAWQ master and all segment nodes must source the greenplum_path.sh file.

  • HAWQ might not have been restarted after adding environment variable settings to the .bashrc file. Again, the Python import command may not return an error in this case.

    Try: Ensure that you have restarted HAWQ.

    1. gpadmin@master$ hawq restart cluster

Using the GD Dictionary to Improve PL/Python Performance

Importing a Python module is an expensive operation that can adversely affect performance. If you are importing the same module frequently, you can use Python global variables to import the module on the first invocation and forego loading the module on subsequent imports.

The following PL/Python function uses the GD persistent storage dictionary to avoid importing the module NumPy if it has already been imported in the GD. The UDF includes a call to plpy.notice() to display a message when importing the module.

  1. =# CREATE FUNCTION mypy_import2gd() RETURNS text AS $$
  2. if 'numpy' not in GD:
  3. plpy.notice('mypy_import2gd: importing module numpy')
  4. import numpy
  5. GD['numpy'] = numpy
  6. return 'numpy'
  7. $$ LANGUAGE plpythonu;
  1. =# SELECT mypy_import2gd();
  2. NOTICE: mypy_import2gd: importing module numpy
  3. CONTEXT: PL/Python function "mypy_import2gd"
  4. mypy_import2gd
  5. ----------------
  6. numpy
  7. (1 row)
  1. =# SELECT mypy_import2gd();
  2. mypy_import2gd
  3. ----------------
  4. numpy
  5. (1 row)

The second SELECT call does not include the NOTICE message, indicating that the module was obtained from the GD.

References

This section lists references for using PL/Python.

Technical References

For information about PL/Python in HAWQ, see the PL/Python - Python Procedural Language PostgreSQL documentation.

For information about Python Package Index (PyPI), refer to PyPI - the Python Package Index.

The following Python modules may be of interest:

  • SciPy library provides user-friendly and efficient numerical routines including those for numerical integration and optimization. To download the SciPy package tar file:

    1. hawq-node$ wget http://sourceforge.net/projects/scipy/files/scipy/0.10.1/scipy-0.10.1.tar.gz
  • Natural Language Toolkit (nltk) is a platform for building Python programs to work with human language data.

    The Python distribute package is required for nltk. The distribute package should be installed before installing ntlk. To download the distribute package tar file:

    1. hawq-node$ wget http://pypi.python.org/packages/source/d/distribute/distribute-0.6.21.tar.gz

    To download the nltk package tar file:

    1. hawq-node$ wget http://pypi.python.org/packages/source/n/nltk/nltk-2.0.2.tar.gz#md5=6e714ff74c3398e88be084748df4e657

Useful Reading

For information about the Python language, see http://www.python.org/.

A set of slides that were used in a talk about how the Pivotal Data Science team uses the PyData stack in the Pivotal MPP databases and on Pivotal Cloud Foundry http://www.slideshare.net/SrivatsanRamanujam/all-thingspythonpivotal.