You may be using the gphdfs external table protocol in a Greenplum Database version 4 or 5 cluster to access data stored in Hadoop. Greenplum Database version 6 removes support for the gphdfs protocol. To maintain access to Hadoop data in Greenplum 6, you must migrate your gphdfs external tables to use the Greenplum Platform Extension Framework (PXF). This involves setting up PXF and creating new external tables that use the pxf external table protocol.

To migrate your gphdfs external tables to use the pxf external table protocol, you:

  1. Prepare for the migration.
  2. Map configuration properties, and then [set up PXF] (#id_cfg_prop).
  3. Create a new pxf external table to replace each gphdfs external table.
  4. Verify access to Hadoop files with the pxf external tables.
  5. Remove the gphdfs external tables.
  6. Revoke privileges to the gphdfs protocol.
  7. Migrate data to Greenplum 6.

Note: If you are migrating gphdfs from a Greenplum Database 5 installation, you perform the migration in the order above in your Greenplum 5 cluster before you migrate data to Greenplum 6.

Note: If you are migrating gphdfs from a Greenplum Database 4 installation, you perform the migration in a similar order. However, since PXF is not available in Greenplum Database 4, you must perform certain actions in the Greenplum 6 installation before you migrate the data:

  1. Greenplum 4: Prepare for the migration.
  2. Greenplum 6:
    1. Install and configure the Greenplum 6 software.
    2. Map configuration properties, and then install and set up PXF.
    3. Create a new pxf external table to replace each gphdfs external table.
    4. Verify access to Hadoop files with the pxf external tables.
  3. Greenplum 4:
    1. Remove the gphdfs external tables.
    2. Revoke privileges to the gphdfs protocol.
  4. Migrate Greenplum 4 data to Greenplum 6.

Preparing for the Migration

As you prepare for migrating from gphdfs to PXF:

  1. Determine which gphdfs tables you want to migrate.

    You can run the following query to list the gphdfs external tables in a database:

    1. SELECT n.nspname, d.objid::regclass as tablename
    2. FROM pg_depend d
    3. JOIN pg_exttable x ON ( d.objid = x.reloid )
    4. JOIN pg_extprotocol p ON ( p.oid = d.refobjid )
    5. JOIN pg_class c ON ( c.oid = d.objid )
    6. JOIN pg_namespace n ON ( c.relnamespace = n.oid )
    7. WHERE d.refclassid = 'pg_extprotocol'::regclass AND p.ptcname = 'gphdfs';
  2. For each table that you choose to migrate, identify the format of the external data and the column definitions. Also identify the options with which the gphdfs table was created. You can use the \dt+ SQL meta-command to obtain this information. For example:

    1. \d+ public.gphdfs_writable_parquet
    2. External table "public.gphdfs_writable_parquet"
    3. Column | Type | Modifiers | Storage | Description
    4. --------+---------+-----------+----------+-------------
    5. id | integer | | plain |
    6. msg | text | | extended |
    7. Type: writable
    8. Encoding: UTF8
    9. Format type: parquet
    10. Format options: formatter 'gphdfs_export'
    11. External options: {}
    12. External location: gphdfs://hdfshost:8020/data/dir1/gphdfs_writepq?codec=GZIP
    13. Execute on: all segments
  3. Save the information that you gathered above.

Setting Up PXF

PXF does not use the following gphdfs configuration options:

gphdfs Configuration OptionDescriptionpxf Consideration
HADOOP_HOMEEnvironment variable that identifies the Hadoop installation directoryNot applicable; PXF is bundled with the required dependent Hadoop libraries and JARs
CLASSPATHEnvironment variable that identifies the locations of Hadoop JAR and configuration filesNot applicable, PXF automatically includes the Hadoop libraries, JARs, and configuration files that it bundles in the CLASSPATH. PXF also automatically includes user-registered dependencies found in the $PXF_CONF/lib directory in the CLASSPATH.
gp_hadoop_target_versionServer configuration parameter that identifies the Hadoop distributionNot applicable, PXF works out-of-the-box with the different Hadoop distributions
gp_hadoop_homeServer configuration parameter that identifies the Hadoop installation directoryNot applicable, PXF works out-of-the-box with the different Hadoop distributions

Configuration properties required by PXF, and the gphdfs equivalent, if applicable, include:

Configuration ItemDescriptiongphdfs Configpxf Config
JAVA_HOMEEnvironment variable that identifies the Java installation directorySet JAVA_HOME on each segment hostSet JAVA_HOME on each segment host
JVM option settingsOptions with which to start the JVMSet options in the GP_JAVA_OPT environment variable in hadoop_env.shSet options in the PXF_JVM_OPTS environment variable in $PXF_CONF/conf/pxf-env.sh
PXF ServerPXF server configuration for HadoopNot applicableConfigure a PXF server for Hadoop
PrivilegesThe Greenplum Database privileges required to create external tables in the given protocolGrant SELECT and INSERT privileges on the gphdfs protocol to appropriate usersGrant SELECT and INSERT privileges on the pxf protocol to appropriate users

After you determine the equivalent PXF configuration properties, you will:

  1. Update the Java version installed on each Greenplum Database host, if necessary. PXF supports Java version 8 and 11. If your Greenplum Database cluster hosts are running Java 7, upgrade to Java version 8 or 11 as described in Installing Java for PXF. Note the $JAVA_HOME setting.

  2. If you are migrating from Greenplum Database 4, or you have not previously used PXF in your Greenplum 5 installation:

    1. Install the newest version of the independent PXF distribution on your Greenplum Database hosts.

    2. Inform PXF of the $JAVA_HOME setting by specifying its value in the pxf-env.sh configuration file.

      • Edit the pxf-env.sh file on the Greenplum master node.

        1. gpadmin@gpmaster$ vi /usr/local/pxf-gp6/conf/pxf-env.sh
      • Locate the JAVA_HOME setting in the pxf-env.sh file, uncomment if necessary, and set it to your $JAVA_HOME value. For example:

        1. export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk/jre/
    3. Register the PXF extension with Greenplum Database:

    You must initialize PXF before you can access Hadoop.

    1. Enable the PXF extension and grant users access to PXF.
  3. Configure the PXF Hadoop Connectors. This procedure creates a PXF server configuration that provides PXF the information that it requires to access Hadoop. This procedure also synchronizes the configuration changes to all hosts in your Greenplum cluster.

  4. Start or restart PXF in your Greenplum Database cluster:

    1. gpadmin@gpmaster$ pxf cluster start

Creating a PXF External Table

gphdfs and pxf are both external table protocols. Creating an external table using these protocols is similar. You specify the external table name and its column definitions. You also specify LOCATION and FORMAT clauses. gphdfs and pxf use information in these clauses to determine the location and type of the external data.

Mapping the LOCATION Clause

The LOCATION clause of an external table identifies the external table protocol, location of the external data, and protocol- and operation-specific custom options.

The format of gphdfs‘s LOCATION clause is as follows:

  1. LOCATION('gphdfs://<hdfs_host>:<hdfs_port>/<path-to-data>?[&<custom-option>=<value>[...]]')

PXF’s LOCATION clause takes the following format when you access data stored on Hadoop:

  1. LOCATION('pxf://<path-to-data>?PROFILE=<profile_name>[&SERVER=<server_name>][&<custom-option>=<value>[...]]')

You are not required to specify the HDFS host and port number when you create a PXF external table. PXF obtains this information from the default server configuration, or from the server configuration name that you specify in <server_name>.

Refer to Creating an External Table in the PXF documentation for more information about the PXF CREATE EXTERNAL TABLE syntax and keywords.

When you create an external table specifying the gphdfs protocol, you identify the format of the external data in the FORMAT clause (discussed in the next section). PXF uses a PROFILE option in the LOCATION clause to identify the source and type of the external data.

Data Formatpxf PROFILE
Avrohdfs:avro
Parquethdfs:parquet
Texthdfs:text

Refer to Connectors, Data Formats, and Profiles in the PXF documentation for more information about the PXF profiles supported for Hadoop.

Both gphdfs and pxf utilize custom options in the LOCATION clause to identify data-format-, operation-, or profile-specific options supported by the protocol. For example, both gphdfs and pxf support parquet and compression options on INSERT operations.

Should you need to migrate a gphdfs writable external table that references an HDFS file to PXF, map gphdfs to PXF writable external table compression options as follows:

Descriptiongphdfs LOCATION Optionpxf LOCATION Option
Use of CompressioncompressNot applicable; depends on the profile - may be uncompressed by default or specified via COMPRESSION_CODEC
Type of compressioncompression_typeCOMPRESSION_TYPE
Compression codeccodecCOMPRESSION_CODEC
Level of Compression1codec_levelCODEC_LEVEL (supported in PXF 5.14.0 and newer versions)

1 Avro format deflate codec only.

If the HDFS file is a Parquet-format file, map these additional parquet options as follows:

Descriptiongphdfs LOCATION Optionpxf LOCATION Option
Parquet schemaschemaSCHEMA
Page sizepagesizePAGE_SIZE
Row group sizerowgroupsizeROWGROUP_SIZE
Parquet versionparquetversion or pqversionPARQUET_VERSION
Enable a dictionarydictionaryenableThe dictionary is always enabled when writing Parquet data with PXF
Dictionary page sizedictionarypagesizeDICTIONARY_PAGE_SIZE

Mapping the FORMAT Options

The gphdfs protocol uses the FORMAT clause to determine the format of the external data. For Avro- and Parquet-format data, the PXF FORMAT clause must identify the name of a custom formatter.

Data Formatgphdfs FORMAT Optionpxf FORMAT Option
AvroFORMAT ‘AVRO’FORMAT ‘CUSTOM’ (FORMATTER=’pxfwritable_import’) (read)
FORMAT ‘CUSTOM’ (FORMATTER=’pxfwritable_export’) (write)
ParquetFORMAT ‘PARQUET’FORMAT ‘CUSTOM’ (FORMATTER=’pxfwritable_import’) (read)
FORMAT ‘CUSTOM’ (FORMATTER=’pxfwritable_export’) (write)
TextFORMAT ‘TEXT’ (DELIMITER ‘,’)FORMAT ‘TEXT’ (DELIMITER ‘,’)

For text data, the FORMAT clause may identify a delimiter or other formatting option as described on the CREATE EXTERNAL TABLE command reference page.

Example gphdfs to pxf External Table Mapping for an HDFS Text File

Example gphdfs CREATE EXTERNAL TABLE command to read a text file on HDFS:

  1. CREATE EXTERNAL TABLE ext_expenses (
  2. name text,
  3. date date,
  4. amount float4,
  5. category text,
  6. desc1 text )
  7. LOCATION ('gphdfs://hdfshost-1:8081/dir/filename.txt')
  8. FORMAT 'TEXT' (DELIMITER ',');

Equivalent pxf CREATE EXTERNAL TABLE command, providing that the default PXF server contains the Hadoop configuration:

  1. CREATE EXTERNAL TABLE ext_expenses_pxf (
  2. name text,
  3. date date,
  4. amount float4,
  5. category text,
  6. desc1 text )
  7. LOCATION ('pxf://dir/filename.txt?PROFILE=hdfs:text')
  8. FORMAT 'TEXT' (DELIMITER ',');

Verifying Access with PXF

Ensure that you can read from, or write to, each pxf external table that you have created.

Removing the gphdfs External Tables

You must remove all gphdfs external tables before you can successfully migrate a Greenplum Database 4 or 5 database to Greenplum 6.

Drop an external table as follows:

  1. DROP EXTERNAL TABLE <schema_name>.<external_table_name>;

Revoking Privileges to the gphdfs Protocol

Before you migrate, you must revoke privileges to the gphdfs protocol from each Greenplum Database role to which you assigned the privileges.

Revoke the privilege as follows:

  1. ALTER ROLE <role_name> NOCREATEEXTTABLE(protocol='gphdfs',type='readable');
  2. ALTER ROLE <role_name> NOCREATEEXTTABLE(protocol='gphdfs',type='writable');