One-time gphdfs Protocol Installation (Deprecated)
Install and configure Hadoop for use with gphdfs
as follows:
Install Java 1.7 or later on all Greenplum Database hosts: master, segment, and standby master.
Install a compatible Hadoop distribution on all hosts. The distribution must be the same on all hosts. For Hadoop installation information, see the Hadoop distribution documentation.
See the Greenplum Database Release Notes for information about compatible Hadoop distributions.
After installation, ensure that the Greenplum system user (
gpadmin
) has read and execute access to the Hadoop libraries or to the Greenplum MR client.Set the following environment variables on all segments:
JAVA_HOME
– the Java home directoryHADOOP_HOME
– the Hadoop home directory For example, add lines such as the following to thegpadmin
user.bashrc
profile.
export JAVA_HOME=/usr/java/default
export HADOOP_HOME=/usr/lib/gphd
The variables must be set in the
~gpadmin/.bashrc
or the~gpadmin/.bash_profile
file so that the gpadmin user shell environment can locate the Java home and Hadoop home.Set the following Greenplum Database server configuration parameters and restart Greenplum Database.
Configuration Parameter Description Default Value Set Classifications gp_hadoop_target_version
The Hadoop target. Choose one of the following: cdh
hadoop
hdp
mpr
hadoop
master
session
reloadgp_hadoop_home
This parameter specifies the installation directory for Hadoop. NULL
master
session
reloadFor example, the following commands use the Greenplum Database utilities
gpconfig
andgpstop
to set the server configuration parameters and restart Greenplum Database:gpconfig -c gp_hadoop_target_version -v 'hdp'
gpstop -u
For information about the Greenplum Database utilities
gpconfig
andgpstop
, see the Greenplum Database Utility Guide.If needed, ensure that the
CLASSPATH
environment variable generated by the$GPHOME/lib/hadoop/hadoop_env.sh
file on every Greenplum Database host contains the path to JAR files that contain Java classes that are required forgphdfs
.For example, if
gphdfs
returns a class not found exception, ensure the JAR file containing the class is on every Greenplum Database host and update the$GPHOME/lib/hadoop/hadoop_env.sh
file so that theCLASSPATH
environment variable created by file contains the JAR file.
Parent topic: Accessing HDFS Data with gphdfs (Deprecated)
About gphdfs JVM Memory
When Greenplum Database accesses external table data from an HDFS location with gphdfs
protocol, each Greenplum Database segment on a host system starts a JVM for use by the protocol. The default JVM heapsize is 1GB and should be enough for most workloads
If the gphdfs
JVM runs out of memory, the issue might be related to the density of tuples inside the Hadoop HDFS block assigned to the gphdfs
segment worker. A higher density of tuples per block requires more gphdfs
memory. HDFS block size is usually 128MB, 256MB, or 512MB depending on the Hadoop cluster configuration.
You can increase the JVM heapsize by changing GP_JAVA_OPT
variable in the file $GPHOME/lib/hadoop/hadoop_env.sh
. In this example line, the option -Xmx1000m
specifies that the JVM consumes 1GB of virtual memory.
export GP_JAVA_OPT='-Xmx1000m -XX:+DisplayVMOutputToStderr'
The $GPHOME/lib/hadoop/hadoop_env.sh
must be updated for every segment instance in the Greenplum Database system.
Important: Before increasing the gphdfs
JVM memory, ensure that you have sufficient memory on the host. For example, 8 primary segments consume 8GB of virtual memory for the gphdfs
JVM when using default. Increasing the Java -Xmx
value to 2GB results in 16GB allocated in that environment of 8 segments per host.