The name of the new external table.
The name of a column to create in the external table definition. Unlike regular tables, external tables do not have column constraints or default values, so do not specify those.
LIKE
The LIKE
clause specifies a table from which the new external table automatically copies all column names, data types and HAWQ distribution policy. If the original table specifies any column constraints or default column values, those will not be copied over to the new external table definition.
The data type of the column.
LOCATION (’://[:]//’ [, …])
For readable external tables, specifies the URI of the external data source(s) to be used to populate the external table or web table. Regular readable external tables allow the file
, gpfdist
, and pxf
protocols. Web external tables allow the http
protocol. If is omitted, the http
and gpfdist
protocols assume port 8080
and the pxf
protocol assumes the specifies a high availability Nameservice ID. If using the gpfdist
protocol, the is relative to the directory from which gpfdist
is serving files (the directory specified when you started the gpfdist
program). Also, the can use wildcards (or other C-style pattern matching) in the name part of the location to denote multiple files in a directory. For example:
'gpfdist://filehost:8081/*'
'gpfdist://masterhost/my_load_file'
'http://intranet.example.com/finance/expenses.csv'
'pxf://mdw:41200/sales/*.csv?Profile=HDFS'
For writable external tables, specifies the URI location of the gpfdist
process that will collect data output from the HAWQ segments and write it to the named file. The is relative to the directory from which gpfdist
is serving files (the directory specified when you started the gpfdist
program). If multiple gpfdist
locations are listed, the segments sending data will be evenly divided across the available output locations. For example:
'gpfdist://outputhost:8081/data1.out',
'gpfdist://outputhost:8081/data2.out'
With two gpfdist
locations listed as in the above example, half of the segments would send their output data to the data1.out
file and the other half to the data2.out
file.
For the pxf
protocol, the LOCATION
string specifies the HDFS NameNode and the of the PXF service, the location of the data, and the PXF profile or Java classes used to convert the data between storage format and HAWQ format. If the is omitted, the is taken to be the logical name for the high availability Nameservice, and the is the value of the pxf_service_port
configuration parameter, 51200 by default. The URL parameters FRAGMENTER
, ACCESSOR
, and RESOLVER
are the names of PXF plug-ins (Java classes) that convert between the external data format and HAWQ data format. The FRAGMENTER
parameter is only used with readable external tables. PXF allows combinations of these parameters to be configured as profiles so that a single PROFILE
parameter can be specified to access external data, for example ?PROFILE=Hive
. Additional s can be added to the LOCATION URI to further describe the external data format or storage options. For details about the plug-ins and profiles provided with PXF and information about creating custom plug-ins for other data sources see Using PXF with Unmanaged Data.
EXECUTE ’’ ON …
Allowed for readable web external tables or writable external tables only. For readable web external tables, specifies the OS command to be executed by the segment instances. The can be a single OS command or a script. If executes a script, that script must reside in the same location on all of the segment hosts and be executable by the HAWQ superuser (gpadmin
).
For writable external tables, the specified in the EXECUTE
clause must be prepared to have data piped into it, as segments having data to send write their output to the specified program. HAWQ uses virtual elastic segments to run its queries.
The ON
clause is used to specify which segment instances will execute the given command. For writable external tables, only ON
is supported.
Note: ON ALL/HOST is deprecated when creating a readable external table, as HAWQ cannot guarantee scheduling executors on a specific host. Instead, use ON MASTER
, ON <number>
, or SEGMENT <virtual_segment>
to specify which segment instances will execute the command.
ON MASTER
runs the command on the master host only.ON <number>
means the command will be executed by the specified number of virtual segments. The particular segments are chosen by the HAWQ system’s Resource Manager at runtime.ON SEGMENT <virtual_segment>
means the command will be executed only once by the specified segment.
FORMAT ‘TEXT | CSV’ ()
Specifies the format of the external or web table data - either plain text (TEXT
) or comma separated values (CSV
) format.
DELIMITER
Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in TEXT
mode, a comma in CSV
mode. In TEXT
mode for readable external tables, the delimiter can be set to OFF
for special use cases in which unstructured data is loaded into a single-column table.
NULL
Specifies the string that represents a NULL
value. The default is \N
(backslash-N) in TEXT
mode, and an empty value with no quotations in CSV
mode. You might prefer an empty string even in TEXT
mode for cases where you do not want to distinguish NULL
values from empty strings. When using external and web tables, any data item that matches this string will be considered a NULL
value.
ESCAPE
Specifies the single character that is used for C escape sequences (such as \n
,\t
,\100
, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a "
(double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to disable escaping in text-formatted files by specifying the value 'OFF'
as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.
NEWLINE
Specifies the newline used in your data files – LF
(Line feed, 0x0A), CR
(Carriage return, 0x0D), or CRLF
(Carriage return plus line feed, 0x0D 0x0A). If not specified, a HAWQ segment will detect the newline type by looking at the first row of data it receives and using the first newline type encountered.
HEADER
For readable external tables, specifies that the first line in the data file(s) is a header row (contains the names of the table columns) and should not be included as data for the table. If using multiple data source files, all files must have a header row.
Note: The HEADER
formatting option is not allowed with PXF. For CSV files or other files that include a header line, use an error table instead of the HEADER
formatting option.
QUOTE
Specifies the quotation character for CSV
mode. The default is double-quote ("
).
FORCE NOT NULL
In CSV
mode, processes each specified column as though it were quoted and hence not a NULL
value. For the default null string in CSV
mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings.
FORCE QUOTE
In CSV
mode for writable external tables, forces quoting to be used for all non-NULL
values in each specified column. NULL
output is never quoted.
FILL MISSING FIELDS
In both TEXT
and CSV
mode for readable external tables, specifying FILL MISSING FIELDS
will set missing trailing field values to NULL
(instead of reporting an error) when a row of data has missing data fields at the end of a line or row. Blank rows, fields with a NOT NULL
constraint, and trailing delimiters on a line will still report an error.
ENCODING ’’
Character set encoding to use for the external table. Specify a string constant (such as 'SQL_ASCII'
), an integer encoding number, or DEFAULT
to use the default client encoding.
LOG ERRORS INTO
This is an optional clause that can precede a SEGMENT REJECT LIMIT
clause to log information about rows with formatting errors. It specifies an error table where rows with formatting errors will be logged when running in single row error isolation mode. You can then examine this to see error rows that were not loaded (if any). If the specified already exists, it will be used. If it does not exist, it will be automatically generated.
SEGMENT REJECT LIMIT [ROWS | PERCENT]
Runs a COPY FROM
operation in single row error isolation mode. If the input rows have format errors they will be discarded provided that the reject limit is not reached on any HAWQ segment instance during the load operation. The reject limit can be specified as number of rows (the default) or percentage of total rows (1-100). If PERCENT
is used, each segment starts calculating the bad row percentage only after the number of rows specified by the parameter gp_reject_percent_threshold
has been processed. The default for gp_reject_percent_threshold
is 300 rows. Constraint errors such as violation of a NOT NULL
or CHECK
constraint will still be handled in “all-or-nothing” input mode. If the limit is not reached, all good rows will be loaded and any error rows discarded.
DISTRIBUTED RANDOMLY
Used to declare the HAWQ distribution policy for a writable external table. By default, writable external tables are distributed randomly. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table will improve unload performance by eliminating the need to move rows over the interconnect. When you issue an unload command such as INSERT INTO wex_table SELECT * FROM source_table
, the rows that are unloaded can be sent directly from the segments to the output location if the two tables have the same hash distribution policy.
Examples
Start the gpfdist
file server program in the background on port 8081
serving files from directory /var/data/staging
:
gpfdist -p 8081 -d /var/data/staging -l /home/gpadmin/log &
Create a readable external table named ext_customer
using the gpfdist
protocol and any text formatted files (*.txt
) found in the gpfdist
directory. The files are formatted with a pipe (|
) as the column delimiter and an empty space as NULL
. Also access the external table in single row error isolation mode:
CREATE EXTERNAL TABLE ext_customer
(id int, name text, sponsor text)
LOCATION ( 'gpfdist://filehost:8081/*.txt' )
FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;
Create the same readable external table definition as above, but with CSV formatted files:
CREATE EXTERNAL TABLE ext_customer
(id int, name text, sponsor text)
LOCATION ( 'gpfdist://filehost:8081/*.csv' )
FORMAT 'CSV' ( DELIMITER ',' );
Create a readable external table using the pxf
protocol to read data in HDFS files:
CREATE EXTERNAL TABLE ext_customer
(id int, name text, sponsor text)
LOCATION ('pxf://mdw:51200/sales/customers/customers.tsv.gz'
'?Fragmenter=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
'&Accessor=org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor'
'&Resolver=org.apache.hawq.pxf.plugins.hdfs.StringPassResolver')
FORMAT 'TEXT' (DELIMITER = E'\t');
The LOCATION
string in this command is equivalent to the previous example, but using a PXF Profile:
CREATE EXTERNAL TABLE ext_customer
(id int, name text, sponsor text)
LOCATION ('pxf://mdw:51200/sales/customers/customers.tsv.gz?Profile=HdfsTextSimple')
FORMAT 'TEXT' (DELIMITER = E'\t');
Create a readable web external table that executes a script on five virtual segment hosts. (The script must reside at the same location on all segment hosts.)
CREATE EXTERNAL WEB TABLE log_output (linenum int, message text)
EXECUTE '/var/load_scripts/get_log_data.sh' ON 5
FORMAT 'TEXT' (DELIMITER '|');
Create a writable external table named sales_out
that uses gpfdist
to write output data to a file named sales.out
. The files are formatted with a pipe (|
) as the column delimiter and an empty space as NULL
.
CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales)
LOCATION ('gpfdist://etl1:8081/sales.out')
FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
DISTRIBUTED BY (txn_id);
The following command sequence shows how to create a writable external web table using a specified number of elastic virtual segments to run the query:
postgres=# CREATE TABLE a (i int);
CREATE TABLE
postgres=# INSERT INTO a VALUES(1);
INSERT 0 1
postgres=# INSERT INTO a VALUES(2);
INSERT 0 1
postgres=# INSERT INTO a VALUES(10);
INSERT 0 1
postgres=# CREATE WRITABLE EXTERNAL WEB TABLE externala (output text)
postgres-# EXECUTE 'cat > /tmp/externala' ON 3
postgres-# FORMAT 'TEXT' DISTRIBUTED RANDOMLY;
CREATE EXTERNAL TABLE
postgres=# INSERT INTO externala SELECT * FROM a;
INSERT 0 3
Create a writable external web table that pipes output data received by the segments to an executable script named to_adreport_etl.sh
:
CREATE WRITABLE EXTERNAL WEB TABLE campaign_out (LIKE campaign)
EXECUTE '/var/unload_scripts/to_adreport_etl.sh'
FORMAT 'TEXT' (DELIMITER '|');
Use the writable external table defined above to unload selected data:
INSERT INTO campaign_out
SELECT * FROM campaign WHERE customer_id=123;
Compatibility
CREATE EXTERNAL TABLE
is a HAWQ extension. The SQL standard makes no provisions for external tables.
See Also
CREATE TABLE, CREATE TABLE AS, COPY, INSERT, SELECT INTO