Configuration
Depending on the requirements of a Python Table API program, it might be necessary to adjust certain parameters for optimization. All the config options available for Java/Scala Table API program could also be used in the Python Table API program. You could refer to the Table API Configuration for more details on all the available config options for Java/Scala Table API programs. It has also provided examples on how to set the config options in a Table API program.
Python Options
Key | Default | Type | Description |
---|---|---|---|
python.archives | (none) | String | Add python archive files for job. The archive files will be extracted to the working directory of python UDF worker. Currently only zip-format is supported. For each archive file, a target directory is specified. If the target directory name is specified, the archive file will be extracted to a name can directory with the specified name. Otherwise, the archive file will be extracted to a directory with the same name of the archive file. The files uploaded via this option are accessible via relative path. ‘#’ could be used as the separator of the archive file path and the target directory name. Comma (‘,’) could be used as the separator to specify multiple archive files. This option can be used to upload the virtual environment, the data files used in Python UDF. The data files could be accessed in Python UDF, e.g.: f = open(‘data/data.txt’, ‘r’). The option is equivalent to the command line option “-pyarch”. |
python.client.executable | “python” | String | The python interpreter used to launch the python process when compiling the jobs containing Python UDFs. Equivalent to the environment variable PYFLINK_EXECUTABLE. The priority is as following: 1. the configuration ‘python.client.executable’ defined in the source code; 2. the environment variable PYFLINK_EXECUTABLE; 3. the configuration ‘python.client.executable’ defined in flink-conf.yaml |
python.executable | “python” | String | Specify the path of the python interpreter used to execute the python UDF worker. The python UDF worker depends on Python 3.5+, Apache Beam (version == 2.19.0), Pip (version >= 7.1.0) and SetupTools (version >= 37.0.0). Please ensure that the specified environment meets the above requirements. The option is equivalent to the command line option “-pyexec”. |
python.files | (none) | String | Attach custom python files for job. These files will be added to the PYTHONPATH of both the local client and the remote python UDF worker. The standard python resource file suffixes such as .py/.egg/.zip or directory are all supported. Comma (‘,’) could be used as the separator to specify multiple files. The option is equivalent to the command line option “-pyfs”. |
python.fn-execution.arrow.batch.size | 10000 | Integer | The maximum number of elements to include in an arrow batch for Python user-defined function execution. The arrow batch size should not exceed the bundle size. Otherwise, the bundle size will be used as the arrow batch size. |
python.fn-execution.buffer.memory.size | “15mb” | String | The amount of memory to be allocated by the input buffer and output buffer of a Python worker. The memory will be accounted as managed memory if the actual memory allocated to an operator is no less than the total memory of a Python worker. Otherwise, this configuration takes no effect. |
python.fn-execution.bundle.size | 100000 | Integer | The maximum number of elements to include in a bundle for Python user-defined function execution. The elements are processed asynchronously. One bundle of elements are processed before processing the next bundle of elements. A larger value can improve the throughput, but at the cost of more memory usage and higher latency. |
python.fn-execution.bundle.time | 1000 | Long | Sets the waiting timeout(in milliseconds) before processing a bundle for Python user-defined function execution. The timeout defines how long the elements of a bundle will be buffered before being processed. Lower timeouts lead to lower tail latencies, but may affect throughput. |
python.fn-execution.framework.memory.size | “64mb” | String | The amount of memory to be allocated by the Python framework. The sum of the value of this configuration and “python.fn-execution.buffer.memory.size” represents the total memory of a Python worker. The memory will be accounted as managed memory if the actual memory allocated to an operator is no less than the total memory of a Python worker. Otherwise, this configuration takes no effect. |
python.fn-execution.memory.managed | false | Boolean | If set, the Python worker will configure itself to use the managed memory budget of the task slot. Otherwise, it will use the Off-Heap Memory of the task slot. In this case, users should set the Task Off-Heap Memory using the configuration key taskmanager.memory.task.off-heap.size. For each Python worker, the required Task Off-Heap Memory is the sum of the value of python.fn-execution.framework.memory.size and python.fn-execution.buffer.memory.size. |
python.metric.enabled | true | Boolean | When it is false, metric for Python will be disabled. You can disable the metric to achieve better performance at some circumstance. |
python.requirements | (none) | String | Specify a requirements.txt file which defines the third-party dependencies. These dependencies will be installed and added to the PYTHONPATH of the python UDF worker. A directory which contains the installation packages of these dependencies could be specified optionally. Use ‘#’ as the separator if the optional parameter exists. The option is equivalent to the command line option “-pyreq”. |