Apache Beam Google DataFlow Pipeline Engine
Beam DataFlow
Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. As a managed Google Cloud service, it provisions worker nodes and out of the box optimization.
The Cloud Dataflow Runner and service are suitable for large scale continuous jobs and provide:
A fully managed service
Autoscaling of the number of workers throughout the lifetime of the job
Dynamic work re-balancing
Check the Google DataFlow docs and Apache Beam DataFlow runner docs for more information.
Google Dataflow Configuration
INFO: this configuration checklist was reprinted (copied) from the Apache Beam documentation.
To use the Google Cloud Dataflow runtime configuration, you must complete the setup in the Before you begin section of the Cloud Dataflow quickstart for your chosen language.
Select or create a Google Cloud Platform Console project.
Enable billing for your project.
Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource Manager. You may need to enable additional APIs (such as BigQuery, Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.
Authenticate with Google Cloud Platform.
Install the Google Cloud SDK.
Create a Cloud Storage bucket.
Options
Option | Description |
---|---|
Project ID | The project ID for your Google Cloud Project. This is required if you want to run your pipeline using the Dataflow managed service. |
Service Account | With this option you can run the pipeline dataflow job with a specific service account, instead of the default GCE robot. The default is to leave this blank. |
Application name | The name of the Dataflow job being executed as it appears in Dataflow’s jobs list and job details. Also used when updating an existing pipeline. |
Staging location | Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs://. |
Initial number of workers | The initial number of Google Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins. |
Maximum number of workers | The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by numWorkers to allow your job to scale up, automatically or otherwise. |
Auto-scaling algorithm | The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See Autotuning features to learn more about how autoscaling works in the Dataflow managed service. |
Worker machine type | The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement. Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. Check the list of machine types for reference. |
Worker disk type | The type of persistent disk to use, specified by a full URL of the disk type resource. For example, use compute.googleapis.com/projects//zones//diskTypes/pd-ssd to specify a SSD persistent disk. more. |
Disk size in GB | The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs. |
Region | Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for workerRegion is automatically assigned. Note: This option cannot be combined with workerZone or zone. (regions list). |
Zone | Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. Note: This option cannot be combined with workerRegion or zone. |
Network | This is the GCE network for launching workers. For more information, see the reference documentation https://cloud.google.com/compute/docs/networking. The default is up to the Dataflow service. |
Subnetwork | This is the GCE subnetwork for launching workers. For more information, see the reference documentation https://cloud.google.com/compute/docs/networking. The default is up to the Dataflow service. |
Use public IPs? | Specifies whether worker pools should be started with public IP addresses. WARNING: This feature is experimental. You must be allowlisted to use it. |
Dataflow Service Options | This is a comma separated list of options in the format |
User agent | A user agent string as per RFC2616, describing the pipeline to external services. |
Temp location | Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://. |
Plugins to stage (, delimited) | Comma separated list of plugins. |
Transform plugin classes | List of transform plugin classes. |
XP plugin classes | List of extensions point plugins. |
Streaming Hop transforms flush interval (ms) | The amount of time after which the internal buffer is sent completely over the network and emptied. |
Hop streaming transforms buffer size | The internal buffer size to use. |
Fat jar file location | Fat jar location. Generate a fat jar using |
Environment Settings
This environment variable need to be set locally.
GOOGLE_APPLICATION_CREDENTIALS=/path/to/google-key.json
Security considerations
To allow encrypted (TLS) network connections to, for example, Kafka and Neo4j Aura certain older security algorithms are disabled on Dataflow. This is done by setting security property jdk.tls.disabledAlgorithms
to value: Lv3, RC4, DES, MD5withRSA, DH keySize < 1024, EC keySize < 224, 3DES_EDE_CBC, anon, NULL
.
Please let us know if you have a need to make this configurable and we’ll look for a way to not hardcode this. Just create a JIRA case to let us know.