Description

Map a continuous variable into several buckets. It supports a single column input or multiple columns input. If input is a single column, selectedColName, outputColName and splits should be set. If input are multiple columns, selectedColNames, outputColnames and splitsArray should be set, and the lengths of them should be equal. In the case of multiple columns, each column used the corresponding splits.

Parameters

Name Description Type Required? Default Value
handleInvalid Strategy to handle unseen token when doing prediction, one of “keep”, “skip” or “error” String “keep”
encode Encode method,”INDEX”, “VECTOR”, “ASSEMBLED_VECTOR” String INDEX
dropLast drop last Boolean true
leftOpen left open Boolean true
cutsArray Split points array, each of them is used for the corresponding selected column. double[][]
selectedCols Names of the columns used for processing String[]
outputCols Names of the output columns String[] null
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. data = np.array([
  4. [1.1, True, "2", "A"],
  5. [1.1, False, "2", "B"],
  6. [1.1, True, "1", "B"],
  7. [2.2, True, "1", "A"]
  8. ])
  9. df = pd.DataFrame({"double": data[:, 0], "bool": data[:, 1], "number": data[:, 2], "str": data[:, 3]})
  10. inOp1 = BatchOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
  11. inOp2 = StreamOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
  12. bucketizer = BucketizerBatchOp().setSelectedCols(["double"]).setSplitsArray(["-Infinity:2:Infinity"])
  13. bucketizer.linkFrom(inOp1).print()
  14. bucketizer = BucketizerStreamOp().setSelectedCols(["double"]).setSplitsArray(["-Infinity:2:Infinity"])
  15. bucketizer.linkFrom(inOp2).print()
  16. StreamOperator.execute()

Results

Output Data
  1. rowID double bool number str
  2. 0 0 True 2 A
  3. 1 0 False 2 B
  4. 2 0 True 1 B
  5. 3 1 True 1 A