Description
Quantile discretizer calculate the q-quantile as the interval, output the interval as model, and can transform a new data using the model. The output is the index of the interval.
Parameters
Name | Description | Type | Required? | Default Value |
---|---|---|---|---|
selectedCols | Names of the columns used for processing | String[] | ✓ | |
numBuckets | number of buckets | Integer | 2 | |
numBucketsArray | Array of num bucket | Integer[] | null | |
selectedCols | Names of the columns used for processing | String[] | ✓ | |
reservedCols | Names of the columns to be retained in the output table | String[] | null | |
outputCols | Names of the output columns | String[] | null |
Script Example
Code
import numpy as np
import pandas as pd
from pyalink.alink import *
def exampleData():
return np.array([
["a", 1, 1, 2.0, True],
["c", 1, 2, -3.0, True],
["a", 2, 2, 2.0, False],
["c", 0, 0, 0.0, False]
])
def sourceFrame():
data = exampleData()
return pd.DataFrame({
"f_string": data[:, 0],
"f_long": data[:, 1],
"f_int": data[:, 2],
"f_double": data[:, 3],
"f_boolean": data[:, 4]
})
def batchSource():
return dataframeToOperator(
sourceFrame(),
schemaStr='''
f_string string,
f_long long,
f_int int,
f_double double,
f_boolean boolean
''',
op_type='batch'
)
def streamSource():
return dataframeToOperator(
sourceFrame(),
schemaStr='''
f_string string,
f_long long,
f_int int,
f_double double,
f_boolean boolean
''',
op_type='stream'
)
(
QuantileDiscretizer()
.setSelectedCols(['f_double'])
.setNumBuckets(8)
.fit(batchSource())
.transform(batchSource())
.print()
)
(
QuantileDiscretizer()
.setSelectedCols(['f_double'])
.setNumBuckets(8)
.fit(batchSource())
.transform(streamSource())
.print()
)
StreamOperator.execute()
Result
Batch prediction
f_string f_long f_int f_double f_boolean
0 a 1 1 2 True
1 c 1 2 0 True
2 a 2 2 2 False
3 c 0 0 1 False
Stream Prediction
f_string f_long f_int f_double f_boolean
0 c 1 2 0 True
1 c 0 0 1 False
2 a 1 1 2 True
3 a 2 2 2 False