Feature engineering - QuantileDiscretizer - 《Alink v1.0.1 Document》

Description
Parameters
Script Example
- Code
- Result

Description

Quantile discretizer calculate the q-quantile as the interval, output the interval as model, and can transform a new data using the model. The output is the index of the interval.

Parameters

Name	Description	Type	Required？	Default Value
selectedCols	Names of the columns used for processing	String[]	✓
numBuckets	number of buckets	Integer		2
numBucketsArray	Array of num bucket	Integer[]		null
selectedCols	Names of the columns used for processing	String[]	✓
reservedCols	Names of the columns to be retained in the output table	String[]		null
outputCols	Names of the output columns	String[]		null

Script Example

Code

import numpy as np
import pandas as pd
from pyalink.alink import *
def exampleData():
    return np.array([
        ["a", 1, 1, 2.0, True],
        ["c", 1, 2, -3.0, True],
        ["a", 2, 2, 2.0, False],
        ["c", 0, 0, 0.0, False]
    ])
def sourceFrame():
    data = exampleData()
    return pd.DataFrame({
        "f_string": data[:, 0],
        "f_long": data[:, 1],
        "f_int": data[:, 2],
        "f_double": data[:, 3],
        "f_boolean": data[:, 4]
    })
def batchSource():
    return dataframeToOperator(
        sourceFrame(),
        schemaStr='''
    f_string string, 
    f_long long, 
    f_int int, 
    f_double double, 
    f_boolean boolean
    ''',
        op_type='batch'
    )
def streamSource():
    return dataframeToOperator(
        sourceFrame(),
        schemaStr='''
    f_string string, 
    f_long long, 
    f_int int, 
    f_double double, 
    f_boolean boolean
    ''',
        op_type='stream'
    )
(
    QuantileDiscretizer()
    .setSelectedCols(['f_double'])
    .setNumBuckets(8)
    .fit(batchSource())
    .transform(batchSource())
    .print()
)
(
    QuantileDiscretizer()
    .setSelectedCols(['f_double'])
    .setNumBuckets(8)
    .fit(batchSource())
    .transform(streamSource())
    .print()
)
StreamOperator.execute()

Result

Batch prediction

  f_string  f_long  f_int  f_double  f_boolean
0        a       1      1         2       True
1        c       1      2         0       True
2        a       2      2         2      False
3        c       0      0         1      False

Stream Prediction

    f_string    f_long    f_int    f_double    f_boolean
0    c    1    2    0    True
1    c    0    0    1    False
2    a    1    1    2    True
3    a    2    2    2    False