Description

The random forest use the bagging to prevent the overfitting.

In the operator, we implement three type of decision tree to increase diversity of the forest.

    id3 cart c4.5
and the criteria is
    information gini information ratio mse

Parameters

Name Description Type Required? Default Value
featureSubsamplingRatio Ratio of the features used in each tree, in range (0, 1]. Double 0.2
numSubsetFeatures The number of features to consider for splits at each tree node. Integer 2147483647
numTrees Number of decision trees. Integer 10
subsamplingRatio Ratio of the training samples used for learning each decision tree. Double 100000.0
treeType treeType String “avg”
predictionCol Column name of prediction. String
predictionDetailCol Column name of prediction result, it will include detailed info. String
reservedCols Names of the columns to be retained in the output table String[] null
maxDepth depth of the tree Integer 2147483647
minSamplesPerLeaf Minimal number of sample in one leaf. Integer 2
createTreeMode series or parallel String “series”
maxBins MAX number of bins for continuous feature Integer 128
maxMemoryInMB max memory usage in tree histogram aggregate. Integer 64
featureCols Names of the feature columns used for training in the input table String[]
labelCol Name of the label column in the input table String
categoricalCols Names of the categorical columns used for training in the input table String[]
weightCol Name of the column indicating weight String null
maxLeaves max leaves of tree Integer 2147483647
minSampleRatioPerChild Minimal value of: (num of samples in child)/(num of samples in its parent). Double 0.0
minInfoGain minimum info gain when performing split Double 0.0

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. from pyalink.alink import *
  4. def exampleData():
  5. return np.array([
  6. [1.0, "A", 0, 0, 0],
  7. [2.0, "B", 1, 1, 0],
  8. [3.0, "C", 2, 2, 1],
  9. [4.0, "D", 3, 3, 1]
  10. ])
  11. def sourceFrame():
  12. data = exampleData()
  13. return pd.DataFrame({
  14. "f0": data[:, 0],
  15. "f1": data[:, 1],
  16. "f2": data[:, 2],
  17. "f3": data[:, 3],
  18. "label": data[:, 4]
  19. })
  20. def batchSource():
  21. return dataframeToOperator(
  22. sourceFrame(),
  23. schemaStr='''
  24. f0 double,
  25. f1 string,
  26. f2 int,
  27. f3 int,
  28. label int
  29. ''',
  30. op_type='batch'
  31. )
  32. def streamSource():
  33. return dataframeToOperator(
  34. sourceFrame(),
  35. schemaStr='''
  36. f0 double,
  37. f1 string,
  38. f2 int,
  39. f3 int,
  40. label int
  41. ''',
  42. op_type='stream'
  43. )
  44. (
  45. RandomForestClassifier()
  46. .setPredictionDetailCol('pred_detail')
  47. .setPredictionCol('pred')
  48. .setLabelCol('label')
  49. .setFeatureCols(['f0', 'f1', 'f2', 'f3'])
  50. .fit(batchSource())
  51. .transform(batchSource())
  52. .print()
  53. )
  54. (
  55. RandomForestClassifier()
  56. .setPredictionDetailCol('pred_detail')
  57. .setPredictionCol('pred')
  58. .setLabelCol('label')
  59. .setFeatureCols(['f0', 'f1', 'f2', 'f3'])
  60. .fit(batchSource())
  61. .transform(streamSource())
  62. .print()
  63. )
  64. StreamOperator.execute()

Result

Batch prediction

  1. f0 f1 f2 f3 label pred pred_detail
  2. 0 1.0 A 0 0 0 0 {"0":1.0,"1":0.0}
  3. 1 2.0 B 1 1 0 0 {"0":1.0,"1":0.0}
  4. 2 3.0 C 2 2 1 1 {"0":0.0,"1":1.0}
  5. 3 4.0 D 3 3 1 1 {"0":0.0,"1":1.0}

Stream Prediction

  1. f0 f1 f2 f3 label pred pred_detail
  2. 0 1.0 A 0 0 0 0 {"0":1.0,"1":0.0}
  3. 1 3.0 C 2 2 1 1 {"0":0.0,"1":1.0}
  4. 2 2.0 B 1 1 0 0 {"0":1.0,"1":0.0}
  5. 3 4.0 D 3 3 1 1 {"0":0.0,"1":1.0}