SQLTransformer

  SQLTransformer实现了一种转换,这个转换通过SQl语句来定义。目前我们仅仅支持的SQL语法是像SELECT ... FROM __THIS__ ...的形式。
这里__THIS__表示输入数据集相关的表。例如,SQLTransformer支持的语句如下:

  • SELECT a, a + b AS a_b FROM __THIS__
  • SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5
  • SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b

例子

  假设我们拥有下面的DataFrame,它的列名是id,v1,v2

  1. id | v1 | v2
  2. ----|-----|-----
  3. 0 | 1.0 | 3.0
  4. 2 | 2.0 | 5.0

  下面是语句SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__的输出结果。

  1. id | v1 | v2 | v3 | v4
  2. ----|-----|-----|-----|-----
  3. 0 | 1.0 | 3.0 | 4.0 | 3.0
  4. 2 | 2.0 | 5.0 | 7.0 |10.0

  下面是程序调用的例子。

  1. import org.apache.spark.ml.feature.SQLTransformer
  2. val df = spark.createDataFrame(
  3. Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")
  4. val sqlTrans = new SQLTransformer().setStatement(
  5. "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
  6. sqlTrans.transform(df).show()