卡方选择器

卡方选择器

特征选择试图识别相关的特征用于模型构建。它改变特征空间的大小，它可以提高速度以及统计学习行为。ChiSqSelector实现卡方特征选择，它操作于带有类别特征的标注数据。
ChiSqSelector根据独立的卡方测试对特征进行排序，然后选择排序最高的特征。下面是一个使用的例子。

import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.feature.ChiSqSelector
// 加载数据
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// 卡方分布需要类别特征，所以对特征除一个整数。虽然特征是double类型，
//但是ChiSqSelector将每个唯一的值当做一个类别
val discretizedData = data.map { lp =>
  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => (x / 16).floor } ) )
}
// Create ChiSqSelector that will select top 50 of 692 features
val selector = new ChiSqSelector(50)
// Create ChiSqSelector model (selecting features)
val transformer = selector.fit(discretizedData)
// Filter the top 50 features from each feature vector
val filteredData = discretizedData.map { lp => 
  LabeledPoint(lp.label, transformer.transform(lp.features)) 
}

下面看看选择特征的实现，入口函数是fit。

 def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
    //计算数据卡方值
    val indices = Statistics.chiSqTest(data)
      .zipWithIndex.sortBy { case (res, _) => -res.statistic }
      .take(numTopFeatures)
      .map { case (_, indices) => indices }
      .sorted
    new ChiSqSelectorModel(indices)
  }

这里通过Statistics.chiSqTest计算卡方检测的值。下面需要了解卡方检测的理论基础。

1 卡方检测

1.1 什么是卡方检测

卡方检验是一种用途很广的计数资料的假设检验方法。它属于非参数检验的范畴，主要是比较两个及两个以上样本率( 构成比）以及两个分类变量的关联性分析。
其根本思想就是在于比较理论频数和实际频数的吻合程度或拟合优度问题。

1.2 卡方检测的基本思想

卡方检验是以${X}^{2}$分布为基础的一种常用假设检验方法，它的无效假设H0是：观察频数与期望频数没有差别。

该检验的基本思想是：首先假设H0成立，基于此前提计算出${X}^{2}$值，它表示观察值与理论值之间的偏离程度。根据${X}^{2}$分布及自由度可以确定在H0假设成立的情况下获得当前统计量及更极端情况的概率P。
如果P值很小，说明观察值与理论值偏离程度太大，应当拒绝无效假设，表示比较资料之间有显著差异；否则就不能拒绝无效假设，尚不能认为样本所代表的实际情况和理论假设有差别。

1.3 卡方值的计算与意义

卡方值表示观察值与理论值之问的偏离程度。计算这种偏离程度的基本思路如下。

设A代表某个类别的观察频数，E代表基于H0计算出的期望频数，A与E之差称为残差。
残差可以表示某一个类别观察值和理论值的偏离程度，但如果将残差简单相加以表示各类别观察频数与期望频数的差别，则有一定的不足之处。
因为残差有正有负，相加后会彼此抵消，总和仍然为0，为此可以将残差平方后求和。
另一方面，残差大小是一个相对的概念，相对于期望频数为10时，期望频数为20的残差非常大，但相对于期望频数为1000时20的残差就很小了。
考虑到这一点，人们又将残差平方除以期望频数再求和，以估计观察频数与期望频数的差别。

进行上述操作之后，就得到了常用的${X}^{2}$统计量。其计算公式是：

当n比较大时，卡方统计量近似服从k-1(计算E_i时用到的参数个数)个自由度的卡方分布。由卡方的计算公式可知，当观察频数与期望频数完全一致时，卡方值为0；观察频数与期望频数越接近，两者之间的差异越小，卡方值越小；
反之，观察频数与期望频数差别越大，两者之间的差异越大，卡方值越大。

2 卡方检测的源码实现

在MLlib中，使用chiSquaredFeatures方法实现卡方检测。它为每个特征进行皮尔森独立检测。下面看它的代码实现。

def chiSquaredFeatures(data: RDD[LabeledPoint],
      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    val maxCategories = 10000
    val numCols = data.first().features.size
    val results = new Array[ChiSqTestResult](numCols)
    var labels: Map[Double, Int] = null
    // 某个时刻至少1000列
    val batchSize = 1000
    var batch = 0
    while (batch * batchSize < numCols) {
      val startCol = batch * batchSize
      val endCol = startCol + math.min(batchSize, numCols - startCol)
      val pairCounts = data.mapPartitions { iter =>
        val distinctLabels = mutable.HashSet.empty[Double]
        val allDistinctFeatures: Map[Int, mutable.HashSet[Double]] =
          Map((startCol until endCol).map(col => (col, mutable.HashSet.empty[Double])): _*)
        var i = 1
        iter.flatMap { case LabeledPoint(label, features) =>
          if (i % 1000 == 0) {
            if (distinctLabels.size > maxCategories) {
              throw new SparkException
            }
            allDistinctFeatures.foreach { case (col, distinctFeatures) =>
              if (distinctFeatures.size > maxCategories) {
                throw new SparkException
              }
            }
          }
          i += 1
          distinctLabels += label
          features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) =>
            allDistinctFeatures(col) += feature
            (col, feature, label)
          }
        }
      }.countByValue()
      if (labels == null) {
        // Do this only once for the first column since labels are invariant across features.
        labels =
          pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct.zipWithIndex.toMap
      }
      val numLabels = labels.size
      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
        val features = keys.map(_._2).toArray.distinct.zipWithIndex.toMap
        val numRows = features.size
        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
        keys.foreach { case (_, feature, label) =>
          val i = features(feature)
          val j = labels(label)
          //带有标签的特征的出现次数
          contingency(i, j) += pairCounts((col, feature, label))
        }
        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
      }
      batch += 1
    }
    results
  }

上述代码主要对数据进行处理，获取带有标签的特征的出现次数，并用这个次数计算卡方值。真正获取卡方值的函数是chiSquaredMatrix。

 def chiSquaredMatrix(counts: Matrix, methodName: String = PEARSON.name): ChiSqTestResult = {
    val method = methodFromString(methodName)
    val numRows = counts.numRows
    val numCols = counts.numCols
    // get row and column sums
    val colSums = new Array[Double](numCols)
    val rowSums = new Array[Double](numRows)
    val colMajorArr = counts.toArray
    val colMajorArrLen = colMajorArr.length
    var i = 0
    while (i < colMajorArrLen) {
      val elem = colMajorArr(i)
      if (elem < 0.0) {
        throw new IllegalArgumentException("Contingency table cannot contain negative entries.")
      }
      //每列的总数
      colSums(i / numRows) += elem
      //每行的总数
      rowSums(i % numRows) += elem
      i += 1
    }
    //所有元素的总和
    val total = colSums.sum
    // second pass to collect statistic
    var statistic = 0.0
    var j = 0
    while (j < colMajorArrLen) {
      val col = j / numRows
      val colSum = colSums(col)
      if (colSum == 0.0) {
        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"
          + s"0 sum in column [$col].")
      }
      val row = j % numRows
      val rowSum = rowSums(row)
      if (rowSum == 0.0) {
        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"
          + s"0 sum in row [$row].")
      }
      //期望值
      val expected = colSum * rowSum / total
      //PEARSON
      statistic += method.chiSqFunc(colMajorArr(j), expected)
      j += 1
    }
    //自由度
    val df = (numCols - 1) * (numRows - 1)
    if (df == 0) {
      // 1 column or 1 row. Constant distribution is independent of anything.
      // pValue = 1.0 and statistic = 0.0 in this case.
      new ChiSqTestResult(1.0, 0, 0.0, methodName, NullHypothesis.independence.toString)
    } else {
      //计算累积概率
      val pValue = 1.0 - new ChiSquaredDistribution(df).cumulativeProbability(statistic)
      new ChiSqTestResult(pValue, df, statistic, methodName, NullHypothesis.independence.toString)
    }
  }
  //上述代码中的method.chiSqFunc(colMajorArr(j), expected)，调用下面的代码
  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
      val dev = observed - expected
      dev * dev / expected
    })

上述代码的实现和参考文献【2】中Test of independence的描述一致。

参考文献

【1】卡方检验

【2】Pearson’s chi-squared test