特征选择(Feature Selection)指的是在特征向量中选择出那些“优秀”的特征,组成新的、更“精简”的特征向量的过程。它在高维数据分析中十分常用,可以剔除掉“冗余”和“无关”的特征,提升学习器的性能。
特征选择方法和分类方法一样,也主要分为有监督(Supervised)和无监督(Unsupervised)两种,卡方选择则是统计学上常用的一种有监督特征选择方法,它通过对特征和真实标签之间进行卡方检验,来判断该特征和真实标签的关联程度,进而确定是否对其进行选择。
package Spark_MLlibimport org.apache.spark.ml.feature.ChiSqSelectorimport org.apache.spark.ml.linalg.Vectorsimport org.apache.spark.sql.SparkSessionobject 特征选择_卡方选择器 { val spark= SparkSession.builder().master("local").appName("卡方特征选择").getOrCreate() import spark.implicits._ def main(args: Array[String]): Unit = { val df=spark.createDataFrame(Seq( (1,Vectors.dense(0,0,30,1),1), (2,Vectors.dense(0,1,20,0),0), (3,Vectors.dense(1,0,15,2),0), (4,Vectors.dense(0,1,28,0),1), //这里第一个0变为1,选2个特征输出时会不同 (5,Vectors.dense(1,0,27,0),0) )).toDF("id","features","label") df.show() val selector=new ChiSqSelector().setNumTopFeatures(2).setFeaturesCol("features").setLabelCol("label").setOutputCol("selectedFeatures")//setNumTopFeatures(1):设置只选择和标签关联性最强的2个特征 val selector_model=selector.fit(df) val result=selector_model.transform(df) result.show(false) }}
结果:
+---+------------------+-----+
| id| features|label|+---+------------------+-----+| 1|[0.0,0.0,30.0,1.0]| 1|| 2|[0.0,1.0,20.0,0.0]| 0|| 3|[1.0,0.0,15.0,2.0]| 0|| 4|[0.0,1.0,28.0,0.0]| 1|| 5|[1.0,0.0,27.0,0.0]| 0|+---+------------------+-----++---+------------------+-----+----------------+|id |features |label|selectedFeatures|+---+------------------+-----+----------------+|1 |[0.0,0.0,30.0,1.0]|1 |[0.0,30.0] ||2 |[0.0,1.0,20.0,0.0]|0 |[0.0,20.0] ||3 |[1.0,0.0,15.0,2.0]|0 |[1.0,15.0] ||4 |[0.0,1.0,28.0,0.0]|1 |[0.0,28.0] ||5 |[1.0,0.0,27.0,0.0]|0 |[1.0,27.0] |+---+------------------+-----+----------------+