机器学习特征工程:从原始数据到模型输入
机器学习特征工程从原始数据到模型输入1. 引言“数据和特征决定了机器学习的上限。” 好的特征工程可以让简单模型超越复杂模型。特征工程流程原始数据 → 数据清洗 → 特征提取 → 特征变换 → 特征选择 → 模型输入2. 数值特征处理2.1 标准化与归一化fromsklearn.preprocessingimportStandardScaler,MinMaxScaler,RobustScaler# 标准化均值0方差1适合正态分布scalerStandardScaler()X_scaledscaler.fit_transform(X)# 归一化缩放到 [0, 1]scalerMinMaxScaler()X_normalizedscaler.fit_transform(X)# 鲁棒标准化用中位数和四分位距适合有异常值scalerRobustScaler()X_robustscaler.fit_transform(X)2.2 对数变换importnumpyasnp# 处理右偏分布如收入、房价X_lognp.log1p(X)# log(1 x)避免 log(0)# Box-Cox 变换fromscipy.statsimportboxcox X_boxcox,lambda_optboxcox(X1)# Yeo-Johnson 变换支持负数fromsklearn.preprocessingimportPowerTransformer ptPowerTransformer(methodyeo-johnson)X_yeojohnsonpt.fit_transform(X)3. 类别特征编码3.1 常用编码fromsklearn.preprocessingimportLabelEncoder,OrdinalEncoder,OneHotEncoder# 标签编码有序类别leLabelEncoder()encodedle.fit_transform([低,中,高,中])# [0, 1, 2, 1]# One-Hot 编码无序类别oheOneHotEncoder(sparseFalse)encodedohe.fit_transform([[红],[蓝],[红],[绿]])# [[1,0,0], [0,1,0], [1,0,0], [0,0,1]]# 目标编码高基数类别fromcategory_encodersimportTargetEncoder teTargetEncoder()encodedte.fit_transform(X_categorical,y)3.2 高基数类别处理# 频率编码freqX[city].value_counts(normalizeTrue)X[city_freq]X[city].map(freq)# 目标编码带交叉验证防止过拟合fromcategory_encodersimportTargetEncoder teTargetEncoder(smoothing10)X[city_target]te.fit_transform(X[city],y)# 嵌入编码深度学习importtorch.nnasnn embeddingnn.Embedding(num_categories100,embedding_dim8)4. 时间特征importpandasaspddefextract_time_features(df,time_col):提取时间特征df[time_col]pd.to_datetime(df[time_col])# 基础特征df[year]df[time_col].dt.year df[month]df[time_col].dt.month df[day]df[time_col].dt.day df[hour]df[time_col].dt.hour df[minute]df[time_col].dt.minute df[dayofweek]df[time_col].dt.dayofweek# 0周一df[dayofyear]df[time_col].dt.dayofyear df[week]df[time_col].dt.isocalendar().week# 周期编码sin/cosdf[month_sin]np.sin(2*np.pi*df[month]/12)df[month_cos]np.cos(2*np.pi*df[month]/12)df[hour_sin]np.sin(2*np.pi*df[hour]/24)df[hour_cos]np.cos(2*np.pi*df[hour]/24)df[dayofweek_sin]np.sin(2*np.pi*df[dayofweek]/7)df[dayofweek_cos]np.cos(2*np.pi*df[dayofweek]/7)# 布尔特征df[is_weekend]df[dayofweek].isin([5,6]).astype(int)df[is_month_start]df[time_col].dt.is_month_start.astype(int)df[is_month_end]df[time_col].dt.is_month_end.astype(int)returndf5. 文本特征fromsklearn.feature_extraction.textimportTfidfVectorizerfromtransformersimportAutoTokenizer,AutoModel# TF-IDFtfidfTfidfVectorizer(max_features10000,ngram_range(1,2))X_tfidftfidf.fit_transform(texts)# Sentence-BERT 嵌入tokenizerAutoTokenizer.from_pretrained(all-MiniLM-L6-v2)modelAutoModel.from_pretrained(all-MiniLM-L6-v2)defget_embedding(text):inputstokenizer(text,return_tensorspt,truncationTrue,max_length512)withtorch.no_grad():outputmodel(**inputs)returnoutput.last_hidden_state.mean(dim1).squeeze().numpy()6. 交互特征# 数学组合X[ratio]X[feature_a]/(X[feature_b]1e-8)X[product]X[feature_a]*X[feature_b]X[difference]X[feature_a]-X[feature_b]# 多项式特征fromsklearn.preprocessingimportPolynomialFeatures polyPolynomialFeatures(degree2,interaction_onlyTrue)X_polypoly.fit_transform(X)# 分桶X[age_bin]pd.cut(X[age],bins[0,18,35,50,65,100],labels[少年,青年,中年,老年,高龄])7. 特征选择7.1 过滤法fromsklearn.feature_selectionimportSelectKBest,f_classif,mutual_info_classif# 方差过滤删除方差为 0 的特征fromsklearn.feature_selectionimportVarianceThreshold selectorVarianceThreshold(threshold0.01)X_filteredselector.fit_transform(X)# 相关性过滤correlationsX.corrwith(y).abs()selectedcorrelations[correlations0.05].index# SelectKBestselectorSelectKBest(f_classif,k50)X_selectedselector.fit_transform(X,y)7.2 包装法fromsklearn.feature_selectionimportRFEfromsklearn.ensembleimportRandomForestClassifier# 递归特征消除modelRandomForestClassifier(n_estimators100)rfeRFE(model,n_features_to_select20,step5)X_selectedrfe.fit_transform(X,y)print(f选中的特征:{X.columns[rfe.support_].tolist()})7.3 嵌入法fromsklearn.ensembleimportGradientBoostingClassifier# 基于树模型的特征重要性modelGradientBoostingClassifier()model.fit(X,y)importancespd.Series(model.feature_importances_,indexX.columns)top_featuresimportances.nlargest(20)print(top_features)8. 特征工程 Pipelinefromsklearn.pipelineimportPipelinefromsklearn.composeimportColumnTransformer# 定义不同列的处理方式numeric_features[age,income,score]categorical_features[city,gender,education]preprocessorColumnTransformer([(num,Pipeline([(scaler,StandardScaler()),]),numeric_features),(cat,Pipeline([(encoder,OneHotEncoder(handle_unknownignore)),]),categorical_features),])# 完整 PipelinepipelinePipeline([(preprocessor,preprocessor),(feature_selection,SelectKBest(f_classif,k50)),(classifier,GradientBoostingClassifier()),])pipeline.fit(X_train,y_train)scorepipeline.score(X_test,y_test)9. 总结特征工程的核心数值特征标准化/归一化是基础对数变换处理偏态类别特征低基数用 One-Hot高基数用目标编码时间特征周期编码sin/cos比直接用数值更好特征选择先过滤快速再包装精确最后嵌入模型驱动Pipeline把所有步骤封装为可复现的流水线