【宽客学院】因子预处理

预处理
数据转换
特征转换
宽客学院
标签: #<Tag:0x00007f5bf6cfd560> #<Tag:0x00007f5bf6cfd420> #<Tag:0x00007f5bf6cfd2e0> #<Tag:0x00007f5bf6cfd1a0>

(iQuant) #1
作者:bigquant
阅读时间:12分钟
本文由BigQuant宽客学院推出,难度标签:☆☆☆

导语:在机器学习中,很多因子数据必须经过预处理才能参与模型训练。本文简单介绍使用sklearn进行因子预处理。


克隆策略
In [39]:
# 导入包
from sklearn import preprocessing 
In [40]:
# 基础设置
class conf:
    start_date = '2015-01-01'
    end_date='2017-08-10'
    split_date = '2016-01-01'
    instruments = D.instruments(start_date, split_date)
    features = ['fs_current_assets_0','market_cap_0','close_0']
In [41]:
# 计算特征数据
m2 = M.general_feature_extractor.v5(
    instruments=conf.instruments, start_date=conf.start_date, end_date=conf.split_date,
    features=conf.features)
# 数据预处理:缺失数据处理,数据规范化,T.get_stock_ranker_default_transforms为StockRanker模型做数据预处理
m3 = M.transform.v2(
    data=m2.data, transforms=T.get_stock_ranker_default_transforms(),
    drop_null=True, astype='int32', except_columns=['date', 'instrument'],)
[2017-10-16 16:05:47.926344] INFO: bigquant: general_feature_extractor.v5 开始运行..
[2017-10-16 16:05:47.936684] INFO: bigquant: 命中缓存
[2017-10-16 16:05:47.938360] INFO: bigquant: general_feature_extractor.v5 运行完成[0.012008s].
[2017-10-16 16:05:47.972345] INFO: bigquant: transform.v2 开始运行..
[2017-10-16 16:05:47.975012] INFO: bigquant: 命中缓存
[2017-10-16 16:05:47.976093] INFO: bigquant: transform.v2 运行完成[0.003792s].
In [42]:
# # 全部数据
all_data = m2.data.read_df() 
# 某一天数据
df = all_data[all_data['date']=='2015-01-05']
# # 某一天数据的直方图分析,可以发现数据很粗糙,极值存在,需要做一些处理和转化
# df[df.dtypes[df.dtypes == 'float32'].index.values].hist(bins=50, figsize=[15,12])
In [43]:
## 1. 缺失值处理
for factor in ['close_0', 'market_cap_0', 'fs_current_assets_0']:
    df[factor].fillna(np.nanmean(df[factor]), inplace=True)
In [44]:
## 2. 极值处理
for factor in ['close_0', 'market_cap_0', 'fs_current_assets_0']:
    p_95 = np.percentile(df[factor], 95)
    p_5 = np.percentile(df[factor], 5)
    df[factor][df[factor] > p_95] = p_95
    df[factor][df[factor] < p_5] = p_5
In [45]:
## 3. 标准化
df = all_data[all_data['date']=='2015-01-05'].dropna()
for factor in ['close_0', 'market_cap_0', 'fs_current_assets_0']:
    df[factor] = (df[factor] - df[factor].mean()) / df[factor].std()
    
df[['close_0', 'market_cap_0', 'fs_current_assets_0']].values
Out[45]:
array([[  1.01638317,   2.82943988,  14.47532082],
       [ -0.05427583,  -0.21515357,  -0.21295929],
       [  0.04300562,  -0.06115698,   0.07691551],
       ..., 
       [ -0.08408548,  -0.03219347,  -0.11243575],
       [ -0.06934822,   0.05156187,  -0.11314245],
       [ -0.08663286,   0.67376161,   0.04899734]], dtype=float32)
In [46]:
# 上述标准化结果与sklearn的scale处理结果是大致一样的(因为scale函数内部实现细节有点差异)
preprocessing.scale(df[['close_0', 'market_cap_0', 'fs_current_assets_0']])
Out[46]:
array([[  1.01660847,   2.83006714,  14.47853436],
       [ -0.05428803,  -0.21520142,  -0.2130065 ],
       [  0.04301501,  -0.06117069,   0.07693265],
       ..., 
       [ -0.08410428,  -0.03220075,  -0.11246065],
       [ -0.06936376,   0.05157316,  -0.11316751],
       [ -0.08665223,   0.67391087,   0.04900828]])
In [47]:
## 4. 规范化
preprocessing.normalize(df[['close_0', 'market_cap_0', 'fs_current_assets_0']])
Out[47]:
array([[ 0.06874776,  0.1913822 ,  0.979105  ],
       [-0.1764766 , -0.69956682, -0.69243215],
       [ 0.40093092, -0.57015163,  0.71706451],
       ..., 
       [-0.58374982, -0.22349793, -0.78056699],
       [-0.48710097,  0.36216984, -0.79471105],
       [-0.12720051,  0.98926461,  0.07194138]])
In [48]:
# 因子预处理函数
def preprocess(df):
    ## 1. 缺失值处理
    for factor in ['close_0', 'market_cap_0', 'fs_current_assets_0']:
        # 缺失值处理
        df[factor].fillna(np.nanmean(df[factor]), inplace=True)
        # 极值处理
        p_95 = np.percentile(df[factor], 95)
        p_5 = np.percentile(df[factor], 5)
        df[factor][df[factor] > p_95] = p_95
        df[factor][df[factor] < p_5] = p_5
        # 标准化处理
        df[factor] = (df[factor] - df[factor].mean()) / df[factor].std()
    return df 

# 按每个交易日进行因子预处理,此时因子预处理完成,我们可以用预处理后的结果加入更多的机器学习算法中
all_data.groupby('date').apply(preprocess)
Out[48]:
close_0 date fs_current_assets_0 instrument market_cap_0
0 3.052037 2015-01-05 0.556457 000001.SZA 3.297253
1 3.048802 2015-01-06 0.556164 000001.SZA 3.257948
2 3.049037 2015-01-07 0.557937 000001.SZA 3.270279
3 3.031636 2015-01-08 0.557385 000001.SZA 3.231909
4 3.007655 2015-01-09 0.560743 000001.SZA 3.260432
5 2.998902 2015-01-12 0.570150 000001.SZA 3.236919
6 3.000345 2015-01-13 0.566009 000001.SZA 3.222543
7 3.004525 2015-01-14 0.565694 000001.SZA 3.216527
8 3.012266 2015-01-15 0.562824 000001.SZA 3.243954
9 3.033375 2015-01-16 0.566757 000001.SZA 3.231477
10 2.985070 2015-01-19 0.571762 000001.SZA 3.156042
11 2.996415 2015-01-20 0.571735 000001.SZA 3.171534
12 2.993371 2015-01-21 0.571391 000001.SZA 3.191233
13 3.017199 2015-01-22 0.571763 000001.SZA 3.176629
14 3.004288 2015-01-23 0.571596 000001.SZA 3.205649
15 3.022293 2015-01-26 0.571354 000001.SZA 3.194622
16 3.000776 2015-01-27 0.570767 000001.SZA 3.177091
17 3.015631 2015-01-28 0.570511 000001.SZA 3.173037
18 3.001202 2015-01-29 0.569341 000001.SZA 3.138286
19 2.976087 2015-01-30 0.569285 000001.SZA 3.153763
20 2.972936 2015-02-02 0.533049 000001.SZA 3.147219
21 2.978402 2015-02-03 0.528683 000001.SZA 3.137114
22 2.966676 2015-02-04 0.529062 000001.SZA 3.147889
23 2.957267 2015-02-05 0.529196 000001.SZA 3.174509
24 2.993339 2015-02-06 0.529368 000001.SZA 3.144080
25 2.971802 2015-02-09 0.528575 000001.SZA 3.154410
26 2.980456 2015-02-10 0.528297 000001.SZA 3.160760
27 2.991138 2015-02-11 0.565610 000001.SZA 3.148416
28 2.985317 2015-02-12 0.565225 000001.SZA 3.159806
29 2.968544 2015-02-13 0.560827 000001.SZA 3.134854
... ... ... ... ... ...
569668 -0.364151 2015-12-14 -0.612207 603998.SHA -0.582843
569669 -0.393706 2015-12-15 -0.613224 603998.SHA -0.602110
569670 -0.378904 2015-12-16 -0.613241 603998.SHA -0.593836
569671 -0.388544 2015-12-17 -0.613026 603998.SHA -0.600101
569672 -0.402664 2015-12-18 -0.610031 603998.SHA -0.612462
569673 -0.383377 2015-12-21 -0.610888 603998.SHA -0.596951
569674 -0.388683 2015-12-22 -0.612257 603998.SHA -0.601776
569675 -0.381756 2015-12-23 -0.611544 603998.SHA -0.595081
569676 -0.378545 2015-12-24 -0.612179 603998.SHA -0.599207
569677 -0.358746 2015-12-25 -0.612344 603998.SHA -0.582991
569678 -0.358682 2015-12-28 -0.612574 603998.SHA -0.584567
569679 -0.332569 2015-12-29 -0.611711 603998.SHA -0.566934
569680 -0.332485 2015-12-30 -0.613864 603998.SHA -0.570732
569681 -0.343865 2015-12-31 -0.616675 603998.SHA -0.575315
569682 -0.982238 2015-12-10 0.590867 603999.SHA -0.826569
569683 -0.978856 2015-12-11 0.586158 603999.SHA -0.821680
569684 -0.965441 2015-12-14 0.585241 603999.SHA -0.796086
569685 -0.945971 2015-12-15 0.585689 603999.SHA -0.769995
569686 -0.926357 2015-12-16 0.585801 603999.SHA -0.742622
569687 -0.914591 2015-12-17 0.585244 603999.SHA -0.717395
569688 -0.885378 2015-12-18 0.581989 603999.SHA -0.678198
569689 -0.858536 2015-12-21 0.552980 603999.SHA -0.634116
569690 -0.827869 2015-12-22 0.553608 603999.SHA -0.590174
569691 -0.784113 2015-12-23 0.549469 603999.SHA -0.524914
569692 -0.742582 2015-12-24 0.549022 603999.SHA -0.467260
569693 -0.698834 2015-12-25 0.547223 603999.SHA -0.405524
569694 -0.632080 2015-12-28 0.547461 603999.SHA -0.314254
569695 -0.583340 2015-12-29 0.547183 603999.SHA -0.244122
569696 -0.530886 2015-12-30 0.556995 603999.SHA -0.166064
569697 -0.444703 2015-12-31 0.557870 603999.SHA -0.053904

569698 rows × 5 columns

小结: 本文引入sklearn进行分子预处理,为大家在进行分子预处理时提供参考。

参考文章:使用sklearn进行数据预处理

参考文章:标准化、规范化、二值化等多种机器学习数据预处理方法


   本文由BigQuant宽客学院推出,版权归BigQuant所有,转载请注明出处。


(qci133) #2

这个结果与前面的手工预处理的结果完全不一样啊?


(iQuant) #3

前面手工预处理的结果指的是?


(qci133) #4

指的是原文中的这一段内容。我看到这两段处理代码得到的结果数据是完全不同的,但原文中说 “上述标准化结果与sklearn的scale处理结果是一样的” ,我感觉较为费解。


(iQuant) #5

更新了下代码


现在解决了你的疑问了吗?


(qci133) #6

现在数据对得上了。不过我没看出来哪个代码修改了,是不是手工预处理的地方做了调整?


(小Q) #7

是在标准化的时候,使用的数据是df。
之前对应不上,是因为df经过了缺失值和极值处理,df本身经过了修改。