如何按自己需求实现自定义因子?

自定义因子
标签: #<Tag:0x00007f5bffd37090>

(bluexxxx) #1

在平台的因子库上看到有许多因子,但是我想计算一些自定义的因子,比如,相对大盘的收益率,这个因子怎么构造呢?谢谢!


策略开发常用代码示例
【宽客学院】Pandas快速入门
无需代码构造您自己的衍生因子
(brantyz) #3

这个需要用一个新加入的api,具体使用可以查看https://bigquant.com/docs/module_user_feature_extractor.html

具体实现代码见下面:

克隆策略
In [116]:
start_date = '2017-01-01'
end_date = '2017-06-01'
instruments = D.instruments(start_date=start_date, end_date=end_date)

#获取沪深300数据,并将字段close重新命名为hs300_close
df_hs300 = D.history_data(
    '000300.SHA',
    start_date=(pd.to_datetime(start_date) - datetime.timedelta(days=10)).strftime('%Y-%m-%d'),  # 多取几天的数据
    end_date=end_date)[['date', 'close']].rename(columns={'close': 'hs300_close'})


def calculate_relative_return(base_df):
    #沪深300和一只股票的数据做合并
    df = base_df[['date', 'close']].reset_index().merge(df_hs300, on='date', how='left').set_index('index')
    # 过去5日相对收益
    return df['close'] / df['close'].shift(5) - df['hs300_close'] / df['hs300_close'].shift(5)

#计算相对收益,并命名为u_relative_return_5
m1 = M.user_feature_extractor.v1(
    instruments=instruments, start_date=start_date, end_date=end_date,
    history_data_fields=['close'], look_back_days=5, features_by_instrument={
    'u_relative_return_5': calculate_relative_return 
    }, m_deps = [df_hs300])
[2017-07-03 14:13:36.193843] INFO: bigquant: user_feature_extractor.v1 start ..
[2017-07-03 14:13:36.642317] INFO: user_feature_extractor: compute feature:u_relative_return_5
[2017-07-03 14:13:58.720242] INFO: user_feature_extractor: year 2017, featurerows=286124
[2017-07-03 14:13:58.725564] INFO: user_feature_extractor: total feature rows: 286124
[2017-07-03 14:13:58.735148] INFO: bigquant: user_feature_extractor.v1 end [22.541298s].
In [128]:
m1.data.read_df().head()
Out[128]:
amount date instrument close u_relative_return_5
713418 420595168.0 2017-01-03 000001.SZA 959.585571 -0.001582
713419 449757472.0 2017-01-03 000002.SZA 2752.580566 -0.002094
713420 31737708.0 2017-01-03 000004.SZA 180.638672 0.010265
713421 25240360.0 2017-01-03 000005.SZA 63.297707 -0.011790
713422 243074080.0 2017-01-03 000006.SZA 314.764771 0.000276

如果需要和M.general_feature_extractor生成的特征合并,请看下面的样例:

In [120]:
features = ['close_5/close_0']
m2 = M.general_feature_extractor.v5(instruments=instruments, start_date=start_date, end_date=end_date, features=features)
m3 = M.join.v2(data1=m1.data, data2=m2.data, on=['date', 'instrument'], sort=True)
[2017-07-03 14:40:22.498186] INFO: bigquant: general_feature_extractor.v5 start ..
[2017-07-03 14:40:27.046378] INFO: general_feature_extractor: year 2017, featurerows=286124
[2017-07-03 14:40:27.051680] INFO: general_feature_extractor: total feature rows: 286124
[2017-07-03 14:40:27.060410] INFO: bigquant: general_feature_extractor.v5 end [4.562273s].
In [127]:
m3.data.read_df().head()
Out[127]:
amount date instrument close u_relative_return_5 close_0 close_5
0 420595168.0 2017-01-03 000001.SZA 959.585571 -0.001582 959.585571 955.395264
1 449757472.0 2017-01-03 000002.SZA 2752.580566 -0.002094 2752.580566 2741.958008
2 31737708.0 2017-01-03 000004.SZA 180.638672 0.010265 180.638672 177.753326
3 25240360.0 2017-01-03 000005.SZA 63.297707 -0.011790 63.297707 63.668411
4 243074080.0 2017-01-03 000006.SZA 314.764771 0.000276 314.764771 312.811737
In [ ]:
 

如何过滤出波动幅度相对基准指数的波幅小于5%的个股
策略研究常用功能
(神龙斗士) #4

另一个简单的例子,计算过去三年的ROE:

克隆策略

计算过去三年的ROE

从history data中读取数据,并取过去第250/500/750个交易日为的ROE分别做为前1/2/3年的ROE。这里预估每年的有250个交易日,可以用 D.trading_days 来实现更精确的计算。

In [23]:
start_date = '2017-01-01'
end_date = '2017-06-01'
instruments = D.instruments(start_date=start_date, end_date=end_date)

m1 = M.user_feature_extractor.v1(
    instruments=instruments, start_date=start_date, end_date=end_date,
    history_data_fields=['fs_roe'], look_back_days=4 * 366,
    features_by_instrument={
        'u_fs_roe_y1': lambda instrument_df: instrument_df['fs_roe'].shift(250),
        'u_fs_roe_y2': lambda instrument_df: instrument_df['fs_roe'].shift(500),
        'u_fs_roe_y3': lambda instrument_df: instrument_df['fs_roe'].shift(750),
    })
[2017-07-05 17:51:21.176922] INFO: user_feature_extractor: compute feature:u_fs_roe_y3
[2017-07-05 17:51:42.530716] INFO: user_feature_extractor: compute feature:u_fs_roe_y2
[2017-07-05 17:51:58.343198] INFO: user_feature_extractor: compute feature:u_fs_roe_y1
[2017-07-05 17:52:17.973067] INFO: user_feature_extractor: year 2017, featurerows=286124
[2017-07-05 17:52:17.987011] INFO: user_feature_extractor: total feature rows: 286124
[2017-07-05 17:52:18.062263] INFO: bigquant: user_feature_extractor.v1 end [131.93348s].
In [24]:
m1.data.read_df().head()
Out[24]:
instrument amount date fs_roe u_fs_roe_y3 u_fs_roe_y2 u_fs_roe_y1
4528853 000001.SZA 420595168.0 2017-01-03 10.4073 13.0164 13.1431 12.3158
4528854 000002.SZA 449757472.0 2017-01-03 8.2382 2.4944 2.0333 0.7346
4528855 000004.SZA 31737708.0 2017-01-03 16.5389 1.0370 -3.7478 -2.3361
4528856 000005.SZA 25240360.0 2017-01-03 11.0044 -1.3633 -1.4006 -4.5565
4528857 000006.SZA 243074080.0 2017-01-03 2.6203 9.9579 5.9817 6.6016

如果需要和M.general_feature_extractor生成的特征合并,请看下面的样例:

In [25]:
features = ['close_5/close_0']
m2 = M.general_feature_extractor.v5(instruments=instruments, start_date=start_date, end_date=end_date, features=features)
m3 = M.join.v2(data1=m1.data, data2=m2.data, on=['date', 'instrument'], sort=True)
[2017-07-05 17:52:18.295358] INFO: bigquant: general_feature_extractor.v5 start ..
[2017-07-05 17:52:18.304168] INFO: bigquant: hit cache
[2017-07-05 17:52:18.305775] INFO: bigquant: general_feature_extractor.v5 end [0.010418s].
[2017-07-05 17:52:18.311260] INFO: bigquant: join.v2 start ..
[2017-07-05 17:52:18.948812] INFO: join: /y_2017, rows=286124/286124, timetaken=0.366294s
[2017-07-05 17:52:18.968007] INFO: join: total result rows: 286124
[2017-07-05 17:52:18.970218] INFO: bigquant: join.v2 end [0.659036s].
In [26]:
m3.data.read_df().head()
Out[26]:
instrument amount date fs_roe u_fs_roe_y3 u_fs_roe_y2 u_fs_roe_y1 close_0 close_5
0 000001.SZA 420595168.0 2017-01-03 10.4073 13.0164 13.1431 12.3158 959.585571 955.395264
1 000002.SZA 449757472.0 2017-01-03 8.2382 2.4944 2.0333 0.7346 2752.580566 2741.958008
2 000004.SZA 31737708.0 2017-01-03 16.5389 1.0370 -3.7478 -2.3361 180.638672 177.753326
3 000005.SZA 25240360.0 2017-01-03 11.0044 -1.3633 -1.4006 -4.5565 63.297707 63.668411
4 000006.SZA 243074080.0 2017-01-03 2.6203 9.9579 5.9817 6.6016 314.764771 312.811737

(神龙斗士) #5

使用 D.financial_statements 来计算自定义的ROE因子:

克隆策略

计算过去三年的ROE

从D.financial_statements中读取数据。

In [46]:
start_date = '2017-01-01'
end_date = '2017-06-01'
instruments = D.instruments(start_date=start_date, end_date=end_date)

def extract_feature_roe(instruments, start_date, end_date):
    orgin_start_date = start_date
    start_date = (pd.to_datetime(start_date) - datetime.timedelta(days=4 * 366)).strftime('%Y-%m-%d')
    df = D.financial_statements(
        instruments=instruments,
        start_date=start_date, end_date=end_date,
        fields=['date', 'instrument', 'fs_roe'])

    trading_days_df = D.trading_days(start_date=start_date, end_date=end_date)
    
    def _roe(instrument_df):
        instrument = instrument_df['instrument'].iloc[0]
        instrument_df['fs_roe_y1'] = instrument_df['fs_roe'].shift(4)
        instrument_df['fs_roe_y2'] = instrument_df['fs_roe'].shift(8)
        instrument_df['fs_roe_y3'] = instrument_df['fs_roe'].shift(12)
        instrument_df = trading_days_df.merge(instrument_df, on='date', how='left')
        instrument_df['instrument'] = instrument
        instrument_df = instrument_df.fillna(method='ffill')
        return instrument_df

    df = df.groupby('instrument').apply(_roe)
    df = df.sort_values('date')
    df = df[(orgin_start_date <= df['date']) & (df['date'] <= end_date)]
    df.reset_index(drop=True, inplace=True)
    return Outputs(data=DataSource.write_df(df))

m1 = M.cached.v2(run=extract_feature_roe, kwargs={'instruments': instruments, 'start_date': start_date, 'end_date': end_date})
[2017-07-05 18:26:23.975245] INFO: bigquant: cached.v2 start ..
[2017-07-05 18:26:51.412187] INFO: bigquant: cached.v2 end [27.436948s].
In [47]:
m1.data.read_df().head()
Out[47]:
date instrument fs_roe fs_roe_y1 fs_roe_y2 fs_roe_y3
0 2017-01-03 600652.SHA 1.114500 1.1443 -5.6274 0.3902
1 2017-01-03 000633.SZA -26.859699 -10.3953 -10.0292 0.6394
2 2017-01-03 300535.SZA 8.451600 NaN NaN NaN
3 2017-01-03 002578.SZA 0.903400 2.0293 2.9336 4.6380
4 2017-01-03 300606.SZA NaN NaN NaN NaN

如果需要和M.general_feature_extractor生成的特征合并,请看下面的样例:

In [48]:
features = ['close_5/close_0']
m2 = M.general_feature_extractor.v5(instruments=instruments, start_date=start_date, end_date=end_date, features=features)
m3 = M.join.v2(data1=m1.data, data2=m2.data, on=['date', 'instrument'], sort=True)
[2017-07-05 18:26:51.559576] INFO: bigquant: general_feature_extractor.v5 start ..
[2017-07-05 18:26:51.587971] INFO: bigquant: hit cache
[2017-07-05 18:26:51.589077] INFO: bigquant: general_feature_extractor.v5 end [0.029538s].
[2017-07-05 18:26:51.593168] INFO: bigquant: join.v2 start ..
[2017-07-05 18:26:52.351613] INFO: join: /y_2017, rows=285062/286124, timetaken=0.659125s
[2017-07-05 18:26:52.372535] INFO: join: total result rows: 285062
[2017-07-05 18:26:52.374898] INFO: bigquant: join.v2 end [0.781704s].
In [49]:
m3.data.read_df().head()
Out[49]:
close_0 close_5 date instrument fs_roe fs_roe_y1 fs_roe_y2 fs_roe_y3
0 959.585571 955.395264 2017-01-03 000001.SZA 10.4073 12.3158 13.1431 13.0164
1 2752.580566 2741.958008 2017-01-03 000002.SZA 8.2382 7.7412 8.3103 9.3272
2 180.638672 177.753326 2017-01-03 000004.SZA 3.5966 1.4988 -6.3542 0.0698
3 63.297707 63.668411 2017-01-03 000005.SZA 11.0044 -4.5565 -5.2495 -4.3439
4 314.764771 312.811737 2017-01-03 000006.SZA 2.6203 6.6016 5.9817 9.9579

(hugo) #6

你好请问如果
m2= M.user_feature_extractor.v1(
instruments=conf.instrument, start_date=conf.startdate,end_date=conf.enddate,
history_data_fields=[‘close’, ‘open’], look_back_days=120,
features_by_instrument={
‘u_am20’:lambda x:x.amount.rolling(20).mean()/x.shift(20).amount.rolling(20).mean()
}
)
在用M.stock_ranker_train.v5(training_ds=m2.data,features=这部分怎么表达呢)


(iQuant) #7

有时候,我们需要这样的因子,该因子与股票没有直接关系,而是反映市场的一个情形。这时候,我们依然可以使用M.user_feature_extractor模块。

克隆策略

通过M.user_feature_extractor 模块,很容易实现一些特定的因子,比如大盘5日收益率因子,在同一个交易日,所有股票该因子值一样。

In [37]:
start_date = '2017-01-01'
end_date = '2017-06-01'
instruments = D.instruments(start_date=start_date, end_date=end_date)[:10]  # 股票列表,这里以10只举例

#获取沪深300数据,并将字段close重新命名为hs300_close
df_hs300 = D.history_data(
    '000300.SHA',
    start_date=(pd.to_datetime(start_date) - datetime.timedelta(days=10)).strftime('%Y-%m-%d'),  # 多取几天的数据
    end_date=end_date)[['date', 'close']].rename(columns={'close': 'hs300_close'})

# 计算基准(大盘)5日收益率因子
def calculate_benchmark_5days_return(base_df):
    #沪深300和一只股票的数据做合并
    df = base_df[['date', 'close','instrument']].reset_index().merge(df_hs300, on='date', how='left').set_index('index')
    # 返回大盘5日收益率
    return df['hs300_close'] / df['hs300_close'].shift(5) - 1 

# 计算基准(大盘)5日收益率因子,并命名为:u_benchmark_5days_return,因此在同一天,所有股票该因子值相同
m1 = M.user_feature_extractor.v1(
    instruments=instruments, start_date=start_date, end_date=end_date,
    history_data_fields=['close'], look_back_days=5, features_by_instrument={
    'u_benchmark_5days_return': calculate_benchmark_5days_return
    }, m_deps = [df_hs300])
[2018-01-18 18:18:53.323898] INFO: bigquant: user_feature_extractor.v1 开始运行..
[2018-01-18 18:18:54.205280] INFO: user_feature_extractor: compute feature:u_benchmark_5days_return
[2018-01-18 18:18:54.421976] INFO: user_feature_extractor: year 2017, featurerows=851
[2018-01-18 18:18:54.428171] INFO: user_feature_extractor: total feature rows: 851
[2018-01-18 18:18:54.431497] INFO: bigquant: user_feature_extractor.v1 运行完成[1.107597s].
In [39]:
m1.data.read_df().head(10)
Out[39]:
instrument amount date close u_benchmark_5days_return
70 000001.SZA 420595168.0 2017-01-03 959.585571 0.005968
71 000002.SZA 449757472.0 2017-01-03 2752.580566 0.005968
72 000004.SZA 31737708.0 2017-01-03 180.638672 0.005968
73 000005.SZA 25240360.0 2017-01-03 63.297707 0.005968
74 000006.SZA 243074080.0 2017-01-03 314.764771 0.005968
75 000007.SZA 44872752.0 2017-01-03 144.103104 0.005968
76 000008.SZA 44600660.0 2017-01-03 205.675446 0.005968
77 000009.SZA 276159680.0 2017-01-03 76.551094 0.005968
78 000010.SZA 127736256.0 2017-01-03 83.933731 0.005968
79 000011.SZA 240469744.0 2017-01-03 65.416298 0.005968

与其他因子合并

In [40]:
features = ['close_20/close_0'] # 以这个因子为例
m2 = M.general_feature_extractor.v5(instruments=instruments, start_date=start_date, end_date=end_date, features=features)
m3 = M.join.v2(data1=m1.data, data2=m2.data, on=['date', 'instrument'], sort=True)
[2018-01-18 18:20:56.978757] INFO: bigquant: general_feature_extractor.v5 开始运行..
[2018-01-18 18:21:20.716795] INFO: general_feature_extractor: year 2017, featurerows=851
[2018-01-18 18:21:20.722902] INFO: general_feature_extractor: total feature rows: 851
[2018-01-18 18:21:20.726086] INFO: bigquant: general_feature_extractor.v5 运行完成[23.747324s].
[2018-01-18 18:21:20.745735] INFO: bigquant: join.v2 开始运行..
[2018-01-18 18:21:20.999925] INFO: join: /y_2017, rows=851/851, timetaken=0.032091s
[2018-01-18 18:21:21.009892] INFO: join: total result rows: 851
[2018-01-18 18:21:21.012294] INFO: bigquant: join.v2 运行完成[0.266916s].
In [42]:
m3.data.read_df().head()
Out[42]:
instrument amount date close u_benchmark_5days_return close_0 close_20
0 000001.SZA 420595168.0 2017-01-03 959.585571 0.005968 959.585571 991.013062
1 000002.SZA 449757472.0 2017-01-03 2752.580566 0.005968 2752.580566 3385.953125
2 000004.SZA 31737708.0 2017-01-03 180.638672 0.005968 180.638672 169.503677
3 000005.SZA 25240360.0 2017-01-03 63.297707 0.005968 63.297707 66.726723
4 000006.SZA 243074080.0 2017-01-03 314.764771 0.005968 314.764771 290.677307

(jove) #8

对六楼hugo提出的问题,还有一些疑问:

1.M.user_feature_extractor.v1和M.general_feature_extractor分别产生的特征数据,经过M.join.v2合并后,得到训练数据m2,m2作为参数,用M.stock_ranker_train.v5(training_ds=m2.data,features=???)进行训练,训练结果应该和feature数值没有关系,这样理解是否准确?

2.如果理解准确,那么当同时使用两种方法获得特征因子时:

M.general_feature_extractor(features=[‘rank_pb_lf_0’]),
M.user_feature_extractor.v1(features_by_instrument= { ‘gtja_100’:lambdax:x.volume.rolling(20).std() },

用M.stock_ranker_train.v5(training_ds=m2.data,features=[‘rank_pb_lf_0’]进行训练后,
要查看:feature_gains.read_df(),只能看到’rank_pb_lf_0’的gain值,'gtja_100’虽然也参加了训练,但gain值却看不到。

除非写成:M.stock_ranker_train.v5(training_ds=m2.data,features=[‘rank_pb_lf_0’,‘rank_pb_lf_0’]),
两个值才能正常同时看到。

因子数量少时,可以手工填写,如果数量较多,怎么解决?

另外,这种自定义因子合并操作方法会在训练数据、测试数据、回测等多处地方进行,修改维护起来比较麻烦,是否有简单的解决方案?

3.使用M.user_feature_extractor.v1(features_by_instrument=‘gtja_100’)提取特征数据时,除了需要的’gtja_100’列以外,还会多出来诸如’volume’、‘amount’,‘m:low_price’,'m:high_price’等不需要的列,是否能够去掉这些不需要的列?

M.fast_auto_labeler.v8,M.fast_auto_labeler.v8产生的数据也有同样的情况,这些多出来的列数据,最后应该是作为特征因子进入训练模块里参加训练的,必然会对训练结果产生贡献。

如果进行单因子研究时,这些不可控产生的列因子,就是极大的干扰项,这种情况如何解决?
M.stock_ranker_train.v5模块内部运行机制中,是否已经把这些额外产生的列因子,在训练数据时自动把它们屏蔽掉了?

4.iQuant在7楼的例子中提出:

features = [‘close_20/close_0’] # 以这个因子为例
m2 = M.general_feature_extractor.v5(instruments=instruments, start_date=start_date, end_date=end_date, features=features)
m3 = M.join.v2(data1=m1.data, data2=m2.data, on=[‘date’, ‘instrument’], sort=True)

这样生成的数据进行训练,M.stock_ranker_train.v5(training_ds=m2.data,features=‘close_20/close_0’),系统会报错:
KeyError: "[‘close_20/close_0’] not in index"
因为之前生成的特征数据表中,没有’close_20/close_0’这一列,系统找不到它,这个问题如何解决?

5.在用M.stock_ranker_train.v5进行训练时,添加多个特征因子的预测结果经常会远远不如只添加一个特征因子,也就是说每添加一个新的特征因子,都会对模型预测效果产生重大影响,这种现象是该排序算法本身的特点吗?
在用随机深林等机器学习算法时,添加一个特征因子不会对预测结果造成这么大的影响,哪怕添加一个非常无关的因子,预测结果只是轻微变化,因为算法本身会认为该无关特征对预测贡献极小。

谢谢!


(小Q) #9

你好。

  1. M.stock_ranker_train.v5(training_ds=m2.data,features=???)进行训练,训练结果和样本的特征值有关系的,不同的特征值构成的训练集训练出来的模型不一样。

  2. M.stock_ranker_train.v5(training_ds=m2.data,features=[‘rank_pb_lf_0’]进行训练后,
    要查看:feature_gains.read_df(),只能看到’rank_pb_lf_0’的gain值,此时因为features列表参数没有gtja_100因子,因此该因子的gain值得不到。在stock_ranker_train模块中,并不是m2.data的所有列的特征都参与模型训练,而是与传入的features列表有关系。因此你希望哪些特征参与模型训练,features一定要传输正确和完备。

    当因子数量较多时,目前只支持手动填写,建议使用可视化模块,处理起来能够方便一些。

  3. M.user_feature_extractor运行完成之后,会有’volume’、‘amount’,‘m:low_price’,'m:high_price’等数据,这些列数据并不需要手动删除,因为并不会参与模型训练,不会造成影响。

  4. IQuant举例只是介绍如何自定义因子,运行报错与此无关。训练模型需要的数据是既要保护标注数据也有包含特征数据。

  5. 使用随机森林对单个因子影响不大,是否是在我们平台上进行测试得出的结论。在排序算法中,确实会出现增加一个因子对模型预测效果产生重大影响的情况,这是排序算法的一个特点。

另外,我们 墙裂建议,使用可视化编辑模式开发ai策略,这样更高效快捷!