[量化学堂-Python编程]Pandas库之数据处理与规整

pandas
标签: #<Tag:0x00007f5c0104f2e0>

(iQuant) #1

导语:本文继续讲解Pandas库在数据分析和处理上的一些应用。


克隆策略
In [1]:
# 首先导入库
import pandas as pd
import numpy as np
In [2]:
df = D.history_data(instruments=['000300.SHA'], start_date='2016-01-01', 
                    end_date='2016-02-01',fields=['open', 'high', 'low', 'close', 'volume', 'amount'])
df.head()
Out[2]:
high low volume open close amount instrument date
0 3726.244629 3468.948486 11537067400 3725.856201 3469.066162 1.459682e+11 000300.SHA 2016-01-04
1 3518.217041 3377.279785 16211698400 3382.177002 3478.779785 1.960171e+11 000300.SHA 2016-01-05
2 3543.739502 3468.466553 14596614400 3482.406494 3539.808105 1.609472e+11 000300.SHA 2016-01-06
3 3481.149902 3284.737305 4410264100 3481.149902 3294.383789 4.713080e+10 000300.SHA 2016-01-07
4 3418.850830 3237.930664 18595945100 3371.871094 3361.563232 2.034989e+11 000300.SHA 2016-01-08

1、缺失数据处理

1. 1去掉包含缺失值的行:

In [3]:
df_drop = df.dropna()
df_drop.head()
Out[3]:
high low volume open close amount instrument date
0 3726.244629 3468.948486 11537067400 3725.856201 3469.066162 1.459682e+11 000300.SHA 2016-01-04
1 3518.217041 3377.279785 16211698400 3382.177002 3478.779785 1.960171e+11 000300.SHA 2016-01-05
2 3543.739502 3468.466553 14596614400 3482.406494 3539.808105 1.609472e+11 000300.SHA 2016-01-06
3 3481.149902 3284.737305 4410264100 3481.149902 3294.383789 4.713080e+10 000300.SHA 2016-01-07
4 3418.850830 3237.930664 18595945100 3371.871094 3361.563232 2.034989e+11 000300.SHA 2016-01-08

1.2 对缺失值进行填充:

In [4]:
df_fillna = df.fillna(value=0)
df_fillna.head()
Out[4]:
high low volume open close amount instrument date
0 3726.244629 3468.948486 11537067400 3725.856201 3469.066162 1.459682e+11 000300.SHA 2016-01-04
1 3518.217041 3377.279785 16211698400 3382.177002 3478.779785 1.960171e+11 000300.SHA 2016-01-05
2 3543.739502 3468.466553 14596614400 3482.406494 3539.808105 1.609472e+11 000300.SHA 2016-01-06
3 3481.149902 3284.737305 4410264100 3481.149902 3294.383789 4.713080e+10 000300.SHA 2016-01-07
4 3418.850830 3237.930664 18595945100 3371.871094 3361.563232 2.034989e+11 000300.SHA 2016-01-08

1.3 判断数据是否为nan,并进行布尔填充:

In [5]:
df_isnull = pd.isnull(df)
df_isnull.head()
Out[5]:
high low volume open close amount instrument date
0 False False False False False False False False
1 False False False False False False False False
2 False False False False False False False False
3 False False False False False False False False
4 False False False False False False False False

2、函数的应用和映射

In [6]:
df.mean()#列计算平均值
Out[6]:
high      3.238047e+03
low       3.116849e+03
volume    1.207993e+10
open      3.191507e+03
close     3.165315e+03
amount    1.300057e+11
dtype: float32
In [7]:
df.mean(1)#行计算平均值
Out[7]:
0     2.625088e+10
1     3.537146e+10
2     2.925730e+10
3     8.590179e+09
4     3.701580e+10
5     3.365465e+10
6     2.581192e+10
7     2.382546e+10
8     2.637089e+10
9     2.405715e+10
10    1.992852e+10
11    2.594551e+10
12    2.606307e+10
13    2.216253e+10
14    1.764680e+10
15    1.565532e+10
16    2.332401e+10
17    2.337010e+10
18    1.759769e+10
19    2.013213e+10
20    1.526850e+10
dtype: float32
In [8]:
df.mean(axis = 1,skipna = False) # skipna参数默认是 True 表示排除缺失值
Out[8]:
0     2.625088e+10
1     3.537146e+10
2     2.925730e+10
3     8.590179e+09
4     3.701580e+10
5     3.365465e+10
6     2.581192e+10
7     2.382546e+10
8     2.637089e+10
9     2.405715e+10
10    1.992852e+10
11    2.594551e+10
12    2.606307e+10
13    2.216253e+10
14    1.764680e+10
15    1.565532e+10
16    2.332401e+10
17    2.337010e+10
18    1.759769e+10
19    2.013213e+10
20    1.526850e+10
dtype: float32
In [9]:
sorted_row_df = df.sort_index()#行名字排序
sorted_row_df.head()
Out[9]:
high low volume open close amount instrument date
0 3726.244629 3468.948486 11537067400 3725.856201 3469.066162 1.459682e+11 000300.SHA 2016-01-04
1 3518.217041 3377.279785 16211698400 3382.177002 3478.779785 1.960171e+11 000300.SHA 2016-01-05
2 3543.739502 3468.466553 14596614400 3482.406494 3539.808105 1.609472e+11 000300.SHA 2016-01-06
3 3481.149902 3284.737305 4410264100 3481.149902 3294.383789 4.713080e+10 000300.SHA 2016-01-07
4 3418.850830 3237.930664 18595945100 3371.871094 3361.563232 2.034989e+11 000300.SHA 2016-01-08
In [10]:
sorted_col_df = df.sort_index(axis=1) # 列名字排序
sorted_col_df.head()
Out[10]:
amount close date high instrument low open volume
0 1.459682e+11 3469.066162 2016-01-04 3726.244629 000300.SHA 3468.948486 3725.856201 11537067400
1 1.960171e+11 3478.779785 2016-01-05 3518.217041 000300.SHA 3377.279785 3382.177002 16211698400
2 1.609472e+11 3539.808105 2016-01-06 3543.739502 000300.SHA 3468.466553 3482.406494 14596614400
3 4.713080e+10 3294.383789 2016-01-07 3481.149902 000300.SHA 3284.737305 3481.149902 4410264100
4 2.034989e+11 3361.563232 2016-01-08 3418.850830 000300.SHA 3237.930664 3371.871094 18595945100
In [11]:
# 数据默认是按升序排序的,也可以降序排序
df.sort_index(axis=1,ascending = False)
Out[11]:
volume open low instrument high date close amount
0 11537067400 3725.856201 3468.948486 000300.SHA 3726.244629 2016-01-04 3469.066162 1.459682e+11
1 16211698400 3382.177002 3377.279785 000300.SHA 3518.217041 2016-01-05 3478.779785 1.960171e+11
2 14596614400 3482.406494 3468.466553 000300.SHA 3543.739502 2016-01-06 3539.808105 1.609472e+11
3 4410264100 3481.149902 3284.737305 000300.SHA 3481.149902 2016-01-07 3294.383789 4.713080e+10
4 18595945100 3371.871094 3237.930664 000300.SHA 3418.850830 2016-01-08 3361.563232 2.034989e+11
5 17463838700 3303.124512 3192.449951 000300.SHA 3342.480713 2016-01-11 3192.449951 1.844640e+11
6 12822579600 3214.823486 3174.549805 000300.SHA 3242.253174 2016-01-12 3215.709961 1.420489e+11
7 12066649400 3240.483887 3155.878662 000300.SHA 3257.295410 2016-01-13 3155.878662 1.308861e+11
8 13453767100 3076.644531 3072.038818 000300.SHA 3226.657715 2016-01-14 3221.571289 1.447716e+11
9 12460177900 3200.887939 3101.052246 000300.SHA 3216.797607 2016-01-15 3118.729980 1.318827e+11
10 10112838600 3068.234863 3066.727051 000300.SHA 3165.616455 2016-01-18 3130.729004 1.094583e+11
11 12765389300 3132.704590 3119.216309 000300.SHA 3227.907959 2016-01-19 3223.125000 1.429077e+11
12 13065050400 3204.218262 3154.425781 000300.SHA 3225.949707 2016-01-20 3174.378174 1.433133e+11
13 11118973300 3136.384033 3081.345459 000300.SHA 3202.538086 2016-01-21 3081.345459 1.218562e+11
14 8995108700 3114.895020 3053.958008 000300.SHA 3132.170898 2016-01-22 3113.462646 9.688567e+10
15 8154006100 3129.027832 3105.100342 000300.SHA 3149.332275 2016-01-25 3128.885986 8.577789e+10
16 12693392100 3099.595703 2934.646973 000300.SHA 3106.530273 2016-01-26 2940.508545 1.272507e+11
17 13222936600 2953.182617 2839.289795 000300.SHA 2963.543701 2016-01-27 2930.351807 1.269977e+11
18 9922275600 2909.333984 2841.844238 000300.SHA 2941.886719 2016-01-28 2853.756104 9.566387e+10
19 11330425900 2855.598389 2854.371094 000300.SHA 2965.309326 2016-01-29 2946.090088 1.094624e+11
20 8679534200 2939.040039 2869.561279 000300.SHA 2944.517090 2016-02-01 2901.047607 8.293145e+10

常用的方法如上所介绍们,还要其他许多,可自行学习,下面罗列了一些,可供参考:

count 非na值的数量

describe 针对Series或个DataFrame列计算汇总统计

min、max 计算最小值和最大值

argmin、argmax 计算能够获取到最大值和最小值得索引位置(整数)

idxmin、idxmax 计算能够获取到最大值和最小值得索引值

quantile 计算样本的分位数(0到1)

sum 值的总和

mean 值得平均数

median 值得算术中位数(50%分位数)

mad 根据平均值计算平均绝对离差

var 样本值的方差

std 样本值的标准差

skew 样本值得偏度(三阶矩)

kurt 样本值得峰度(四阶矩)

cumsum 样本值得累计和

cummin,cummax 样本值得累计最大值和累计最小值

cumprod 样本值得累计积

diff 计算一阶差分(对时间序列很有用)

pct_change 计算百分数变化

3、数据规整

Pandas提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作

concat 可以沿一条轴将多个对象堆叠到一起

append 将一行连接到一个DataFrame上

duplicated 移除重复数据

3.1 数据堆叠 concat

In [12]:
df1 = D.history_data(instruments=['000300.SHA'], start_date='2017-01-01', end_date='2017-01-05',
               fields=['open', 'high', 'low', 'close'])
df1
Out[12]:
high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05
In [13]:
df2 = D.history_data(instruments=['000300.SHA'], start_date='2017-01-06', end_date='2017-01-10',
               fields=['open', 'high', 'low', 'close'])
df2
Out[13]:
high low open close instrument date
0 3368.870605 3346.891357 3366.852295 3347.666504 000300.SHA 2017-01-06
1 3366.221924 3344.347168 3345.748291 3363.901367 000300.SHA 2017-01-09
2 3370.505127 3354.087891 3361.639404 3358.271484 000300.SHA 2017-01-10

纵向拼接(默认):

In [14]:
pd.concat([df1, df2], axis=0)
Out[14]:
high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05
0 3368.870605 3346.891357 3366.852295 3347.666504 000300.SHA 2017-01-06
1 3366.221924 3344.347168 3345.748291 3363.901367 000300.SHA 2017-01-09
2 3370.505127 3354.087891 3361.639404 3358.271484 000300.SHA 2017-01-10

横向拼接,index对不上的会用 NaN 填充:

In [15]:
pd.concat([df1, df2], axis=1)
Out[15]:
high low open close instrument date high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03 3368.870605 3346.891357 3366.852295 3347.666504 000300.SHA 2017-01-06
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04 3366.221924 3344.347168 3345.748291 3363.901367 000300.SHA 2017-01-09
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05 3370.505127 3354.087891 3361.639404 3358.271484 000300.SHA 2017-01-10

3.2 数据连接 append

In [16]:
df1
Out[16]:
high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05
In [17]:
s = df1.iloc[0]
s
Out[17]:
high                      3345.26
low                       3313.95
open                      3313.95
close                     3342.23
instrument             000300.SHA
date          2017-01-03 00:00:00
Name: 0, dtype: object
In [18]:
df1.append(s, ignore_index=False) # ignore_index=False 表示索引不变
Out[18]:
high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
In [19]:
df1.append(s, ignore_index=True) # ignore_index=True 表示索引重置
Out[19]:
high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05
3 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03

3.3 移除重复数据duplicated

In [20]:
z = df1.append(s, ignore_index=False)
z
Out[20]:
high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03

查看重复数据:

In [21]:
z.duplicated()
Out[21]:
0    False
1    False
2    False
0     True
dtype: bool

移除重复数据:

In [22]:
z.drop_duplicates()
Out[22]:
high low open close instrument date
0 3345.262939 3313.953369 3313.953369 3342.227295 000300.SHA 2017-01-03
1 3369.565918 3338.152588 3340.557617 3368.311768 000300.SHA 2017-01-04
2 3373.411865 3361.461914 3368.340088 3367.789307 000300.SHA 2017-01-05

4、分组

In [23]:
z.groupby('open').sum()
Out[23]:
high low close
open
3313.953369 6690.525879 6627.906738 6684.454590
3340.557617 3369.565918 3338.152588 3368.311768
3368.340088 3373.411865 3361.461914 3367.789307
In [24]:
z.groupby(['open','close']).sum()
Out[24]:
high low
open close
3313.953369 3342.227295 6690.525879 6627.906738
3340.557617 3368.311768 3369.565918 3338.152588
3368.340088 3367.789307 3373.411865 3361.461914

本文由BigQuant宽客学院推出,版权归BigQuant所有,转载请注明出处。


【宽客学院】Pandas快速入门
社区干货与精选整理(持续更新中...)
(disc) #2

pandas金融大数据分析很多时候需要借助groupby技术:数据聚合和分组运算,我觉得可以多介绍一点groupby