|
楼主 |
发表于 2024-4-7 11:09
|
显示全部楼层
本帖最后由 bala 于 2024-9-18 16:58 编辑
(...接上一节)
3.Pandas教程
我们建议学习基础教程:https://www.runoob.com/pandas/pandas-tutorial.html
后面不管是数据处理还是绘图,pandas将会是一个非常非常重要的库,广泛应用在学术、金融、统计学等各个数据分析领域。Pandas 的主要数据结构是 Series (一维数据)与 DataFrame(二维数据)。
首先我们打开cmd,输入,安装完成后即可在开发环境中使用。
首先我们了解一下Series和Dataframe:
Series: 类似于一维数组或列表,是由一组数据以及与之相关的数据标签(索引)构成。Series 可以看作是 DataFrame 中的一列,也可以是单独存在的一维数据结构。
我们可以打个比方理解为是只有两列的excel单元格,一列索引,一列数据。
DataFrame: 类似于一个二维表格,它是 Pandas 中最重要的数据结构。DataFrame 可以看作是由多个 Series 按列排列构成的表格,它既有行索引也有列索引,因此可以方便地进行行列选择、过滤、合并等操作。
我们也可以理解为有多行多列的excel表格的形式。
(当然,我们常用的excel表格内的数据与dataframe可以快速简便的转换,后面的小节会提到)
(1)pandas读取和写入数据
①直接读取
我们有这么一坨数据:
- [{"type":"A.01","Datetime":"2024/02/17 12","TEM":12.7},{"type":"A.01","Datetime":"2024/02/17 13","TEM":10.6},{"type":"A.01","Datetime":"2024/02/17 14","TEM":10.6},{"type":"A.01","Datetime":"2024/02/17 15","TEM":8.7},{"type":"A.01","Datetime":"2024/02/17 16","TEM":7.6},{"type":"A.01","Datetime":"2024/02/17 17","TEM":9.8},{"type":"A.01","Datetime":"2024/02/17 18","TEM":11.0},{"type":"A.01","Datetime":"2024/02/17 19","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 20","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 21","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 22","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 23","TEM":10.5},{"type":"A.01","Datetime":"2024/02/18 00","TEM":10.4},{"type":"A.01","Datetime":"2024/02/18 01","TEM":11.3},{"type":"A.01","Datetime":"2024/02/18 02","TEM":14.5},{"type":"A.01","Datetime":"2024/02/18 03","TEM":16.5},{"type":"A.01","Datetime":"2024/02/18 04","TEM":18.0},{"type":"A.01","Datetime":"2024/02/18 05","TEM":18.8},{"type":"A.01","Datetime":"2024/02/18 06","TEM":19.4},{"type":"A.01","Datetime":"2024/02/18 07","TEM":20.0},{"type":"A.01","Datetime":"2024/02/18 08","TEM":19.5},{"type":"A.01","Datetime":"2024/02/18 09","TEM":18.7},{"type":"A.01","Datetime":"2024/02/18 10","TEM":17.9},{"type":"A.01","Datetime":"2024/02/18 11","TEM":16.5},{"type":"A.01","Datetime":"2024/02/18 12","TEM":12.9}]
复制代码
这是一坨列表嵌套字典的数据,每一个{字典}是最外层[列表]的一个值,将其转为dataframe数据非常简单,只需要下面的代码:
- import pandas as pd
- a = [{"type":"A.01","Datetime":"2024/02/17 12","TEM":12.7},{"type":"A.01","Datetime":"2024/02/17 13","TEM":10.6},{"type":"A.01","Datetime":"2024/02/17 14","TEM":10.6},{"type":"A.01","Datetime":"2024/02/17 15","TEM":8.7},{"type":"A.01","Datetime":"2024/02/17 16","TEM":7.6},{"type":"A.01","Datetime":"2024/02/17 17","TEM":9.8},{"type":"A.01","Datetime":"2024/02/17 18","TEM":11.0},{"type":"A.01","Datetime":"2024/02/17 19","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 20","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 21","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 22","TEM":10.7},{"type":"A.01","Datetime":"2024/02/17 23","TEM":10.5},{"type":"A.01","Datetime":"2024/02/18 00","TEM":10.4},{"type":"A.01","Datetime":"2024/02/18 01","TEM":11.3},{"type":"A.01","Datetime":"2024/02/18 02","TEM":14.5},{"type":"A.01","Datetime":"2024/02/18 03","TEM":16.5},{"type":"A.01","Datetime":"2024/02/18 04","TEM":18.0},{"type":"A.01","Datetime":"2024/02/18 05","TEM":18.8},{"type":"A.01","Datetime":"2024/02/18 06","TEM":19.4},{"type":"A.01","Datetime":"2024/02/18 07","TEM":20.0},{"type":"A.01","Datetime":"2024/02/18 08","TEM":19.5},{"type":"A.01","Datetime":"2024/02/18 09","TEM":18.7},{"type":"A.01","Datetime":"2024/02/18 10","TEM":17.9},{"type":"A.01","Datetime":"2024/02/18 11","TEM":16.5},{"type":"A.01","Datetime":"2024/02/18 12","TEM":12.9}]
- df = pd.DataFrame(a)
- print(df)
复制代码
结果:
type Datetime TEM
0 A.01 2024/02/17 12 12.7
1 A.01 2024/02/17 13 10.6
2 A.01 2024/02/17 14 10.6
3 A.01 2024/02/17 15 8.7
4 A.01 2024/02/17 16 7.6
5 A.01 2024/02/17 17 9.8
6 A.01 2024/02/17 18 11.0
7 A.01 2024/02/17 19 10.7
8 A.01 2024/02/17 20 10.7
9 A.01 2024/02/17 21 10.7
10 A.01 2024/02/17 22 10.7
11 A.01 2024/02/17 23 10.5
12 A.01 2024/02/18 00 10.4
13 A.01 2024/02/18 01 11.3
14 A.01 2024/02/18 02 14.5
15 A.01 2024/02/18 03 16.5
16 A.01 2024/02/18 04 18.0
17 A.01 2024/02/18 05 18.8
18 A.01 2024/02/18 06 19.4
19 A.01 2024/02/18 07 20.0
20 A.01 2024/02/18 08 19.5
21 A.01 2024/02/18 09 18.7
22 A.01 2024/02/18 10 17.9
23 A.01 2024/02/18 11 16.5
24 A.01 2024/02/18 12 12.9
这个形式有点像excel,大胆点,跟excel基本上一样!
②读取csv
我地有一坨温江的探空数据(见附件),格式为CSV,打开方式有很多,我们可以直接用wps或者excel打开,当然你也可以用编程软件打开,例如vscode等等。
(注意:打开时可能会遇到编码的问题,例如:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 0: invalid start byte,意思是无法使用UTF-8解码,我们可以加上encoding参数,具体的错误代码可复制自行百度)
用pandas打开方式如下:
- import pandas as pd
- df = pd.read_csv("2023030700-56187.csv")
- print(df)
复制代码
输出结果:
time longitude latitude pressure geopotential height temperature dew point temperature ice point temperature relative humidity humidity wrt ice mixing ratio wind direction wind speed
0 2023-03-06 23:15:12 103.870 30.750 954.5 549 13.1 12.0 12.0 93 93 9.27 0.0 0.0
1 2023-03-06 23:15:20 103.870 30.750 949.2 595 13.1 11.3 11.3 89 89 8.89 0.0 0.0
2 2023-03-06 23:15:42 103.870 30.750 934.9 725 12.3 10.9 10.9 91 91 8.79 81.0 1.6
3 2023-03-06 23:15:57 103.870 30.750 925.0 813 12.1 10.2 10.2 88 88 8.48 87.0 1.5
4 2023-03-06 23:16:14 103.869 30.750 913.8 914 11.6 9.8 9.8 89 89 8.35 94.0 1.4
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
94 2023-03-06 23:55:12 104.354 30.770 109.1 15877 -69.9 0 266.0 44.1
95 2023-03-06 23:55:47 104.371 30.770 104.9 16116 -71.5 0 269.0 44.2
96 2023-03-06 23:56:12 104.383 30.771 101.9 16287 -71.2 0 271.0 44.3
97 2023-03-06 23:56:21 104.388 30.771 101.0 16348 -71.1 0 271.0 44.4
98 2023-03-06 23:56:29 104.391 30.771 100.0 16404 -71.6 0 271.0 44.4
[99 rows x 13 columns]
③读取excel表格
方法基本上跟上面一样
- import pandas as pd
- df = pd.read_excel("hainan-2024040316.xlsx")
- print(df)
复制代码
输出结果:
type Station_Id_C Station_Name Lat Lon Alti Datetime PRS PRS_Sea PRS_Max PRS_Min TEM TEM_Max TEM_Min RHU PRE_1h PRE_24h WIN_D_INST_Max WIN_S_Inst_Max VIS VIS_Min
0 A.01 N2253 三江良口和里 25.74 109.49 200.3 2024/04/03 16 986.5 1009.6 986.5 985.9 18.1 18.4 18.0 95.0 0.0 0.3 240.0 1.6 NaN NaN
1 A.01 N4033 岑溪马路昙容 22.93 110.75 91.0 2024/04/03 16 NaN NaN NaN NaN 25.1 26.2 25.1 NaN 0.0 NaN NaN NaN NaN NaN
2 A.01 N6252 平果马头 23.32 107.59 108.8 2024/04/03 16 992.4 NaN 992.4 992.0 28.4 29.1 28.4 75.0 0.0 NaN 151.0 4.6 NaN NaN
3 A.01 N5853 平南平南街道 23.55 110.39 32.5 2024/04/03 16 1003.0 1006.7 1003.6 1003.0 28.3 28.8 28.3 78.0 0.0 0.0 122.0 3.0 NaN NaN
4 A.01 N2394 融水镇 25.08 109.25 122.7 2024/04/03 16 995.2 1009.4 995.2 994.5 19.8 19.8 19.7 84.0 0.0 6.7 38.0 5.6 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3102 A.01 N7018 浦寨 21.99 106.69 250.0 2024/04/03 16 NaN NaN NaN NaN 26.7 27.1 26.7 NaN 0.0 NaN NaN NaN NaN NaN
3103 A.01 N6127 凌云逻楼安水 24.36 106.72 875.0 2024/04/03 16 NaN NaN NaN NaN 24.6 25.2 24.5 NaN 0.0 NaN NaN NaN NaN NaN
3104 A.01 N1136 中和贵德 22.64 108.75 117.0 2024/04/03 16 992.2 1005.6 992.2 992.0 24.5 25.7 24.5 93.0 0.0 0.0 125.0 2.8 NaN NaN
3105 A.01 N2025 大坡 25.15 109.52 180.0 2024/04/03 16 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
3106 A.01 N8258 百乐 24.61 107.08 851.0 2024/04/03 16 911.9 NaN 911.9 911.3 22.5 23.5 22.5 78.0 0.0 NaN 38.0 8.6 NaN NaN
④导出为csv
我们使用①内a的数据:
- import pandas as pd
- df = pd.DataFrame(a)
- df.to_csv('114514.csv')
复制代码
此时a的数据就被写入了csv文件内
⑤导出为excel
- import pandas as pd
- df = pd.DataFrame(a)
- df.to_excel('114514.xlsx',sheet_name='Sheet1')
复制代码
当然,也可以不加sheet_name=...,会自动新建工作簿并导入数据
(2)pandas处理数据
①拼接和追加
Concat:将 dataframe 拼接在一起(可按 rows 或 columns)。
Merge:类似于 SQL 中 Join 的用法。
Append:将数据按 rows 拼接到 dataframe 中。
Concat:
- import pandas as pd
- import numpy as np
- df=pd.DataFrame(np.random.randn(10,4))
- df1 = df[:3]
- print(df1)
- df2 = df[3:7]
- print(df2)
- df3 = df[7:]
- print(df3)
- df4 = pd.concat([df1,df2,df3],axis=0) #拼接
- print(df4)
复制代码
结果:
Merge:
- import pandas as pd
- left=pd.DataFrame({'key':['city','temp'],'value':['济南',26]})
- right=pd.DataFrame({'key':['city','temp'],'value':['兰州',17]})
- df = pd.merge(left,right,on='key')#根据 key 进行连接
- print(df)
复制代码
结果:
key value_x value_y
0 city 济南 兰州
1 temp 26 17
Append:
- df=pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'] ,index=range(1,9))
- s=df.iloc[0] #提取第一行数据
- df2 = df._append(s,ignore_index=False) #ignore_index 若为 Ture 则插入数据后索引将更新,否则保持原有索引值
- print(df2)
复制代码
结果:
A B C D
1 -0.545607 -1.674156 0.407157 -0.197993
2 -2.879381 -0.379466 -1.114840 -0.458665
3 -0.475737 -0.587428 -1.622563 0.435147
4 0.688105 -0.634619 0.274038 0.271619
5 -0.999742 -1.958650 -0.446780 0.858811
6 -0.504541 -1.107137 -0.834836 -0.569136
7 0.964024 -1.140281 0.711705 0.716114
8 -0.481429 0.421924 1.125234 -0.099209
1 -0.545607 -1.674156 0.407157 -0.197993
第一行复制出来之后被拼接到了最后一行
csv、excel等文件读取后也可按照上面的方法进行拼接。
②检索数据
- import pandas as pd
- df = pd.read_excel("hainan-2024040316.xlsx")
- #直接切片获取数据(行根据位置,列根据列名)
- df1 = df['Station_Id_C'] #根据属性名,获取列
- df2 = df[0:3] #切片获取位置(0:2)的数据,相当于 df.iloc[0:3]
- print(df1)
- print(df2)
- #利用 loc 根据标签值 Label 获取数据:可以交叉取值
- df3 = df.loc[0:3,['Station_Id_C','PRE_1h']] #获取索引值为(0:3)的中'Station_Id_C'的数据
- df4 = df.loc[1] #获取索引值为 2 的所有数据
- print(df3)
- print(df4)
- #利用 iloc 根据位置获取 Position 数据
- df5 = df.iloc[1] #获取第二行的所有数据
- df6 = df.iloc[1:3,[0,3]] #获取第二、三行,一、四列的数据
- print(df5)
- print(df6)
- #利用布尔值判断取数
- df7 = df[df['PRE_24h']>50] #提取出现24雨量大于50的站点数据
- print(df7)
- #isin()方法,类似于 SQL 中的 in 方法
- stidlist = ['57859','57927','57941','57942','57947','57948']
- df8 = df[df['Station_Id_C'].isin(stidlist)] #与 stidlist 中的所需站号匹配选取 df 中记录
- print(df8)
复制代码
结果太长了就不详细列出来了。
注意:大量数据情况下请使用csv,毕竟csv文件是一种专门的数据结构,读取同样的数据excel类型文件会比csv文件慢很多很多!
|
本帖子中包含更多资源
您需要 登录 才可以下载或查看,没有账号?立即注册
×
|