本文最后更新于：2023年4月10日上午

1. 数据读写

数据读入

# csv
pd.read_csv()
pd.read_table()
# excel
pd.read_excel(<path>, <sheet_name>=0, <header>=0)
"""
1. path: 文件路径
2. sheet_name: 指定Excel文件读取sheet，默认取第一个
	-int 按sheet的索引,sheet_name=1,取第二张表
	-str 按sheet的名字,sheet_name='总结表'
	-list 多张表的df字典,sheet_name=[0, 1, "Sheet5"],取第一个、第二个、名为Sheet5的
3. header: 指定列索引,默认0
	-int 指定行的index
	-list 多层索引
4. index_col: 指定行索引，默认None
"""

数据写入

1 2	`df.to_excel(<path>) df.to_csv(<path>)`

2. 数据结构

Series，一维数组，pandas.Series( data, index, dtype, name, copy)
DataFrame，二维数组，pandas.DataFrame( data, index, columns, dtype, copy)

<index>：行标签

<columns>：列标签

df.values可将DataFrame转为ndarry返回

索引切片

设置索引

1
2
3

# Series
pd.Series(<data>, index=<索引>)
pd.Series({<键-索引>:<值-数值>})

# 以时间为索引 DatetimeIndex()函数
'''
pandas 时间对象:
1. Timestamp 时间戳 
pd.Timestamp(2018, 10, 1)
pd.Timestamp("2018-10-1 10:00:1")
2. Period 时间段
'''

查询数组行/列标签情况

sr.index # RangeIndex(start=0, stop=5, step=1)
df.index # RangeIndex(start=0, stop=5, step=1)
df.index.values # array类型
df.columns # Index(['姓名', '班级'], dtype='object')
df.columns.values # array类型
df.axes # 返回含行列情况的数组

按行索引/切片

索引 df.iloc

1 2	`df.iloc[0] # series df.iloc[[0]] # dataframe`

切片 df[] / df.iloc

# 连续多行，前开后闭
df[2:3] # data frame
df.iloc[2:3] # data frame
# 不连续多行
df.iloc[[1,3]] # data frame

按列索引/切片

df[] / df.loc[] / df.iloc[]

# series
testdf3['A'] # 按列名索引得到单列
testdf3.loc[:,'A'] # 按列名索引得到单列
testdf3.iloc[:,0] # 按列index索引得到单列

# data frame
testdf3[['A','B']] # 按列名索引得到单/多列
testdf3.loc[:,['A','B']] # 按列名索引得到单/多列
testdf3.iloc[:,[0, 1]] # 按列index索引得到单/多列

行列同时索引/切片

df.loc[] / df.iloc[]

df.iloc[<行>,<列>]
df.iloc[[1,3],0] # 1. series
df.iloc[[1,3],[0]] # 2. dataframe
df.iloc[[1,3],[1,3]] # 2. dataframe
df.iloc[[1,3],1:3] # 3. dataframe

df.loc[1,["A","D"]] # series 对应上述1
df.loc[[1],["A","D"]] # df 对应上述2
df.loc[[1,3],["A","D"]] # df 对应上述2
df.loc[[1,3],"A":"D"] # df 对应上述3

df.loc 按名称，df.iloc 按index。

基本信息

查询数据的基本信息

查询数据前/后片段

1 2	`df.head(n) # 前n行 df.tail(n) # 尾n行`

查询数据维度 (行/列总数)
1
df.shape # (3453, 27)
查询数据类型 / 索引
1
df.dtypes # 各列数据类型
查询行/列信息
1
2
df.index df.columns

查询数据总体情况

# 非零值、数据类型
df.info()

# 统计值
df.describe()

3. 数据类型

时间类型

Timestamp 时间戳

1 2	`pd.Timestamp(2018, 10, 1) pd.Timestamp("2018-10-1 10:00:1")`

Period 时间段

4. 数据操作

删除、修改、合并

# 删除满足列条件的行
df = df.drop(df[<some boolean condition>].index) # 单个条件
df = df.drop(df[<some boolean condition> | <the other condition>].index) # 多个条件
df_clear = df.drop(df[df['x']<0.01].index)

# 删除空值
df.dropna()
'''
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df.dropna() 有一个空值即删除该行
df.dropna(axis='columns') 按列删除
df.dropna(thresh=2) 删除缺失值至少为2个的行
df.dropna(subset=['name', 'born']) 删除指定列出现空值的行
'''

# 改名
df.rename(columns={'a':'A'})
#改数据类型
df[].astype()
# 表达式映射
series.apply(lambda x:x+1)

# concat合并
con = pd.concat([df1, df2, ...], <axis>, <join>,...)
'''
- axis: 0 按行/1 按列
- join: 其它轴的合并方式, inner 交集/ outer并集
- ignore_index: 是否保留原表索引 False 保留,(默认) / True 不保留
- keys 形成多层索引 
'''
# merge合并
match = pd.merge(data07,data17,on=["标准地号","树号"])

计数、排序、唯一

series.value_counts() # 对Series里面的每个值进行计数并且排序
df.dtypes.value_counts() # 统计各字段数据类型及频数
df.sort_values(ascending=True) # 排序，默认正序

df[<col>].unique() # 特征的所有唯一值
df[<col>].nunique() # 特征的所有唯一值的个数

运算

1	`df.corr() # 相关系数矩阵`

按条件选取

# 得到按指定字段降序排序后的指定行数
df.nlargest(n, column, keep)
'''
n: 行数
column: 指定字段
keep: 出现重复值的处理方法
'''

划分求值

类似于数据库的GROUPBY，得到一个DataFrameGroupBy对象，可以对此对象进行后续操作，如：数据库里的聚合函数

# 创建group对象
grouped = df.groupby('Gender')
# Gender
# Female    3
# Male      5
# dtype: int64

grouped = df.groupby(['Gender', 'Age'])
# Gender  Age
# Female  17     1
#         18     1
#         22     1
# Male    18     1
#         19     1
#         20     2
#         21     1
# dtype: int64

转换

ndarray

import numpy as np
import pandas as pd

## ndarray -> dataframe
ndarray = np.array([[11,22,33],[44,55,66]])
df = pd.DataFrame(my_array, columns = ['Column_A','Column_B','Column_C'], index = ['Item_1', 'Item_2'])

#         Column_A  Column_B  Column_C
# Item_1        11        22        33
# Item_2        44        55        66
# <class 'pandas.core.frame.DataFrame'>

## dataframe -> ndarray
df.values

## ndarray -> list
<np>.tolist()

4. 数据清洗

缺失值

线下csv文件，空格在DataFrame里边也是用的“NaN”表示。

1
2
3

# 每列缺失值总数降序排列
df.isnull().sum().sort_values(ascending=False)
df.isna()

缺失值填补

df.fillna

1
2
3

# df.fillna(<value>, <methond>, <inplace>)
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} # 字典键为列值为填充值
df.fillna(value=values)

重复值

# 查询重复值的个数。
df.duplicated().sum()
# 去重
df.drop_duplicates(inplace = True)

独热编码

1	`pd.get_dummies(<df>)`

参考资料

数据分析

python库数据挖掘

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

决策树算法原理及实现上一篇

计算机组成原理下一篇

数据容器(一) Pandas库备忘