Python?pandas的八個生命周期總結(jié)

更新時間：2022年10月21日 08:22:01 作者：Python集中營

這篇文章主要從八個pandas的數(shù)據(jù)處理生命周期，整理匯總出pandas框架在整個數(shù)據(jù)處理過程中都是如何處理數(shù)據(jù)的，感興趣的小伙伴可以了解一下

這里從八個pandas的數(shù)據(jù)處理生命周期，整理匯總出pandas框架在整個數(shù)據(jù)處理過程中都是如何處理數(shù)據(jù)的。

也就是從pandas的數(shù)據(jù)表對象以及數(shù)據(jù)匯總、數(shù)據(jù)統(tǒng)計等等直到數(shù)據(jù)導出的八個處理過程來完成pandas使用的匯總處理。

首先，需要準備好將python非標準庫導入進來，除了pandas之外一般伴隨數(shù)據(jù)分析處理使用的還有numpy科學計算庫。

# Importing the pandas library and giving it the alias pd.
import pandas as pd

# Importing the numpy library and giving it the alias np.
import numpy as np

1、數(shù)據(jù)表對象（DataFrame）

在pandas的數(shù)據(jù)分析處理中，主要依賴的是對DataFrame對象的處理來完成數(shù)據(jù)的提取、匯總、統(tǒng)計等操作。

那么在初始化DataFrame對象的時候有兩種方式，一種是直接讀取Excel、csv文件獲取數(shù)據(jù)后返回DataFrame數(shù)據(jù)對象。

# Reading the csv file and converting it into a dataframe.
dataframe_csv = pd.DataFrame(pd.read_csv('./data.csv'))

# Reading the excel file and converting it into a dataframe.
dataframe_xlsx = pd.DataFrame(pd.read_excel('./data.xlsx'))

另一種則是需要自己創(chuàng)建DataFrame對象的數(shù)據(jù)，將字典等類型的python對象直接初始化為DataFrame數(shù)據(jù)表的形式。

# Creating a dataframe with two columns, one called `name` and the other called `age`.
dataframe = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                          "已誕生多少年": [23, 20, 28]},
                         columns=['編程語言', '已誕生多少年'])

2、數(shù)據(jù)表（DataFrame）結(jié)構(gòu)信息

通過DataFrame對象內(nèi)置的各種函數(shù)來查看數(shù)據(jù)維度、列名稱、數(shù)據(jù)格式等信息。

# Creating a dataframe with two columns, one called `name` and the other called `age`.
dataframe = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                          "已誕生多少年": [23, 20, 28]},
                         columns=['編程語言', '已誕生多少年'])

【加粗】dataframe.info()

查看數(shù)據(jù)表的基本信息展示，包括列數(shù)、數(shù)據(jù)格式、列名稱、占用空間等。

dataframe.info()

# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 2 columns):
#  #   Column  Non-Null Count  Dtype
# ---  ------  --------------  -----
#  0   編程語言    0 non-null      object
#  1   已誕生多少年  0 non-null      object
# dtypes: object(2)
# memory usage: 0.0+ bytes

【加粗】dataframe.columns

查看DataFrame對象的所有列的名稱，并返回數(shù)組信息。

print('顯示所有列的名稱是：{0}'.format(dataframe.columns))

# 顯示所有列的名稱是：Index(['編程語言', '已誕生多少年'], dtype='object')

【加粗】dataframe['列名'].dtype

查看DataFrame對象中某一列的格式dtype是什么。

print('列名（編程語言）的格式是：{0}'.format(dataframe[u'編程語言'].dtype))

# 列名（編程語言）的格式是：object

【加粗】dataframe.shape

通過DataFrame對象的shape函數(shù)，進而展示出數(shù)據(jù)是幾行幾列的結(jié)構(gòu)。

print('dataframe的結(jié)構(gòu)是：{0}'.format(dataframe.shape))

# dataframe的結(jié)構(gòu)是：(3, 2)

【加粗】dataframe.values

使用DataFrame對象的values函數(shù)，得出所有數(shù)據(jù)內(nèi)容的結(jié)果。

# Importing the pprint function from the pprint module.
from pprint import pprint

pprint('dataframe對象的值是：{0}'.format(dataframe.values))

# "dataframe對象的值是：[['Java' 23]\n ['Python' 20]\n ['C++' 28]]"

3、數(shù)據(jù)清洗

數(shù)據(jù)清洗即是對DataFrame對象中的數(shù)據(jù)進行規(guī)范化的處理，比如空值的數(shù)據(jù)填充、重復數(shù)據(jù)的清理、數(shù)據(jù)格式的統(tǒng)一轉(zhuǎn)換等等。

【加粗】dataframe.fillna()

# 將所有數(shù)據(jù)為空的項填充為0
dataframe.fillna(value=0)

# 使用均值進行填充
dataframe[u'已誕生多少年'].fillna(dataframe[u'已誕生多少年'].mean())

【加粗】map(str.strip)

# 去除指定列的首尾多余的空格后，再重新賦值給所在列

dataframe[u'編程語言'] = dataframe[u'編程語言'].map(str.strip)

【加粗】dataframe.astype

# 更改DataFrame數(shù)據(jù)對象中某個列的數(shù)據(jù)格式。

dataframe[u'已誕生多少年'].astype('int')

【加粗】dataframe.rename

# 更改DataFrame數(shù)據(jù)對象中某個列的名稱

dataframe.rename(columns={u'已誕生多少年': u'語言年齡'})

【加粗】 dataframe.drop_duplicates

# 以DataFrame中的某個列為準，刪除其中的重復項

dataframe[u'編程語言'].drop_duplicates()

【加粗】dataframe.replace

# 替換DataFrame數(shù)據(jù)對象中某個列中指定的值

dataframe[u'編程語言'].replace('Java', 'C#')

4、數(shù)據(jù)預梳理

數(shù)據(jù)預處理（data preprocessing）是指在主要的處理以前對數(shù)據(jù)進行的一些處理。

如對大部分地球物理面積性觀測數(shù)據(jù)在進行轉(zhuǎn)換或增強處理之前，首先將不規(guī)則分布的測網(wǎng)經(jīng)過插值轉(zhuǎn)換為規(guī)則網(wǎng)的處理，以利于計算機的運算。

【加粗】數(shù)據(jù)合并

使用DataFrame對象數(shù)據(jù)合并的有四種方式可以選擇，分別是merge、append、join、concat方式，不同方式實現(xiàn)的效果是不同的。

接下來使用兩種比較常見的方式append、concat、join來演示一下DataFrame對象合并的效果。

使用兩個DataFrame的數(shù)據(jù)對象通過append將對象的數(shù)據(jù)內(nèi)容進行合并。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeA = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeB = pd.DataFrame({"編程語言": ['Scala', 'C#', 'Go'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Appending the dataframeB to the dataframeA.
res = dataframeA.append(dataframeB)

# Printing the result of the append operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28
# 0   Scala      23
# 1      C#      20
# 2      Go      28
#
# Process finished with exit code 0

使用兩個DataFrame的數(shù)據(jù)對象通過concat將對象的數(shù)據(jù)內(nèi)容進行合并。

# Concatenating the two dataframes together.
res = pd.concat([dataframeA, dataframeB])

# Printing the result of the append operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28
# 0   Scala      23
# 1      C#      20
# 2      Go      28

concat函數(shù)的合并效果和append函數(shù)有異曲同工之妙，兩者同樣都是對數(shù)據(jù)內(nèi)容進行縱向合并的。

使用兩個DataFrame的數(shù)據(jù)對象通過join將對象的數(shù)據(jù)結(jié)構(gòu)及數(shù)據(jù)內(nèi)容進行橫向合并。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeC = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Creating a dataframe with one column called `歷史表現(xiàn)` and three rows.
dataframeD = pd.DataFrame({"歷史表現(xiàn)": ['A', 'A', 'A']})

# Joining the two dataframes together.
res = dataframeC.join(dataframeD, on=None)

# Printing the result of the append operation.
print(res)

#      編程語言  已誕生多少年 歷史表現(xiàn)
# 0    Java      23    A
# 1  Python      20    A
# 2     C++      28    A

可以發(fā)現(xiàn)使用join的函數(shù)之后，將dataframeD作為一個列擴展了并且對應的每一行都準確的填充了數(shù)據(jù)A。

【加粗】設置索引

給DataFrame對象設置索引的話就比較方便了，直接DataFrame對象提供的set_index函數(shù)設置需要定義索引的列名稱就OK了。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeE = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Setting the index of the dataframe to the column `編程語言`.
dataframeE.set_index(u'編程語言')

# Printing the dataframeE.
print(dataframeE)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28

【加粗】數(shù)據(jù)排序

DataFrame數(shù)據(jù)對象的排序主要是通過索引排序、某個指定列排序的方式為參照完成對DataFrame對象中的整個數(shù)據(jù)內(nèi)容排序。

# Sorting the dataframeE by the index.
res = dataframeE.sort_index()

# Printing the res.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28

# Sorting the dataframeE by the column `已誕生多少年`.
res = dataframeE.sort_values(by=['已誕生多少年'], ascending=False)

# Printing the res.
print(res)

#      編程語言  已誕生多少年
# 2     C++      28
# 0    Java      23
# 1  Python      20

sort_index函數(shù)是指按照當前DataFrame數(shù)據(jù)對象的索引進行排序，sort_values則是按照指定的一個或多個列的值進行降序或者升序。

【加粗】數(shù)據(jù)分組

數(shù)據(jù)預處理中的數(shù)據(jù)分組主要是需要的分組的數(shù)據(jù)打上特殊的標記以便于后期對數(shù)據(jù)的歸類處理。

比較簡單一些的分組處理可以使用numpy中提供的函數(shù)進行處理，這里使用numpy的where函數(shù)來設置過濾條件。

# Creating a new column called `分組標記（高齡/低齡）` and setting the value to `高` if the value in the column `已誕生多少年` is greater
# than or equal to 23, otherwise it is setting the value to `低`.
dataframeE['分組標記（高齡/低齡）'] = np.where(dataframeE[u'已誕生多少年'] >= 23, '高', '低')

# Printing the dataframeE.
print(dataframeE)

#      編程語言  已誕生多少年 分組標記（高齡/低齡）
# 0    Java      23           高
# 1  Python      20           低
# 2     C++      28           高

稍微復雜一些的過濾條件可以使用多條件的過濾方式找出符合要求的數(shù)據(jù)項進行分組標記。

# Creating a new column called `分組標記（高齡/低齡,是否是Java）` and setting the value to `高/是` if the value in the column `已誕生多少年` is
# greater than or equal to 23 and the value in the column `編程語言` is equal to `Java`, otherwise it is setting the value to
# `低/否`.
dataframeE['分組標記（高齡/低齡,是否是Java）'] = np.where((dataframeE[u'已誕生多少年'] >= 23) & (dataframeE[u'編程語言'] == 'Java'), '高/是',
                                             '低/否')

# Printing the dataframeE.
print(dataframeE)

#      編程語言  已誕生多少年 分組標記（高齡/低齡） 分組標記（高齡/低齡,是否是Java）
# 0    Java      23           高                 高/是
# 1  Python      20           低                 低/否
# 2     C++      28           高                 低/否

5、提取數(shù)據(jù)

數(shù)據(jù)提取即是對符合要求的數(shù)據(jù)完成提取操作，DataFrame對象提取數(shù)據(jù)主要是按照標簽值、標簽值和位置以及數(shù)據(jù)位置進行提取。

DataFrame對象按照位置或位置區(qū)域提取數(shù)據(jù)，這里所說的位置其實就是DataFrame對象的索引。

基本上所有的操作都能夠使用DataFrame對象的loc函數(shù)、iloc函數(shù)這兩個函數(shù)來實現(xiàn)操作。

提取索引為2的DataFrame對象對應的行數(shù)據(jù)。

# Selecting the row with the index of 2.
res = dataframeE.loc[2]

# Printing the result of the operation.
print(res)

# 編程語言                   C++
# 已誕生多少年                  28
# 分組標記（高齡/低齡）              高
# 分組標記（高齡/低齡,是否是Java）    低/否
# Name: 2, dtype: object

提取索引0到1位置的所有的行數(shù)據(jù)。

# Selecting the rows with the index of 0 and 1.
res = dataframeE.loc[0:1]

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年 分組標記（高齡/低齡） 分組標記（高齡/低齡,是否是Java）
# 0    Java      23           高                 高/是
# 1  Python      20           低                 低/否

按照前兩行前兩列的數(shù)據(jù)區(qū)域提取數(shù)據(jù)。

# 注意這里帶有冒號:的iloc函數(shù)用法效果是和前面不一樣的。

# Selecting the first two rows and the first two columns.
res = dataframeE.iloc[:2, :2]

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20

提取符合條件的數(shù)據(jù)項，對某一列數(shù)據(jù)中指定的值完成提取。

# 提取出編程語言這個列中數(shù)據(jù)內(nèi)容是Java、C++的數(shù)據(jù)行。

# Selecting the rows where the value in the column `編程語言` is either `Java` or `C++`.
res = dataframeE.loc[dataframeE[u'編程語言'].isin(['Java', 'C++'])]

# Printing the result of the operation.
print(res)

#    編程語言  已誕生多少年 分組標記（高齡/低齡） 分組標記（高齡/低齡,是否是Java）
# 0  Java      23           高                 高/是
# 2   C++      28           高                 低/否

6、篩選數(shù)據(jù)

篩選數(shù)據(jù)是數(shù)據(jù)處理整個生命周期中的最后一個對原有數(shù)據(jù)的提取操作，通過各種邏輯判斷條件的操作來完成數(shù)據(jù)篩選。

這里分別通過使用DataFrame對象的'與'、'或'、'非'三種常用的邏輯判斷來實現(xiàn)下面的數(shù)據(jù)篩選操作。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeF = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

res = dataframeF.loc[(dataframeF[u'已誕生多少年'] > 25) & (dataframeF[u'編程語言'] == 'C++'), [u'編程語言', u'已誕生多少年']]

# Printing the result of the operation.
print(res)

#   編程語言  已誕生多少年
# 2  C++      28

res = dataframeF.loc[(dataframeF[u'已誕生多少年'] > 23) | (dataframeF[u'編程語言'] == 'Java'), [u'編程語言', u'已誕生多少年']]

# Printing the result of the operation.
print(res)

#    編程語言  已誕生多少年
# 0  Java      23
# 2   C++      28

res = dataframeF.loc[(dataframeF[u'編程語言'] != 'Java'), [u'編程語言', u'已誕生多少年']]

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年
# 1  Python      20
# 2     C++      28

7、數(shù)據(jù)匯總

數(shù)據(jù)匯總通常是使用groupby函數(shù)對一個或多個列名稱進行分組，再使用count函數(shù)統(tǒng)計分組后的數(shù)目。

res = dataframeF.groupby(u'編程語言').count()

# Printing the result of the operation.
print(res)

#         已誕生多少年
# 編程語言
# C++          1
# Java         1
# Python       1

res = dataframeF.groupby(u'編程語言')[u'已誕生多少年'].count()

# Printing the result of the operation.
print(res)

# 編程語言
# C++       1
# Java      1
# Python    1
# Name: 已誕生多少年, dtype: int64

res = dataframeF.groupby([u'編程語言',u'已誕生多少年'])[u'已誕生多少年'].count()

# Printing the result of the operation.
print(res)

# 編程語言    已誕生多少年
# C++     28        1
# Java    23        1
# Python  20        1
# Name: 已誕生多少年, dtype: int64

8、數(shù)據(jù)統(tǒng)計

數(shù)據(jù)統(tǒng)計的概念基本上和數(shù)學上的思路是一樣的，首先是對數(shù)據(jù)進行采樣，采樣完成計算相關(guān)的標準差、協(xié)方差等相關(guān)的數(shù)據(jù)指標。

'''按照采樣不放回的方式，隨機獲取DataFrame對象中的兩條數(shù)據(jù)'''
res = dataframeF.sample(n=2, replace=False)

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20

可以發(fā)現(xiàn)每次執(zhí)行之后都會隨機的從DataFrame的數(shù)據(jù)表中取出兩條數(shù)據(jù)。

若是采樣放回的方式時則可以將replace的屬性設置為True即可。

# 計算出DataFrame對象的所有列的協(xié)方差
res = dataframeF.cov()

# Printing the result of the operation.
print(res)

#            已誕生多少年
# 已誕生多少年  16.333333

# 計算出DataFrame對象相關(guān)性
res = dataframeF.corr()

# Printing the result of the operation.
print(res)

#         已誕生多少年
# 已誕生多少年     1.0

以上就是Python pandas的八個生命周期總結(jié)的詳細內(nèi)容，更多關(guān)于Python pandas生命周期的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: