快捷導(dǎo)航

Python實現(xiàn)提取和去除數(shù)據(jù)中包含關(guān)鍵詞的行

更新時間：2023年08月01日 16:48:24 作者：上景

這篇文章主要介紹了Python如何提取數(shù)據(jù)中包含關(guān)鍵詞的行已經(jīng)如何去除數(shù)據(jù)中包含關(guān)鍵詞的行，文中的示例代碼講解詳細，需要的可以參考一下

幫對象處理所需數(shù)據(jù)時寫的代碼——第六彈（實現(xiàn)功能一：Python實現(xiàn)根據(jù)某列中找到的關(guān)鍵字從原始數(shù)據(jù)中過濾行，然后匹配到關(guān)鍵詞的行數(shù)據(jù)保存到新的 CSV 文件中；實現(xiàn)功能二：從原始數(shù)據(jù)中刪除“刪除的關(guān)鍵字”列中找到的任何關(guān)鍵字的行，然后將剩余數(shù)據(jù)保存到新的 CSV 文件中）

功能一：篩選出包含關(guān)鍵詞的行

第一節(jié) 讀取數(shù)據(jù)和設(shè)置

在這一部分中，代碼從兩個不同的源讀取數(shù)據(jù)：

It reads "Table 1" from an Excel file (需要保留的關(guān)鍵詞.xlsx) into a DataFrame called keywords_df.
It reads "Table 2" from a CSV file (原始數(shù)據(jù).csv) into another DataFrame called data_df.

創(chuàng)建一個名為的空 DataFrame，result_df其列與相同data_df。

import pandas as pd
from tqdm import tqdm
# Read Table 1 
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要保留的關(guān)鍵詞.xlsx")
# Read Table 2 (數(shù)據(jù)表格)
data_df = pd.read_csv(r"C:\Users\Desktop\原始數(shù)據(jù).csv", dtype=str)
# Create an empty Table 3
result_df = pd.DataFrame(columns=data_df.columns)
# Iterate over the keywords in Table 1

第二節(jié) 迭代關(guān)鍵字并過濾數(shù)據(jù)

在此部分中，代碼使用循環(huán)和庫迭代“關(guān)鍵字”列中的每個關(guān)鍵字，tqdm以顯示名為“處理”的進度條。

對于每個關(guān)鍵字，它執(zhí)行以下步驟：

它搜索“表 2”( data_df) 中“地址”列包含當前關(guān)鍵字的行。該str.contains()方法用于檢查部分匹配，并na=False用于忽略缺失值。

匹配的行存儲在名為的 DataFrame 中matched_rows。

使用, 將DataFramematched_rows附加到先前創(chuàng)建的空 DataFrame 中，以重置串聯(lián) DataFrame 的索引。result_dfpd.concat()ignore_index=True

for keyword in tqdm(keywords_df['關(guān)鍵詞'], desc="Processing"):
    # Find rows in Table 2 where the "地址" column matches the keyword
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False)]
    # Append the matched rows to Table 3
    result_df = pd.concat([result_df, matched_rows], ignore_index=True)

第三節(jié) 刪除重復(fù)行并保存結(jié)果

在這一部分中，代碼執(zhí)行以下步驟：

它使用該方法根據(jù)所有列從“表 3”( ) 中刪除重復(fù)行drop_duplicates()。DataFrameresult_df已更新為僅包含唯一行。

使用該方法將刪除重復(fù)行的結(jié)果 DataFrame 保存到名為“篩選出包含關(guān)鍵詞的行.csv”的新 CSV 文件中to_csv()。設(shè)置index為False避免將 DataFrame 索引保存為 CSV 文件中的單獨列。

最后，打印“Query Complete”，表示關(guān)鍵字搜索、過濾和CSV保存過程已完成。

# Remove duplicate rows from Table 3 based on all columns
result_df = result_df.drop_duplicates()
# Save Table 3 to a CSV file
result_df.to_csv(r"C:\Users\Desktop\篩選出包含關(guān)鍵詞的行.csv", index=False)
# Print "Query Complete"
print("Query Complete")

第四節(jié) 運行示例

原始數(shù)據(jù)如下：

需要保留的關(guān)鍵詞假設(shè)如下：

代碼運行完畢后（只保留了包含太原市和陽泉市的行）：

完整代碼

import pandas as pd
from tqdm import tqdm
# Read Table 1 
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要保留的關(guān)鍵詞.xlsx")
# Read Table 2 (數(shù)據(jù)表格)
data_df = pd.read_csv(r"C:\Users\Desktop\原始數(shù)據(jù).csv", dtype=str)
# Create an empty Table 3
result_df = pd.DataFrame(columns=data_df.columns)
# Iterate over the keywords in Table 1
for keyword in tqdm(keywords_df['關(guān)鍵詞'], desc="Processing"):
    # Find rows in Table 2 where the "地址" column matches the keyword
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False)]
    # Append the matched rows to Table 3
    result_df = pd.concat([result_df, matched_rows], ignore_index=True)
# Remove duplicate rows from Table 3 based on all columns
result_df = result_df.drop_duplicates()
# Save Table 3 to a CSV file
result_df.to_csv(r"C:\Users\Desktop\篩選出包含關(guān)鍵詞的行.csv", index=False)
# Print "Query Complete"
print("Query Complete")

功能二：去除掉包含關(guān)鍵詞的行

第一節(jié) 數(shù)據(jù)加載

在這一部分中，代碼導(dǎo)入所需的庫、pandas 和 tq??dm。然后它從外部文件加載兩個數(shù)據(jù)集。

import pandas as pd
from tqdm import tqdm
# Read Table 1
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要刪除的關(guān)鍵詞.xlsx")
# Read Table 2
data_df = pd.read_csv(r"C:\Users\Desktop\篩選包含關(guān)鍵詞的行.csv", dtype=str)

第二節(jié) 關(guān)鍵字處理和過濾

該部分涉及迭代keywords_dfDataFrame 中的每個關(guān)鍵字。對于每個關(guān)鍵字，代碼都會搜索data_df“地址”列包含該關(guān)鍵字作為子字符串的行。結(jié)果存儲在matched_rows.

for keyword in tqdm(keywords_df['刪除的關(guān)鍵詞'], desc="Processing"):
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False, regex=False)]
    data_df = data_df[~data_df['地址'].str.contains(keyword, na=False, regex=False)]

第三節(jié) 保存和完成

在這一部分中，DataFrame中的剩余數(shù)據(jù)data_df（在過濾掉具有匹配關(guān)鍵字的行之后）將保存到桌面上名為“消失掉包含關(guān)鍵字的行.csv”的新CSV文件中。該index=False參數(shù)確保索引列不會保存到 CSV 文件中。最后，腳本打印“Query Complete”，表明關(guān)鍵字處理和過濾操作已完成。

data_df.to_csv(r"C:\Users\Desktop\去除掉包含關(guān)鍵詞的行.csv", index=False)
print("Query Complete")

第四節(jié) 運行示例

原始數(shù)據(jù)如下：

需要刪除的關(guān)鍵詞假設(shè)如下：

代碼運行完畢后（刪除了包含太原市和陽泉市的行）：

完整代碼

import pandas as pd
from tqdm import tqdm
# Read Table 1 
keywords_df = pd.read_excel(r"C:\Users\Desktop\需要刪除的關(guān)鍵詞.xlsx")
# Read Table 2 
data_df = pd.read_csv(r"C:\Users\Desktop\原始數(shù)據(jù).csv", dtype=str)
# Iterate over the keywords in Table 1
for keyword in tqdm(keywords_df['刪除的關(guān)鍵詞'], desc="Processing"):
    # Find rows in Table 2 where the "地址" column contains the keyword as a substring
    matched_rows = data_df[data_df['地址'].str.contains(keyword, na=False, regex=False)]
    # Remove the matched rows from Table 2
    data_df = data_df[~data_df['地址'].str.contains(keyword, na=False, regex=False)]
# Save the remaining data to a CSV file
data_df.to_csv(r"C:\Users\Desktop\去除掉包含關(guān)鍵詞的行.csv", index=False)
# Print "Query Complete"
print("Query Complete")

上述代碼注意文件的格式，有csv格式和xlsx格式，根據(jù)需要適當修改程序即可。

以上就是Python實現(xiàn)提取和去除數(shù)據(jù)中包含關(guān)鍵詞的行的詳細內(nèi)容，更多關(guān)于Python提取和去除關(guān)鍵詞的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: