腳本之家服務器常用軟件

快捷導航

使用Python處理數(shù)據(jù)集的技巧分享

更新時間：2024年12月27日 08:44:59 作者：engchina

這篇文章會從加載數(shù)據(jù)開始,一步步教大家如何格式化數(shù)據(jù)、保存數(shù)據(jù),最后還會教大家如何加載處理后的數(shù)據(jù),感興趣的小伙伴可以跟隨小編一起學習一下

1. 導入需要的庫
2. 加載預訓練數(shù)據(jù)集
3. 查看數(shù)據(jù)集的前5個樣本
4. 加載公司微調(diào)數(shù)據(jù)集
5. 格式化數(shù)據(jù)
6. 使用模板格式化數(shù)據(jù)
7. 生成微調(diào)數(shù)據(jù)集
8. 保存處理后的數(shù)據(jù)
9. 加載處理后的數(shù)據(jù)
總結(jié)

1. 導入需要的庫

首先，我們需要導入一些Python庫，這些庫會幫助我們處理數(shù)據(jù)。代碼如下：

import jsonlines
import itertools
import pandas as pd
from pprint import pprint

import datasets
from datasets import load_dataset

解釋：

jsonlines: 用來處理JSON Lines格式的文件。

itertools: 提供了一些高效的循環(huán)工具。

pandas: 用來處理表格數(shù)據(jù)，比如Excel或CSV文件。

pprint: 用來美化打印數(shù)據(jù)，讓數(shù)據(jù)看起來更整齊。

datasets: 一個專門用來加載和處理數(shù)據(jù)集的庫。

2. 加載預訓練數(shù)據(jù)集

接下來，我們要加載一個預訓練的數(shù)據(jù)集。這里我們使用 allenai/c4 數(shù)據(jù)集，它是一個英文文本數(shù)據(jù)集。

pretrained_dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)

解釋：

load_dataset: 用來加載數(shù)據(jù)集。

"allenai/c4": 數(shù)據(jù)集的名稱。

"en": 表示我們只加載英文部分。

split="train": 表示我們只加載訓練集。

streaming=True: 表示以流式方式加載數(shù)據(jù)，適合處理大數(shù)據(jù)集。

3. 查看數(shù)據(jù)集的前5個樣本

我們可以用以下代碼查看數(shù)據(jù)集的前5個樣本：

n = 5
print("Pretrained dataset:")
top_n = itertools.islice(pretrained_dataset, n)
for i in top_n:
  print(i)

解釋：

n = 5: 表示我們要查看5個樣本。

itertools.islice: 用來從數(shù)據(jù)集中取出前5個樣本。

for i in top_n:: 遍歷這5個樣本并打印出來。

4. 加載公司微調(diào)數(shù)據(jù)集

假設(shè)我們有一個名為 lamini_docs.jsonl 的文件，里面存儲了一些問題和答案。我們可以用以下代碼加載這個文件：

filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df

解釋：

pd.read_json: 用來讀取JSON Lines格式的文件，并將其轉(zhuǎn)換為表格形式（DataFrame）。

instruction_dataset_df: 打印表格內(nèi)容。

5. 格式化數(shù)據(jù)

我們可以把問題和答案拼接成一個字符串，方便后續(xù)處理：

examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
text

解釋：

to_dict(): 把表格數(shù)據(jù)轉(zhuǎn)換成字典格式。

examples["question"][0]: 獲取第一個問題的內(nèi)容。

examples["answer"][0]: 獲取第一個答案的內(nèi)容。

text: 把問題和答案拼接成一個字符串。

6. 使用模板格式化數(shù)據(jù)

我們可以使用模板來格式化問題和答案，讓它們看起來更整齊：

prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template

解釋：

prompt_template_qa: 定義了一個模板，包含“Question”和“Answer”兩部分。

format: 把問題和答案插入到模板中。

7. 生成微調(diào)數(shù)據(jù)集

我們可以把所有的問答對都格式化，并保存到一個列表中：

num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

解釋：

num_examples: 獲取問題的數(shù)量。

finetuning_dataset_text_only: 存儲格式化后的文本。

finetuning_dataset_question_answer: 存儲格式化后的問題和答案。

8. 保存處理后的數(shù)據(jù)

我們可以把處理后的數(shù)據(jù)保存到一個新的文件中：

with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

解釋：

jsonlines.open: 打開一個文件，準備寫入數(shù)據(jù)。

writer.write_all: 把所有的數(shù)據(jù)寫入文件。

9. 加載處理后的數(shù)據(jù)

最后，我們可以加載剛剛保存的數(shù)據(jù)集：

finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

解釋：

load_dataset: 加載指定名稱的數(shù)據(jù)集。

print(finetuning_dataset): 打印加載的數(shù)據(jù)集。

總結(jié)

通過這篇文章，我們學習了如何用Python加載、處理和保存數(shù)據(jù)集。我們從簡單的數(shù)據(jù)加載開始，逐步學習了如何格式化數(shù)據(jù)、保存數(shù)據(jù)，最后還學會了如何加載處理后的數(shù)據(jù)。

到此這篇關(guān)于使用Python處理數(shù)據(jù)集的技巧分享的文章就介紹到這了,更多相關(guān)Python處理數(shù)據(jù)集內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

使用Python處理數(shù)據(jù)集的技巧分享

目錄

1. 導入需要的庫

2. 加載預訓練數(shù)據(jù)集

3. 查看數(shù)據(jù)集的前5個樣本

4. 加載公司微調(diào)數(shù)據(jù)集

5. 格式化數(shù)據(jù)

6. 使用模板格式化數(shù)據(jù)

7. 生成微調(diào)數(shù)據(jù)集

8. 保存處理后的數(shù)據(jù)

9. 加載處理后的數(shù)據(jù)

總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具