25個(gè)值得收藏的Python文本處理案例
1提取 PDF 內(nèi)容
# pip install PyPDF2 ?安裝 PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader
?
# Creating a pdf file object.
pdf = open("test.pdf", "rb")
?
# Creating pdf reader object.
pdf_reader = PyPDF2.PdfFileReader(pdf)
?
# Checking total number of pages in a pdf file.
print("Total number of Pages:", pdf_reader.numPages)
?
# Creating a page object.
page = pdf_reader.getPage(200)
?
# Extract data from a specific page number.
print(page.extractText())
?
# Closing the object.
pdf.close()2提取 Word 內(nèi)容
# pip install python-docx ?安裝 python-docx
import docx
?
?
def main():
? ? try:
? ? ? ? doc = docx.Document('test.docx') ?# Creating word reader object.
? ? ? ? data = ""
? ? ? ? fullText = []
? ? ? ? for para in doc.paragraphs:
? ? ? ? ? ? fullText.append(para.text)
? ? ? ? ? ? data = '\n'.join(fullText)
?
? ? ? ? print(data)
?
? ? except IOError:
? ? ? ? print('There was an error opening the file!')
? ? ? ? return
?
?
if __name__ == '__main__':
? ? main()3提取 Web 網(wǎng)頁(yè)內(nèi)容
# pip install bs4 ?安裝 bs4
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
?
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
? ? ? ? ? ? ? headers={'User-Agent': 'Mozilla/5.0'})
?
webpage = urlopen(req).read()
?
# Parsing
soup = BeautifulSoup(webpage, 'html.parser')
?
# Formating the parsed html file
strhtm = soup.prettify()
?
# Print first 500 lines
print(strhtm[:500])
?
# Extract meta tag value
print(soup.title.string)
print(soup.find('meta', attrs={'property':'og:description'}))
?
# Extract anchor tag value
for x in soup.find_all('a'):
? ? print(x.string)
?
# Extract Paragraph tag value ? ?
for x in soup.find_all('p'):
? ? print(x.text)4讀取 Json 數(shù)據(jù)
import requests
import json
r = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
res = r.json()
# Extract specific node content.
print(res['quiz']['sport'])
# Dump data as string
data = json.dumps(res)
print(data)5讀取 CSV 數(shù)據(jù)
import csv
with open('test.csv','r') as csv_file:
? ? reader =csv.reader(csv_file)
? ? next(reader) # Skip first row
? ? for row in reader:
? ? ? ? print(row)6刪除字符串中的標(biāo)點(diǎn)符號(hào)
import re
import string
?
data = "Stuning even for the non-gamer: This sound track was beautiful!\
It paints the senery in your mind so well I would recomend\
it even to people who hate vid. game music! I have played the game Chrono \
Cross but out of all of the games I have ever played it has the best music! \
It backs away from crude keyboarding and takes a fresher step with grate\
guitars and soulful orchestras.\
It would impress anyone who cares to listen!"
?
# Methood 1 : Regex
# Remove the special charaters from the read string.
no_specials_string = re.sub('[!#?,.:";]', '', data)
print(no_specials_string)
?
?
# Methood 2 : translate()
# Rake translator object
translator = str.maketrans('', '', string.punctuation)
data = data.translate(translator)
print(data)7使用 NLTK 刪除停用詞
from nltk.corpus import stopwords
?
?
data = ['Stuning even for the non-gamer: This sound track was beautiful!\
It paints the senery in your mind so well I would recomend\
it even to people who hate vid. game music! I have played the game Chrono \
Cross but out of all of the games I have ever played it has the best music! \
It backs away from crude keyboarding and takes a fresher step with grate\
guitars and soulful orchestras.\
It would impress anyone who cares to listen!']
?
# Remove stop words
stopwords = set(stopwords.words('english'))
?
output = []
for sentence in data:
? ? temp_list = []
? ? for word in sentence.split():
? ? ? ? if word.lower() not in stopwords:
? ? ? ? ? ? temp_list.append(word)
? ? output.append(' '.join(temp_list))
?
?
print(output)8使用 TextBlob 更正拼寫
from textblob import TextBlob data = "Natural language is a cantral part of our day to day life, and it's so antresting to work on any problem related to langages." output = TextBlob(data).correct() print(output)
9使用 NLTK 和 TextBlob 的詞標(biāo)記化
import nltk from textblob import TextBlob data = "Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages." nltk_output = nltk.word_tokenize(data) textblob_output = TextBlob(data).words print(nltk_output) print(textblob_output)
Output:
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']
10使用 NLTK 提取句子單詞或短語(yǔ)的詞干列表
from nltk.stem import PorterStemmer
?
st = PorterStemmer()
text = ['Where did he learn to dance like that?',
? ? ? ? 'His eyes were dancing with humor.',
? ? ? ? 'She shook her head and danced away',
? ? ? ? 'Alex was an excellent dancer.']
?
output = []
for sentence in text:
? ? output.append(" ".join([st.stem(i) for i in sentence.split()]))
?
for item in output:
? ? print(item)
?
print("-" * 50)
print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))Output:
where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump
11使用 NLTK 進(jìn)行句子或短語(yǔ)詞形還原
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
text = ['She gripped the armrest as he passed two cars at a time.',
? ? ? ? 'Her car was in full view.',
? ? ? ? 'A number of cars carried out of state license plates.']
output = []
for sentence in text:
? ? output.append(" ".join([wnl.lemmatize(i) for i in sentence.split()]))
for item in output:
? ? print(item)
print("*" * 10)
print(wnl.lemmatize('jumps', 'n'))
print(wnl.lemmatize('jumping', 'v'))
print(wnl.lemmatize('jumped', 'v'))
print("*" * 10)
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('happiest', 'a'))
print(wnl.lemmatize('easiest', 'a'))Output:
She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy
12使用 NLTK 從文本文件中查找每個(gè)單詞的頻率
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
?
nltk.download('webtext')
wt_words = webtext.words('testing.txt')
data_analysis = nltk.FreqDist(wt_words)
?
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
?
for key in sorted(filter_words):
? ? print("%s: %s" % (key, filter_words[key]))
?
data_analysis = nltk.FreqDist(filter_words)
?
data_analysis.plot(25, cumulative=False)Output:
[nltk_data] Downloading package webtext to
[nltk_data] C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...
13從語(yǔ)料庫(kù)中創(chuàng)建詞云
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
?
nltk.download('webtext')
wt_words = webtext.words('testing.txt') ?# Sample data
data_analysis = nltk.FreqDist(wt_words)
?
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
?
wcloud = WordCloud().generate_from_frequencies(filter_words)
?
# Plotting the wordcloud
plt.imshow(wcloud, interpolation="bilinear")
?
plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)
plt.show()14NLTK 詞法散布圖
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
?
words = ['data', 'science', 'dataset']
?
nltk.download('webtext')
wt_words = webtext.words('testing.txt') ?# Sample data
?
points = [(x, y) for x in range(len(wt_words))
? ? ? ? ? for y in range(len(words)) if wt_words[x] == words[y]]
?
if points:
? ? x, y = zip(*points)
else:
? ? x = y = ()
?
plt.plot(x, y, "rx", scalex=.1)
plt.yticks(range(len(words)), words, color="b")
plt.ylim(-1, len(words))
plt.title("Lexical Dispersion Plot")
plt.xlabel("Word Offset")
plt.show()15使用 countvectorizer 將文本轉(zhuǎn)換為數(shù)字
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
?
# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."
?
df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})
?
# Initialize
vectorizer = CountVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])
?
# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
? ? ? ? ? ? ? ? ? ?index=vectorizer.get_feature_names())
?
# Change column headers
df2.columns = df1.columns
print(df2)Output:
Go Java Python
and 2 2 2
application 0 1 0
are 1 0 1
bytecode 0 1 0
can 0 1 0
code 0 1 0
comes 1 0 1
compiled 0 1 0
derived 0 1 0
develops 0 1 0
for 0 2 0
from 0 1 0
functional 1 0 1
imperative 1 0 1
...
16使用 TF-IDF 創(chuàng)建文檔術(shù)語(yǔ)矩陣
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."
df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})
# Initialize
vectorizer = TfidfVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])
# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
? ? ? ? ? ? ? ? ? ?index=vectorizer.get_feature_names())
# Change column headers
df2.columns = df1.columns
print(df2)Output:
Go Java Python
and 0.323751 0.137553 0.323751
application 0.000000 0.116449 0.000000
are 0.208444 0.000000 0.208444
bytecode 0.000000 0.116449 0.000000
can 0.000000 0.116449 0.000000
code 0.000000 0.116449 0.000000
comes 0.208444 0.000000 0.208444
compiled 0.000000 0.116449 0.000000
derived 0.000000 0.116449 0.000000
develops 0.000000 0.116449 0.000000
for 0.000000 0.232898 0.000000
...
17為給定句子生成 N-gram
自然語(yǔ)言工具包:NLTK
import nltk
from nltk.util import ngrams
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
? ? n_grams = ngrams(nltk.word_tokenize(data), num)
? ? return [ ' '.join(grams) for grams in n_grams]
data = 'A class is a blueprint for the object.'
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))文本處理工具:TextBlob
from textblob import TextBlob
?
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
? ? n_grams = TextBlob(data).ngrams(num)
? ? return [ ' '.join(grams) for grams in n_grams]
?
data = 'A class is a blueprint for the object.'
?
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))Output:
1-gram: ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram: ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram: ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram: ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']
18使用帶有二元組的 sklearn CountVectorize 詞匯規(guī)范
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
?
# Sample data for analysis
data1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them."
data2 = "Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language"
?
df1 = pd.DataFrame({'Machine': [data1], 'Assembly': [data2]})
?
# Initialize
vectorizer = CountVectorizer(ngram_range=(2, 2))
doc_vec = vectorizer.fit_transform(df1.iloc[0])
?
# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
? ? ? ? ? ? ? ? ? ?index=vectorizer.get_feature_names())
?
# Change column headers
df2.columns = df1.columns
print(df2)Output:
Assembly Machine
also either 0 1
and or 0 1
are also 0 1
are readable 1 0
are still 1 0
assembly language 5 0
because each 1 0
but difficult 0 1
by computers 0 1
by people 0 1
can execute 0 1
...
19使用 TextBlob 提取名詞短語(yǔ)
from textblob import TextBlob
#Extract noun
blob = TextBlob("Canada is a country in the northern part of North America.")
for nouns in blob.noun_phrases:
? ? print(nouns)Output:
canada
northern part
america
20如何計(jì)算詞-詞共現(xiàn)矩陣
import numpy as np
import nltk
from nltk import bigrams
import itertools
import pandas as pd
?
?
def generate_co_occurrence_matrix(corpus):
? ? vocab = set(corpus)
? ? vocab = list(vocab)
? ? vocab_index = {word: i for i, word in enumerate(vocab)}
?
? ? # Create bigrams from all words in corpus
? ? bi_grams = list(bigrams(corpus))
?
? ? # Frequency distribution of bigrams ((word1, word2), num_occurrences)
? ? bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
?
? ? # Initialise co-occurrence matrix
? ? # co_occurrence_matrix[current][previous]
? ? co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
?
? ? # Loop through the bigrams taking the current and previous word,
? ? # and the number of occurrences of the bigram.
? ? for bigram in bigram_freq:
? ? ? ? current = bigram[0][1]
? ? ? ? previous = bigram[0][0]
? ? ? ? count = bigram[1]
? ? ? ? pos_current = vocab_index[current]
? ? ? ? pos_previous = vocab_index[previous]
? ? ? ? co_occurrence_matrix[pos_current][pos_previous] = count
? ? co_occurrence_matrix = np.matrix(co_occurrence_matrix)
?
? ? # return the matrix and the index
? ? return co_occurrence_matrix, vocab_index
?
?
text_data = [['Where', 'Python', 'is', 'used'],
? ? ? ? ? ? ?['What', 'is', 'Python' 'used', 'in'],
? ? ? ? ? ? ?['Why', 'Python', 'is', 'best'],
? ? ? ? ? ? ?['What', 'companies', 'use', 'Python']]
?
# Create one list using many lists
data = list(itertools.chain.from_iterable(text_data))
matrix, vocab_index = generate_co_occurrence_matrix(data)
?
?
data_matrix = pd.DataFrame(matrix, index=vocab_index,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?columns=vocab_index)
print(data_matrix)Output:
best use What Where ... in is Python used
best 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
use 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0
What 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Where 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Pythonused 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
Why 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
companies 0.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0
in 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0
is 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
Python 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
used 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
[11 rows x 11 columns]
21使用 TextBlob 進(jìn)行情感分析
from textblob import TextBlob
def sentiment(polarity):
? ? if blob.sentiment.polarity < 0:
? ? ? ? print("Negative")
? ? elif blob.sentiment.polarity > 0:
? ? ? ? print("Positive")
? ? else:
? ? ? ? print("Neutral")
blob = TextBlob("The movie was excellent!")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)
blob = TextBlob("The movie was not bad.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)
blob = TextBlob("The movie was ridiculous.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)Output:
Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative
22使用 Goslate 進(jìn)行語(yǔ)言翻譯
import goslate text = "Comment vas-tu?" gs = goslate.Goslate() translatedText = gs.translate(text, 'en') print(translatedText) translatedText = gs.translate(text, 'zh') print(translatedText) translatedText = gs.translate(text, 'de') print(translatedText)
23使用 TextBlob 進(jìn)行語(yǔ)言檢測(cè)和翻譯
from textblob import TextBlob
?
blob = TextBlob("Comment vas-tu?")
?
print(blob.detect_language())
?
print(blob.translate(to='es'))
print(blob.translate(to='en'))
print(blob.translate(to='zh'))Output:
fr
¿Como estas tu?
How are you?
你好嗎?
24使用 TextBlob 獲取定義和同義詞
from textblob import TextBlob
from textblob import Word
?
text_word = Word('safe')
?
print(text_word.definitions)
?
synonyms = set()
for synset in text_word.synsets:
? ? for lemma in synset.lemmas():
? ? ? ? synonyms.add(lemma.name())
? ? ? ? ?
print(synonyms)Output:
['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}
25使用 TextBlob 獲取反義詞列表
from textblob import TextBlob
from textblob import Word
text_word = Word('safe')
antonyms = set()
for synset in text_word.synsets:
? ? for lemma in synset.lemmas(): ? ? ? ?
? ? ? ? if lemma.antonyms():
? ? ? ? ? ? antonyms.add(lemma.antonyms()[0].name()) ? ? ? ?
print(antonyms)Output:
{'dangerous', 'out'}
到此這篇關(guān)于25個(gè)值得收藏的Python文本處理案例的文章就介紹到這了,更多相關(guān)Python文本處理案例內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
詳解在Python的Django框架中創(chuàng)建模板庫(kù)的方法
這篇文章主要介紹了在Python的Django框架中創(chuàng)建模板庫(kù)的方法,模版庫(kù)通常用來(lái)管理單獨(dú)的Django中的應(yīng)用,需要的朋友可以參考下2015-07-07
Python使用Streamlit快速創(chuàng)建儀表盤
這篇文章主要為大家詳細(xì)介紹了Python如何使用Streamlit快速創(chuàng)建一個(gè)簡(jiǎn)單的儀表盤,文中的示例代碼簡(jiǎn)潔易懂,快跟隨小編一起來(lái)學(xué)習(xí)一下吧2023-09-09
python3中eval函數(shù)用法使用簡(jiǎn)介
這篇文章主要介紹了python3中eval函數(shù)用法使用簡(jiǎn)介,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2019-08-08
python輸入錯(cuò)誤密碼用戶鎖定實(shí)現(xiàn)方法
這篇文章主要介紹了python輸入錯(cuò)誤密碼用戶鎖定實(shí)現(xiàn)方法以及代碼實(shí)現(xiàn)過(guò)程,一起參考一下。2017-11-11
python回調(diào)函數(shù)中使用多線程的方法
這篇文章主要介紹了python回調(diào)函數(shù)中使用多線程的方法,需要的朋友可以參考下2017-12-12
解決python文件字符串轉(zhuǎn)列表時(shí)遇到空行的問(wèn)題
下面小編就為大家?guī)?lái)一篇解決python文件字符串轉(zhuǎn)列表時(shí)遇到空行的問(wèn)題。小編覺(jué)得挺不錯(cuò)的,現(xiàn)在就分享給大家,也給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧2017-07-07
Python+matplotlib實(shí)現(xiàn)循環(huán)作圖的方法詳解
這篇文章主要為大家介紹了Python如何利用matplotlib實(shí)現(xiàn)循環(huán)作圖的,文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)學(xué)習(xí)2022-06-06

