Python腳本提取fasta文件單序列信息實(shí)現(xiàn)

更新時(shí)間：2022年07月01日 09:12:38 作者：土雕藝術(shù)家

這篇文章主要為大家介紹了Python腳本提取fasta文件單序列信息實(shí)現(xiàn)示例，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進(jìn)步，早日升職加薪

Python腳本編輯

使用Python對(duì)fasta格式的序列進(jìn)行基本信息統(tǒng)計(jì)

預(yù)期設(shè)計(jì)輸出文件中包括fasta文件名，序列長(zhǎng)度，GC含量以及ATCG各自的含量。

使用的文件

test.fasta

stat.py

輸入 sys模塊

#!/usr/bin/env python
import sys

從命令行獲得文件名稱

file_fasta = sys.argv[1]
#獲得文件名
file_name = file_fasta.split('.')
name = file_name[0]

sys.argv[1]模塊是從程序外部獲取參數(shù)的橋梁，可以將命令行的參數(shù)輸入到py程序內(nèi)。

sys.argv[0]是程序本身，sys.argv[1]是程序后跟著第一個(gè)參數(shù)。

我們將文件名作為輸入?yún)?shù)，這一步在最后運(yùn)行有展示。

在結(jié)束輸出時(shí)會(huì)輸出一個(gè)包含統(tǒng)計(jì)信息的txt文件，我們將用fasta文件名作為txt文件的前綴，所以我們需要獲取fasta文件的名字。

.split('.')是將file_fasta以.為分隔符分開(kāi)會(huì)生成'test','txt'，賦值給file_name則file_name會(huì)包含著兩個(gè)字符。

file_name[0]則是取第一個(gè)值'test'，python中默認(rèn)第一個(gè)數(shù)字是0，所以不能輸入1。

進(jìn)行序列信息統(tǒng)計(jì)的函數(shù)

序列的長(zhǎng)度很好統(tǒng)計(jì)使用len函數(shù)即可，但是GC含量和ACTG的百分比計(jì)算需要費(fèi)點(diǎn)事情。

使用def制作一個(gè)函數(shù)

Python 使用def 開(kāi)始函數(shù)定義，緊接著是函數(shù)名，括號(hào)內(nèi)部為函數(shù)的參數(shù)，內(nèi)部為函數(shù)的具體功能實(shí)現(xiàn)代碼

def get_info(chr):
    chr = chr.upper()
    count_g = chr.count('G')
    count_c = chr.count('C')
    count_a = chr.count('A')
    count_t = chr.count('T')

命名這個(gè)函數(shù)為get_info，內(nèi)部參數(shù)為chr

在咱們會(huì)將fasta中ATCG的堿基內(nèi)容賦值給chr，堿基可能有大寫(xiě)有小寫(xiě)，所以我們使用.upper將所以字符變成大寫(xiě)。

再使用.count('G')統(tǒng)計(jì)ATCG各自的數(shù)量并賦值給對(duì)應(yīng)count_g，我們用ATCG各自的統(tǒng)計(jì)數(shù)可以在后面計(jì)算中免疫N值干擾。

gc = (count_g + count_c) / (count_a + count_t + count_c + count_g)
A = (count_a) / (count_a + count_t + count_c + count_g)
T = (count_t) / (count_a + count_t + count_c + count_g)
C = (count_c) / (count_a + count_t + count_c + count_g)
G = (count_g) / (count_a + count_t + count_c + count_g)
gc_con = '{:.2%}'.format(gc)
A_content = '{:.2%}'.format(A)
T_content = '{:.2%}'.format(T)
C_content = '{:.2%}'.format(C)
G_content = '{:.2%}'.format(G)
return (gc_con,A_content,T_content,C_content,G_content)

gc含量計(jì)算其等于（G的數(shù)量+C的數(shù)量）/（A的數(shù)量+T的數(shù)量+C的數(shù)量+G的數(shù)量）

A的含量等于（A的數(shù)量）/（A的數(shù)量+T的數(shù)量+C的數(shù)量+G的數(shù)量），其他值的計(jì)算以此類推。

.format使用：

"{1} {0} {1}".format("hello", "world")設(shè)置指定位置。

'world hello world'

{:.2f} 保留小數(shù)點(diǎn)后兩位

最后，使用return返回函數(shù)結(jié)果(gc_con,A_content,T_content,C_content,G_content)

進(jìn)行函數(shù)計(jì)算

#進(jìn)行函數(shù)計(jì)算
with open(file_fasta,'r') as read_fa:

讀取文件內(nèi)容賦值給read_fa

python中有兩個(gè)方式打開(kāi)文件一種是直接使用open("test.fasta","r")，執(zhí)行完以后f.close()關(guān)閉。
注釋："r"只讀模式打開(kāi)文件；"w"以只寫(xiě)模式打開(kāi)文件，這種模式下輸入內(nèi)容會(huì)覆蓋原有內(nèi)容；"a"以追加模式打開(kāi)一個(gè)文件,這個(gè)模式會(huì)把新內(nèi)容追加到原有內(nèi)容的末尾，不會(huì)覆蓋。
這里使用的是第二方式with內(nèi)置函數(shù)，它可以將文件自動(dòng)關(guān)閉。

    for val in read_fa:
        val = val.strip()
        if not val.startswith("&gt;"):
            seq_info = get_info(val)
            len_fasta = len(val)

將read_fa內(nèi)容賦值給val。

strip() 方法用于移除字符串頭尾指定的字符(默認(rèn)為空格或換行符)，這里使用默認(rèn)。

然后使用startswith() 方法用于檢查字符串是否是以指定子字符串開(kāi)頭，在當(dāng)不是>開(kāi)頭的行時(shí)候，才對(duì)核酸序列才進(jìn)行信息統(tǒng)計(jì)。

len() 方法返回字符長(zhǎng)度獲得片段長(zhǎng)度

結(jié)果屏幕展示

#結(jié)果屏幕展示
print('******\n{0}\nlength:{1}\ngc content :{2}\nA content :{3}\nT content :{4}\nC content :{5}\nG content :{6}\n******'.format(name,len_fasta,seq_info[0],seq_info[1],seq_info[2],seq_info[3],seq_info[4]))

使用\n進(jìn)行換行，用.format指定值輸出位置。

結(jié)果輸出文件

os.write(fd, str)

write() 方法用于寫(xiě)入字符串到文件描述符 fd

#結(jié)果輸出文件
file_output = open("{}sum.txt".format(name),'a')
file_output.write('******\n')
file_output.write('{}\n'.format(name))
file_output.write('length:{:d}\n'.format(len_fasta))
file_output.write('gc content :{}\n'.format(seq_info[0]))
file_output.write('A content :{}\n'.format(seq_info[1]))
file_output.write('T content :{}\n'.format(seq_info[2]))
file_output.write('C content :{}\n'.format(seq_info[3]))
file_output.write('G content :{}\n'.format(seq_info[4]))
file_output.write('******')
file_output.close()