欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

Python利用模糊哈希實(shí)現(xiàn)對(duì)比文件相似度

 更新時(shí)間:2023年01月28日 14:57:18   作者:nick  
對(duì)比兩個(gè)文件相似度,python中可通過(guò)difflib.SequenceMatcher/ssdeep/python_mmdt/tlsh實(shí)現(xiàn),<BR>在大量需要對(duì)比,且文件較大時(shí),需要更高的效率,可以考慮模糊哈希,本文就來(lái)和大家詳細(xì)聊聊

對(duì)比兩個(gè)文件相似度,python中可通過(guò)difflib.SequenceMatcher/ssdeep/python_mmdt/tlsh實(shí)現(xiàn),在大量需要對(duì)比,且文件較大時(shí),需要更高的效率,可以考慮模糊哈希(fuzzy hash),如ssdeep/python_mmdt

測(cè)試過(guò)程發(fā)現(xiàn):

  • difflib方法,讀取文件后,可以實(shí)現(xiàn)匹配度輸出
  • ssdeep/mmdt/tlsh方法可以實(shí)現(xiàn),實(shí)現(xiàn)提前模糊哈希值,驗(yàn)證時(shí),只讀取一次,完成對(duì)比,從而優(yōu)化對(duì)比時(shí)間,及內(nèi)存/cpu消耗
  • tlsh測(cè)試時(shí),值越小,相似度越高,在對(duì)比小文件時(shí),很不理想
  • 在對(duì)比小文件時(shí),三種方法相差不大,在對(duì)比大文件(案例中81MB),difflib方法慢的難以接受
  • 在實(shí)際環(huán)境中,建議使用mmdt方法,因?yàn)閟sdeep在二進(jìn)制對(duì)比中差別較大,失去參考價(jià)值,具體還有哪些文件類型存在此問(wèn)題有待考量,

測(cè)試環(huán)境:

OS:ubuntu20.04

python:3.8.10

py-tlsh==4.7.2

python-mmdt==0.3.1

ssdeep==3.4

# -*- coding: utf-8 -*-

import ssdeep
import time
from python_mmdt.mmdt.mmdt import MMDT
from difflib import SequenceMatcher

def difflib_test(file1,file2):
    start_time = time.time()
    with open(file1,'rb') as f:
        s1 = f.read()
    with open(file2,'rb') as f:
        s2 = f.read()
    match_obj =  SequenceMatcher(None,s1,s2)
    print("difflib match:",match_obj.ratio())
    end_time = time.time()
    print('difflib_test cost :',end_time-start_time)

def mmdt_test(file1,file2):
    start_time = time.time()
    mmdt=MMDT()
    r1 = mmdt.mmdt_hash(file1)
    print(r1)
    r2 = mmdt.mmdt_hash_streaming(file2)
    print(r2)
    # sim1 = mmdt.mmdt_compare(file1, file2)
    # print("mmdt match:",sim1)
    sim2 = mmdt.mmdt_compare_hash(r1, r2)
    print("mmdt match:",sim2)
    end_time = time.time()
    print('mmdt_test cost :',end_time-start_time)

def ssdeep_test(file1,file2):
    start_time = time.time()
    sig1=ssdeep.hash_from_file(file1)
    sig2=ssdeep.hash_from_file(file2)
    print(sig1)
    print(sig2)
    print("ssdeep match:",ssdeep.compare(sig1,sig2))
    end_time = time.time()
    print('ssdeep_test cost :',end_time-start_time)

if __name__ == '__main__':
    start_time = time.time()
    file1='/root/test/fstab'
    file2='/root/test/fstab2'
    # file1 = '/root/test/initrd.img-5.4.0-125-generic'
    # file2 = '/root/test/initrd.img-5.4.0-135-generic'
    mmdt_test(file1,file2)    
    ssdeep_test(file1,file2)
    difflib_test(file1,file2)
    end_time = time.time()
    print('總執(zhí)行時(shí)間:',end_time-start_time)

下面給出對(duì)比小文件/大文件效果:

測(cè)試tlsh

import tlsh
import time

def tlsh_test(file1,file2):
    start_time = time.time()
    with open(file1,'rb') as f:
        s1 = tlsh.hash(f.read())
    with open(file2,'rb') as f:
        s2 = tlsh.hash(f.read())
    match_obj =  tlsh.diff(s1,s2)
    print("tlsh match:",match_obj)
    end_time = time.time()
    print('difflib_test cost :',end_time-start_time)


if __name__ == '__main__':
    start_time = time.time()
    # file1='/root/test/fstab'
    # file2='/root/test/fstab2'
    file1 = '/root/test/initrd.img-5.4.0-125-generic'
    file2 = '/root/test/initrd.img-5.4.0-135-generic'
    tlsh_test(file1,file2)
    end_time = time.time()
    print('總執(zhí)行時(shí)間:',end_time-start_time)

對(duì)比小文件/大文件

到此這篇關(guān)于Python利用模糊哈希實(shí)現(xiàn)對(duì)比文件相似度的文章就介紹到這了,更多相關(guān)Python對(duì)比文件相似度內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

最新評(píng)論