大量小文件的實(shí)時(shí)同步的解決方案分析

發(fā)布時(shí)間：2009-01-10 15:46:00 作者：佚名

這類需求也是經(jīng)常存在的，在這篇blog中作者給出了一個(gè)不錯(cuò)的方案。

傳統(tǒng)的文件同步方案有rsync(單向) 和 unison(雙向)等，它們需要掃描所有文件后進(jìn)行比對(duì)，差量傳輸。如果文件數(shù)量達(dá)到了百萬(wàn)甚至千萬(wàn)量級(jí)，掃描所有文件將非常耗時(shí)。而且正在發(fā)生變化的往往是其中很少的一部分，這是非常低效的方式。

之前看了Amazon的Dynamo的設(shè)計(jì)文檔，它們每個(gè)節(jié)點(diǎn)的數(shù)據(jù)是通過(guò)Hash Tree來(lái)實(shí)現(xiàn)同步，既有通過(guò)日志來(lái)同步的軟實(shí)時(shí)特點(diǎn)(msyql, bdb等)，也可以保證最終數(shù)據(jù)的一致性(rsync, unison等)。Hash Tree的大體思路是將所有數(shù)據(jù)存儲(chǔ)成樹(shù)狀結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)的Hash是其所有子節(jié)點(diǎn)的Hash的Hash，葉子節(jié)點(diǎn)的Hash是其內(nèi)容的Hash。這樣一旦某個(gè)節(jié)點(diǎn)發(fā)生變化，其Hash的變化會(huì)迅速傳播到根節(jié)點(diǎn)。需要同步的系統(tǒng)只需要不斷查詢跟節(jié)點(diǎn)的hash，一旦有變化，順著樹(shù)狀結(jié)構(gòu)就能夠在logN級(jí)別的時(shí)間找到發(fā)生變化的內(nèi)容，馬上同步。

文件系統(tǒng)天然的是樹(shù)狀結(jié)構(gòu)，盡管不是平衡的數(shù)。如果文件的修改時(shí)間是可靠的，可以表征文件的變化，那就可以用它作為文件的Hash值。另一方面，文件的修改通常是按順序執(zhí)行的，后修改的文件比早修改的文件具有更大的修改時(shí)間，這樣就可以把一個(gè)目錄內(nèi)的最大修改時(shí)間作為它的修改時(shí)間，以實(shí)現(xiàn)Hash Tree。這樣，一旦某個(gè)文件被修改，修改時(shí)間的信息就會(huì)迅速傳播到根目錄。

一般的文件系統(tǒng)都不是這樣做的，目錄的修改時(shí)間表示的是目錄結(jié)構(gòu)最后發(fā)生變化的時(shí)間，不包括子目錄，否則會(huì)不堪重負(fù)。因?yàn)槲覀冃枰约簩?shí)現(xiàn)這個(gè)功能，利用Linux 2.6內(nèi)核的新特性inotify獲得某個(gè)目錄內(nèi)文件發(fā)生變化的信息，并把其修改時(shí)間傳播到它的上級(jí)目錄(以及再上級(jí)目錄)。Python 有 pyinotify，watch.py的代碼如下：

復(fù)制代碼

代碼如下:

#!/usr/bin/python
from pyinotify import *
import os, os.path
flags = IN_CLOSE_WRITE|IN_CREATE|IN_Q_OVERFLOW
dirs = {}
base = '/log/lighttpd/cache/images/icon/u241'
base = 'tmp'
class UpdateParentDir(ProcessEvent):
def process_IN_CLOSE_WRITE(self, event):
print 'modify', event.pathname
mtime = os.path.getmtime(event.pathname)
p = event.path
while p.startswith(base):
m = os.path.getmtime(p)
if m < mtime:
print 'update', p
os.utime(p, (mtime,mtime))
elif m > mtime:
mtime = m
p = os.path.dirname(p)
process_IN_MODIFY = process_IN_CLOSE_WRITE
def process_IN_Q_OVERFLOW(self, event):
print 'over flow'
max_queued_events.value *= 2
def process_default(self, event):
pass
wm = WatchManager()
notifier = Notifier(wm, UpdateParentDir())
dirs.update(wm.add_watch(base, flags, rec=True, auto_add=True))
notifier.loop()

在已經(jīng)有Hash Tree的時(shí)候，同步就比較簡(jiǎn)單了，不停地獲取根目錄的修改時(shí)間并順著目錄結(jié)構(gòu)往下找即可。需要注意的是，在更新完文件后，需要設(shè)置修改時(shí)間為原文件的修改時(shí)間，目錄也是，保證Hash Tree的一致性，否則沒(méi)法同步。mirror.py的代碼如下

復(fù)制代碼

代碼如下:

#!/usr/bin/python
import sys,time,re,urllib
import os,os.path
from os.path import exists, isdir, getmtime
src = sys.argv[1]
dst = sys.argv[2]
def local_mirror(src, dst):
if exists(dst) and mtime == getmtime(dst):
return
if not isdir(src):
print 'update:', dst
open(dst,'wb').write(open(src).read())
else:
if not exists(dst):
os.makedirs(dst)
for filename in os.listdir(src):
local_mirror(os.path.join(src,filename), os.path.join(dst,filename))
os.utime(dst, (mtime,mtime))
def get_info(path):
f = urllib.urlopen(path)
mtime = f.headers.get('Last-Modified')
if mtime:
mtime = time.mktime(time.strptime(mtime, '%a, %d %b %Y %H:%M:%S %Z'))
content = f.read()
f.close()
return int(mtime), content
p = re.compile(r'([\d.]+?) +([\w/]+)')
def remote_mirror(src, dst):
mtime, content = get_info(src)
if exists(dst) and mtime == int(getmtime(dst)):
return
print 'update:', dst, src
if not src.endswith('/'):
open(dst,'wb').write(content)
else:
if not exists(dst):
os.makedirs(dst)
for mt,filename in p.findall(content):
mt = int(float(mt))
lpath = dst+filename
if not exists(lpath) or int(getmtime(lpath)) != mt:
remote_mirror(src+filename, lpath)
os.utime(dst, (mtime,mtime))
if src.startswith('http://'):
mirror = remote_mirror
else:
mirror = local_mirror
while True:
mirror(src, dst)
time.sleep(1)

如果源文件不在同一臺(tái)機(jī)器上，可以通過(guò)NFS等共享過(guò)來(lái)。或者可以通過(guò)支持列目錄的HTTP服務(wù)器來(lái)訪問(wèn)遠(yuǎn)程目錄，mirror.py 已經(jīng)支持這種訪問(wèn)方式。server.py 是用webpy做的一個(gè)簡(jiǎn)單的只是列目錄的文件服務(wù)器。由于瓶頸在IO上，它的性能不是關(guān)鍵。server.py的代碼如下：

復(fù)制代碼

代碼如下:

#!/usr/bin/python
import os,os.path
import web
import time
root = 'tmp'
HTTP_HEADER_TIME = '%a, %d %b %Y %H:%M:%S %Z'
class FileServer:
def GET(self, path):
path = root + path
if not os.path.exists(path):
return 404
mtime = time.localtime(os.path.getmtime(path))
web.header('Last-Modified', time.strftime(HTTP_HEADER_TIME, mtime))
if os.path.isdir(path):
for file in os.listdir(path):
if file.startswith('.'): continue
p = os.path.join(path,file)
m = os.path.getmtime(p)
if os.path.isdir(p):
file += '/'
print m, file
else:
print open(path,'rb').read()
urls = (
"(/.*)", "FileServer",
)
if __name__ == '__main__':
web.run(urls, globals())

為了獲得更好性能，以達(dá)到更好的實(shí)時(shí)性，Hash Tree最好是平衡的，比如BTree。如果一個(gè)文件發(fā)生變化，同步它需要進(jìn)行的IO操作為N*M，其中N為數(shù)的層數(shù)，M為每層的文件數(shù)目。現(xiàn)在我們N為2，M最大為10000，適當(dāng)減少它可以獲得更好的性能，比如N為4，M為100。在以后創(chuàng)建目錄結(jié)構(gòu)時(shí)，最好能夠考慮這方面的因素。

之前hongqn推薦過(guò)一個(gè)利用inotify的文件同步方案，同步方式類似于mysql和bdb等，由于過(guò)于復(fù)雜導(dǎo)致不可靠而沒(méi)有采用。上面這個(gè)方案只用了一百多行Python代碼就基本解決問(wèn)題了，是不是很帥？:-)