Python實現(xiàn)代碼統(tǒng)計工具（終極篇）

更新時間：2016年07月04日 10:53:15 作者：clover_toeic

這篇文章主要介紹了Python實現(xiàn)代碼統(tǒng)計工具的相關(guān)資料，供大家參考，感興趣的小伙伴們可以參考一下

本文對于先前系列文章中實現(xiàn)的C/Python代碼統(tǒng)計工具(CPLineCounter)，通過C擴展接口重寫核心算法加以優(yōu)化，并與網(wǎng)上常見的統(tǒng)計工具做對比。實測表明，CPLineCounter在統(tǒng)計精度和性能方面均優(yōu)于其他同類統(tǒng)計工具。以千萬行代碼為例評測性能，CPLineCounter在Cpython和Pypy環(huán)境下運行時，比國外統(tǒng)計工具cloc1.64分別快14.5倍和29倍，比國內(nèi)SourceCounter3.4分別快1.8倍和3.6倍。

運行測試環(huán)境
本文基于Windows系統(tǒng)平臺，運行和測試所涉及的代碼實例。平臺信息如下：

>>> import sys, platform
>>> print '%s %s, Python %s' %(platform.system(), platform.release(), platform.python_version())
Windows XP, Python 2.7.11
>>> sys.version
'2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:32:19) [MSC v.1500 32 bit (Intel)]'

注意，Python不同版本間語法存在差異，故文中某些代碼實例需要稍作修改，以便在低版本Python環(huán)境中運行。
一. 代碼實現(xiàn)與優(yōu)化
為避免碎片化，本節(jié)將給出完整的實現(xiàn)代碼。注意，本節(jié)某些變量或函數(shù)定義與先前系列文章中的實現(xiàn)存在細微差異，請注意甄別。
1.1 代碼實現(xiàn)
首先，定義兩個存儲統(tǒng)計結(jié)果的列表：

import os, sys
rawCountInfo = [0, 0, 0, 0, 0]
detailCountInfo = []

其中，rawCountInfo存儲粗略的文件總行數(shù)信息，列表元素依次為文件行、代碼行、注釋行和空白行的總數(shù)，以及文件數(shù)目。detailCountInfo存儲詳細的統(tǒng)計信息，包括單個文件的行數(shù)信息和文件名，以及所有文件的行數(shù)總和。

以下將給出具體的實現(xiàn)代碼。為避免大段粘貼代碼，以函數(shù)為片段簡要描述。

 def CalcLinesCh(line, isBlockComment):
 lineType, lineLen = 0, len(line)
 if not lineLen:
  return lineType

 line = line + '\n' #添加一個字符防止iChar+1時越界
 iChar, isLineComment = 0, False
 while iChar < lineLen:
  if line[iChar] == ' ' or line[iChar] == '\t': #空白字符
   iChar += 1; continue
  elif line[iChar] == '/' and line[iChar+1] == '/': #行注釋
   isLineComment = True
   lineType |= 2; iChar += 1 #跳過'/'
  elif line[iChar] == '/' and line[iChar+1] == '*': #塊注釋開始符
   isBlockComment[0] = True
   lineType |= 2; iChar += 1
  elif line[iChar] == '*' and line[iChar+1] == '/': #塊注釋結(jié)束符
   isBlockComment[0] = False
   lineType |= 2; iChar += 1
  else:
   if isLineComment or isBlockComment[0]:
    lineType |= 2
   else:
    lineType |= 1
  iChar += 1

 return lineType #Bitmap：0空行，1代碼，2注釋，3代碼和注釋

def CalcLinesPy(line, isBlockComment):
 #isBlockComment[single quotes, double quotes]
 lineType, lineLen = 0, len(line)
 if not lineLen:
  return lineType

 line = line + '\n\n' #添加兩個字符防止iChar+2時越界
 iChar, isLineComment = 0, False
 while iChar < lineLen:
  if line[iChar] == ' ' or line[iChar] == '\t': #空白字符
   iChar += 1; continue
  elif line[iChar] == '#':   #行注釋
   isLineComment = True
   lineType |= 2
  elif line[iChar:iChar+3] == "'''": #單引號塊注釋
   if isBlockComment[0] or isBlockComment[1]:
    isBlockComment[0] = False
   else:
    isBlockComment[0] = True
   lineType |= 2; iChar += 2
  elif line[iChar:iChar+3] == '"""': #雙引號塊注釋
   if isBlockComment[0] or isBlockComment[1]:
    isBlockComment[1] = False
   else:
    isBlockComment[1] = True
   lineType |= 2; iChar += 2
  else:
   if isLineComment or isBlockComment[0] or isBlockComment[1]:
    lineType |= 2
   else:
    lineType |= 1
  iChar += 1

 return lineType #Bitmap：0空行，1代碼，2注釋，3代碼和注釋

CalcLinesCh()和CalcLinesPy()函數(shù)分別基于C和Python語法判斷文件行屬性，按代碼、注釋或空行分別統(tǒng)計。

 from ctypes import c_uint, c_ubyte, CDLL
CFuncObj = None
def LoadCExtLib():
 try:
  global CFuncObj
  CFuncObj = CDLL('CalcLines.dll')
 except Exception: #不捕獲系統(tǒng)退出(SystemExit)和鍵盤中斷(KeyboardInterrupt)異常
  pass

def CalcLines(fileType, line, isBlockComment):
 try:
  #不可將CDLL('CalcLines.dll')放于本函數(shù)內(nèi)，否則可能嚴重拖慢執(zhí)行速度
  bCmmtArr = (c_ubyte * len(isBlockComment))(*isBlockComment)
  CFuncObj.CalcLinesCh.restype = c_uint
  if fileType is 'ch': #is(同一性運算符)判斷對象標識(id)是否相同，較==更快
   lineType = CFuncObj.CalcLinesCh(line, bCmmtArr)
  else:
   lineType = CFuncObj.CalcLinesPy(line, bCmmtArr)

  isBlockComment[0] = True if bCmmtArr[0] else False
  isBlockComment[1] = True if bCmmtArr[1] else False
  #不能采用以下寫法，否則本函數(shù)返回后isBlockComment列表內(nèi)容仍為原值
  #isBlockComment = [True if i else False for i in bCmmtArr]
 except Exception, e:
  #print e
  if fileType is 'ch':
   lineType = CalcLinesCh(line, isBlockComment)
  else:
   lineType = CalcLinesPy(line, isBlockComment)

 return lineType

為提升運行速度，作者將CalcLinesCh()和CalcLinesPy()函數(shù)用C語言重寫，并編譯生成動態(tài)鏈接庫。這兩個函數(shù)的C語言版本實現(xiàn)和使用詳見1.2小節(jié)。LoadCExtLib()和CalcLines()函數(shù)旨在加載該動態(tài)鏈接庫并執(zhí)行相應(yīng)的C版本統(tǒng)計函數(shù)，若加載失敗則執(zhí)行較慢的Python版本統(tǒng)計函數(shù)。

上述代碼運行于CPython環(huán)境，且C動態(tài)庫通過Python2.5及后續(xù)版本內(nèi)置的ctypes模塊加載和執(zhí)行。該模塊作為Python的外部函數(shù)庫，提供與C語言兼容的數(shù)據(jù)類型，并允許調(diào)用DLL或共享庫中的函數(shù)。因此，ctypes常被用來在純Python代碼中封裝(wrap)外部動態(tài)庫。

若代碼運行于Pypy環(huán)境，則需使用cffi接口調(diào)用C程序：

from cffi import FFI
CFuncObj, ffiBuilder = None, FFI()
def LoadCExtLib():
 try:
  global CFuncObj
  ffiBuilder.cdef('''
  unsigned int CalcLinesCh(char *line, unsigned char isBlockComment[2]);
  unsigned int CalcLinesPy(char *line, unsigned char isBlockComment[2]);
  ''')
  CFuncObj = ffiBuilder.dlopen('CalcLines.dll')
 except Exception: #不捕獲系統(tǒng)退出(SystemExit)和鍵盤中斷(KeyboardInterrupt)異常
  pass

def CalcLines(fileType, line, isBlockComment):
 try:
  bCmmtArr = ffiBuilder.new('unsigned char[2]', isBlockComment)
  if fileType is 'ch': #is(同一性運算符)判斷對象標識(id)是否相同，較==更快
   lineType = CFuncObj.CalcLinesCh(line, bCmmtArr)
  else:
   lineType = CFuncObj.CalcLinesPy(line, bCmmtArr)

  isBlockComment[0] = True if bCmmtArr[0] else False
  isBlockComment[1] = True if bCmmtArr[1] else False
  #不能采用以下寫法，否則本函數(shù)返回后isBlockComment列表內(nèi)容仍為原值
  #isBlockComment = [True if i else False for i in bCmmtArr]
 except Exception, e:
  #print e
  if fileType is 'ch':
   lineType = CalcLinesCh(line, isBlockComment)
  else:
   lineType = CalcLinesPy(line, isBlockComment)

 return lineType

cffi用法類似ctypes，但允許直接加載C文件來調(diào)用里面的函數(shù)(在解釋過程中自動編譯)。此處為求統(tǒng)一，仍使用加載動態(tài)庫的方式。

def SafeDiv(dividend, divisor):
 if divisor: return float(dividend)/divisor
 elif dividend:  return -1
 else:    return 0

gProcFileNum = 0
def CountFileLines(filePath, isRawReport=True, isShortName=False):
 fileExt = os.path.splitext(filePath)
 if fileExt[1] == '.c' or fileExt[1] == '.h':
  fileType = 'ch'
 elif fileExt[1] == '.py': #==(比較運算符)判斷對象值(value)是否相同
  fileType = 'py'
 else:
  return

 global gProcFileNum; gProcFileNum += 1
 sys.stderr.write('%d files processed...\r'%gProcFileNum)

 isBlockComment = [False]*2 #或定義為全局變量，以保存上次值
 lineCountInfo = [0]*5  #[代碼總行數(shù), 代碼行數(shù), 注釋行數(shù), 空白行數(shù), 注釋率]
 with open(filePath, 'r') as file:
  for line in file:
   lineType = CalcLines(fileType, line.strip(), isBlockComment)
   lineCountInfo[0] += 1
   if lineType == 0: lineCountInfo[3] += 1
   elif lineType == 1: lineCountInfo[1] += 1
   elif lineType == 2: lineCountInfo[2] += 1
   elif lineType == 3: lineCountInfo[1] += 1; lineCountInfo[2] += 1
   else:
    assert False, 'Unexpected lineType: %d(0~3)!' %lineType

 if isRawReport:
  global rawCountInfo
  rawCountInfo[:-1] = [x+y for x,y in zip(rawCountInfo[:-1], lineCountInfo[:-1])]
  rawCountInfo[-1] += 1
 elif isShortName:
  lineCountInfo[4] = SafeDiv(lineCountInfo[2], lineCountInfo[2]+lineCountInfo[1])
  detailCountInfo.append([os.path.basename(filePath), lineCountInfo])
 else:
  lineCountInfo[4] = SafeDiv(lineCountInfo[2], lineCountInfo[2]+lineCountInfo[1])
  detailCountInfo.append([filePath, lineCountInfo])

注意"%d files processed..."進度提示。因無法判知輸出是否通過命令行重定向至文件(sys.stdout不變，sys.argv不含">out")，該進度提示將換行寫入輸出文件內(nèi)。假定代碼文件數(shù)目為N，輸出文件內(nèi)將含N行進度信息。目前只能利用重定向缺省只影響標準輸出的特點，將進度信息由標準錯誤輸出至控制臺；同時增加-o選項，以顯式地區(qū)分標準輸出和文件寫入，降低使用者重定向的可能性。

此外，調(diào)用CalcLines()函數(shù)時通過strip()方法剔除文件行首尾的空白字符。因此，CalcLinesCh()和CalcLinesPy()內(nèi)無需行結(jié)束符判斷分支。

SORT_ORDER = (lambda x:x[0], False)
def SetSortArg(sortArg=None):
 global SORT_ORDER
 if not sortArg:
  return
 if any(s in sortArg for s in ('file', '0')): #條件寬松些
 #if sortArg in ('rfile', 'file', 'r0', '0'):
  keyFunc = lambda x:x[1][0]
 elif any(s in sortArg for s in ('code', '1')):
  keyFunc = lambda x:x[1][1]
 elif any(s in sortArg for s in ('cmmt', '2')):
  keyFunc = lambda x:x[1][2]
 elif any(s in sortArg for s in ('blan', '3')):
  keyFunc = lambda x:x[1][3]
 elif any(s in sortArg for s in ('ctpr', '4')):
  keyFunc = lambda x:x[1][4]
 elif any(s in sortArg for s in ('name', '5')):
  keyFunc = lambda x:x[0]
 else: #因argparse內(nèi)已限制排序參數(shù)范圍，此處也可用assert
  print >>sys.stderr, 'Unsupported sort order(%s)!' %sortArg
  return

 isReverse = sortArg[0]=='r' #False:升序(ascending); True:降序(decending)
 SORT_ORDER = (keyFunc, isReverse)

def ReportCounterInfo(isRawReport=True, stream=sys.stdout):
  #代碼注釋率 = 注釋行 / (注釋行+有效代碼行)
 print >>stream, 'FileLines CodeLines CommentLines BlankLines CommentPercent %s'\
   %(not isRawReport and 'FileName' or '')

 if isRawReport:
  print >>stream, '%-11d%-11d%-14d%-12d%-16.2f<Total:%d Code Files>' %(rawCountInfo[0],\
    rawCountInfo[1], rawCountInfo[2], rawCountInfo[3], \
    SafeDiv(rawCountInfo[2], rawCountInfo[2]+rawCountInfo[1]), rawCountInfo[4])
  return

 total = [0, 0, 0, 0]
 #對detailCountInfo排序。缺省按第一列元素(文件名)升序排序，以提高輸出可讀性。
 detailCountInfo.sort(key=SORT_ORDER[0], reverse=SORT_ORDER[1])
 for item in detailCountInfo:
  print >>stream, '%-11d%-11d%-14d%-12d%-16.2f%s' %(item[1][0], item[1][1], item[1][2], \
    item[1][3], item[1][4], item[0])
  total[0] += item[1][0]; total[1] += item[1][1]
  total[2] += item[1][2]; total[3] += item[1][3]
 print >>stream, '-' * 90 #輸出90個負號(minus)或連字號(hyphen)
 print >>stream, '%-11d%-11d%-14d%-12d%-16.2f<Total:%d Code Files>' \
   %(total[0], total[1], total[2], total[3], \
   SafeDiv(total[2], total[2]+total[1]), len(detailCountInfo))

ReportCounterInfo()輸出統(tǒng)計報告。注意，詳細報告輸出前，會根據(jù)指定的排序規(guī)則對輸出內(nèi)容排序。此外，空白行術(shù)語由EmptyLines改為BlankLines。前者表示該行除行結(jié)束符外不含任何其他字符，后者表示該行只包含空白字符(空格、制表符和行結(jié)束符等)。

為支持同時統(tǒng)計多個目錄和(或)文件，使用ParseTargetList()解析目錄-文件混合列表，將其元素分別存入目錄和文件列表：

def ParseTargetList(targetList):
 fileList, dirList = [], []
 if targetList == []:
  targetList.append(os.getcwd())
 for item in targetList:
  if os.path.isfile(item):
   fileList.append(os.path.abspath(item))
  elif os.path.isdir(item):
   dirList.append(os.path.abspath(item))
  else:
   print >>sys.stderr, "'%s' is neither a file nor a directory!" %item
 return [fileList, dirList]

LineCounter()函數(shù)基于目錄和文件列表進行統(tǒng)計：

def CountDir(dirList, isKeep=False, isRawReport=True, isShortName=False):
 for dir in dirList:
  if isKeep:
   for file in os.listdir(dir):
    CountFileLines(os.path.join(dir, file), isRawReport, isShortName)
  else:
   for root, dirs, files in os.walk(dir):
    for file in files:
     CountFileLines(os.path.join(root, file), isRawReport, isShortName)

def CountFile(fileList, isRawReport=True, isShortName=False):
 for file in fileList:
  CountFileLines(file, isRawReport, isShortName)

def LineCounter(isKeep=False, isRawReport=True, isShortName=False, targetList=[]):
 fileList, dirList = ParseTargetList(targetList)
 if fileList != []:
  CountFile(fileList, isRawReport, isShortName)
 if dirList != []:
  CountDir(dirList, isKeep, isRawReport, isShortName)

然后，添加命令行解析處理：

import argparse
def ParseCmdArgs(argv=sys.argv):
 parser = argparse.ArgumentParser(usage='%(prog)s [options] target',
      description='Count lines in code files.')
 parser.add_argument('target', nargs='*',
   help='space-separated list of directories AND/OR files')
 parser.add_argument('-k', '--keep', action='store_true',
   help='do not walk down subdirectories')
 parser.add_argument('-d', '--detail', action='store_true',
   help='report counting result in detail')
 parser.add_argument('-b', '--basename', action='store_true',
   help='do not show file\'s full path')
## sortWords = ['0', '1', '2', '3', '4', '5', 'file', 'code', 'cmmt', 'blan', 'ctpr', 'name']
## parser.add_argument('-s', '--sort',
##  choices=[x+y for x in ['','r'] for y in sortWords],
##  help='sort order: {0,1,2,3,4,5} or {file,code,cmmt,blan,ctpr,name},' \
##    "prefix 'r' means sorting in reverse order")
 parser.add_argument('-s', '--sort',
   help='sort order: {0,1,2,3,4,5} or {file,code,cmmt,blan,ctpr,name}, ' \
    "prefix 'r' means sorting in reverse order")
 parser.add_argument('-o', '--out',
   help='save counting result in OUT')
 parser.add_argument('-c', '--cache', action='store_true',
   help='use cache to count faster(unreliable when files are modified)')
 parser.add_argument('-v', '--version', action='version',
   version='%(prog)s 3.0 by xywang')

 args = parser.parse_args()
 return (args.keep, args.detail, args.basename, args.sort, args.out, args.cache, args.target)

注意ParseCmdArgs()函數(shù)中增加的-s選項。該選項指定輸出排序方式，并由r前綴指定升序還是降序。例如，-s 0或-s file表示輸出按文件行數(shù)升序排列，-s r0或-s rfile表示輸出按文件行數(shù)降序排列。
-c緩存選項最適用于改變輸出排序規(guī)則時。為支持該選項，使用Json模塊持久化統(tǒng)計報告：

CACHE_FILE = 'Counter.dump'
CACHE_DUMPER, CACHE_GEN = None, None

from json import dump, JSONDecoder
def CounterDump(data):
 global CACHE_DUMPER
 if CACHE_DUMPER == None:
  CACHE_DUMPER = open(CACHE_FILE, 'w')
 dump(data, CACHE_DUMPER)

def ParseJson(jsonData):
 endPos = 0
 while True:
  jsonData = jsonData[endPos:].lstrip()
  try:
   pyObj, endPos = JSONDecoder().raw_decode(jsonData)
   yield pyObj
  except ValueError:
   break

def CounterLoad():
 global CACHE_GEN
 if CACHE_GEN == None:
  CACHE_GEN = ParseJson(open(CACHE_FILE, 'r').read())

 try:
  return next(CACHE_GEN)
 except StopIteration, e:
  return []

def shouldUseCache(keep, detail, basename, cache, target):
 if not cache: #未指定啟用緩存
  return False

 try:
  (_keep, _detail, _basename, _target) = CounterLoad()
 except (IOError, EOFError, ValueError): #緩存文件不存在或內(nèi)容為空或不合法
  return False

 if keep == _keep and detail == _detail and basename == _basename \
  and sorted(target) == sorted(_target):
  return True
 else:
  return False

注意，json持久化會涉及字符編碼問題。例如，當源文件名包含gbk編碼的中文字符時，文件名寫入detailCountInfo前應(yīng)通過unicode(os.path.basename(filePath), 'gbk')轉(zhuǎn)換為Unicode，否則dump時會報錯。幸好，只有測試用的源碼文件才可能包含中文字符。因此，通常不用考慮編碼問題。

此時，可調(diào)用以上函數(shù)統(tǒng)計代碼并輸出報告：

def main():
 global gIsStdout, rawCountInfo, detailCountInfo
 (keep, detail, basename, sort, out, cache, target) = ParseCmdArgs()
 stream = sys.stdout if not out else open(out, 'w')
 SetSortArg(sort); LoadCExtLib()
 cacheUsed = shouldUseCache(keep, detail, basename, cache, target)
 if cacheUsed:
  try:
   (rawCountInfo, detailCountInfo) = CounterLoad()
  except (EOFError, ValueError), e: #不太可能出現(xiàn)
   print >>sys.stderr, 'Unexpected Cache Corruption(%s), Try Counting Directly.'%e
   LineCounter(keep, not detail, basename, target)
 else:
  LineCounter(keep, not detail, basename, target)

 ReportCounterInfo(not detail, stream)
 CounterDump((keep, detail, basename, target))
 CounterDump((rawCountInfo, detailCountInfo))

為測量行數(shù)統(tǒng)計工具的運行效率，還可添加如下計時代碼：

if __name__ == '__main__':
 from time import clock
 startTime = clock()
 main()
 endTime = clock()
 print >>sys.stderr, 'Time Elasped: %.2f sec.' %(endTime-startTime)

為避免cProfile開銷，此處使用time.clock()測量耗時。
1.2 代碼優(yōu)化
CalcLinesCh()和CalcLinesPy()除len()函數(shù)外并未使用其他Python庫函數(shù)，因此很容易改寫為C實現(xiàn)。其C語言版本實現(xiàn)最初如下：

#include <stdio.h>
#include <string.h>
#define TRUE 1
#define FALSE 0

unsigned int CalcLinesCh(char *line, unsigned char isBlockComment[2]) {
 unsigned int lineType = 0;
 unsigned int lineLen = strlen(line);
 if(!lineLen)
  return lineType;

 char *expandLine = calloc(lineLen + 1/*\n*/, 1);
 if(NULL == expandLine)
  return lineType;
 memmove(expandLine, line, lineLen);
 expandLine[lineLen] = '\n'; //添加一個字符防止iChar+1時越界

 unsigned int iChar = 0;
 unsigned char isLineComment = FALSE;
 while(iChar < lineLen) {
  if(expandLine[iChar] == ' ' || expandLine[iChar] == '\t') { //空白字符
   iChar += 1; continue;
  }
  else if(expandLine[iChar] == '/' && expandLine[iChar+1] == '/') { //行注釋
   isLineComment = TRUE;
   lineType |= 2; iChar += 1; //跳過'/'
  }
  else if(expandLine[iChar] == '/' && expandLine[iChar+1] == '*') { //塊注釋開始符
   isBlockComment[0] = TRUE;
   lineType |= 2; iChar += 1;
  }
  else if(expandLine[iChar] == '*' && expandLine[iChar+1] == '/') { //塊注釋結(jié)束符
   isBlockComment[0] = FALSE;
   lineType |= 2; iChar += 1;
  }
  else {
   if(isLineComment || isBlockComment[0])
    lineType |= 2;
   else
    lineType |= 1;
  }
  iChar += 1;
 }

 free(expandLine);
 return lineType; //Bitmap：0空行，1代碼，2注釋，3代碼和注釋
}

unsigned int CalcLinesPy(char *line, unsigned char isBlockComment[2]) {
 //isBlockComment[single quotes, double quotes]
 unsigned int lineType = 0;
 unsigned int lineLen = strlen(line);
 if(!lineLen)
  return lineType;

 char *expandLine = calloc(lineLen + 2/*\n\n*/, 1);
 if(NULL == expandLine)
  return lineType;
 memmove(expandLine, line, lineLen);
 //添加兩個字符防止iChar+2時越界
 expandLine[lineLen] = '\n'; expandLine[lineLen+1] = '\n';

 unsigned int iChar = 0;
 unsigned char isLineComment = FALSE;
 while(iChar < lineLen) {
  if(expandLine[iChar] == ' ' || expandLine[iChar] == '\t') { //空白字符
   iChar += 1; continue;
  }
  else if(expandLine[iChar] == '#') { //行注釋
   isLineComment = TRUE;
   lineType |= 2;
  }
  else if(expandLine[iChar] == '\'' && expandLine[iChar+1] == '\''
    && expandLine[iChar+2] == '\'') { //單引號塊注釋
   if(isBlockComment[0] || isBlockComment[1])
    isBlockComment[0] = FALSE;
   else
    isBlockComment[0] = TRUE;
   lineType |= 2; iChar += 2;
  }
  else if(expandLine[iChar] == '"' && expandLine[iChar+1] == '"'
    && expandLine[iChar+2] == '"') { //雙引號塊注釋
   if(isBlockComment[0] || isBlockComment[1])
    isBlockComment[1] = FALSE;
   else
    isBlockComment[1] = TRUE;
   lineType |= 2; iChar += 2;
  }
  else {
   if(isLineComment || isBlockComment[0] || isBlockComment[1])
    lineType |= 2;
   else
    lineType |= 1;
  }
  iChar += 1;
 }

 free(expandLine);
 return lineType; //Bitmap：0空行，1代碼，2注釋，3代碼和注釋
}

這種實現(xiàn)最接近原來的Python版本，但還能進一步優(yōu)化，如下：

 #define TRUE 1
#define FALSE 0
unsigned int CalcLinesCh(char *line, unsigned char isBlockComment[2]) {
 unsigned int lineType = 0;

 unsigned int iChar = 0;
 unsigned char isLineComment = FALSE;
 while(line[iChar] != '\0') {
  if(line[iChar] == ' ' || line[iChar] == '\t') { //空白字符
   iChar += 1; continue;
  }
  else if(line[iChar] == '/' && line[iChar+1] == '/') { //行注釋
   isLineComment = TRUE;
   lineType |= 2; iChar += 1; //跳過'/'
  }
  else if(line[iChar] == '/' && line[iChar+1] == '*') { //塊注釋開始符
   isBlockComment[0] = TRUE;
   lineType |= 2; iChar += 1;
  }
  else if(line[iChar] == '*' && line[iChar+1] == '/') { //塊注釋結(jié)束符
   isBlockComment[0] = FALSE;
   lineType |= 2; iChar += 1;
  }
  else {
   if(isLineComment || isBlockComment[0])
    lineType |= 2;
   else
    lineType |= 1;
  }
  iChar += 1;
 }

 return lineType; //Bitmap：0空行，1代碼，2注釋，3代碼和注釋
}

unsigned int CalcLinesPy(char *line, unsigned char isBlockComment[2]) {
 //isBlockComment[single quotes, double quotes]
 unsigned int lineType = 0;

 unsigned int iChar = 0;
 unsigned char isLineComment = FALSE;
 while(line[iChar] != '\0') {
  if(line[iChar] == ' ' || line[iChar] == '\t') { //空白字符
   iChar += 1; continue;
  }
  else if(line[iChar] == '#') { //行注釋
   isLineComment = TRUE;
   lineType |= 2;
  }
  else if(line[iChar] == '\'' && line[iChar+1] == '\''
    && line[iChar+2] == '\'') { //單引號塊注釋
   if(isBlockComment[0] || isBlockComment[1])
    isBlockComment[0] = FALSE;
   else
    isBlockComment[0] = TRUE;
   lineType |= 2; iChar += 2;
  }
  else if(line[iChar] == '"' && line[iChar+1] == '"'
    && line[iChar+2] == '"') { //雙引號塊注釋
   if(isBlockComment[0] || isBlockComment[1])
    isBlockComment[1] = FALSE;
   else
    isBlockComment[1] = TRUE;
   lineType |= 2; iChar += 2;
  }
  else {
   if(isLineComment || isBlockComment[0] || isBlockComment[1])
    lineType |= 2;
   else
    lineType |= 1;
  }
  iChar += 1;
 }

 return lineType; //Bitmap：0空行，1代碼，2注釋，3代碼和注釋
}

優(yōu)化后的版本利用&&運算符短路特性，因此不必考慮越界問題，從而避免動態(tài)內(nèi)存的分配和釋放。

作者的Windows系統(tǒng)最初未安裝Microsoft VC++工具，因此使用已安裝的MinGW開發(fā)環(huán)境編譯dll文件。將上述C代碼保存為CalcLines.c，編譯命令如下：
gcc -shared -o CalcLines.dll CalcLines.c
注意，MinGW中編譯dll和編譯so的命令相同。-shared選項指明創(chuàng)建共享庫，在Windows中為dll文件，在Unix系統(tǒng)中為so文件。

其間，作者還嘗試其他C擴展工具，如PyInline。在http://pyinline.sourceforge.net/下載壓縮包，解壓后拷貝目錄PyInline-0.03至Lib\site-packages下。在命令提示符窗口中進入該目錄，執(zhí)行python setup.py install安裝PyInline
執(zhí)行示例時提示BuildError: error: Unable to find vcvarsall.bat。查閱網(wǎng)絡(luò)資料，作者下載Microsoft Visual C++ Compiler for Python 2.7并安裝。然而，實踐后發(fā)現(xiàn)PyInline非常難用，于是作罷。

由于對MinGW編譯效果存疑，作者最終決定安裝VS2008 Express Edition。之所以選擇2008版本，是考慮到CPython2.7的Windows版本基于VS2008的運行時(runtime)庫。安裝后，在C:\Program Files\Microsoft Visual Studio 9.0\VC\bin目錄可找到cl.exe(編譯器)和link.exe(鏈接器)。按照網(wǎng)絡(luò)教程設(shè)置環(huán)境變量后，即可在Visual Studio 2008 Command Prompt命令提示符中編譯和鏈接程序。輸入cl /help或cl -help可查看編譯器選項說明。

將CalcLines.c編譯為動態(tài)鏈接庫前，還需要對函數(shù)頭添加_declspec(dllexport)，以指明這是從dll導(dǎo)出的函數(shù)：
_declspec(dllexport) unsigned int CalcLinesCh(char *line, unsigned char isBlockComment[2]) {...
_declspec(dllexport) unsigned int CalcLinesPy(char *line, unsigned char isBlockComment[2]) {...
否則Python程序加載動態(tài)庫后，會提示找不到相應(yīng)的C函數(shù)。

添加函數(shù)導(dǎo)出標記后，執(zhí)行如下命令編譯源代碼：
cl /Ox /Ot /Wall /LD /FeCalcLines.dll CalcLines.c
其中，/Ox選項表示使用最大優(yōu)化，/Ot選項表示代碼速度優(yōu)先。/LD表示創(chuàng)建動態(tài)鏈接庫，/Fe指明動態(tài)庫名稱。

動態(tài)庫文件可用UPX壓縮。由MinGW編譯的dll文件，UPX壓縮前后分別為13KB和11KB；而VS2008編譯過的dll文件，UPX壓縮前后分別為41KB和20KB。經(jīng)測兩者速度相當。考慮到動態(tài)庫體積，后文僅使用MinGW編譯的dll文件。

使用C擴展的動態(tài)鏈接庫，代碼統(tǒng)計工具在CPython2.7環(huán)境下可獲得極大的速度提升。相對而言，Pypy因為本身加速效果顯著，動態(tài)庫的性能提升反而不太明顯。此外，當待統(tǒng)計文件數(shù)目較少時，也可不使用dll文件(此時將啟用Python版本的算法)；當文件數(shù)目較多時，dll文件會顯著提高統(tǒng)計速度。詳細的評測數(shù)據(jù)參見第二節(jié)。

作者使用的Pypy版本為5.1，可從官網(wǎng)下載Win32安裝包。該安裝包默認包含cffi1.6，后者的使用可參考《Python學(xué)習(xí)入門手冊以及CFFI》或CFFI官方文檔。安裝Pypy5.1后，在命令提示符窗口輸入pypy可查看pypy和cffi版本信息：

E:\PyTest>pypy
Python 2.7.10 (b0a649e90b66, Apr 28 2016, 13:11:00)
[PyPy 5.1.1 with MSC v.1500 32 bit] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>> import cffi
>>>> cffi.__version__
'1.6.0'

若要CPLineCounter在未安裝Python環(huán)境的主機上運行，應(yīng)先將CPython版本的代碼轉(zhuǎn)換為exe并壓縮后，連同壓縮后的dll文件一并發(fā)布。使用者可將其放入同一個目錄，再將該目錄加入PATH環(huán)境變量，即可在Windows命令提示符窗口中運行CPLineCounter。例如：

 D:\pytest>CPLineCounter -d lctest -s code
FileLines CodeLines CommentLines BlankLines CommentPercent FileName
6   3   4    0   0.57   D:\pytest\lctest\hard.c
27   7   15   5   0.68   D:\pytest\lctest\file27_code7_cmmt15_blank5.py
33   19   15   4   0.44   D:\pytest\lctest\line.c
44   34   3    7   0.08   D:\pytest\lctest\test.c
44   34   3    7   0.08   D:\pytest\lctest\subdir\test.c
243  162  26   60   0.14   D:\pytest\lctest\subdir\CLineCounter.py
------------------------------------------------------------------------------------------
397  259  66   83   0.20   <Total:6 Code Files>
Time Elasped: 0.04 sec.

二. 精度與性能評測
為檢驗CPLineCounter統(tǒng)計精度和性能，作者從網(wǎng)上下載幾款常見的行數(shù)統(tǒng)計工具，即cloc1.64(10.9MB)、linecount3.7(451KB)、SourceCounter3.4(8.34MB)和SourceCount_1.0(644KB)。

首先測試統(tǒng)計精度。以line.c為目標代碼，上述工具的統(tǒng)計輸出如下表所示("-"表示該工具未直接提供該統(tǒng)計項)：

　　　　經(jīng)

人工檢驗，CPLineCounter的統(tǒng)計結(jié)果準確無誤。linecount和SourceCounter統(tǒng)計也較為可靠。
然后，統(tǒng)計82個源代碼文件，上述工具的統(tǒng)計輸出如下表所示：　　　　

通常，文件總行數(shù)和空行數(shù)統(tǒng)計規(guī)則簡單，不易出錯。因此，選取這兩項統(tǒng)計重合度最高的工具作為基準，即CPLineCounter和linecount。同時，對于代碼行數(shù)和注釋行數(shù)，CPLineCounter和SourceCounter的統(tǒng)計結(jié)果重合。根據(jù)統(tǒng)計重合度，有理由認為CPLineCounter的統(tǒng)計精度最高。

最后，測試統(tǒng)計性能。在作者的Windows XP主機(Pentium G630 2.7GHz主頻2GB內(nèi)存)上，統(tǒng)計5857個C源代碼文件，總行數(shù)接近千萬級。上述工具的性能表現(xiàn)如下表所示。表中僅顯示總計項，實際上仍統(tǒng)計單個文件的行數(shù)信息。注意，測試時linecount要勾選"目錄統(tǒng)計時包含同名文件"，cloc要添加--skip-uniqueness和--by-file選項?！　　　?br />

其中，CPLineCounter的性能因運行場景而異，統(tǒng)計耗時少則29秒，多則281秒。。需要注意的是，cloc僅統(tǒng)計出5733個文件。
以條形圖展示上述工具的統(tǒng)計性能，如下所示：

圖中"Opt-c"表示CPLineCounter以-c選項運行，"CPython2.7+ctypes(O)"表示以CPython2.7環(huán)境運行附帶舊DLL庫的CPLineCounter，"Pypy5.1+cffi1.6(N)"表示以Pypy5.1環(huán)境運行附帶新DLL庫的CPLineCounter，以此類推。

由于CPLineCounter并非純粹的CPU密集型程序，因此DLL庫算法本身的優(yōu)化并未帶來性能的顯著提升(對比舊DLL庫和新DLL庫)。對比之下，Pypy內(nèi)置JIT(即時編譯)解釋器，可從整體上極大地��升Python腳本的運行速度，加速效果甚至可與C匹敵。此外，性能測試數(shù)據(jù)會受到目標代碼、CPU架構(gòu)、預(yù)熱、緩存、后臺程序等多方面因素影響，因此不同工具或組合的性能表現(xiàn)可能與作者給出的數(shù)據(jù)略有出入。

綜合而言，CPLineCounter統(tǒng)計速度最快且結(jié)果可靠，軟件體積也小(exe1.3MB,dll11KB)。SourceCounter統(tǒng)計結(jié)果比較可靠，速度較快，且內(nèi)置項目管理信息。cloc文件數(shù)目統(tǒng)計誤差大，linecount代碼行統(tǒng)計誤差大，兩者速度較慢。但cloc可配置項豐富，并且可自行編譯以壓縮體積。SourceCount統(tǒng)計速度最慢，結(jié)果也不太可靠。

了解Python并行計算的讀者也可修改CPLineCounter源碼實現(xiàn)，加入多進程處理，壓滿多核處理器；還可嘗試多線程，以改善IO性能。以下截取CountFileLines()函數(shù)的部分line_profiler結(jié)果：

 E:\PyTest>kernprof -l -v CPLineCounter.py source -d > out.txt
140872  93736  32106   16938  0.26   <Total:82 Code Files>
Wrote profile results to CPLineCounter.py.lprof
Timer unit: 2.79365e-07 s

Total time: 5.81981 s
File: CPLineCounter.py
Function: CountFileLines at line 143

Line #  Hits   Time Per Hit % Time Line Contents
==============================================================
 143           @profile
 144           def CountFileLines(filePath, isRawReport=True, isShortName=False):
... ... ... ... ... ... ... ...
 162  82  7083200 86380.5  34.0  with open(filePath, 'r') as file:
 163 140954  1851877  13.1  8.9   for line in file:
 164 140872  6437774  45.7  30.9    lineType = CalcLines(fileType, line.strip(), isBlockComment)
 165 140872  1761864  12.5  8.5    lineCountInfo[0] += 1
 166 140872  1662583  11.8  8.0    if lineType == 0: lineCountInfo[3] += 1
 167 123934  1499176  12.1  7.2    elif lineType == 1: lineCountInfo[1] += 1
 168  32106  406931  12.7  2.0    elif lineType == 2: lineCountInfo[2] += 1
 169  1908  27634  14.5  0.1    elif lineType == 3: lineCountInfo[1] += 1; lineCountInfo[2] += 1
... ... ... ... ... ... ... ...

line_profiler可用pip install line_profiler安裝。在待評估函數(shù)前添加裝飾器@profile后，運行kernprof命令，將給出被裝飾函數(shù)中每行代碼所耗費的時間。-l選項指明逐行分析，-v選項則指明執(zhí)行后屏顯計時信息。Hits(執(zhí)行次數(shù))或Time(執(zhí)行時間)值較大的代碼行具有較大的優(yōu)化空間。

由line_profiler結(jié)果可見，該函數(shù)偏向CPU密集型(75~80行占用該函數(shù)56.7%的耗時)。然而考慮到目錄遍歷等操作，很可能整體程序為IO密集型。因此，選用多進程還是多線程加速還需要測試驗證。最簡單地，可將73~80行(即讀文件和統(tǒng)計行數(shù))均改為C實現(xiàn)。其他部分要么為IO密集型要么使用Python庫，用C語言改寫事倍功半。

最后，若僅僅統(tǒng)計代碼行數(shù)，Linux或Mac系統(tǒng)中可使用如下shell命令：
find ./codeDir -name "*.c" -or -name "*.h" | xargs wc -l #除空行外的總行數(shù)
find ./codeDir -name "*.c" -or -name "*.h" | xargs wc -l #各文件行數(shù)及總和

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: