基于一個應(yīng)用程序多線程誤用的分析詳解

更新時(shí)間：2013年05月13日 11:12:59 作者：

本篇文章是對一個應(yīng)用程序多線程的誤用進(jìn)行了詳細(xì)的分析介紹，需要的朋友參考下

一、需求和初步實(shí)現(xiàn)
很簡單的一個windows服務(wù)：客戶端連接郵件服務(wù)器，下載郵件(含附件)并保存為.eml格式，保存成功后刪除服務(wù)器上的郵件。實(shí)現(xiàn)的偽代碼大致如下：

復(fù)制代碼代碼如下:

      public void Process()
        {
            var recordCount = 1000;//每次取出郵件記錄數(shù)
            while (true)
            {
                using (var client = new Pop3Client())
                {
                    //1、建立連接，并進(jìn)行身份認(rèn)證
                    client.Connect(server, port, useSSL);
                    client.Authenticate(userName, pwd);

var messageCount = client.GetMessageCount(); // 郵箱中現(xiàn)有郵件數(shù)
 if (messageCount > recordCount)
 {
 messageCount = recordCount;
 }
 if (messageCount < 1)
 {
 break;
 }
 var listAllMsg = new List<Message>(messageCount); //用于臨時(shí)保存取出的郵件

//2、取出郵件后填充至列表，每次最多recordCount封郵件
 for (int i = 1; i <= messageCount; i++) //郵箱索引是基于1開始的，索引范圍: [1, messageCount]
 {
 listAllMsg.Add(client.GetMessage(i)); //取出郵件至列表
 }

                    //3、遍歷并保存至客戶端，格式為.eml
                    foreach (var message in listAllMsg)
                    {
                        var emlInfo = new System.IO.FileInfo(string.Format("{0}.eml", Guid.NewGuid().ToString("n")));
                        message.SaveToFile(emlInfo);//保存郵件為.eml格式文件
                    }

                    //4、遍歷并刪除
                    int messageNumber = 1;
                    foreach (var message in listAllMsg)
                    {
                        client.DeleteMessage(messageNumber); //刪除郵件（本質(zhì)上，在關(guān)閉連接前只是打上DELETE標(biāo)簽，并沒有真正刪除）
                        messageNumber++;
                    }

//5、斷開連接，真正完成刪除
client.Disconnect();

if (messageCount < recordCount)
 {
 break;
 }
 }
 }
 }

開發(fā)中接收郵件的時(shí)候使用了開源組件Mail.Net（實(shí)際上這是OpenSMTP.Net和OpenPop兩個項(xiàng)目的并集），調(diào)用接口實(shí)現(xiàn)很簡單。代碼寫完后發(fā)現(xiàn)基本功能是滿足了，本著在穩(wěn)定的基礎(chǔ)上更快更有效率的原則，最終進(jìn)行性能調(diào)優(yōu)。

二、性能調(diào)優(yōu)及產(chǎn)生BUG分析
暫時(shí)不管這里的耗時(shí)操作是屬于計(jì)算密集型還是IO密集型，反正有人一看到有集合要一個一個遍歷順序處理，就忍不住有多線程異步并行操作的沖動。有條件異步盡量異步，沒有條件異步，創(chuàng)造條件也要異步，真正發(fā)揮多線程優(yōu)勢，充分利用服務(wù)器的強(qiáng)大處理能力，而且也自信中規(guī)中矩寫了很多多線程程序，這個業(yè)務(wù)邏輯比較簡單而且異常處理也較容易控制（就算有問題也有補(bǔ)償措施，可以在后期處理中完善它），理論上每天需要查收的郵件的數(shù)量也不會太多，不會長時(shí)間成為CPU和內(nèi)存殺手，這樣的多線程異步服務(wù)實(shí)現(xiàn)應(yīng)該可以接受。而且根據(jù)分析，顯而易見，這是一個典型的頻繁訪問網(wǎng)絡(luò)IO密集型的應(yīng)用程序，當(dāng)然要從IO處理上下功夫。

1、收取郵件
從Mail.Net的示例代碼中看到，取郵件需要一個從1開始的索引，而且必須有序。如果異步發(fā)起多個請求，這個索引怎么傳入呢？必須有序這一條開始讓我有點(diǎn)猶豫，如果通過Lock或者Interlocked等同步構(gòu)造，很顯然就失去了多線程的優(yōu)勢，我猜可能還不如順序同步獲取速度快。

分析歸分析，我們還是寫點(diǎn)代碼試試看效率如何。

快速寫個異步方法傳遞整型參數(shù)，同時(shí)通過Interlocked控制提取郵件總數(shù)的變化，每一個異步方法獲取完了之后通過Lock將Message加入到listAllMsg列表中即可。

郵件服務(wù)器測試郵件不多，測試獲取一兩封郵件，嗯，很好，提取郵件成功，初步調(diào)整就有收獲，可喜可賀。

2、保存郵件
調(diào)優(yōu)過程是這樣的：遍歷并保存為.eml的實(shí)現(xiàn)代碼改為使用多線程，將message.SaveToFile保存操作并行處理，經(jīng)測試，保存一到兩封郵件，CPU沒看出高多少，保存的效率貌似稍有提升，又有點(diǎn)進(jìn)步。

3、刪除郵件
再次調(diào)優(yōu)：仿照多線程保存操作，將遍歷刪除郵件的代碼進(jìn)行修改，也通過多線程并行處理刪除的操作。好，很好，非常好，這時(shí)候我心里想著什么Thread啊，ThreadPool啊，CCR啊，TPL啊，EAP啊，APM啊，把自己知道的能用的全給它用一遍，挑最好用的最優(yōu)效率的一個，顯得很有技術(shù)含量，哇哈哈。

然后，快速寫了個異步刪除方法開始測試。在郵件不多的情況下，比如三兩封信，能正常工作，看起來好像蠻快的。

到這里我心里已經(jīng)開始準(zhǔn)備慶祝大功告成了。

4、產(chǎn)生BUG原因分析
從上面的1、2、3獨(dú)立效果看，似乎每一個線程都能夠獨(dú)立運(yùn)行而不需要相互通信或者數(shù)據(jù)共享，而且使用了異步多線程技術(shù)，取的快存的快刪的也快，看上去郵件處理將進(jìn)入最佳狀態(tài)。但是最后提取、保存、刪除集成聯(lián)調(diào)測試。運(yùn)行了一段時(shí)間查看日志，悲劇發(fā)生了：

在測試郵件較多的時(shí)候，比如二三十封左右，日志里看到有PopServerException異常，好像還有點(diǎn)亂碼，而且每次亂碼好像還不一樣；再測試三兩封信，發(fā)現(xiàn)有時(shí)能正常工作，有時(shí)也拋出PopServerException異常，還是有亂碼，分析出錯堆棧，是在刪除郵件的地方。

我kao，這是要鬧哪樣啊，和郵件服務(wù)器關(guān)系沒搞好嗎，怎么總是PopServerException異常？

難道，難道是異步刪除方法有問題？異步刪除，索引為1的序號，嗯，索引的問題？還是不太確定。

到這里你能發(fā)現(xiàn)多線程處理刪除操作拋出異常的原因嗎？你已經(jīng)知道原因了？OK，下面的內(nèi)容對你就毫無意義了，可以不用往下看了。

談?wù)勎业呐挪榻?jīng)過。

看日志我初步懷疑是刪除郵件的方法有問題，但是看了一下目測還是可靠的。接著估計(jì)是刪除時(shí)郵件編碼不正確，后來又想不太可能，同樣的郵件同步代碼查收保存刪除這三個操作就沒有異常拋出。不太放心，又分幾次分別測試了幾封郵件，有附件的沒附件的，html的純文本的，同步代碼處理的很好。

百思不得其解，打開Mail.NET源碼，從DeleteMessage方法跟蹤查看到Mail.Net的Pop3Client類中的SendCommand方法，一下子感覺有頭緒了。DeleteMessage刪除郵件的源碼如下：

復(fù)制代碼代碼如下:

        public void DeleteMessage(int messageNumber)
        {
            AssertDisposed();

ValidateMessageNumber(messageNumber);

if (State != ConnectionState.Transaction)
throw new InvalidUseException("You cannot delete any messages without authenticating yourself towards the server first");

SendCommand("DELE " + messageNumber);
}

最后一行SendCommand需要提交一個DELE命令，跟進(jìn)去看看它是怎么實(shí)現(xiàn)的：

復(fù)制代碼代碼如下:

        private void SendCommand(string command)
        {
            // Convert the command with CRLF afterwards as per RFC to a byte array which we can write
            byte[] commandBytes = Encoding.ASCII.GetBytes(command + "\r\n");

            // Write the command to the server
            OutputStream.Write(commandBytes, 0, commandBytes.Length);
            OutputStream.Flush(); // Flush the content as we now wait for a response

// Read the response from the server. The response should be in ASCII
LastServerResponse = StreamUtility.ReadLineAsAscii(InputStream);

IsOkResponse(LastServerResponse);
}

注意InputStream和OutputStream屬性，它們的定義如下（神奇的private修飾屬性，這種寫法少見哪）：

復(fù)制代碼代碼如下:

/// <summary>
 /// This is the stream used to read off the server response to a command
 /// </summary>
 private Stream InputStream { get; set; }

/// <summary>
 /// This is the stream used to write commands to the server
 /// </summary>
 private Stream OutputStream { get; set; }

給它賦值的地方是調(diào)用Pop3Client類里的 public void Connect(Stream inputStream, Stream outputStream)方法，而這個Connect方法最終調(diào)用的Connect方法如下：

復(fù)制代碼代碼如下:

/// <summary>
 /// Connects to a remote POP3 server
 /// </summary>
 /// <param name="hostname">The <paramref name="hostname"/> of the POP3 server</param>
 /// <param name="port">The port of the POP3 server</param>
 /// <param name="useSsl">True if SSL should be used. False if plain TCP should be used.</param>
 /// <param name="receiveTimeout">Timeout in milliseconds before a socket should time out from reading. Set to 0 or -1 to specify infinite timeout.</param>
 /// <param name="sendTimeout">Timeout in milliseconds before a socket should time out from sending. Set to 0 or -1 to specify infinite timeout.</param>
 /// <param name="certificateValidator">If you want to validate the certificate in a SSL connection, pass a reference to your validator. Supply <see langword="null"/> if default should be used.</param>
 /// <exception cref="PopServerNotAvailableException">If the server did not send an OK message when a connection was established</exception>
 /// <exception cref="PopServerNotFoundException">If it was not possible to connect to the server</exception>
 /// <exception cref="ArgumentNullException">If <paramref name="hostname"/> is <see langword="null"/></exception>
 /// <exception cref="ArgumentOutOfRangeException">If port is not in the range [<see cref="IPEndPoint.MinPort"/>, <see cref="IPEndPoint.MaxPort"/> or if any of the timeouts is less than -1.</exception>
 public void Connect(string hostname, int port, bool useSsl, int receiveTimeout, int sendTimeout, RemoteCertificateValidationCallback certificateValidator)
 {
 AssertDisposed();

if (hostname == null)
throw new ArgumentNullException("hostname");

if (hostname.Length == 0)
throw new ArgumentException("hostname cannot be empty", "hostname");

if (port > IPEndPoint.MaxPort || port < IPEndPoint.MinPort)
throw new ArgumentOutOfRangeException("port");

if (receiveTimeout < -1)
throw new ArgumentOutOfRangeException("receiveTimeout");

if (sendTimeout < -1)
throw new ArgumentOutOfRangeException("sendTimeout");

if (State != ConnectionState.Disconnected)
throw new InvalidUseException("You cannot ask to connect to a POP3 server, when we are already connected to one. Disconnect first.");

            TcpClient clientSocket = new TcpClient();
            clientSocket.ReceiveTimeout = receiveTimeout;
            clientSocket.SendTimeout = sendTimeout;

            try
            {
                clientSocket.Connect(hostname, port);
            }
            catch (SocketException e)
            {
                // Close the socket - we are not connected, so no need to close stream underneath
                clientSocket.Close();

                DefaultLogger.Log.LogError("Connect(): " + e.Message);
                throw new PopServerNotFoundException("Server not found", e);
            }

            Stream stream;
            if (useSsl)
            {
                // If we want to use SSL, open a new SSLStream on top of the open TCP stream.
                // We also want to close the TCP stream when the SSL stream is closed
                // If a validator was passed to us, use it.
                SslStream sslStream;
                if (certificateValidator == null)
                {
                    sslStream = new SslStream(clientSocket.GetStream(), false);
                }
                else
                {
                    sslStream = new SslStream(clientSocket.GetStream(), false, certificateValidator);
                }
                sslStream.ReadTimeout = receiveTimeout;
                sslStream.WriteTimeout = sendTimeout;

// Authenticate the server
sslStream.AuthenticateAsClient(hostname);

                stream = sslStream;
            }
            else
            {
                // If we do not want to use SSL, use plain TCP
                stream = clientSocket.GetStream();
            }

            // Now do the connect with the same stream being used to read and write to
            Connect(stream, stream); //In/OutputStream屬性初始化
        }

一下子看到了TcpClient對象，這個不就是基于Socket，通過Socket編程實(shí)現(xiàn)POP3協(xié)議操作指令嗎？毫無疑問需要發(fā)起TCP連接，什么三次握手呀，發(fā)送命令操作服務(wù)器呀…一下子全想起來了。

我們知道一個TCP連接就是一個會話（Session），發(fā)送命令（比如獲取和刪除）需要通過TCP連接和郵件服務(wù)器通信。如果是多線程在一個會話上發(fā)送命令（比如獲?。═OP或者RETR）、刪除（DELE））操作服務(wù)器，這些命令的操作都不是線程安全的，這樣很可能出現(xiàn)OutputStream和InputStream數(shù)據(jù)不匹配而相互打架的情況，這個很可能就是我們看到的日志里有亂碼的原因。說到線程安全，突然恍然大悟，我覺得查收郵件應(yīng)該也有問題。為了驗(yàn)證我的想法，我又查看了下GetMessage方法的源碼：

復(fù)制代碼代碼如下:

        public Message GetMessage(int messageNumber)
        {
            AssertDisposed();

ValidateMessageNumber(messageNumber);

if (State != ConnectionState.Transaction)
throw new InvalidUseException("Cannot fetch a message, when the user has not been authenticated yet");

byte[] messageContent = GetMessageAsBytes(messageNumber);

return new Message(messageContent);
}

內(nèi)部的GetMessageAsBytes方法最終果然還是走SendCommand方法：

復(fù)制代碼代碼如下:

      if (askOnlyForHeaders)
            {
                // 0 is the number of lines of the message body to fetch, therefore it is set to zero to fetch only headers
                SendCommand("TOP " + messageNumber + " 0");
            }
            else
            {
                // Ask for the full message
                SendCommand("RETR " + messageNumber);
            }

根據(jù)我的跟蹤，在測試中拋出異常的亂碼來自于LastServerResponse(This is the last response the server sent back when a command was issued to it),在IsOKResponse方法中它不是以“+OK”開頭就會拋出PopServerException異常：

復(fù)制代碼代碼如下:

/// <summary>
 /// Tests a string to see if it is a "+OK" string. 
 /// An "+OK" string should be returned by a compliant POP3
 /// server if the request could be served. 
 /// 
 /// The method does only check if it starts with "+OK".
 /// </summary>
 /// <param name="response">The string to examine</param>
 /// <exception cref="PopServerException">Thrown if server did not respond with "+OK" message</exception>
 private static void IsOkResponse(string response)
 {
 if (response == null)
 throw new PopServerException("The stream used to retrieve responses from was closed");

if (response.StartsWith("+OK", StringComparison.OrdinalIgnoreCase))
return;

throw new PopServerException("The server did not respond with a +OK response. The response was: \"" + response + "\"");
}

分析到這里，終于知道最大的陷阱是Pop3Client不是線程安全的。終于找到原因了，哈哈哈，此刻我猶如見到女神出現(xiàn)一樣異常興奮心花怒放，高興的差點(diǎn)忘了錯誤的代碼就是自己寫的。

片刻后終于冷靜下來，反省自己犯了很低級的失誤，暈死，我怎么把TCP和線程安全這茬給忘了呢？啊啊啊啊啊啊，好累，感覺再也不會用類庫了。

對了，保存為.eml的時(shí)候是通過Message對象的SaveToFile方法，并不需要和郵件服務(wù)器通信，所以異步保存沒有出現(xiàn)異常（二進(jìn)制數(shù)組RawMessage也不會數(shù)據(jù)不匹配），它的源碼是下面這樣的：

復(fù)制代碼代碼如下:

/// <summary>
 /// Save this <see cref="Message"/> to a file. 
 /// 
 /// Can be loaded at a later time using the <see cref="LoadFromFile"/> method.
 /// </summary>
 /// <param name="file">The File location to save the <see cref="Message"/> to. Existent files will be overwritten.</param>
 /// <exception cref="ArgumentNullException">If <paramref name="file"/> is <see langword="null"/></exception>
 /// <exception>Other exceptions relevant to file saving might be thrown as well</exception>
 public void SaveToFile(FileInfo file)
 {
 if (file == null)
 throw new ArgumentNullException("file");

File.WriteAllBytes(file.FullName, RawMessage);
}

再來總結(jié)看看這個bug是怎么產(chǎn)生的：對TCP和線程安全沒有保持足夠的敏感和警惕，看見for循環(huán)就進(jìn)行性能調(diào)優(yōu)，測試數(shù)據(jù)不充分，不小心觸雷。歸根結(jié)底，產(chǎn)生錯誤的原因是對線程安全考慮不周異步場景選擇不當(dāng)，這種不當(dāng)?shù)氖褂眠€有很多，比較典型的就是對數(shù)據(jù)庫連接的誤用。我看過一篇講數(shù)據(jù)庫連接對象誤用的文章，比如這一篇《解析為何要關(guān)閉數(shù)據(jù)庫連接,可不可以不關(guān)閉的問題詳解》，當(dāng)時(shí)我也總結(jié)過，所以很有印象。現(xiàn)在還是要羅嗦一下，對于using一個Pop3Client或者SqlConnection這種方式共用一個連接訪問網(wǎng)絡(luò)的情況可能不適合使用多線程，尤其是和服務(wù)器進(jìn)行密集通信的時(shí)候，哪怕用對了多線程技術(shù)，性能也不見得有提升。

我們經(jīng)常使用的一些Libray或者.NET客戶端，比如FastDFS、Memcached、RabbitMQ、Redis、MongDB、Zookeeper等等，它們都要訪問網(wǎng)絡(luò)和服務(wù)器通信并解析協(xié)議，分析過幾個客戶端的源碼，記得FastDFS，Memcached及Redis的客戶端內(nèi)部都有一個Pool的實(shí)現(xiàn)，印象中它們就沒有線程安全風(fēng)險(xiǎn)。依個人經(jīng)驗(yàn)，使用它們的時(shí)候必須保持敬畏之心，也許你用的語言和類庫編程體驗(yàn)非常友好，API使用說明通俗易懂，調(diào)用起來看上去輕而易舉，但是要用好用對也不是全部都那么容易，最好快速過一遍源碼理解大致實(shí)現(xiàn)思路，否則如不熟悉內(nèi)部實(shí)現(xiàn)原理埋頭拿過來即用很可能掉入陷阱當(dāng)中而不自知。當(dāng)我們重構(gòu)或調(diào)優(yōu)使用多線程技術(shù)的時(shí)候，絕不能忽視一個深刻的問題，就是要清醒認(rèn)識到適合異步處理的場景，就像知道適合使用緩存場景一樣，我甚至認(rèn)為明白這一點(diǎn)比怎么寫代碼更重要。還有就是重構(gòu)或調(diào)優(yōu)必須要謹(jǐn)慎，測試所依賴的數(shù)據(jù)必須準(zhǔn)備充分，實(shí)際工作當(dāng)中這一點(diǎn)已經(jīng)被多次證明，給我的印象尤其深刻。很多業(yè)務(wù)系統(tǒng)數(shù)據(jù)量不大的時(shí)候都可以運(yùn)行良好，但在高并發(fā)數(shù)據(jù)量較大的環(huán)境下很容易出現(xiàn)各種各樣莫名其妙的問題，比如本文中所述，在測試多線程異步獲取和刪除郵件的時(shí)候，郵件服務(wù)器上只有一兩封內(nèi)容和附件很小的郵件，通過異步獲取和刪除都正常運(yùn)行，沒有任何異常日志，但是數(shù)據(jù)一多，出現(xiàn)異常日志，排查，調(diào)試，看源碼，再排查......這篇文章就面世了。