快捷導航

調(diào)用Process.waitfor導致的進程掛起問題及解決

更新時間：2021年12月14日 09:04:38 作者：我是安靜的美男子

這篇文章主要介紹了調(diào)用Process.waitfor導致的進程掛起問題及解決，具有很好的參考價值，希望對大家有所幫助。如有錯誤或未考慮完全的地方，望不吝賜教

問題背景

如果要在Java中調(diào)用shell腳本時，可以使用Runtime.exec或ProcessBuilder.start。它們都會返回一個Process對象，通過這個Process可以對獲取腳本執(zhí)行的輸出，然后在Java中進行相應(yīng)處理。

例如，下面的代碼：

		try 
		{
			Process process = Runtime.getRuntime().exec(cmd);			
			process.waitFor();                        
                        //do something ...
		} 
		catch (Exception e) 
		{			
			e.printStackTrace();
		}

通常，安全編碼規(guī)范中都會指出：使用Process.waitfor的時候，可能導致進程阻塞，甚至死鎖。那么這句應(yīng)該怎么理解呢？用個實際的例子說明下。

問題描述

使用Java代碼調(diào)用shell腳本，執(zhí)行后會發(fā)現(xiàn)Java進程和Shell進程都會掛起，無法結(jié)束。

Java代碼 processtest.java

		try 
		{
			Process process = Runtime.getRuntime().exec(cmd);
			System.out.println("start run cmd=" + cmd);
			
			process.waitFor();
			System.out.println("finish run cmd=" + cmd);
		} 
		catch (Exception e) 
		{			
			e.printStackTrace();
		}

被調(diào)用的Shell腳本doecho.sh

#!/bin/bash
for((i=0; ;i++))
do    
    echo -n "0123456789"
    echo $i >> count.log
done

掛起原因

主進程中調(diào)用Runtime.exec會創(chuàng)建一個子進程，用于執(zhí)行shell腳本。子進程創(chuàng)建后會和主進程分別獨立運行。
因為主進程需要等待腳本執(zhí)行完成，然后對腳本返回值或輸出進行處理，所以這里主進程調(diào)用Process.waitfor等待子進程完成。
通過shell腳本可以看出：子進程執(zhí)行過程就是不斷的打印信息。主進程中可以通過Process.getInputStream和Process.getErrorStream獲取并處理。
這時候子進程不斷向主進程發(fā)生數(shù)據(jù)，而主進程調(diào)用Process.waitfor后已掛起。當前子進程和主進程之間的緩沖區(qū)塞滿后，子進程不能繼續(xù)寫數(shù)據(jù)，然后也會掛起。
這樣子進程等待主進程讀取數(shù)據(jù)，主進程等待子進程結(jié)束，兩個進程相互等待，最終導致死鎖。

解決方法

基于上述分析，只要主進程在waitfor之前，能不斷處理緩沖區(qū)中的數(shù)據(jù)就可以。因為，我們可以再waitfor之前，單獨啟兩個額外的線程，分別用于處理InputStream和ErrorStream就可以。實例代碼如下：

		try 
		{
			final Process process = Runtime.getRuntime().exec(cmd);
			System.out.println("start run cmd=" + cmd);
			
			//處理InputStream的線程
			new Thread()
			{
				@Override
				public void run()
				{
					BufferedReader in = new BufferedReader(new InputStreamReader(process.getInputStream())); 
					String line = null;
					
					try 
					{
						while((line = in.readLine()) != null)
						{
							System.out.println("output: " + line);
						}
					} 
					catch (IOException e) 
					{						
						e.printStackTrace();
					}
					finally
					{
						try 
						{
							in.close();
						} 
						catch (IOException e) 
						{
							e.printStackTrace();
						}
					}
				}
			}.start();
			
			new Thread()
			{
				@Override
				public void run()
				{
					BufferedReader err = new BufferedReader(new InputStreamReader(process.getErrorStream())); 
					String line = null;
					
					try 
					{
						while((line = err.readLine()) != null)
						{
							System.out.println("err: " + line);
						}
					} 
					catch (IOException e) 
					{						
						e.printStackTrace();
					}
					finally
					{
						try 
						{
							err.close();
						} 
						catch (IOException e) 
						{
							e.printStackTrace();
						}
					}
				}
			}.start();
			
			process.waitFor();
			System.out.println("finish run cmd=" + cmd);
		} 
		catch (Exception e) 
		{			
			e.printStackTrace();
		}

JDK上的說明

By default, the created subprocess does not have its own terminal or console.

All its standard I/O (i.e. stdin, stdout, stderr) operations will be redirected to the parent process, where they can be accessed via the streams obtained using the methods getOutputStream(), getInputStream(), and getErrorStream().

The parent process uses these streams to feed input to and get output from the subprocess.

Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, or even deadlock.

從JDK的說明中可以看出兩點：

如果系統(tǒng)中標準輸入輸出流使用的bufffer大小有限，所有讀寫時可能會出現(xiàn)阻塞或死鎖。------這點上面已分析
子進程的標準I/O已經(jīng)被重定向到了父進程。父進程可以通過對應(yīng)的接口獲取到子進程的I/O。------I/O是如何重定向的？

背后的故事

要回答上面的問題可以從系統(tǒng)的層面嘗試分析。

首先通過ps命令可以看到，在linux上多出了兩個進程：一個Java進程、一個shell進程，且shell是java的子進程。

然后，可以看到shell進程的狀態(tài)顯示為pipe_w。我剛開始以為pipe_w表示pipe_write。進一步查看/proc/pid/wchan發(fā)現(xiàn)pipe_w其實表示為pipe_wait。通常/proc/pid/wchan表示一個內(nèi)存地址或進程正在執(zhí)行的方法名稱。因此，這似乎表明該進程在操作pipe時發(fā)生了等待，從而被掛起。我們知道pipe是IPC的一種，通常用于父子進程之間通信。這樣我們可以猜測：可能是父子進程之間通過pipe通信的時候出現(xiàn)了阻塞。

另外，觀察父子進程的fd信息，即/proc/pid/fd?？梢钥吹阶舆M程的0/1/2（即：stdin/stdout/stderr）分別被重定向到了三個pipe文件；父親進程中對應(yīng)的也有對著三個pipe文件的引用。

綜上所述，這個過程應(yīng)該是這樣的：子進程不斷向pipe中寫數(shù)據(jù)，而父進程一直不讀取pipe中的數(shù)據(jù)，導致pipe被塞滿，子進程無法繼續(xù)寫入，所以出現(xiàn)pipe_wait的狀態(tài)。那么pipe到底有多大呢？

測試pipe的大小

因為我已經(jīng)在doecho.sh的腳步中記錄了打印了字符數(shù)，查看count.log就可以知道子進程最終發(fā)送了多少數(shù)據(jù)。在子進程掛起了，count.log的數(shù)據(jù)一致保持在6543不變。故，當前子進程向pipe中寫入6543*10=65430bytes時，出現(xiàn)進程掛起。65536-65430=106byte即距離64K差了106bytes。

換另外的測試方式，每次寫入1k，記錄總共可以寫入多少。進程代碼如test_pipe_size.sh所示。測試結(jié)果為64K。兩次結(jié)果相差了106byte，那個這個pipe到底多大？

Linux上pipe分析

最直接的方式就是看源碼。Pipe的實現(xiàn)代碼主要在linux/fs/pipe.c中，我們主要看pipe_wait方法。

 pipe_read(struct kiocb *iocb, struct iov_iter *to)
 {
         size_t total_len = iov_iter_count(to);
         struct file *filp = iocb->ki_filp;
         struct pipe_inode_info *pipe = filp->private_data;
         int do_wakeup;
         ssize_t ret;
 
         /* Null read succeeds. */
         if (unlikely(total_len == 0))
                 return 0;
 
         do_wakeup = 0;
         ret = 0;
         __pipe_lock(pipe);
         for (;;) {
                 int bufs = pipe->nrbufs;
                 if (bufs) {
                         int curbuf = pipe->curbuf;
                         struct pipe_buffer *buf = pipe->bufs + curbuf;
                         const struct pipe_buf_operations *ops = buf->ops;
                         size_t chars = buf->len;
                         size_t written;
                         int error;
 
                         if (chars > total_len)
                                 chars = total_len;
 
                         error = ops->confirm(pipe, buf);
                         if (error) {
                                 if (!ret)
                                         ret = error;
                                 break;
                         }
 
                         written = copy_page_to_iter(buf->page, buf->offset, chars, to);
                         if (unlikely(written < chars)) {
                                 if (!ret)
                                         ret = -EFAULT;
                                 break;
                         }
                         ret += chars;
                         buf->offset += chars;
                         buf->len -= chars;
 
                         /* Was it a packet buffer? Clean up and exit */
                         if (buf->flags & PIPE_BUF_FLAG_PACKET) {
                                 total_len = chars;
                                 buf->len = 0;
                         }
 
                         if (!buf->len) {
                                 buf->ops = NULL;
                                 ops->release(pipe, buf);
                                 curbuf = (curbuf + 1) & (pipe->buffers - 1);
                                 pipe->curbuf = curbuf;
                                 pipe->nrbufs = --bufs;
                                 do_wakeup = 1;
                         }
                         total_len -= chars;
                         if (!total_len)
                                 break;  /* common path: read succeeded */
                 }
                 if (bufs)       /* More to do? */
                         continue;
                 if (!pipe->writers)
                         break;
                 if (!pipe->waiting_writers) {
                         /* syscall merging: Usually we must not sleep
                          * if O_NONBLOCK is set, or if we got some data.
                          * But if a writer sleeps in kernel space, then
                          * we can wait for that data without violating POSIX.
                          */
                         if (ret)
                                 break;
                         if (filp->f_flags & O_NONBLOCK) {
                                 ret = -EAGAIN;
                                 break;
                         }
                 }
                 if (signal_pending(current)) {
                         if (!ret)
                                 ret = -ERESTARTSYS;
                         break;
                 }
                 if (do_wakeup) {
                         wake_up_interruptible_sync_poll(&pipe->wait, POLLOUT | POLLWRNORM);
                         kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
                 }
                 pipe_wait(pipe);
         }
         __pipe_unlock(pipe);
 
         /* Signal writers asynchronously that there is more room. */
         if (do_wakeup) {
                 wake_up_interruptible_sync_poll(&pipe->wait, POLLOUT | POLLWRNORM);
                 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
         }
         if (ret > 0)
                 file_accessed(filp);
         return ret;
 }

可以看到Pipe被組織成環(huán)狀結(jié)構(gòu)，即一個循環(huán)鏈表。鏈表中的元素為struct pipe_buffer的結(jié)構(gòu)，每個pipe_buffer對于一個page。鏈表中共有16個元素，即pipe buffer的總大小為16*page。如果page大小為4K，那么pipe buffer的總大小應(yīng)該為16*4K=64K。