欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

Python中的分布式框架Ray的安裝與使用教程

 更新時間:2024年08月28日 12:24:45   作者:DECHIN  
Ray框架下不僅可以通過conda和Python十分方便的構(gòu)建一個集群,還可以自動的對分布式任務(wù)進行并發(fā)處理,且支持GPU分布式任務(wù)的提交,本文介紹了基于Python的分布式框架Ray的基本安裝與使用,感興趣的朋友一起看看吧

技術(shù)背景

假設(shè)我們在一個局域網(wǎng)內(nèi)有多臺工作站(不是服務(wù)器),那么有沒有一個簡單的方案可以實現(xiàn)一個小集群,提交分布式的任務(wù)呢?Ray為我們提供了一個很好的解決方案,允許你通過conda和Python靈活的構(gòu)建集群環(huán)境,并提交分布式的任務(wù)。其基本架構(gòu)為:

那么本文簡單的介紹一下Ray的安裝與基本使用。

安裝

由于是一個Python的框架,Ray可以直接使用pip進行安裝和管理:

$ python3 -m pip install ray[default]

但是需要注意的是,在所有需要構(gòu)建集群的設(shè)備上,需要統(tǒng)一Python和Ray的版本,因此建議先使用conda創(chuàng)建同樣的虛擬環(huán)境之后,再安裝統(tǒng)一版本的ray。否則在添加集群節(jié)點的時候就有可能出現(xiàn)如下問題:

RuntimeError: Version mismatch: The cluster was started with:
    Ray: 2.7.2
    Python: 3.7.13
This process on node 172.17.0.2 was started with:
    Ray: 2.7.2
    Python: 3.7.5

啟動和連接服務(wù)

一般在配置集群的時候可以先配置下密鑰登陸:

$ ssh-keygen -t rsa
$ ssh-copy-id user_name@ip_address

就這么兩步,就可以配置遠程服務(wù)器ssh免密登陸(配置的過程中有可能需要輸入一次密碼)。然后在主節(jié)點(配置一個master節(jié)點)啟動ray服務(wù):

$ ray start --head --dashboard-host='0.0.0.0' --dashboard-port=8265
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: xxx.xxx.xxx.xxx
--------------------
Ray runtime started.
--------------------
Next steps
  To add another node to this Ray cluster, run
    ray start --address='xxx.xxx.xxx.xxx:6379'
  To connect to this Ray cluster:
    import ray
    ray.init()
  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://xxx.xxx.xxx.xxx:8265' ray job submit --working-dir . -- python my_script.py
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.
  To terminate the Ray runtime, run
    ray stop
  To view the status of the cluster, use
    ray status
  To monitor and debug Ray, view the dashboard at
    xxx.xxx.xxx.xxx:8265
  If connection to the dashboard fails, check your firewall settings and network configuration.

這就啟動完成了,并給你指示了下一步的操作,例如在另一個節(jié)點上配置添加到集群中,可以使用指令:

$ ray start --address='xxx.xxx.xxx.xxx:6379'

但是前面提到了,這里要求Python和Ray版本要一致,如果版本不一致就會出現(xiàn)這樣的報錯:

RuntimeError: Version mismatch: The cluster was started with:
    Ray: 2.7.2
    Python: 3.7.13
This process on node 172.17.0.2 was started with:
    Ray: 2.7.2
    Python: 3.7.5

到這里其實Ray集群就已經(jīng)部署完成了,非常的簡單方便。

基礎(chǔ)使用

我們先用一個最簡單的案例來測試一下:

# test_ray.py 
import os
import ray
ray.init()
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

這個Python腳本打印了遠程節(jié)點的計算資源,那么我們可以用這樣的方式去提交一個本地的job:

$ RAY_ADDRESS='http://xxx.xxx.xxx.xxx:8265' ray job submit --working-dir . -- python test_ray.py 
Job submission server address: http://xxx.xxx.xxx.xxx:8265
2024-08-27 07:35:10,751 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_4b79155b5de665ce.zip.
2024-08-27 07:35:10,751 INFO packaging.py:518 -- Creating a file package for local directory '.'.
-------------------------------------------------------
Job 'raysubmit_7Uqy8LjP4dxjZxGa' submitted successfully
-------------------------------------------------------
Next steps
  Query the logs of the job:
    ray job logs raysubmit_7Uqy8LjP4dxjZxGa
  Query the status of the job:
    ray job status raysubmit_7Uqy8LjP4dxjZxGa
  Request the job to be stopped:
    ray job stop raysubmit_7Uqy8LjP4dxjZxGa
Tailing logs until the job exits (disable with --no-wait):
2024-08-27 15:35:14,079 INFO worker.py:1330 -- Using address xxx.xxx.xxx.xxx:6379 set in the environment variable RAY_ADDRESS
2024-08-27 15:35:14,079 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: xxx.xxx.xxx.xxx:6379...
2024-08-27 15:35:14,103 INFO worker.py:1639 -- Connected to Ray cluster. View the dashboard at http://xxx.xxx.xxx.xxx:8265 
This cluster consists of
    1 nodes in total
    48.0 CPU resources in total
------------------------------------------
Job 'raysubmit_7Uqy8LjP4dxjZxGa' succeeded
------------------------------------------

這里的信息說明,遠程的集群只有一個節(jié)點,該節(jié)點上有48個可用的CPU核資源。這些輸出信息不僅可以在終端窗口上看到,還可以從這里給出的dashboard鏈接里面看到更加詳細的任務(wù)管理情況:

這里也順便提交一個輸出軟件位置信息的指令,確認下任務(wù)是在遠程執(zhí)行而不是在本地執(zhí)行:

import ray
ray.init()
import numpy as np
print (np.__file__)

返回的日志為:

$ RAY_ADDRESS='http://xxx.xxx.xxx.xxx:8265' ray job submit --working-dir . -- python test_ray.py 
Job submission server address: http://xxx.xxx.xxx.xxx:8265
2024-08-27 07:46:10,645 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_5bba1a7144beb522.zip.
2024-08-27 07:46:10,658 INFO packaging.py:518 -- Creating a file package for local directory '.'.
-------------------------------------------------------
Job 'raysubmit_kQ3XgE4Hxp3dkmuU' submitted successfully
-------------------------------------------------------
Next steps
  Query the logs of the job:
    ray job logs raysubmit_kQ3XgE4Hxp3dkmuU
  Query the status of the job:
    ray job status raysubmit_kQ3XgE4Hxp3dkmuU
  Request the job to be stopped:
    ray job stop raysubmit_kQ3XgE4Hxp3dkmuU
Tailing logs until the job exits (disable with --no-wait):
2024-08-27 15:46:12,456 INFO worker.py:1330 -- Using address xxx.xxx.xxx.xxx:6379 set in the environment variable RAY_ADDRESS
2024-08-27 15:46:12,457 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: xxx.xxx.xxx.xxx:6379...
2024-08-27 15:46:12,470 INFO worker.py:1639 -- Connected to Ray cluster. View the dashboard at http://xxx.xxx.xxx.xxx:8265 
/home/dechin/anaconda3/envs/mindspore-latest/lib/python3.7/site-packages/numpy/__init__.py
------------------------------------------
Job 'raysubmit_kQ3XgE4Hxp3dkmuU' succeeded
------------------------------------------
$ python3 -m pip show numpy
Name: numpy
Version: 1.21.6
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: 
License: BSD
Location: /usr/local/python-3.7.5/lib/python3.7/site-packages
Requires: 
Required-by: CyFES, h5py, hadder, matplotlib, mindinsight, mindspore, mindspore-serving, pandas, ray, scikit-learn, scipy

這里可以看到,提交的任務(wù)中numpy是保存在mindspore-latest虛擬環(huán)境中的,而本地的numpy不在虛擬環(huán)境中,說明任務(wù)確實是在遠程執(zhí)行的。類似的可以在dashboard上面看到提交日志:

接下來測試一下分布式框架ray的并發(fā)特性:

import ray
ray.init()
@ray.remote(num_returns=1)
def cpu_task():
    import time
    time.sleep(2)
    import numpy as np
    nums = 100000
    arr = np.random.random((2, nums))
    arr2 = arr[1]**2 + arr[0]**2
    pi = np.where(arr2<=1, 1, 0).sum() * 4 / nums
    return pi
num_conc = 10
res = ray.get([cpu_task.remote() for _ in range(num_conc)])
print (sum(res) / num_conc)

這個案例的內(nèi)容是用蒙特卡洛算法計算圓周率的值,一次提交10個任務(wù),每個任務(wù)中撒點100000個,并休眠2s。那么如果是順序執(zhí)行的話,理論上需要休眠20s。而這里提交任務(wù)之后,輸出如下:

$ time RAY_ADDRESS='http://xxx.xxx.xxx.xxx:8265' ray job submit --working-dir . --entrypoint-num-cpus 10 -- python te
st_ray.py 
Job submission server address: http://xxx.xxx.xxx.xxx:8265
2024-08-27 08:30:13,315 INFO dashboard_sdk.py:385 -- Package gcs://_ray_pkg_d66b052eb6944465.zip already exists, skipping upload.
-------------------------------------------------------
Job 'raysubmit_Ur6MAvP7DYiCT6Uz' submitted successfully
-------------------------------------------------------
Next steps
  Query the logs of the job:
    ray job logs raysubmit_Ur6MAvP7DYiCT6Uz
  Query the status of the job:
    ray job status raysubmit_Ur6MAvP7DYiCT6Uz
  Request the job to be stopped:
    ray job stop raysubmit_Ur6MAvP7DYiCT6Uz
Tailing logs until the job exits (disable with --no-wait):
2024-08-27 16:30:15,032 INFO worker.py:1330 -- Using address xxx.xxx.xxx.xxx:6379 set in the environment variable RAY_ADDRESS
2024-08-27 16:30:15,033 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: xxx.xxx.xxx.xxx:6379...
2024-08-27 16:30:15,058 INFO worker.py:1639 -- Connected to Ray cluster. View the dashboard at http://xxx.xxx.xxx.xxx:8265 
3.141656
------------------------------------------
Job 'raysubmit_Ur6MAvP7DYiCT6Uz' succeeded
------------------------------------------
real    0m7.656s
user    0m0.414s
sys     0m0.010s

總的運行時間在7.656秒,其中5s左右的時間是來自網(wǎng)絡(luò)delay。所以實際上并發(fā)之后的總運行時間就在2s左右,跟單任務(wù)休眠的時間差不多。也就是說,遠程提交的任務(wù)確實是并發(fā)執(zhí)行的。最終返回的結(jié)果進行加和處理,得到的圓周率估計為:3.141656。而且除了普通的CPU任務(wù)之外,還可以上傳GPU任務(wù):

import ray
ray.init()
@ray.remote(num_returns=1, num_gpus=1)
def test_ms():
    import os
    os.environ['GLOG_v']='4'
    os.environ['CUDA_VISIBLE_DEVICE']='0'
    import mindspore as ms
    ms.set_context(device_target="GPU", device_id=0)
    a = ms.Tensor([1, 2, 3], ms.float32)
    return a.asnumpy().sum()
res = ray.get(test_ms.remote())
ray.shutdown()
print (res)

這個任務(wù)是用mindspore簡單創(chuàng)建了一個Tensor,并計算了Tensor的總和返回給本地,輸出內(nèi)容為:

$ RAY_ADDRESS='http://xxx.xxx.xxx.xxx:8265' ray job submit --working-dir . --entrypoint-num-gpus 1 -- python test_ray.py 
Job submission server address: http://xxx.xxx.xxx.xxx:8265
2024-08-28 01:16:38,712 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_10019cd9fa9bdc38.zip.
2024-08-28 01:16:38,712 INFO packaging.py:518 -- Creating a file package for local directory '.'.

-------------------------------------------------------
Job 'raysubmit_RUvkEqnkjNitKmnJ' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_RUvkEqnkjNitKmnJ
  Query the status of the job:
    ray job status raysubmit_RUvkEqnkjNitKmnJ
  Request the job to be stopped:
    ray job stop raysubmit_RUvkEqnkjNitKmnJ

Tailing logs until the job exits (disable with --no-wait):
2024-08-28 09:16:41,960 INFO worker.py:1330 -- Using address xxx.xxx.xxx.xxx:6379 set in the environment variable RAY_ADDRESS
2024-08-28 09:16:41,960 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: xxx.xxx.xxx.xxx:6379...
2024-08-28 09:16:41,974 INFO worker.py:1639 -- Connected to Ray cluster. View the dashboard at http://xxx.xxx.xxx.xxx:8265 
6.0

------------------------------------------
Job 'raysubmit_RUvkEqnkjNitKmnJ' succeeded
------------------------------------------

返回的計算結(jié)果是6.0,那么也是正確的。

查看和管理任務(wù)

前面的任務(wù)輸出信息中,都有相應(yīng)的job_id,我們可以根據(jù)這個job_id在主節(jié)點上面查看相關(guān)任務(wù)的執(zhí)行情況:

$ ray job status raysubmit_RUvkEqnkjNitKmnJ

可以查看該任務(wù)的輸出內(nèi)容:

$ ray job logs raysubmit_RUvkEqnkjNitKmnJ

還可以終止該任務(wù)的運行:

$ ray job stop raysubmit_RUvkEqnkjNitKmnJ

總結(jié)概要

本文介紹了基于Python的分布式框架Ray的基本安裝與使用。Ray框架下不僅可以通過conda和Python十分方便的構(gòu)建一個集群,還可以自動的對分布式任務(wù)進行并發(fā)處理,且支持GPU分布式任務(wù)的提交,極大的簡化了手動分布式開發(fā)的工作量。

版權(quán)聲明

本文首發(fā)鏈接為:https://www.cnblogs.com/dechinphy/p/ray.html

作者ID:DechinPhy

更多原著文章:https://www.cnblogs.com/dechinphy/

請博主喝咖啡:https://www.cnblogs.com/dechinphy/gallery/image/379634.html

到此這篇關(guān)于Python中的分布式框架Ray的安裝與使用教程的文章就介紹到這了,更多相關(guān)Python 分布式框架Ray內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

最新評論