欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

深度Q網(wǎng)絡(luò)DQN(Deep Q-Network)強(qiáng)化學(xué)習(xí)的原理與實戰(zhàn)

 更新時間:2025年04月04日 10:15:39   作者:wx62088446a1f70  
深度Q學(xué)習(xí)將深度神經(jīng)網(wǎng)絡(luò)與強(qiáng)化學(xué)習(xí)相結(jié)合,解決了傳統(tǒng)Q學(xué)習(xí)在高維狀態(tài)空間下的局限性,通過經(jīng)驗回放和目標(biāo)網(wǎng)絡(luò)等技術(shù),DQN能夠在復(fù)雜環(huán)境中學(xué)習(xí)有效的策略,本文通過CartPole環(huán)境的完整實現(xiàn),展示了DQN的核心思想和實現(xiàn)細(xì)節(jié)

DQN(Deep Q-Network)是一種基于深度學(xué)習(xí)和強(qiáng)化學(xué)習(xí)的算法,由DeepMind提出,用于解決離散動作空間下的馬爾科夫決策過程(MDP)問題。它是首個成功將深度學(xué)習(xí)應(yīng)用于解決強(qiáng)化學(xué)習(xí)任務(wù)的算法之一。DQN,即深度Q網(wǎng)絡(luò)(Deep Q-network),是指基于深度學(xué)習(xí)的Q-Learing算法。

一、強(qiáng)化學(xué)習(xí)基礎(chǔ)

強(qiáng)化學(xué)習(xí)(Reinforcement Learning)是機(jī)器學(xué)習(xí)的一個重要分支,其核心思想是通過與環(huán)境的交互學(xué)習(xí)最優(yōu)策略。與監(jiān)督學(xué)習(xí)不同,強(qiáng)化學(xué)習(xí)不需要預(yù)先準(zhǔn)備好的輸入-輸出對,而是通過試錯機(jī)制獲得獎勵信號來指導(dǎo)學(xué)習(xí)。

1.1 核心概念

• 智能體(Agent):學(xué)習(xí)的執(zhí)行者 • 環(huán)境(Environment):智能體交互的對象 • 狀態(tài)(State):環(huán)境的當(dāng)前情況 • 動作(Action):智能體的行為 • 獎勵(Reward):環(huán)境對動作的反饋 • 策略(Policy):狀態(tài)到動作的映射

1.2 馬爾可夫決策過程

強(qiáng)化學(xué)習(xí)問題通常建模為馬爾可夫決策過程(MDP),由五元組(S, A, P, R, γ)組成: • S:狀態(tài)集合 • A:動作集合 • P:狀態(tài)轉(zhuǎn)移概率 • R:獎勵函數(shù) • γ:折扣因子(0≤γ<1)

二、Q學(xué)習(xí)與深度Q網(wǎng)絡(luò)

2.1 Q學(xué)習(xí)算法

Q學(xué)習(xí)是一種經(jīng)典的強(qiáng)化學(xué)習(xí)算法,通過維護(hù)一個Q值表來估計在給定狀態(tài)下采取某個動作的長期回報:

import numpy as np

# 初始化Q表
q_table = np.zeros((state_space_size, action_space_size))

# Q學(xué)習(xí)更新公式
alpha = 0.1  # 學(xué)習(xí)率
gamma = 0.99  # 折扣因子

for episode in range(total_episodes):
    state = env.reset()
    done = False
    
    while not done:
        action = select_action(state)  # ε-greedy策略
        next_state, reward, done, _ = env.step(action)
        
        # Q值更新
        q_table[state, action] = q_table[state, action] + alpha * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
        )
        state = next_state

2.2 深度Q網(wǎng)絡(luò)(DQN)

當(dāng)狀態(tài)空間較大時,Q表變得不切實際。DQN使用神經(jīng)網(wǎng)絡(luò)近似Q函數(shù):

import torch
import torch.nn as nn
import torch.optim as optim

class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, output_dim)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

三、DQN的改進(jìn)技術(shù)

3.1 經(jīng)驗回放(Experience Replay)

解決樣本相關(guān)性和非平穩(wěn)分布問題:

from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)

3.2 目標(biāo)網(wǎng)絡(luò)(Target Network)

穩(wěn)定訓(xùn)練過程:

target_net = DQN(input_dim, output_dim).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

# 定期更新目標(biāo)網(wǎng)絡(luò)
if steps_done % TARGET_UPDATE == 0:
    target_net.load_state_dict(policy_net.state_dict())

四、完整DQN實現(xiàn)(CartPole環(huán)境)

import gym
import numpy as np
import torch
import random
from collections import deque
import matplotlib.pyplot as plt

# 超參數(shù)
BATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 10
LR = 0.001

# 初始化環(huán)境
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

# 神經(jīng)網(wǎng)絡(luò)定義
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, output_dim)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# 初始化網(wǎng)絡(luò)
policy_net = DQN(state_dim, action_dim).to(device)
target_net = DQN(state_dim, action_dim).to(device)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=LR)
memory = ReplayBuffer(10000)

# 訓(xùn)練過程
def train():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    batch = list(zip(*transitions))
    
    state_batch = torch.FloatTensor(np.array(batch[0]))
    action_batch = torch.LongTensor(np.array(batch[1]))
    reward_batch = torch.FloatTensor(np.array(batch[2]))
    next_state_batch = torch.FloatTensor(np.array(batch[3]))
    done_batch = torch.FloatTensor(np.array(batch[4]))
    
    current_q = policy_net(state_batch).gather(1, action_batch.unsqueeze(1))
    next_q = target_net(next_state_batch).max(1)[0].detach()
    expected_q = reward_batch + (1 - done_batch) * GAMMA * next_q
    
    loss = nn.MSELoss()(current_q.squeeze(), expected_q)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 主訓(xùn)練循環(huán)
episode_rewards = []
for episode in range(500):
    state = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        # ε-greedy動作選擇
        eps_threshold = EPS_END + (EPS_START - EPS_END) * \
            np.exp(-1. * episode / EPS_DECAY)
        if random.random() > eps_threshold:
            with torch.no_grad():
                action = policy_net(torch.FloatTensor(state)).argmax().item()
        else:
            action = random.randint(0, action_dim-1)
        
        next_state, reward, done, _ = env.step(action)
        memory.push(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward
        
        train()
    
    episode_rewards.append(total_reward)
    if episode % 10 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward}")

# 繪制訓(xùn)練曲線
plt.plot(episode_rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('DQN Training Progress')
plt.show()

五、DQN的局限性與發(fā)展

  1. 過估計問題:Double DQN通過解耦動作選擇和Q值評估來解決
  2. 優(yōu)先經(jīng)驗回放:給重要的轉(zhuǎn)移更高采樣概率
  3. 競爭網(wǎng)絡(luò)架構(gòu):Dueling DQN分離價值函數(shù)和優(yōu)勢函數(shù)
  4. 分布式強(qiáng)化學(xué)習(xí):學(xué)習(xí)價值分布而不僅是期望值

六、總結(jié)

深度Q學(xué)習(xí)將深度神經(jīng)網(wǎng)絡(luò)與強(qiáng)化學(xué)習(xí)相結(jié)合,解決了傳統(tǒng)Q學(xué)習(xí)在高維狀態(tài)空間下的局限性。通過經(jīng)驗回放和目標(biāo)網(wǎng)絡(luò)等技術(shù),DQN能夠在復(fù)雜環(huán)境中學(xué)習(xí)有效的策略。本文通過CartPole環(huán)境的完整實現(xiàn),展示了DQN的核心思想和實現(xiàn)細(xì)節(jié)。未來,結(jié)合改進(jìn)技術(shù)和更強(qiáng)大的網(wǎng)絡(luò)架構(gòu),深度強(qiáng)化學(xué)習(xí)將在機(jī)器人控制、游戲AI等領(lǐng)域發(fā)揮更大作用。

到此這篇關(guān)于深度Q網(wǎng)絡(luò)DQN(Deep Q-Network)強(qiáng)化學(xué)習(xí)的原理與實戰(zhàn)的文章就介紹到這了,更多相關(guān)深度Q網(wǎng)絡(luò)DQN強(qiáng)化學(xué)習(xí)內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

最新評論