如何用Pytorch搭建一個(gè)房價(jià)預(yù)測(cè)模型

更新時(shí)間：2023年03月27日 14:57:54 作者：Mr.長安

這篇文章主要介紹了用Pytorch搭建一個(gè)房價(jià)預(yù)測(cè)模型,在這里我將主要討論P(yáng)yTorch建模的相關(guān)方面，作為一點(diǎn)額外的內(nèi)容，我還將演示PyTorch中開發(fā)的模型的神經(jīng)元重要性，需要的朋友可以參考下

一、項(xiàng)目介紹

在此項(xiàng)目中，目的是預(yù)測(cè)愛荷華州Ames的房價(jià)，給定81個(gè)特征，描述了房子、面積、土地、基礎(chǔ)設(shè)施、公共設(shè)施等。埃姆斯數(shù)據(jù)集具有分類和連續(xù)特征的良好組合，大小適中，也許最重要的是，它不像其他類似的數(shù)據(jù)集（如波士頓住房）那樣存在潛在的紅線或數(shù)據(jù)輸入問題。在這里我將主要討論P(yáng)yTorch建模的相關(guān)方面，作為一點(diǎn)額外的內(nèi)容，我還將演示PyTorch中開發(fā)的模型的神經(jīng)元重要性。你可以在PyTorch中嘗試不同的網(wǎng)絡(luò)架構(gòu)或模型類型。本項(xiàng)目中的重點(diǎn)是方法論，而不是詳盡地尋找最佳解決方案。

二、準(zhǔn)備工作

為了準(zhǔn)備這個(gè)項(xiàng)目，我們首先需要下載數(shù)據(jù)，并通過以下步驟進(jìn)行一些預(yù)處理。

from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=42165, as_frame=True)

關(guān)于該數(shù)據(jù)集的完整描述，你可以去該網(wǎng)址查看：https://www.openml.org/d/42165。

查看數(shù)據(jù)特征

import pandas as pd
data_ames = pd.DataFrame(data.data, columns=data.feature_names)
data_ames['SalePrice'] = data.target
data_ames.info()

下面是DataFrame的信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null float64
MSSubClass       1460 non-null float64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null float64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null float64
OverallCond      1460 non-null float64
YearBuilt        1460 non-null float64
YearRemodAdd     1460 non-null float64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null float64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null float64
BsmtUnfSF        1460 non-null float64
TotalBsmtSF      1460 non-null float64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null float64
2ndFlrSF         1460 non-null float64
LowQualFinSF     1460 non-null float64
GrLivArea        1460 non-null float64
BsmtFullBath     1460 non-null float64
BsmtHalfBath     1460 non-null float64
FullBath         1460 non-null float64
HalfBath         1460 non-null float64
BedroomAbvGr     1460 non-null float64
KitchenAbvGr     1460 non-null float64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null float64
Functional       1460 non-null object
Fireplaces       1460 non-null float64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null float64
GarageArea       1460 non-null float64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null float64
OpenPorchSF      1460 non-null float64
EnclosedPorch    1460 non-null float64
3SsnPorch        1460 non-null float64
ScreenPorch      1460 non-null float64
PoolArea         1460 non-null float64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null float64
MoSold           1460 non-null float64
YrSold           1460 non-null float64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null float64
dtypes: float64(38), object(43)
memory usage: 924.0+ KB

接下來，我們還將使用一個(gè)庫，即 captum，它可以檢查 PyTorch 模型的特征和神經(jīng)元重要性。

pip install captum

在做完這些準(zhǔn)備工作后，我們來看看如何預(yù)測(cè)房價(jià)。

三、實(shí)驗(yàn)過程

3.1數(shù)據(jù)預(yù)處理

在這里，首先要進(jìn)行數(shù)據(jù)縮放處理，因?yàn)樗械淖兞慷加胁煌某叨?。分類變量需要轉(zhuǎn)換為數(shù)值類型，以便將它們輸入到我們的模型中。我們可以選擇一熱編碼，即我們?yōu)槊總€(gè)分類因子創(chuàng)建啞變量，或者是序數(shù)編碼，即我們對(duì)所有因子進(jìn)行編號(hào)，并用這些數(shù)字替換字符串。我們可以像其他浮動(dòng)變量一樣將虛擬變量送入，而序數(shù)編碼則需要使用嵌入，即線性神經(jīng)網(wǎng)絡(luò)投影，在多維空間中對(duì)類別進(jìn)行重新排序。我們?cè)谶@里采取嵌入的方式。

import numpy as np
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
 
num_cols = list(data_ames.select_dtypes(include='float'))
cat_cols = list(data_ames.select_dtypes(include='object'))
 
ordinal_encoder = OrdinalEncoder().fit(
    data_ames[cat_cols]
)
standard_scaler = StandardScaler().fit(
    data_ames[num_cols]
)
 
X = pd.DataFrame(
    data=np.column_stack([
        ordinal_encoder.transform(data_ames[cat_cols]),
        standard_scaler.transform(data_ames[num_cols])
    ]),
    columns=cat_cols + num_cols
)

3.2拆分?jǐn)?shù)據(jù)集

在構(gòu)建模型之前，我們需要將數(shù)據(jù)拆分為訓(xùn)練集和測(cè)試集。在這里，我們添加了一個(gè)數(shù)值變量的分層。這可以確保不同的部分（其中五個(gè)）在訓(xùn)練集和測(cè)試集中都以同等的數(shù)量包含。

np.random.seed(12)  
from sklearn.model_selection import train_test_split
 
bins = 5
sale_price_bins = pd.qcut(
    X['SalePrice'], q=bins, labels=list(range(bins))
)
X_train, X_test, y_train, y_test = train_test_split(
    X.drop(columns='SalePrice'),
    X['SalePrice'],
    random_state=12,
    stratify=sale_price_bins
)

3.3構(gòu)建PyTorch模型

接下來開始建立我們的PyTorch模型。我們將使用PyTorch實(shí)現(xiàn)一個(gè)具有批量輸入的神經(jīng)網(wǎng)絡(luò)回歸，具體將涉及以下步驟。

1. 將數(shù)據(jù)轉(zhuǎn)換為Torch tensors
2. 定義模型結(jié)構(gòu)
3. 定義損失標(biāo)準(zhǔn)和優(yōu)化器。
4. 創(chuàng)建一個(gè)批次的數(shù)據(jù)加載器
5. 跑步訓(xùn)練

3.3.1.數(shù)據(jù)轉(zhuǎn)換

首先將數(shù)據(jù)轉(zhuǎn)換為torch tensors

from torch.autograd import Variable 
 
num_features = list(
    set(num_cols) - set(['SalePrice', 'Id'])
)
X_train_num_pt = Variable(
    torch.cuda.FloatTensor(
        X_train[num_features].values
    )
)
X_train_cat_pt = Variable(
    torch.cuda.LongTensor(
        X_train[cat_cols].values
    )
)
y_train_pt = Variable(
    torch.cuda.FloatTensor(y_train.values)
).view(-1, 1)
X_test_num_pt = Variable(
    torch.cuda.FloatTensor(
        X_test[num_features].values
    )
)
X_test_cat_pt = Variable(
   torch.cuda.LongTensor(
        X_test[cat_cols].values
    ).long()
)
y_test_pt = Variable(
    torch.cuda.FloatTensor(y_test.values)
).view(-1, 1)

這可以確保我們將數(shù)字和分類數(shù)據(jù)加載到單獨(dú)的變量中，類似于NumPy。如果你把數(shù)據(jù)類型混合在一個(gè)變量（數(shù)組/矩陣）中，它們就會(huì)變成對(duì)象。我們希望把數(shù)值變量弄成浮點(diǎn)數(shù)，把分類變量弄成長（或int），索引我們的類別。我們還將訓(xùn)練集和測(cè)試集分開。顯然，一個(gè)ID變量在模型中不應(yīng)該是重要的。在最壞的情況下，如果ID與目標(biāo)有任何相關(guān)性，它可能會(huì)引入目標(biāo)泄漏。我們已經(jīng)把它從這一步的處理中刪除了。

3.3.2定義模型架構(gòu)

class RegressionModel(torch.nn.Module): 
  
    def __init__(self, X, num_cols, cat_cols, device=torch.device('cuda'), embed_dim=2, hidden_layer_dim=2, p=0.5): 
        super(RegressionModel, self).__init__() 
        self.num_cols = num_cols
        self.cat_cols = cat_cols
        self.embed_dim = embed_dim
        self.hidden_layer_dim = hidden_layer_dim
        
        self.embeddings = [
            torch.nn.Embedding(
                num_embeddings=len(X[col].unique()),
                embedding_dim=embed_dim
            ).to(device)
            for col in cat_cols
        ]
        hidden_dim = len(num_cols) + len(cat_cols) * embed_dim,
        
        # hidden layer
        self.hidden = torch.nn.Linear(torch.IntTensor(hidden_dim), hidden_layer_dim).to(device)
        self.dropout_layer = torch.nn.Dropout(p=p).to(device)
        self.hidden_act = torch.nn.ReLU().to(device)
        
        # output layer
        self.output = torch.nn.Linear(hidden_layer_dim, 1).to(device)
    
    def forward(self, num_inputs, cat_inputs):
        '''Forward method with two input variables -
        numeric and categorical.
        '''
        cat_x = [
            torch.squeeze(embed(cat_inputs[:, i] - 1))
            for i, embed in enumerate(self.embeddings)
        ]
        x = torch.cat(cat_x + [num_inputs], dim=1)
        x = self.hidden(x)
        x = self.dropout_layer(x)
        x = self.hidden_act(x)
        y_pred = self.output(x)
        return y_pred
 
house_model = RegressionModel(
    data_ames, num_features, cat_cols
)

我們?cè)趦蓚€(gè)線性層（上的激活函數(shù)是整流線性單元激活（ReLU）函數(shù)。這里需要注意的是，我們不可能將相同的架構(gòu)（很容易）封裝成一個(gè)順序模型，因?yàn)榉诸惡蛿?shù)值類型上發(fā)生的操作不同。

3.3.3定義損失準(zhǔn)則和優(yōu)化器

接下來，定義損失準(zhǔn)則和優(yōu)化器。我們將均方誤差（MSE）作為損失，隨機(jī)梯度下降作為我們的優(yōu)化算法。

criterion = torch.nn.MSELoss().to(device)
optimizer = torch.optim.SGD(house_model.parameters(), lr=0.001)

3.3.4創(chuàng)建數(shù)據(jù)加載器

現(xiàn)在，創(chuàng)建一個(gè)數(shù)據(jù)加載器，每次輸入一批數(shù)據(jù)。

data_batch = torch.utils.data.TensorDataset(
    X_train_num_pt, X_train_cat_pt, y_train_pt
)
dataloader = torch.utils.data.DataLoader(
    data_batch, batch_size=10, shuffle=True
)

我們?cè)O(shè)置了10個(gè)批次的大小，接下來我們可以進(jìn)行訓(xùn)練了。

3.3.5.訓(xùn)練模型

基本上，我們要在epoch上循環(huán)，在每個(gè)epoch內(nèi)推理出性能，計(jì)算出誤差，優(yōu)化器根據(jù)誤差進(jìn)行調(diào)整。這是在沒有訓(xùn)練的內(nèi)循環(huán)的情況下，在epochs上的循環(huán)。

from tqdm.notebook import trange
 
train_losses, test_losses = [], []
n_epochs = 30
for epoch in trange(n_epochs):
    train_loss, test_loss = 0, 0
  
    # print the errors in training and test:
    if epoch % 10 == 0 :
        print(
            'Epoch: {}/{}\t'.format(epoch, 1000),
            'Training Loss: {:.3f}\t'.format(
                train_loss / len(dataloader)
            ),
            'Test Loss: {:.3f}'.format(
                test_loss / len(dataloader)
            )
        )

訓(xùn)練是在這個(gè)循環(huán)里面對(duì)所有批次的訓(xùn)練數(shù)據(jù)進(jìn)行的。

for (x_train_num_batch,x_train_cat_batch,y_train_batch) in dataloader:
        (x_train_num_batch,x_train_cat_batch, y_train_batch) = (
                x_train_num_batch.to(device),
                x_train_cat_batch.to(device),
                y_train_batch.to(device))
        pred_ytrain = house_model.forward(x_train_num_batch, x_train_cat_batch)
        loss = torch.sqrt(criterion(pred_ytrain, y_train_batch)) 
 
        optimizer.zero_grad() 
        loss.backward() 
        optimizer.step()
        train_loss += loss.item()
        with torch.no_grad():
            house_model.eval()
            pred_ytest = house_model.forward(X_test_num_pt, X_test_cat_pt)
            test_loss += torch.sqrt(criterion(pred_ytest, y_test_pt))
 
        train_losses.append(train_loss / len(dataloader))
        test_losses.append(test_loss / len(dataloader))

訓(xùn)練結(jié)果如下：

我們?nèi)?nn.MSELoss 的平方根，因?yàn)?PyTorch 中 nn.MSELoss 的定義如下：

((input-target)**2).mean()

繪制一下我們的模型在訓(xùn)練期間對(duì)訓(xùn)練和驗(yàn)證數(shù)據(jù)集的表現(xiàn)。

plt.plot(
    np.array(train_losses).reshape((n_epochs, -1)).mean(axis=1),
    label='Training loss'
)
plt.plot(
    np.array(test_losses).reshape((n_epochs, -1)).mean(axis=1),
    label='Validation loss'
)
plt.legend(frameon=False)
plt.xlabel('epochs')
plt.ylabel('MSE')

在我們的驗(yàn)證損失停止下降之前，我們及時(shí)停止了訓(xùn)練。我們還可以對(duì)目標(biāo)變量進(jìn)行排序和bin，并將預(yù)測(cè)結(jié)果與之對(duì)比繪制，以便了解模型在整個(gè)房價(jià)范圍內(nèi)的表現(xiàn)。這是為了避免回歸中的情況，尤其是用MSE作為損失，即你只對(duì)一個(gè)中值范圍的預(yù)測(cè)很好，接近平均值，但對(duì)其他任何東西都做得不好。

我們可以看到，事實(shí)上，這個(gè)模型在整個(gè)房價(jià)范圍內(nèi)的預(yù)測(cè)非常接近。事實(shí)上，我們得到的Spearman秩相關(guān)度約為93%，具有非常高的顯著性，這證實(shí)了這個(gè)模型的表現(xiàn)具有很高的準(zhǔn)確性。

四、原理講解

深度學(xué)習(xí)神經(jīng)網(wǎng)絡(luò)框架使用不同的優(yōu)化算法。其中流行的有隨機(jī)梯度下降（SGD）、均方根推進(jìn)（RMSProp）和自適應(yīng)矩估計(jì)（ADAM）。我們定義了隨機(jī)梯度下降作為我們的優(yōu)化算法。另外，我們還可以定義其他優(yōu)化器。

opt_SGD = torch.optim.SGD(net_SGD.parameters(), lr=LR)
opt_Momentum = torch.optim.SGD(net_Momentum.parameters(), lr=LR, momentum=0.6)
opt_RMSprop = torch.optim.RMSprop(net_RMSprop.parameters(), lr=LR, alpha=0.1)
opt_Adam = torch.optim.Adam(net_Adam.parameters(), lr=LR, betas=(0.8, 0.98))

SGD的工作原理與梯度下降相同，只是它每次只在一個(gè)例子上工作。有趣的是，收斂性與梯度下降相似，并且更容易占用計(jì)算機(jī)內(nèi)存。

RMSProp的工作原理是根據(jù)梯度符號(hào)來調(diào)整算法的學(xué)習(xí)率。最簡單的變體是檢查最后兩個(gè)梯度符號(hào)，然后調(diào)整學(xué)習(xí)率，如果它們相同，則增加一個(gè)分?jǐn)?shù)，如果它們不同，則減少一個(gè)分?jǐn)?shù)。

ADAM是最流行的優(yōu)化器之一。它是一種自適應(yīng)學(xué)習(xí)算法，根據(jù)梯度的第一和第二時(shí)刻改變學(xué)習(xí)率。

Captum是一個(gè)工具，可以幫助我們了解在數(shù)據(jù)集上學(xué)習(xí)的神經(jīng)網(wǎng)絡(luò)模型的來龍去脈。它可以幫助我們學(xué)習(xí)以下內(nèi)容。

特征重要性
層級(jí)重要性
神經(jīng)元的重要性

這在學(xué)習(xí)可解釋的神經(jīng)網(wǎng)絡(luò)中是非常重要的。在這里，綜合梯度已經(jīng)被應(yīng)用于理解特征重要性。之后，還用層傳導(dǎo)法來證明神經(jīng)元的重要性。

五、補(bǔ)充

既然我們已經(jīng)定義并訓(xùn)練了我們的神經(jīng)網(wǎng)絡(luò)，那么讓我們使用 captum 庫找到重要的特征和神經(jīng)元。

from captum.attr import (
    IntegratedGradients,
    LayerConductance,
    NeuronConductance
)
house_model.cpu()
for embedding in house_model.embeddings:
    embedding.cpu()
 
house_model.cpu()
ing_house = IntegratedGradients(forward_func=house_model.forward, )
#X_test_cat_pt.requires_grad_()
X_test_num_pt.requires_grad_()
attr, delta = ing_house.attribute(
 X_test_num_pt.cpu(),
 target=None,
 return_convergence_delta=True,
 additional_forward_args=X_test_cat_pt.cpu()
)
attr = attr.detach().numpy()

現(xiàn)在，我們有了一個(gè)NumPy的特征重要性數(shù)組。層和神經(jīng)元的重要性也可以用這個(gè)工具獲得。讓我們來看看我們第一層的神經(jīng)元importances。我們可以傳遞house_model.act1，這是第一層線性層上面的ReLU激活函數(shù)。

cond_layer1 = LayerConductance(house_model, house_model.act1)
cond_vals = cond_layer1.attribute(X_test, target=None)
cond_vals = cond_vals.detach().numpy()
df_neuron = pd.DataFrame(data = np.mean(cond_vals, axis=0), columns=['Neuron Importance'])
df_neuron['Neuron'] = range(10)

這張圖顯示了神經(jīng)元的重要性。顯然，一個(gè)神經(jīng)元就是不重要的。我們還可以通過對(duì)之前得到的NumPy數(shù)組進(jìn)行排序，看到最重要的變量。

df_feat = pd.DataFrame(np.mean(attr, axis=0), columns=['feature importance'] )
df_feat['features'] = num_features
df_feat.sort_values(
    by='feature importance', ascending=False
).head(10)

這里列出了10個(gè)最重要的變量

通常情況下，特征導(dǎo)入可以幫助我們既理解模型，又修剪我們的模型，使其變得不那么復(fù)雜（希望減少過度擬合）。

到此這篇關(guān)于如何用Pytorch搭建一個(gè)房價(jià)預(yù)測(cè)模型的文章就介紹到這了,更多相關(guān)Pytorch房價(jià)預(yù)測(cè)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫

CMS

常用工具

如何用Pytorch搭建一個(gè)房價(jià)預(yù)測(cè)模型

目錄

一、項(xiàng)目介紹

二、準(zhǔn)備工作

三、實(shí)驗(yàn)過程

3.1數(shù)據(jù)預(yù)處理

3.2拆分?jǐn)?shù)據(jù)集

3.3構(gòu)建PyTorch模型

3.3.1.數(shù)據(jù)轉(zhuǎn)換

3.3.2定義模型架構(gòu)

3.3.3定義損失準(zhǔn)則和優(yōu)化器

3.3.4創(chuàng)建數(shù)據(jù)加載器

3.3.5.訓(xùn)練模型

四、原理講解

五、補(bǔ)充

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

如何用Pytorch搭建一個(gè)房價(jià)預(yù)測(cè)模型

目錄

一、項(xiàng)目介紹

二、準(zhǔn)備工作

三、實(shí)驗(yàn)過程

3.1數(shù)據(jù)預(yù)處理

3.2拆分?jǐn)?shù)據(jù)集

3.3構(gòu)建PyTorch模型

3.3.1.數(shù)據(jù)轉(zhuǎn)換

3.3.2定義模型架構(gòu)

3.3.3定義損失準(zhǔn)則和優(yōu)化器

3.3.4創(chuàng)建數(shù)據(jù)加載器

3.3.5.訓(xùn)練模型

四、原理講解

五、補(bǔ)充

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

二、準(zhǔn)備工作

四、原理講解

五、補(bǔ)充