快捷導(dǎo)航

Python中CLIP多模態(tài)模型的庫(kù)的實(shí)現(xiàn)

更新時(shí)間：2025年04月28日 10:52:45 作者：彬彬俠

CLIP模型是OpenAI開發(fā)的一種語(yǔ)言和圖像多模態(tài)表示方法,本文主要介紹了Python中CLIP多模態(tài)模型的庫(kù)的實(shí)現(xiàn),具有一定的參考價(jià)值,感興趣的可以了解一下

1. 安裝 OpenAI 官方 CLIP

pip install git+https://github.com/openai/CLIP.git

依賴：torch、numpy, PIL

2. 快速使用示例

import clip
import torch
from PIL import Image

# 加載模型和預(yù)處理方法
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 加載圖像并預(yù)處理
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)

# 編寫文本描述
text = clip.tokenize(["a photo of a cat", "a photo of a dog"]).to(device)

# 提取特征并計(jì)算相似度
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probabilities:", probs)

3. 模型選項(xiàng)

支持的模型有：

"ViT-B/32"：最快，最常用
"ViT-B/16"：更大更準(zhǔn)
"RN50"、"RN101"：基于 ResNet

4. 文本編碼

text = ["a photo of a banana", "a dog", "a car"]
tokens = clip.tokenize(text).to(device)

with torch.no_grad():
    text_features = model.encode_text(tokens)

5. 圖像編碼

from PIL import Image

image = Image.open("example.jpg")
image_input = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)

6. 相似度比較

import torch.nn.functional as F

# 余弦相似度
similarity = F.cosine_similarity(image_features, text_features)
print(similarity)

7. 零樣本圖像分類

labels = ["a dog", "a cat", "a car"]
text_inputs = clip.tokenize([f"a photo of {label}" for label in labels]).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_inputs)
    image_features = model.encode_image(image)

# 歸一化
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# 相似度得分
logits = (image_features @ text_features.T)
pred = logits.argmax().item()

print(f"Predicted label: {labels[pred]}")

8. 與其他庫(kù)對(duì)比

特性	CLIP	BLIP / Flamingo	BERT / GPT
圖文對(duì)齊	是	是	否
多模態(tài)能力	強(qiáng)（圖像 + 文本）	更強(qiáng)（支持生成）	弱
零樣本能力	強(qiáng)	強(qiáng)	無(wú)
適合任務(wù)	圖文檢索、匹配、分類	生成描述、問答、VQA	語(yǔ)言任務(wù)

9. 更強(qiáng)大：open_clip

open_clip 是社區(qū)支持的更強(qiáng)版本，支持更多預(yù)訓(xùn)練模型（如 LAION 提供的）：

pip install open_clip_torch

import open_clip

model, preprocess, tokenizer = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')

10. 總結(jié)

功能	方法
加載模型	`clip.load()`
文本編碼	`model.encode_text()`
圖像編碼	`model.encode_image()`
圖文相似度	`model(image, text)` 或余弦相似度
圖像分類（零樣本）	文本描述嵌入后選最大相似度
支持模型	`"ViT-B/32"`, `"ViT-B/16"` 等

CLIP 是現(xiàn)代多模態(tài) AI 模型的典范，可廣泛應(yīng)用于圖像檢索、圖文分類、圖像問答、跨模態(tài)搜索等場(chǎng)景。它在“零樣本”條件下也能表現(xiàn)良好，是構(gòu)建通用圖文理解系統(tǒng)的強(qiáng)大工具。

到此這篇關(guān)于Python中CLIP多模態(tài)模型的庫(kù)的實(shí)現(xiàn)的文章就介紹到這了,更多相關(guān)Python CLIP多模態(tài)模型內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: