Agents VLM

smolagents 对 vision-language models (VLMs) 提供了支持。

使得 Agent 能够有效地处理和解释图像。

在 Agent 执行开始时提供图像

实例：

from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg", # Joker image
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg" # Joker image
]

# 模型的输入
images = []
for url in image_urls:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" 
    }
    response = requests.get(url,headers=headers)
    image = Image.open(BytesIO(response.content)).convert("RGB")  # 只需RGB 信息
    images.append(image)

from smolagents import CodeAgent, OpenAIServerModel

# 使用哪个模型
model = OpenAIServerModel(model_id="gpt-4o")

# Instantiate the agent
agent = CodeAgent(
    tools=[],
    model=model,
    max_steps=20,
    verbosity_level=2
)

response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=images  # 给出输入的图片
)

这个实例是使用模型描述输入 images

提供动态检索的图像

当用户后头没有 images 时，需要从互联网或其他数据库中获取。在这种方法中，图像在执行过程中动态地添加到 Agent 的记忆中。回忆， multi-step agent 是一个 ReAct 框架的抽象：

系统提示步骤：存储系统提示。
任务步骤：记录用户查询和提供的任何输入。
动作步骤：捕获代理操作的日志和结果。

上面是一个循环。在这个循环过程中，代理动态地整合视觉信息。并适应性地响应不断变化的任务。

我们需要使用 Selenium 和 Helium，它们是浏览器自动化工具。同时我们需要一套专门为浏览设计的代理工具，例如 search_item_ctrl_f 、 go_back 和 close_popups 。这些工具允许代理像人类一样浏览网络。

@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result

@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()

@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! 
    This does not work on cookie consent banners.
    """
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()

我们还需要保存屏幕截图的功能(Tool)，因为这将是 VLM agent 完成任务的关键部分。此功能捕获屏幕截图并将其保存到 step_log.observations_images = [image.copy()] 中，允许代理在导航过程中动态存储和处理图像。«<

def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  # Let JavaScript animations happen before taking the screenshot
    driver = helium.get_driver()
    current_step = step_log.step_number
    if driver is not None:
        for step_logs in agent.logs:  # Remove previous screenshots from logs for lean processing
            if isinstance(step_log, ActionStep) and step_log.step_number <= current_step - 2:
                step_logs.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        step_log.observations_images = [image.copy()]  # Create a copy to ensure it persists, important!

    # Update observations with current URL
    url_info = f"Current url: {driver.current_url}"
    step_log.observations = url_info if step_logs.observations is None 
                                     else step_log.observations + "\n" + url_info
    return

上面的该功能作为 step_callback 传递给代理【不是一个tool】，因为它在代理执行过程中的每个步骤结束时触发。这允许代理在其过程中动态捕获和保存屏幕截图：

from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
model = OpenAIServerModel(model_id="gpt-4o")

agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks=[save_screenshot],  # 在这里
    max_steps=20,
    verbosity_level=2,
)

如此，在用户就可以提需求了：

agent.run("""
Please search for images of Wonder Woman and generate a detailed visual description based on those images. Additionally, navigate to Wikipedia to gather key details about her appearance. With this information, I can determine whether to grant her access to the event.
""" + helium_instructions)

详见 YouTube

在 Agent 执行开始时提供图像#

提供动态检索的图像#

在 Agent 执行开始时提供图像

提供动态检索的图像