这里

ReAct 模式的工作流。

该实例的作用和功能是

  1. 处理图像文档
  2. 使用 VLM 提取图像中的文字
  3. 在需要时调用常规 Tools
  4. 分析内容并提供摘要
  5. 执行与文档相关的特定指令

重点

回顾 ReAct 结构

一个复合 ReAct 结构的 Agent 需要 3 步骤:

  • Reasoning(Thought) - 让 model 分析 tools 的输出,并且决定下一步的行为是什么。比如可能会调用另一个 Tools,或者直接将输出返回给用户
  • Action - 让 Agent 调用逻辑调用 tools
  • Observation - 将调用 Tools 的输出传回给 model

先想清楚,再行动,最后看结果(ReAct)。

ReAct

KAQ:Tool calling 是 Agent 的行为还是 LLM 的行为?

工具调用是 Agent 的行为,因为 Agent 管理执行流程(调用工具、处理结果)。但 LLM 生成工具调用,即依赖 LLM 生成结构化的指令,如:{"tool": "web_search", "query": "Shanghai city walk"}。之后 Agent 的控制逻辑真正执行上述指令。

开源 LLM 中有的支持 Tool calling,有的不支持。如果 LLM 不支持工具调用,Agent 无法有效解析和执行工具调用,导致功能受限。

该 LLM 能在提示引导下,可靠地生成结构化、可解析的工具调用指令(如 JSON),从而让外部 Agent 控制逻辑能正确执行。

在使用开源 LLM 部署本地应用时,遇到过报错说 “当前 LLM 不支持 tool calling”。为什么?答:可能原因:

  1. 这是个 Base Model。
  2. LLM 不是 -Instruct 版本

应用 LangGraph 中的哪些工具(具体在下面code中)

  • AnyMessage
  • LangGraph 如何添加上图中的“环”,即 tool 的输出回大脑。
  • 指明 vision_llm
  • 在 tool 中将 llm 调用封装起来,让 tool 变得强大
  • 将定义的 tools 凑到一堆儿
  • 创建Graph时,通过 add_edge("tools", "assistant") 将 tools 的输出连接到 assistant。提现出 ReAct 结构。

Code

pip install -q -U langchain_openai langchain_core langgraph

工具 extract_text 函数,利用 LLM 从图像中提取文本。即 tool 本身包含一个 llm 调用

import os

# Please setp your own key.
# OpenAI 的 GPT-4o 模型时,必须提供 OpenAI 官方的 API Key 才能正常使用的,
os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"

import base64
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI

vision_llm = ChatOpenAI(model="gpt-4o")


def extract_text(img_path: str) -> str:
    """
    Extract text from an image file using a multimodal model.

    Args:
        img_path: A local image file path (strings).

    Returns:
        A single string containing the concatenated text extracted from each image.
    """
    all_text = ""
    try:
        # Read image and encode as base64
        with open(img_path, "rb") as image_file:
            image_bytes = image_file.read()

        image_base64 = base64.b64encode(image_bytes).decode("utf-8")

        # Prepare the prompt including the base64 image data
        message = [
            HumanMessage(
                content=[
                    {
                        "type": "text",
                        "text": (
                            "Extract all the text from this image. "
                            "Return only the extracted text, no explanations."
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_base64}"
                        },
                    },
                ]
            )
        ]

        # Call the vision-capable model
        response = vision_llm.invoke(message)

        # Append extracted text
        all_text += response.content + "\n\n"

        return all_text.strip()
    except Exception as e:
        # You can choose whether to raise or just return an empty string / error message
        error_msg = f"Error extracting text: {str(e)}"
        print(error_msg)
        return ""


llm = ChatOpenAI(model="gpt-4o")

def divide(a: int, b: int) -> float:
    """Divide a and b."""
    return a / b

# tools 的集合
tools = [
    divide,
    extract_text
]
llm_with_tools = llm.bind_tools(tools, parallel_tool_calls=False)
from typing import TypedDict, Annotated, Optional
from langchain_core.messages import AnyMessage
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    # The input document
    input_file: Optional[str]  # Contains file path, type (PNG)
    messages: Annotated[list[AnyMessage], add_messages]

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.utils.function_calling import convert_to_openai_tool


def assistant(state: AgentState):
    # System message
    textual_description_of_tool = """
extract_text(img_path: str) -> str:
    Extract text from an image file using a multimodal model.

    Args:
        img_path: A local image file path (strings).

    Returns:
        A single string containing the concatenated text extracted from each image.
divide(a: int, b: int) -> float:
    Divide a and b
"""
    image = state["input_file"]
    sys_msg = SystemMessage(content=f"You are an helpful agent that can analyse some images and run \
                                    some computatio without provided tools :\n{textual_description_of_tool} \n 
                                    You have access to some otpional images. Currently the loaded images is : {image}")

    return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])], "input_file": state["input_file"]}
'''
1. define a `tools` node with our list of tools.
2. The `assistant` node is just our model with bound tools.
3. add tools_condition edge, which routes to End or to tools based on whether the assistant calls a tool
4. using connect the `tools` node back to the `assistant`, forming a loop. 

    - After the assistant node executes, tools_condition checks if the model's output is a tool call.
    - If it is a tool call, the flow is directed to the tools node.
    - The tools node connects back to assistant.
    - This loop continues as long as the model decides to call tools.
    - If the model response is not a tool call, the flow is directed to END, terminating the process.

'''
from langgraph.graph import START, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
from IPython.display import Image, display

# Graph
builder = StateGraph(AgentState)

# Define nodes: these do the work
builder.add_node("assistant", assistant)
builder.add_node("tools", ToolNode(tools))

# Define edges: these determine how the control flow moves
builder.add_edge(START, "assistant")
builder.add_conditional_edges(
    "assistant",
    # If the latest message (result) from assistant is a tool call -> tools_condition routes to tools
    # If the latest message (result) from assistant is a not a tool call -> tools_condition routes to END
    tools_condition,
)
builder.add_edge("tools", "assistant")
react_graph = builder.compile()

# Show graph的结构
display(Image(react_graph.get_graph(xray=True).draw_mermaid_png()))
# 工具调用
messages = [HumanMessage(content="Divide 6790 by 5")]
messages = react_graph.invoke({"messages": messages, "input_file": None})
for m in messages['messages']:
    m.pretty_print()


# 输入是含有文本的图片
messages = [HumanMessage(content="According the note provided by MR wayne in the provided images. \
                                    What's the list of items I should buy for the dinner menu ?")]
messages = react_graph.invoke({"messages": messages, "input_file": "Batman_training_and_meals.png"})

# 返回 message 是结构化的JSON的,所以需要提取主要信息
for m in messages['messages']:
    m.pretty_print()