ReAct 模式的工作流。

该实例的作用和功能是

处理图像文档
使用 VLM 提取图像中的文字
在需要时调用常规 Tools
分析内容并提供摘要
执行与文档相关的特定指令

重点

回顾 ReAct 结构

一个复合 ReAct 结构的 Agent 需要 3 步骤：

Reasoning(Thought) - 让 model 分析 tools 的输出，并且决定下一步的行为是什么。比如可能会调用另一个 Tools，或者直接将输出返回给用户
Action - 让 Agent 调用逻辑调用 tools
Observation - 将调用 Tools 的输出传回给 model

先想清楚，再行动，最后看结果（ReAct）。

ReAct

KAQ：Tool calling 是 Agent 的行为还是 LLM 的行为？

工具调用是 Agent 的行为，因为 Agent 管理执行流程（调用工具、处理结果）。但 LLM 生成工具调用，即依赖 LLM 生成结构化的指令，如：{"tool": "web_search", "query": "Shanghai city walk"}。之后 Agent 的控制逻辑真正执行上述指令。

~~开源 LLM 中有的支持 Tool calling，有的不支持。如果 LLM 不支持工具调用，Agent 无法有效解析和执行工具调用，导致功能受限。~~

该 LLM 能在提示引导下，可靠地生成结构化、可解析的工具调用指令（如 JSON），从而让外部 Agent 控制逻辑能正确执行。

在使用开源 LLM 部署本地应用时，遇到过报错说 “当前 LLM 不支持 tool calling”。为什么？答：可能原因：

这是个 Base Model。
LLM 不是 -Instruct 版本

应用 LangGraph 中的哪些工具（具体在下面code中）

AnyMessage 类
LangGraph 如何添加上图中的“环”，即 tool 的输出回大脑。
指明 vision_llm。
在 tool 中将 llm 调用封装起来，让 tool 变得强大
将定义的 tools 凑到一堆儿
创建Graph时，通过 add_edge("tools", "assistant") 将 tools 的输出连接到 assistant。提现出 ReAct 结构。

Code

pip install -q -U langchain_openai langchain_core langgraph

工具 extract_text 函数，利用 LLM 从图像中提取文本。即 tool 本身包含一个 llm 调用。

import os

# Please setp your own key.
# OpenAI 的 GPT-4o 模型时，必须提供 OpenAI 官方的 API Key 才能正常使用的，
os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"

import base64
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI

vision_llm = ChatOpenAI(model="gpt-4o")


def extract_text(img_path: str) -> str:
    """
    Extract text from an image file using a multimodal model.

    Args:
        img_path: A local image file path (strings).

    Returns:
        A single string containing the concatenated text extracted from each image.
    """
    all_text = ""
    try:
        # Read image and encode as base64
        with open(img_path, "rb") as image_file:
            image_bytes = image_file.read()

        image_base64 = base64.b64encode(image_bytes).decode("utf-8")

        # Prepare the prompt including the base64 image data
        message = [
            HumanMessage(
                content=[
                    {
                        "type": "text",
                        "text": (
                            "Extract all the text from this image. "
                            "Return only the extracted text, no explanations."
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_base64}"
                        },
                    },
                ]
            )
        ]

        # Call the vision-capable model
        response = vision_llm.invoke(message)

        # Append extracted text
        all_text += response.content + "\n\n"

        return all_text.strip()
    except Exception as e:
        # You can choose whether to raise or just return an empty string / error message
        error_msg = f"Error extracting text: {str(e)}"
        print(error_msg)
        return ""


llm = ChatOpenAI(model="gpt-4o")

def divide(a: int, b: int) -> float:
    """Divide a and b."""
    return a / b

# tools 的集合
tools = [
    divide,
    extract_text
]
llm_with_tools = llm.bind_tools(tools, parallel_tool_calls=False)

from typing import TypedDict, Annotated, Optional
from langchain_core.messages import AnyMessage
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    # The input document
    input_file: Optional[str]  # Contains file path, type (PNG)
    messages: Annotated[list[AnyMessage], add_messages]

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.utils.function_calling import convert_to_openai_tool


def assistant(state: AgentState):
    # System message
    textual_description_of_tool = """
extract_text(img_path: str) -> str:
    Extract text from an image file using a multimodal model.

    Args:
        img_path: A local image file path (strings).

    Returns:
        A single string containing the concatenated text extracted from each image.
divide(a: int, b: int) -> float:
    Divide a and b
"""
    image = state["input_file"]
    sys_msg = SystemMessage(content=f"You are an helpful agent that can analyse some images and run \
                                    some computatio without provided tools :\n{textual_description_of_tool} \n 
                                    You have access to some otpional images. Currently the loaded images is : {image}")

    return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])], "input_file": state["input_file"]}

'''
1. define a `tools` node with our list of tools.
2. The `assistant` node is just our model with bound tools.
3. add tools_condition edge, which routes to End or to tools based on whether the assistant calls a tool
4. using connect the `tools` node back to the `assistant`, forming a loop. 

    - After the assistant node executes, tools_condition checks if the model's output is a tool call.
    - If it is a tool call, the flow is directed to the tools node.
    - The tools node connects back to assistant.
    - This loop continues as long as the model decides to call tools.
    - If the model response is not a tool call, the flow is directed to END, terminating the process.

'''
from langgraph.graph import START, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
from IPython.display import Image, display

# Graph
builder = StateGraph(AgentState)

# Define nodes: these do the work
builder.add_node("assistant", assistant)
builder.add_node("tools", ToolNode(tools))

# Define edges: these determine how the control flow moves
builder.add_edge(START, "assistant")
builder.add_conditional_edges(
    "assistant",
    # If the latest message (result) from assistant is a tool call -> tools_condition routes to tools
    # If the latest message (result) from assistant is a not a tool call -> tools_condition routes to END
    tools_condition,
)
builder.add_edge("tools", "assistant")
react_graph = builder.compile()

# Show graph的结构
display(Image(react_graph.get_graph(xray=True).draw_mermaid_png()))

# 工具调用
messages = [HumanMessage(content="Divide 6790 by 5")]
messages = react_graph.invoke({"messages": messages, "input_file": None})
for m in messages['messages']:
    m.pretty_print()


# 输入是含有文本的图片
messages = [HumanMessage(content="According the note provided by MR wayne in the provided images. \
                                    What's the list of items I should buy for the dinner menu ?")]
messages = react_graph.invoke({"messages": messages, "input_file": "Batman_training_and_meals.png"})

# 返回 message 是结构化的JSON的，所以需要提取主要信息
for m in messages['messages']:
    m.pretty_print()

该实例的作用和功能是#

重点#

回顾 ReAct 结构#

KAQ：Tool calling 是 Agent 的行为还是 LLM 的行为？#

应用 LangGraph 中的哪些工具（具体在下面code中）#

Code#

该实例的作用和功能是

重点

回顾 ReAct 结构

KAQ：Tool calling 是 Agent 的行为还是 LLM 的行为？

应用 LangGraph 中的哪些工具（具体在下面code中）

Code