ReAct 模式的工作流。
该实例的作用和功能是
- 处理图像文档
 - 使用 VLM 提取图像中的文字
 - 在需要时调用常规 Tools
 - 分析内容并提供摘要
 - 执行与文档相关的特定指令
 
重点
回顾 ReAct 结构
一个复合 ReAct 结构的 Agent 需要 3 步骤:
- Reasoning(Thought) - 让 model 分析 tools 的输出,并且决定下一步的行为是什么。比如可能会调用另一个 Tools,或者直接将输出返回给用户
 - Action - 让 Agent 调用逻辑调用 tools
 - Observation - 将调用 Tools 的输出传回给 model
 
先想清楚,再行动,最后看结果(ReAct)。

KAQ:Tool calling 是 Agent 的行为还是 LLM 的行为?
工具调用是 Agent 的行为,因为 Agent 管理执行流程(调用工具、处理结果)。但 LLM 生成工具调用,即依赖 LLM 生成结构化的指令,如:{"tool": "web_search", "query": "Shanghai city walk"}。之后 Agent 的控制逻辑真正执行上述指令。
开源 LLM 中有的支持 Tool calling,有的不支持。如果 LLM 不支持工具调用,Agent 无法有效解析和执行工具调用,导致功能受限。
该 LLM 能在提示引导下,可靠地生成结构化、可解析的工具调用指令(如 JSON),从而让外部 Agent 控制逻辑能正确执行。
在使用开源 LLM 部署本地应用时,遇到过报错说 “当前 LLM 不支持 tool calling”。为什么?答:可能原因:
- 这是个 Base Model。
 - LLM 不是 
-Instruct版本 
应用 LangGraph 中的哪些工具(具体在下面code中)
AnyMessage类- LangGraph 如何添加上图中的“环”,即 tool 的输出回大脑。
 - 指明 
vision_llm。 - 在 tool 中将 llm 调用封装起来,让 tool 变得强大
 - 将定义的 tools 凑到一堆儿
 - 创建Graph时,通过 
add_edge("tools", "assistant")将 tools 的输出连接到 assistant。提现出 ReAct 结构。 
Code
pip install -q -U langchain_openai langchain_core langgraph
工具 extract_text 函数,利用 LLM 从图像中提取文本。即 tool 本身包含一个 llm 调用。
import os
# Please setp your own key.
# OpenAI 的 GPT-4o 模型时,必须提供 OpenAI 官方的 API Key 才能正常使用的,
os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"
import base64
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
vision_llm = ChatOpenAI(model="gpt-4o")
def extract_text(img_path: str) -> str:
    """
    Extract text from an image file using a multimodal model.
    Args:
        img_path: A local image file path (strings).
    Returns:
        A single string containing the concatenated text extracted from each image.
    """
    all_text = ""
    try:
        # Read image and encode as base64
        with open(img_path, "rb") as image_file:
            image_bytes = image_file.read()
        image_base64 = base64.b64encode(image_bytes).decode("utf-8")
        # Prepare the prompt including the base64 image data
        message = [
            HumanMessage(
                content=[
                    {
                        "type": "text",
                        "text": (
                            "Extract all the text from this image. "
                            "Return only the extracted text, no explanations."
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_base64}"
                        },
                    },
                ]
            )
        ]
        # Call the vision-capable model
        response = vision_llm.invoke(message)
        # Append extracted text
        all_text += response.content + "\n\n"
        return all_text.strip()
    except Exception as e:
        # You can choose whether to raise or just return an empty string / error message
        error_msg = f"Error extracting text: {str(e)}"
        print(error_msg)
        return ""
llm = ChatOpenAI(model="gpt-4o")
def divide(a: int, b: int) -> float:
    """Divide a and b."""
    return a / b
# tools 的集合
tools = [
    divide,
    extract_text
]
llm_with_tools = llm.bind_tools(tools, parallel_tool_calls=False)
from typing import TypedDict, Annotated, Optional
from langchain_core.messages import AnyMessage
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
    # The input document
    input_file: Optional[str]  # Contains file path, type (PNG)
    messages: Annotated[list[AnyMessage], add_messages]
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.utils.function_calling import convert_to_openai_tool
def assistant(state: AgentState):
    # System message
    textual_description_of_tool = """
extract_text(img_path: str) -> str:
    Extract text from an image file using a multimodal model.
    Args:
        img_path: A local image file path (strings).
    Returns:
        A single string containing the concatenated text extracted from each image.
divide(a: int, b: int) -> float:
    Divide a and b
"""
    image = state["input_file"]
    sys_msg = SystemMessage(content=f"You are an helpful agent that can analyse some images and run \
                                    some computatio without provided tools :\n{textual_description_of_tool} \n 
                                    You have access to some otpional images. Currently the loaded images is : {image}")
    return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])], "input_file": state["input_file"]}
'''
1. define a `tools` node with our list of tools.
2. The `assistant` node is just our model with bound tools.
3. add tools_condition edge, which routes to End or to tools based on whether the assistant calls a tool
4. using connect the `tools` node back to the `assistant`, forming a loop. 
    - After the assistant node executes, tools_condition checks if the model's output is a tool call.
    - If it is a tool call, the flow is directed to the tools node.
    - The tools node connects back to assistant.
    - This loop continues as long as the model decides to call tools.
    - If the model response is not a tool call, the flow is directed to END, terminating the process.
'''
from langgraph.graph import START, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
from IPython.display import Image, display
# Graph
builder = StateGraph(AgentState)
# Define nodes: these do the work
builder.add_node("assistant", assistant)
builder.add_node("tools", ToolNode(tools))
# Define edges: these determine how the control flow moves
builder.add_edge(START, "assistant")
builder.add_conditional_edges(
    "assistant",
    # If the latest message (result) from assistant is a tool call -> tools_condition routes to tools
    # If the latest message (result) from assistant is a not a tool call -> tools_condition routes to END
    tools_condition,
)
builder.add_edge("tools", "assistant")
react_graph = builder.compile()
# Show graph的结构
display(Image(react_graph.get_graph(xray=True).draw_mermaid_png()))
# 工具调用
messages = [HumanMessage(content="Divide 6790 by 5")]
messages = react_graph.invoke({"messages": messages, "input_file": None})
for m in messages['messages']:
    m.pretty_print()
# 输入是含有文本的图片
messages = [HumanMessage(content="According the note provided by MR wayne in the provided images. \
                                    What's the list of items I should buy for the dinner menu ?")]
messages = react_graph.invoke({"messages": messages, "input_file": "Batman_training_and_meals.png"})
# 返回 message 是结构化的JSON的,所以需要提取主要信息
for m in messages['messages']:
    m.pretty_print()