韩国激情美女视频,柳岩被捏奶啃胸大尺度视频,清纯美女视频网址

?AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之與LLM協(xié)同工作的多模態(tài)智能體、結論和研究趨勢

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之簡介

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎模型：從專家到通用助-CSDN博客

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之視覺理解、視覺生成

AGI之MFM：《多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之視覺理解、視覺生成_一個處女座的程序猿的博客-CSDN博客

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之統(tǒng)一的視覺模型、加持LLMs的大型多模態(tài)模型

AGI之MFM：《多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之統(tǒng)一的視覺模型、加持LLMs的大型多模態(tài)模型-CSDN博客

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之與LLM協(xié)同工作的多模態(tài)智能體、結論和研究趨勢

AGI之MFM：《多模態(tài)基礎模型：從專家到通用助手》翻譯與解讀之與LLM協(xié)同工作的多模態(tài)智能體、結論和研究趨勢-CSDN博客

6、Multimodal Agents:Chaining Tools with LLM—?與LLM協(xié)同工作的多模態(tài)智能體

提出新的建模范式：將多個工具或專家與LLMs協(xié)同鏈接以解決復雜的開放問題，不需要訓練，只需要示例教導

Large Language Models (LLMs) (Chowdhery et al., 2022; OpenAI, 2023a) have shown intriguing properties generalizing to user prompts in various domains, and rapidly adapting to new scenarios, using in-context learning with a few examples. Inspired by such strong capabilities, researchers are now exploring a new modeling paradigm that shifts from standalone models for solving finite, pre-defined problems, into synergistically chaining multiple tools or experts with LLMs to solve complicated, open problems. Unlike what has been introduced in Chapter 5, such a system can be built without any training involved, just by using a few demonstration examples to teach the LLM to generate proper calling to existing tools.

大型語言模型（LLMs）（Chowdhery等人，2022；OpenAI，2023a）已經(jīng)展示了一些有趣的特性，可以泛化到各個領域的用戶提示，并通過幾個例子使用上下文學習快速適應新的場景。受到這種強大能力的啟發(fā)，研究人員現(xiàn)在正在探索一種新的建模范式，從解決有限預的、預定義問題的獨立模型，轉向為協(xié)同鏈接多個工具或具有LLMs的專家來解決復雜的、開放的問題。與第5章中介紹的不同，這樣的系統(tǒng)可以在不涉及任何訓練的情況下構建，只需使用少量示范示例來教導LLM生成對現(xiàn)有工具的適當調用即可。

In this chapter, we review the fast-evolving literature on chaining different multimodal experts with LLMs to solve complicated multimodal understanding problems, referred to as multimodal agents. We start with an overview on the evolution of this modeling paradigm in Section 6.1, highlighting the differences between traditional approaches and the new modeling paradigm of chaining tools with LLM. Section 6.2 gives a general overview of multimodal agents. Pivoting on an exemplary multimodal agent MM-REACT (Yang* et al., 2023), Section 6.3 comprehensively reviews how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incorporate the latest and strongest LLM and potentially millions of tools. Finally, in Section 6.4, we end the chapter with discussions on advanced topics, such as how to improve/evaluate multimodal agents, the diverse applications powered by multimodal agents.

在本章中，我們將回顧有關將不同多模態(tài)專家與LLMs協(xié)同工作以解決復雜的多模態(tài)理解問題的快速發(fā)展文獻，稱為多模態(tài)智能體。我們從

第6.1節(jié)中對這種建模范式的演變進行概述，強調傳統(tǒng)方法與使用LLMs協(xié)同工具的新建模范式之間的差異。

第6.2節(jié)概述了多模態(tài)智能體的總體概述。以典型的多模式代理MM-REACT?(Yang* et al.， 2023)為中心，

第6.3節(jié)全面回顧了如何構建多模態(tài)智能體，它在多模態(tài)理解方面的新興能力，以及如何輕松擴展以包含最新和最強大的LLM和潛在的數(shù)百萬工具。最后，在

第6.4節(jié)中，我們以高級主題的討論結束本章，例如如何改進/評估多模態(tài)智能體，多模態(tài)智能體驅動的各種應用。

6.1、Overview概述

建模范式的演變：特定任務模型→大型多模態(tài)模型→帶有LLM的鏈接工具范式(無需任何訓練+加持現(xiàn)有開源平臺或API工具)

We first revisit the evolution of modeling paradigms, from task-specific models to the most recent large multimodal models, which all require data curation and model training. We then introduce the new modeling paradigm of chaining tools with LLM, which may not require any training, but instead directly takes advantage of a pre-trained LLM and existing tools that are widely available through open-source platforms or APIs.

首先，我們重新審視了建模范式的演變，從特定任務模型到最新的大型多模態(tài)模型，所有這些都需要數(shù)據(jù)管理和模型訓練。然后，我們引入了帶有LLM的鏈接工具的新的建模范例，它可能不需要任何培訓，而是直接利用了預訓練的LLM和通過開源平臺或API廣泛提供的現(xiàn)有工具的優(yōu)勢。

(1)、Evolution of modeling paradigm建模范式的演變：

特定任務的專用模型→預訓練模型的二階段(預訓練+微調范式，如NLP中的BSRT系列、VL中的UNITER/OSCAR，仍是針對特定任務的微調)→

As summarized in Figure 6.1, we are witnessing the transition from task-specific models towards general-purpose assistants across language, vision, and multi-modal research.

We started with task-specific models that are trained on small-scale well-annotated data. This results in dedicated models (Anderson et al., 2018; Li et al., 2019a; Yu et al., 2019) for each task or even each dataset.

如圖6.1所總結的那樣，我們正在目睹從特定任務模型向跨語言、視覺和多模態(tài)研究的通用助手的過渡。

我們從特定于任務的模型開始，這些模型是在小規(guī)模的注釋良好的數(shù)據(jù)上訓練的。這就產(chǎn)生了針對每個任務甚至每個數(shù)據(jù)集的專用模型（Anderson等人，2018；Li等人，2019a；Yu等人，2019）。

We then transitioned to the phase of pre-trained models, with the pretrain-then-finetune paradigm widely adopted across both NLP and vision-language (VL) research. During pre-training, the model can take advantages of large-scale, web-crawled noisy data, for example, millions to billions of image-text pairs (Chen et al., 2020d; Wang et al., 2022a), or billions of text tokens (Devlin et al., 2019; Liu et al., 2019). However, it is still mostly task-specific finetuned, requiring similarly small-scale, well-annotated data as the ones used in training task-specific models. This paradigm has led to many well-known models, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) in NLP, and UNITER (Chen et al., 2020d), OSCAR (Li et al., 2020b) in VL. These early VL founda-tion models were considered to be large-scale (trained with 10M image-text pairs), but may be of intermediate or even small size in today’s view (billions of pairs).

然后，我們過渡到了預訓練模型的階段，在NLP和視覺語言（VL）研究中廣泛采用了預訓練-微調的范式。在預訓練過程中，模型可以利用大規(guī)模的網(wǎng)絡抓取的嘈雜數(shù)據(jù)，例如數(shù)百萬到數(shù)十億的圖像-文本對（Chen等人，2020d；Wang等人，2022a），或數(shù)十億的文本token（Devlin等人，2019；Liu等人，2019）。但是，它仍然主要是針對特定任務的微調，需要與訓練特定任務模型中使用的數(shù)據(jù)類似的小規(guī)模、良好注釋的數(shù)據(jù)。這種范式已經(jīng)產(chǎn)生了許多著名的模型，例如NLP中的BERT（Devlin等人，2019），RoBERTa（Liu等人，2019），以及VL中的UNITER（Chen等人，2020d），OSCAR（Li等人，2020b）。這些早期的VL基礎模型被認為是大規(guī)模的（訓練了1000萬個圖像-文本對），但在今天的觀點中可能是中等甚至較小的規(guī)模(數(shù)十億對)。

→基于通用的大型語言/多模態(tài)模型(比如GPT系列/PaLM/LLaMA系列/Flamingo)→基于通用模型的構建指令跟隨(比如Alpaca/LLaVA【視覺指令調優(yōu)】)

Nowadays, we are entering a new era of generalist modeling, where the pre-training has been fur-ther scaled up to trillions of text tokens (Gao et al., 2023b). For downstream adaptation, these gen-eralist models have shown strong performance with in-context few-shot learning on a few demon-stration examples, or even zero-shot evaluation. These models are what we now refer as large lan-guage/multimodal models, including the GPT family (OpenAI, 2022, 2023a), PaLM family (Chowd-hery et al., 2022; Driess et al., 2023), LLaMa (Touvron et al., 2023), Flamingo (Alayrac et al., 2022).

Based on the generalist models, the pipeline of building instruction-following models covered in Chapter 5, similarly follows the pretrain-then-finetune paradigm. For example, Alpaca (Taori et al., 2023), is built on top of the pre-trained LLaMa (Touvron et al., 2023), then finetuned on a smaller-scale instruction tuning dataset. Similarly, for instruction-following VL models (e.g. LLaVA (Li et al., 2023e)), an additional stage of image-text alignment pre-training is introduced to align the visual representations to the frozen LLM first, followed by visual instruction tuning.

如今，我們正在進入通用建模的新時代，其中預訓練已經(jīng)進一步擴展到數(shù)萬億的文本token（Gao等人，2023b）。對于下游適應，這些通用模型已經(jīng)展示出在少量示范示例上進行上下文適應學習或甚至零樣本評估時的強大性能。這些模型現(xiàn)在被稱為大型語言/多模態(tài)模型，包括GPT系列（OpenAI，2022，2023a），PaLM系列（Chowd-hery等人，2022；Driess等人，2023），LLaMa（Touvron等人，2023），Flamingo（Alayrac等人，2022）。

在通用模型基礎上，構建指令跟隨模型的流程涵蓋在第5章中，同樣遵循預先訓練然后微調的范式。例如，Alpaca（Taori等人，2023）是在預先訓練的LLaMa（Touvron等人，2023）之上構建的，然后在較小規(guī)模的指令調優(yōu)數(shù)據(jù)上進行微調。同樣，對于指令跟隨VL模型（例如LLaVA（Li等人，2023e）），引入了額外的圖像-文本對齊預訓練階段，以首先將視覺表示與凍結的LLM對齊，然后進行視覺指令調優(yōu)。

(2)、New modeling paradigm:新的建模范式：與LLM鏈接的工具鏈

痛點：基本功能上的挑戰(zhàn)(數(shù)學推理/信息檢索)、通用局限性(能力依賴于過時訓練數(shù)據(jù)的世界而無法及時更新信息)

New modeling paradigm: chaining tools with LLM. LLMs (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023a) have demonstrated exceptional abilities to tackle new tasks with only a few examples or textual instructions, showing the promise of serving as general-purpose foundations for many applications. Despite being versatile and impressive, they encounter challenges with the basic functionalities, such as mathematical reasoning and information retrieval. Furthermore, a fundamental limitation of not only LLMs but also other large-scale models nowadays, is that they only represent the world described by their training data, which will inevitably become outdated over time. Regularly re-training the model with the latest information is simply not feasible.

新的建模范式：與LLM鏈接的工具鏈。

LLMs（Brown等人，2020；Chowdhery等人，2022；OpenAI，2023a）已經(jīng)展示出了只使用少量示例或文本指令就能處理新任務的卓越能力，顯示出它們有望作為許多應用的通用基礎。盡管它們用途廣泛且令人印象深刻，但它們在數(shù)學推理和信息檢索等基本功能方面面臨挑戰(zhàn)。此外，不僅LLMs，而且現(xiàn)今的其他大規(guī)模模型的一個基本限制是，它們只代表其訓練數(shù)據(jù)中描述的世界，這將不可避免地隨著時間的推移而過時。用最新的信息定期重新訓練模型是不可行的。

提出語言建模一種新的建模范式(外部NLP工具補充LLMs，如計算器/搜索引擎/翻譯系統(tǒng)/日歷等)→未來的智能代理(使用工具啟用LLMs對多模態(tài)信號進行感知)

Meanwhile, many tasks with real-world impact cannot be readily tackled by by LLMs alone. For example, accessing up-to-date information and performing computations, can be done via existing tools (e.g. , search engine or calculator). Hence, recent research in language modeling has explored a new modeling paradigm by supplementing LLMs with external NLP tools (Nakano et al., 2021; Huang et al., 2022b; Ahn et al., 2022), including calculators, search engines, translation systems, calendars, or even API calls on other models.

The above studies mainly focus on a single modality, i.e., language, in which the output of the tools are in text format, thereby can naturally be fed into LLMs as additional knowledge. However, we live in a multimodal world and a truly intelligent agent should be able to perform advanced multimodal reasoning and actions. How to enable LLMs with perception of multimodal signals via tool using, is the focus of the remaining part of this chapter.

與此同時，許多對現(xiàn)實世界有影響的任務不能輕易地由LLMs 單獨解決。例如，訪問最新信息和執(zhí)行計算可以通過現(xiàn)有工具（例如，搜索引擎或計算器）來完成。因此，語言建模的最新研究探索了一種新的建模范式，用外部NLP工具補充LLMs（Nakano等人，2021；Huang等人，2022b；Ahn等人，2022），包括計算器、搜索引擎、翻譯系統(tǒng)、日歷，甚至其他模型的API調用。

上述研究主要集中在單一模態(tài)，即語言，其中工具的輸出以文本格式呈現(xiàn)，因此可以自然地輸入到LLMs中作為額外的知識。然而，我們生活在一個多模態(tài)的世界，一個真正智能的代理應該能夠執(zhí)行高級的多模態(tài)推理和行動。如何通過使用工具啟用LLMs對多模態(tài)信號進行感知，是本章剩余部分的重點。

6.2、Multimodal Agent多模態(tài)智能體

代表性作品：VISPROG(第一個利用編程語言將不同的視覺工具與LLM相結合的工作)、Visual ChatGPT(通過各種圖像生成工具結合對話提問實現(xiàn)圖像編輯)、MM-ReAct(體現(xiàn)LLM通過融合多個視覺專家完成復雜的跨模態(tài)行為和推理)

There are several representative works on building multimodal agent with tool use of vision experts, including VISPROG (Gupta and Kembhavi, 2022b), Visual ChatGPT (Wu et al., 2023a) and MM-ReAct (Yang* et al., 2023). VISPROG is the very first work on using programming language to chain different vision tools with a LLM. Visual ChatGPT enables dialogue-based image editing by?complementing ChatGPT (OpenAI, 2022) with various image generation tools. MM-ReAct shows that when collaborating various advanced vision experts, ChatGPT can perform complex multimodal actions and reasoning. Figure 6.2 presents the fast-evolving literature in multimodal agents from November 18, 2022 to July 26th, 2023. Among them, we include a few more exemplary multimodal agents in Table 6.1, along with two representative works in the NLP domain.

利用視覺專家的工具構建多模態(tài)智能體的代表性作品有括VISPROG（Gupta和Kembhavi，2022b）、Visual ChatGPT（Wu等人，2023a）和MM-ReAct（Yang*等人，2023）。

>> VISPROG是第一個使用編程語言將不同的視覺工具與LLM鏈接起來的作品。

>> Visual ChatGPT通過結合ChatGPT（OpenAI，2022）和各種圖像生成工具，實現(xiàn)了基于對話的圖像編輯。

>> MM-ReAct表明，當協(xié)作各種高級視覺專家時，ChatGPT可以執(zhí)行復雜的多模態(tài)操作和推理。

圖6.2展示了從2022年11月18日到2023年7月26日多模態(tài)智能體領域的快速發(fā)展文獻。其中，我們在表6.1中列出了幾個更具代表性的多模態(tài)智能體，以及NLP領域的兩個代表性作品。

典型多模態(tài)智能體框架實現(xiàn)：通過用戶與工具分配器的直接交互,由LLM擔任分配器的大腦來規(guī)劃使用單個或協(xié)同多個工具完成用戶需求的步驟,執(zhí)行后匯集結果輸入到LLM中實現(xiàn)響應

An overview of a typical multimodal agent framework is illustrated in Figure 6.3. The user directly interacts with the Tool Allocator, which functions as the brain of the agent. In current literature, the tool allocator is usually a LLM. To achieve the user’s goal, the LLM will outline all the steps necessary with either a single tool or collaborating multiple tools together. Subsequently, it will retrieve from all the candidate tools for the needed tools, and execute possibly multiple rounds of tools to fulfill the human requirement. Finally, the execution results from the tools are gathered as inputs of the LLM to generate a response to the user. Next, we cover the three key components of multimodal agents.

典型多模態(tài)智能體框架的概述如圖6.3所示。用戶直接與工具分配器交互，它充當代理的大腦。在當前文獻中，工具分配器通常是一個LLM。為了實現(xiàn)用戶的目標，LLM將使用單個工具或協(xié)作多個工具來概述實現(xiàn)任務所需的所有步驟。隨后，它將從所有候選工具中檢索所需的工具，并執(zhí)行可能涉及多輪工具以滿足人類需求。最后，工具的執(zhí)行結果被收集作為LLM的輸入，以生成對用戶的響應。

接下來，我們將介紹多模態(tài)智能體的三個關鍵組成部分。

模態(tài)代理的三個關鍵組件

Tools工具：提供LLM缺失的多模態(tài)信息，比如開源模型/API/代碼解釋器

Tools. Tools are external modules that are callable by the LLM to obtain extra information that is missing from the model weights, including open-source models, public/private APIs, or code inter-preters. As LLMs only accept language inputs, one must include tools that can process multimodal inputs to build a multimodal agent.

工具。工具是可以由LLM調用的外部模塊，用于獲取模型權重中缺失的額外信息，包括開源模型、公共/私有API或代碼解釋器。由于LLMs只接受語言輸入，因此必須包含能夠處理多模態(tài)輸入的工具來構建多模態(tài)智能體。

Planning規(guī)劃：將用戶需求細化為可執(zhí)行步驟調用工具

Planning. During planning, the LLM decomposes the user requests into smaller, manageable sub-problems, and outlines a step-by-step solution, each of which involves calling an external tool. There are two ways to teach LLMs for planning. One is to prompt the LLM with in-context few-shot examples of all candidate tools. This approach can extend the general model directly but is limited by the context length. The other approach relies on large amounts of annotated data to fine-tune the LLM, which most likely will damage the robustness and generalizability of the model.

規(guī)劃。在規(guī)劃過程中，LLM將用戶的請求分解為較小、可管理的子問題，并概述了逐步解決方案，每個解決方案都涉及調用外部工具。

教導LLMs進行規(guī)劃有兩種方式。一種是使用所有候選工具的上下文少量示例來提示LLM進行規(guī)劃。這種方法可以直接擴展通用模型，但受上下文長度的限制。另一種方法依賴于大量注釋數(shù)據(jù)來對LLM進行微調，這很可能會損害模型的穩(wěn)健性和通用性。

Execution執(zhí)行：由LLM翻譯計劃調用工具得到結果與用戶對話

Execution. The generated plan is further translated into executable calls to the required tools, which can be done via regular expression matching (Yang* et al., 2023); directly prompting LLMs to generate executable programs (Sur′?s et al., 2023); or leveraging in-context few-shot learning capability of LLMs by providing natural language instructions that describe the roles of each module together with a few calling examples (Lu et al., 2023b). The execution results are fed back to the LLM to generate a response to the user.

執(zhí)行。生成的計劃進一步轉化為對所需工具的可執(zhí)行調用，可以通過正則表達式匹配（Yang*等人，2023）來完成；直接提示LLMs生成可執(zhí)行程序（Sur′?s等人，2023）；或者通過提供描述每個模塊角色的自然語言指令以及一些調用示例來利用LLMs 的上下文少量學習能力（Lu等人，2023b）。執(zhí)行結果反饋給LLM，以生成對用戶的響應。

6.3、Case Study: MM-REACT案例研究：MM-REACT

We use MM-REACT (Yang* et al., 2023) as a case study to show how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incor-porate the latest and strongest LLM and potentially millions of tools.

我們以MM-REACT（Yang*等人，2023）作為案例研究，展示如何構建多模態(tài)智能體，它在多模態(tài)理解方面的新興能力，以及如何輕松擴展以整合最新和最強大的LLM以及潛在的數(shù)百萬工具。

6.3.1、System Design系統(tǒng)設計：MM-REACT設計智能體范式，通過ChatGPT作為大腦結合多模態(tài)視覺專家，支持圖像和視頻等多模態(tài)輸入輸出實現(xiàn)多模態(tài)推理與行動能力

MM-ReAct designs the system paradigm that composes numerous multimodal tools1 with Chat-GPT (OpenAI, 2022) for multimodal reasoning and action. By augmenting the language-only ChatGPT with various multimodal tools, MM-REACT supports both inputs and outputs in multi-modalities, including text, image and video, as shown in Figure 6.4.

Figure 6.5 shows the system design of MM-REACT. The multimodal tools explored in MM-REACT are mainly computer vision models that take an image as input and interpret the image content from different perspectives. For instance, the image captioning model generates a natural description, the OCR model extracts the scene text in the image, the celebrity recognition model identifies the celebrity names, and the object detection model extracts the salient object with bound-ing box locations. LLMs such as ChatGPT serves as the brain of the agent, which plans on which tools to use, and in what order, based on the input image and the user intent. Next, with the example in Figure 6.5, we unfold the planning and execution of MM-REACT behind the scene.

MM-ReAct設計了系統(tǒng)范例，該系統(tǒng)范例使用ChatGPT?(OpenAI, 2022)組成了多個多模態(tài)工具，用于多模態(tài)推理和行動。通過利用各種多模態(tài)工具來增強僅支持語言的ChatGPT，MM-REACT支持多種模態(tài)輸入和輸出，包括文本、圖像和視頻，如圖6.4所示。

圖6.5顯示了MM-REACT的系統(tǒng)設計。MM-REACT中探討的多模態(tài)工具主要是計算機視覺模型，它們以圖像作為輸入并從不同角度解釋圖像內容。例如，圖像字幕模型生成自然描述，OCR模型提取圖像中的場景文本，名人識別模型識別名人姓名，物體檢測模型提取帶有邊界框位置的顯著物體。像ChatGPT這樣的LLM充當了代理的大腦，它根據(jù)輸入圖像和用戶意圖規(guī)劃使用哪些工具以及以什么順序使用。接下來，通過圖6.5中的示例，我們將揭示MM-REACT在幕后的規(guī)劃和執(zhí)行過程。

User prompt用戶提示：MM-REACT利用圖像文件路徑作為ChatGPT輸入,讓其在規(guī)劃階段通過調用視覺工具來理解圖像內容并回答用戶問題

User prompt.?As ChatGPT only accepts language inputs, to enable image as inputs, we simply use the file path as the input to ChatGPT. The file path functions as a placeholder, allowing ChatGPT to treat it as a black box and later seek help from different tools during the planning stage. Besides the input image, the user can also provide the intent in text format (e.g. , a question about the input image). When there is no text input from the user, the goal is to get a general understanding about the image.

由于ChatGPT僅接受語言輸入，為了啟用圖像作為輸入，我們簡單地使用文件路徑作為ChatGPT的輸入。文件路徑充當占位符，允許ChatGPT將其視為黑盒，并在規(guī)劃階段尋求不同工具的幫助。除了輸入圖像，用戶還可以以文本格式提供意圖（例如關于輸入圖像的問題）。當用戶沒有提供文本輸入時，目標是對圖像有一個大致的了解。

Planning規(guī)劃：MM-REACT通過提示詞與正則判斷是否需要外部工具,并提供工具描述與使用示例指導 ChatGPT合理調用視覺專家完成任務

Planning. Upon receiving the input image and user prompt, ChatGPT plans for what tools to use. Inspired by REACT (Yao et al., 2022c), MM-REACT instructs ChatGPT to respond with certain watchwords, such as “Assistant, what objects are there in the image? <file path>”, if a specific tool is required (i.e., action request in Figure 6.5). In practice, one can tell whether a multimodal tool is needed by simply string-matching the keyword “Assistant,” in ChatGPT’s response.

在接收到輸入圖像和用戶提示后，ChatGPT規(guī)劃要使用哪些工具。受到REACT（Yao等人，2022c）的啟發(fā)，MM-REACT指導ChatGPT以特定的關鍵詞來回應，例如“助手，圖像中有什么對象？<文件路徑>”，如果需要特定工具（即圖6.5中的操作請求）。在實踐中，可以通過簡單地字符串匹配ChatGPT的響應中的關鍵字“Assistant”來判斷是否需要多模態(tài)工具。

MM-ReAct encourages ChatGPT to show the thought (reasoning) process to highlight why an exter-nal tool is needed, which has been proven beneficial in NLP studies (Yao et al., 2022c). In addition, for generating proper calling to each tool, both instructions and in-context examples are added as the prefix when prompting ChatGPT. Each tool is described with the model name, a general description of its capability, the input data format, and the output information. After describing each tool, a few in-context dialogue examples are also included for enhanced performance.

MM-ReAct鼓勵ChatGPT展示思考（推理）過程，以突出為什么需要外部工具，這在NLP研究中已被證明是有益的（Yao等人，2022c）。此外，為了生成對每個工具的正確調用，當提示ChatGPT時，會將指令和上下文示例作為前綴添加。每個工具都用模型名稱、其功能的一般描述、輸入數(shù)據(jù)格式和輸出信息來描述。在描述每個工具之后，還包括了一些上下文對話示例，以增強性能。

Execution執(zhí)行：MM-REACT通過正則匹配解析ChatGPT的動作請求,調用相應工具完成各類視覺任務后,匯總結果與ChatGPT對話,解決用戶提出的問題。

Execution. Given the action request from ChatGPT, the tool name and the file path can be parsed via regular expression matching, which are used to invoke the tool (action execution).

Take the example shown in Figure 6.5, upon receiving the input image, ChatGPT first invokes a se-ries of tools for a general understanding about the image. The invoked tools include image caption-ing for an overall description of the image; dense captioning to get the region-level, more detailed description about the objects in the image; object tagging to get the tags of the objects in the image; face detection to get the box coordinates of the two faces mentioned in the object tags. The outputs from the tools (i.e. observations) are serialized as text, and fed back to ChatGPT.

Combining observations with the chat history, ChatGPT can further invoke additional experts or return the final answer to the user. In this specific example, ChatGPT invokes a second round of thought-action-observation over the two faces detected in the image and calls celebrity recognition to get the names of these two persons.

執(zhí)行。根據(jù)ChatGPT的行動請求，可以通過正則表達式匹配解析工具名稱和文件路徑，這些信息用于調用工具（操作執(zhí)行）。

以圖6.5中顯示的示例為例，收到輸入圖像后，ChatGPT首先調用一系列工具以對圖像進行一般性理解。所調用的工具包括用于所述圖像的總體描述的圖像字幕；密集字幕以獲取圖像中物體的區(qū)域級更詳細的描述；對象標簽以獲取圖像中物體的標簽；人臉檢測以獲取物體標簽中提到的兩張臉的框坐標。工具的輸出（即觀察結果）被序列化為文本，并反饋給ChatGPT。

將觀察結果與聊天歷史結合起來，ChatGPT可以進一步調用其他專家或將最終答案返回給用戶。在此特定示例中，ChatGPT在圖像中檢測到的兩張臉上進行了調用第二輪的思考-行動-觀察，并調用名人識別以獲取這兩位人物的姓名。

Response generation響應生成：MM-REACT實現(xiàn)了對話系統(tǒng)，通過判斷是否需要調用外部工具，將所有觀察信息分析總結給用戶，或利用人名和邊界框調用Bing搜索來回答未知詳情的 follow-up 問題

When ChatGPT decides no external tools are needed, it takes consideration of all observations gathered and summarize them as the response to the user, which is “This image contains two celebrities, Kobe Bryant and Paul Pierce. They are both basketball players.” for the example shown in Figure 6.5.

If the user continues to interact with MM-REACT, it repeats the process described above, but with all observations and chat history available when planning for the tools needed. For instance, if the user then asks “how many championship rings did the player on the left win in his career”, it is not available in the existing observations nor chat history, but ChatGPT has the bounding boxes to decide who is on the left, and also the names of the players. It plans to invoke Bing Search to find the right answer, which should be 5.

當ChatGPT確定不需要外部工具時，它會考慮到收集到的所有觀察結果，并將它們總結為向用戶的響應，例如圖6.5中所示的響應是“這張圖像包含兩位名人，科比·布萊恩特和保羅·皮爾斯。他們都是籃球運動員?！?。

如果用戶繼續(xù)與MM-REACT進行互動，它將重復上述過程，但在規(guī)劃所需工具時會考慮到所有觀察結果和聊天歷史。例如，如果用戶接著問“左邊的球員在他的職業(yè)生涯中贏得了多少個總冠軍戒指”，在現(xiàn)有觀察結果和聊天歷史中沒有該信息，但ChatGPT可以使用邊界框決定誰在左邊，以及球員的名字。它計劃調用Bing Search來找到正確的答案，答案應該是5。

6.3.2、Capabilities能力：MM-REACT 證明了多種代表性能力和應用場景

Figure 6.6 shows the representative capabilities and application scenarios that MM-REACT demon-strates, including visual math and text reasoning, understanding visual-conditioned jokes/memes, spatial/coordinate understanding, visual planning and prediction, multi-image reasoning, multi-hop document understanding, open-world concept understanding, video analysis and summarization.

In addition, we show an example of the full response from MM-REACT on multi-image reasoning in Figure 6.7, which may not be easily achievable by visual instruction tuning in Chapter 5. For more comprehensive examples of all emerging capabilities of MM-REACT, we refer the reader to the original paper.

圖6.6顯示了MM-REACT展示的代表性能力和應用場景，包括視覺數(shù)學和文本推理、理解視覺條件下的笑話/表情、空間/坐標理解、視覺規(guī)劃和預測、多圖像推理、多跳文檔理解、開放世界概念理解、視頻分析和總結。

此外，圖6.7中展示了MM-REACT在多圖像推理方面的完整響應示例，這可能在第5章的視覺指令調優(yōu)中不容易實現(xiàn)。對于MM-REACT的所有新興能力的更全面示例，我們建議讀者參閱原始論文。

6.3.3、Extensibility可擴展性(工具鏈構建多模態(tài)智能體的優(yōu)勢)：可擴展性的兩大策略

One favorable property of tool chaining to build multimodal agents is that the system can be easily extended and enhanced, from two perspectives. One is to upgrade the core part of the system, the LLM, and the other is to expand the number of external tools.

工具鏈構建多模態(tài)智能體的一個有利特性是系統(tǒng)易于擴展和增強，從兩個方面來看。一個是升級系統(tǒng)的核心部分LLM，另一個是擴展外部工具的數(shù)量。

(1)、Upgrading LLM升級LLM：MM-REACT的系統(tǒng)設計可不需重訓練就將LLM升級為更強大的新模型，比如ChatGPT升級到多模態(tài)能力的GLP-4

The system design of MM-REACT allows for upgrading the core part of the system, the LLM, to newer and more powerful models as they come out, without the need of re-training. We show an example in Figure 6.8 on upgrading ChatGPT to language-only GPT-4, which improves MM-REACT to potentially match the performance of multimodal GPT-4.

MM-REACT的系統(tǒng)設計允許將系統(tǒng)的核心部分LLM升級為更新和更強大的模型，而無需重新訓練。我們在圖6.8中展示了將ChatGPT升級為僅支持語言的GPT-4的示例，這可以改進MM-REACT以潛在地匹配多模態(tài)GPT-4的性能。

(2)、Plug-and-play 即插即用（添加更多工具）：現(xiàn)有的多模態(tài)智能體通過插拔式機制整合工具(如HuggingGPT、Chameleon和RestGPT)允許在無需訓練的情況下添加更多工具→擴展到數(shù)千萬個工具仍然具有挑戰(zhàn)性(TaskMatrix.AI的潛力)→SAM可以作為一種工具來實現(xiàn)與多模態(tài)智能體的多種方式的人際互動

Plug-and-play (adding more tools). Existing multimodal agents incorporates tools via a plug-and-play mechanism, allowing adding more tools without training. One prominent work along this direction is HuggingGPT (Shen et al., 2023b), which proposes to leverage all open-source models hosted on huggingface. Chameleon (Lu et al., 2023b), incorporates not only huggingface models, but also open-source models from GitHub, Bing search API, and python compiler. RestGPT (Song et al., 2023) proposes a multi-level online planning framework that effectively handles the practical challenges associated with integrating LLMs with more than 100 RESTful APIs. However, it re-mains challenging in scaling this framework to thousands to millions of tools, which is the potential future demonstrated in TaskMatrix.AI (Liang et al., 2023b).

現(xiàn)有的多模態(tài)智能體通過即插即用的機制集成工具，允許在無需訓練的情況下添加更多工具。沿著這個方向的一項重要工作是HuggingGPT（Shen等人，2023b），該工作提出利用托管在huggingface上的所有開源模型。Chameleon（Lu等人，2023b）不僅包括huggingface模型，還包括來自GitHub、必應搜索API和Python編譯器的開源模型。RestGPT（Song等人，2023）提出了一個多層次在線規(guī)劃框架，有效處理了將LLMs與超過100多個RESTful API集成相關的實際挑戰(zhàn)。然而，將該框架擴展到成千上萬的工具仍然具有挑戰(zhàn)性，這是TaskMatrix.AI（Liang等人，2023b）所展示的未來潛力。

Moreover, one can leverage SAM (Kirillov et al., 2023) as a tool to allow for more types of human interaction with the multimodal agent other than text. Recall in MM-REACT, the user intent is all captured by the natural language query from the user. In InternGPT (Liu et al., 2023l), by connecting the tool SAM with GPT, it allows for more ways to interact with the system, for example, via clicks, scribbles, and drawing bounding boxes. These additional interactions, to some extent, are mimicking the action of finger-pointing when we humans are having a conversation.

此外，人們可以利用SAM（Kirillov等人，2023）作為一種工具，允許以文本之外的多模態(tài)智能體進行更多類型的人類交互?；仡櫼幌?strong>MM-REACT，用戶意圖都是通過用戶的自然語言查詢來捕獲的。在InternGPT（Liu等人，2023l）中，通過將工具SAM與GPT連接，它允許以更多方式與系統(tǒng)進行互動，例如通過點擊、涂鴉和繪制邊界框。在某種程度上，這些額外的互動方式模擬了我們人類進行對話時指向的動作。

6.4、Advanced Topics高級主題

In this section, we discuss more advanced topics and shed light on potential future directions.

在本節(jié)中，我們將討論更高級的主題，并探討潛在的未來發(fā)展方向。

6.4.1、Comparison to Training with LLM in Chapter與第五章中LLM訓練的比較

構建基于LLM的高級多模態(tài)系統(tǒng)方向上的兩個方法→融合兩種范式優(yōu)勢的中間領域的可能性→探討：否可以用LLaVA替代LLM作為工具分配器+需要哪些能力來啟用工具

T1、通過指令調整實現(xiàn)端到端模型+直接解釋多模態(tài)輸入中的豐富語義+但需要數(shù)據(jù)篩選和訓練+成本較高

T2、通過將LLM與現(xiàn)成的工具鏈接+借助上下文中的少樣本示例來教導LLM進行規(guī)劃+但存在如何選擇工具的問題且弱領域專家導致性能差

We have covered two directions on building advanced multimodal systems based on LLMs. As the key distinction, the multimodal agents in this chapter leverages LLMs’ high-level planning abilities to allocate various multimodal tools, while training multimodal models with LLMs in Chapter 5 solely leverages LLMs for text generation conditioned on multimodal inputs.

Nonetheless, both of these methods exhibit their respective advantages and disadvantages. On one hand, instruction tuning enables an end-to-end model that directly interprets rich semantics in multi-modal inputs, but requires data curation and training, hence more computationally expensive. How-ever, limited instruction tuning data may limit its capabilities in certain scenarios, such as OCR. On the other hand, one can build a multimodal agent without any training by chaining LLMs with abundant off-the-shelf models/APIs/code interpreters as tools, and leveraging in-context few-shot examples to teach LLMs on planning. However, as there is no training, the system may fail to in-voke the right tool. Moreover, weak domain experts may produce noisy outputs, that can confuse LLM on planning or reasoning, leading to weak performance.

我們已經(jīng)介紹了兩種基于LLM構建高級多模態(tài)系統(tǒng)的方法。作為關鍵區(qū)別，本章中的多模態(tài)智能體利用了LLM的高級規(guī)劃能力來分配各種多模態(tài)工具，而第5章中使用LLM訓練多模態(tài)模型僅利用LLM來生成基于多模態(tài)輸入的文本。

然而，這兩種方法都具有各自的優(yōu)點和缺點。一方面，指令調優(yōu)使端到端模型能夠直接解釋多模態(tài)輸入中的豐富語義，但需要數(shù)據(jù)管理和訓練，因此計算成本更高。然而，有限的指令調優(yōu)數(shù)據(jù)可能會限制其在某些場景下的能力，例如OCR。另一方面，可以通過將LLM與大量現(xiàn)成的模型/API/代碼解釋器鏈在一起作為工具，以及利用上下文中的少量示例來教導LLM進行規(guī)劃，來構建多模態(tài)智能體而無需任何訓練。然而，由于沒有訓練，系統(tǒng)可能無法調用正確的工具。此外，弱領域專家可能會產(chǎn)生噪聲的輸出，這可能會使LLM在規(guī)劃或推理方面感到困惑，導致性能較差。

Though these two approaches exhibit distinct variations,, we envision the possibility of an interme-diate domain that amalgamates the strengths of both paradigms, and raise the following questions. Now that we have open-source LMM such as LLaVA (Liu et al., 2023c), can we replace the LLM with LLaVA as the tool allocator? If so, what capabilities would require a tool to be enabled? And what problems can be solved by instruction tuning. These are interesting directions that may worth exploring in the near future.

盡管這兩種方法具有不同的變化，但我們設想了一種中間領域的可能性，將兩種范例的優(yōu)勢融合在一起，并提出以下問題。既然我們有像LLaVA（Liu等人，2023c）這樣的開源LMM，是否可以將LLM替換為LLaVA作為工具分配器？如果是這樣，需要啟用工具的哪些功能?哪些問題可以通過指令調優(yōu)來解決。這些都是值得在不久的將來探索的有趣方向。

6.4.2、Improving Multimodal Agents提高多模態(tài)智能體的性能

痛點：當前主要依賴上下文內的少樣本示例來教導LLM進行規(guī)劃，導致不夠可靠和不準確

Existing multimodal agents mainly rely on in-context few-shot examples to teach LLM on planning, which can be unreliable, leading to inaccurate tool using. To improve the accuracy in planning, several works have been proposed and we group them into three categories below.

現(xiàn)有的多模態(tài)智能體主要依賴于上下文中的少量示例來教導LLM進行規(guī)劃，這可能不可靠，導致工具的使用不準確。為了提高規(guī)劃的準確性，已經(jīng)提出了一些方法，我們將它們分為以下三類。

Composing tools via code generation通過代碼生成組合工具(代碼仍由LLM生成導致準確性問題)：探索使用編程語言來代替自然語言進行更準確的工具使用規(guī)劃，基于自然語言指令利用GPT-3(Codex)生成Python代碼，如視覺編程/ViperGPT?

?Most existing multimodal agents uses natural language to prompt LLM for planning which tool to use. Researchers (Gupta and Kembhavi, 2023; Sur′?s et al., 2023) have also been exploring using programming language for more accurate execution. Visual programming (Gupta and Kembhavi, 2023) is a prominent work along this direction, which?uses the in-context learning ability of GPT-3 (Brown et al., 2020) to generate python-like modular programs from natural language instructions for compositional visual tasks ViperGPT Sur′?s et al.(2023) instructs GPT-3 Codex (Chen et al., 2021a) to generate Python code to compose multimodal tools for a one-round query answering. However, as the codes are still generated by a LLM, the problem of inaccurate tool using still remains.

大多數(shù)現(xiàn)有的多模態(tài)智能體使用自然語言提示LLM規(guī)劃使用哪個工具。研究人員（Gupta和Kembhavi，2023；Sur′?s等人，2023）也一直在探索使用編程語言來更準確地執(zhí)行。視覺編程（Gupta和Kembhavi，2023）是這個方向的一項突出工作，它利用了GPT-3（Brown等人，2020）的上下文學習能力，從自然語言指令中生成類似Python的模塊化程序，用于組合視覺任務。ViperGPT Sur′?s等人（2023）指示GPT-3?Codex（Chen等人，2021a）生成Python代碼，以組合多模態(tài)工具進行一輪查詢回答。然而，由于代碼仍然是由LLM生成的，因此仍然存在工具使用不準確的問題。

Improving accuracy in tool using: self-assessment提高工具使用的準確性—自我評估：AssistGPT試圖通過自我評估提升工具使用準確性

A recent work AssistGPT (Gao et al., 2023a) tries to improve the accuracy in tool using via self-assessment. It adds a stage of inspection and learning loop into the system. When the round of plan and execution is finished, the system inspects the outcome, and determines whether the reasoning path of calling the tool is a success or not, if so, save it as an in-context example, to teach LLM for a more accurate tool calling in the future rounds.

最近的一項工作AssistGPT（Gao等人，2023a）嘗試通過自我評估來提高工具使用的準確性。它在系統(tǒng)中添加了一個檢查和學習循環(huán)的階段。當一輪計劃和執(zhí)行完成后，系統(tǒng)檢查結果，判斷調用工具的推理路徑是否成功，如果成功，將其保存為上下文示例，以指導LLM在以后的輪中更準確地調用工具。

Improving accuracy in tool using: instruction tuning提高工具使用的準確性—指令調優(yōu)：通過自我指導產(chǎn)生指令-API對數(shù)據(jù)集微調較小規(guī)模LLM，從而改善工具使用準確性

Improving accuracy in tool using: instruction tuning. Another thread on improving accuracy in tool using is to combine the system with instruction tuning (Patil et al., 2023; Yang et al., 2023c). One can generate a dataset of instruction-API pairs via self-instruct to tune a smaller LLM (e.g. , Vicuna-7B (Vicuna, 2023)).

提高工具使用的準確性：指令調優(yōu)。另一種提高工具使用準確性的方法是將系統(tǒng)與指令調優(yōu)（Patil等人，2023；Yang等人，2023c）相結合?？梢酝ㄟ^自我指導生成指令-API對的數(shù)據(jù)集，以調優(yōu)較小的LLM（例如，Vicuna-7B（Vicuna，2023））。

LMM 作為工具分配器？將LMM替換為系統(tǒng)中的多模態(tài)工具分配器,取消統(tǒng)一工具輸出為文本序列的需求,支持更自然的多模態(tài)工具交互，如模態(tài)GPT-4

LMM as the tool allocator?

In addition, as LMMs evolve, we envision that the LLM can be replaced by a LMM as the tool allocator in the system, to enable even more advanced application scenarios. If the tool allocator can take multimodal inputs, there is no need to unify the outputs of tools into text sequence, allowing more natural interactions between the tool allocator and multi-modal tools, particularly those producing multimodal outputs. For instance, one can imagine using multimodal GPT-4 (OpenAI, 2023a) to coordinate various image or video generation tools to make a short movie by providing it with a sketch of the storyline and visual examples of the main characters.

此外，隨著LMM的發(fā)展，我們設想LMM可以取代LLM作為系統(tǒng)中的工具分配器，以實現(xiàn)更高級的應用場景。如果工具分配器可以接受多模態(tài)輸入，就無需將工具的輸出統(tǒng)一為文本序列，從而允許工具分配器與多模態(tài)工具之間進行更自然的交互，特別是那些生成多模態(tài)輸出的工具。例如，可以想象使用多模態(tài)GPT-4（OpenAI，2023a）來協(xié)調各種圖像或視頻生成工具，通過提供故事情節(jié)的草圖和主要角色的視覺示例來制作短片。

6.4.3、Diverse Applications of Multimodal Agents多模態(tài)智能體的多樣應用

通過組合來自特定領域的工具的系統(tǒng)范式可以支持多樣化的領域特定應用，比如圖像合成、機器人執(zhí)行、音頻生成、3D場景生成、醫(yī)學圖像理解和視覺語言導航等

By composing tools from a specific domain, this new system paradigm can also support diverse domain-specific applications.

Yu et al. (2023b) composes LLMs with image synthesis tools and object-level/pixel-level image un-derstanding tools to build a data synthesis pipeline to provide diverse annotations on synthesized image. Instruct2Act (Huang et al., 2023c) complements the LLM with robotic executors, to enable robotic actions based on multi-modal instructions. When chaining a pool of audio models with LLM, AudioGPT (Huang et al., 2023a) can understand and generate speech, music, sound and talk-ing head. Similarly, WavJourney (Liu et al., 2023i) further supports compositional audio creation with storylines encompassing speech, music, and sound effects. With tracking, captioning, audio un-derstanding models, ChatVideo (Wang et al., 2023c) enables ChatGPT to understand multi-channel videos. Other application scenarios include 3D scene generation (Lin et al., 2023; Feng et al., 2023), medical image understanding (Liu and Zuo, 2023; Sun et al., 2023c) and vision-language naviga-tion (Zhou et al., 2023b).

通過組合來自特定領域的工具，這個新的系統(tǒng)范例還可以支持不同的特定于領域的應用程序。

Yu等人（2023b）將LLMs與圖像合成工具和物體級/像素級圖像理解工具組合起來，構建了一個數(shù)據(jù)合成管道，為合成圖像提供多種注釋。

Instruct2Act（Huang等人，2023c）將LLM與機器人執(zhí)行器結合使用，以實現(xiàn)基于多模態(tài)指令的機器人動作。

在將一組音頻模型與LLM鏈接時，AudioGPT（Huang等人，2023a）可以理解和生成語音、音樂、聲音和說話頭。類似地，WavJourney（Liu等人，2023i）進一步支持包括語音、音樂和音效在內的敘述性音頻創(chuàng)作。

借助跟蹤、字幕、音頻理解模型，ChatVideo（Wang等人，2023c）使ChatGPT能夠理解多通道視頻。

其他應用場景包括3D場景生成（Lin等人，2023；Feng等人，2023）、醫(yī)學圖像理解（Liu和Zuo，2023；Sun等人，2023c）和視覺語言導航（Zhou等人，2023b）。

6.4.4、Evaluation of Multimodal Agents多模態(tài)智能體的評估

多模態(tài)工具使用能力廣泛但其工具使用準確率尚無定量研究：API-Bank（是在系統(tǒng)評估工具增強型LLM中的起點

Multimodal tool using. Although we have seen qualitative examples of new scenarios enabled by multimodal agents, it remains unclear how these agents perform in terms of the accuracy in tool using. API-Bank (Li et al., 2023k) is a starting point on building pipeline in systematically evaluating tool-augmented LLMs.

多模態(tài)工具使用。盡管我們已經(jīng)看到了多模態(tài)智能體所啟用的新場景的定性示例，但目前尚不清楚這些代理在工具使用準確性方面的表現(xiàn)如何。API-Bank（Li等人，2023k）是系統(tǒng)評估工具增強的LLM的起點，用于系統(tǒng)地評估工具增強的LLM。

Emergent capabilities新興能力：現(xiàn)有的視覺語言基準未能考察到大型多模態(tài)智能體的涌現(xiàn)能力，研究人員已經(jīng)開始設計全面的評估樣本來促進LMM評估，比如MM-Vet定義的6個核心視覺語言能力

Emergent capabilities. Existing VL benchmarks focus on specific capabilities of interest, such as visual recognition (Antol et al., 2015), image description (Chen et al., 2015; Agrawal et al., 2019), as well as other benchmarks for specialized capabilities such as scene text understanding (Sidorov et al., 2020; Gurari et al., 2018), commonsense reasoning (Zellers et al., 2019), outside knowl-edge (Schwenk et al., 2022). The intriguing abilities shown in large multimodal models and multi-modal agents are not examined by existing benchmarks, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, or explaining visual jokes. Furthermore, the long, chatty outputs from these systems poses challenges to today’s evaluation met-rics. Researchers (Fu et al., 2023; Liu et al., 2023j) have started to design comprehensive evaluation samples to facilitate the LMM evaluation. As an attempt to test multimodal systems on integrated capabilities, MM-Vet (Yu et al., 2023d) defines 6 core VL capabilities and examines the 16 integra-tions of interest derived from the capability combination (Figure 6.10). In addition, to accommodate for the open-ended free-form text outputs, MM-Vet proposes an LLM-based evaluator to enable evaluation across different question types and answer styles.

現(xiàn)有的VL基準主要關注特定感興趣的能力，例如視覺識別（Antol等人，2015）、圖像描述（Chen等人，2015；Agrawal等人，2019），以及用于專門能力的其他基準，例如場景文本理解（Sidorov等人，2020；Gurari等人，2018）、常識推理（Zellers等人，2019）、外部知識（Schwenk等人，2022）。

大型多模態(tài)模型和多模態(tài)智能體中顯示出的有趣能力并沒有被現(xiàn)有的基準所檢驗，比如解決黑板上寫的數(shù)學問題，推理新聞圖像中的事件和名人，或者解釋視覺笑話。

此外，這些系統(tǒng)產(chǎn)生的冗長對話輸出對今天的評估指標提出了挑戰(zhàn)。研究人員（Fu等人，2023；Liu等人，2023j）已經(jīng)開始設計全面的評估樣本，以促進LMM的評估。作為在綜合能力上測試多模態(tài)系統(tǒng)的嘗試，MM-Vet（Yu等人，2023d）定義了6種核心VL能力，并檢查了從能力組合中派生出的16種感興趣的整合（圖6.10）。此外，為了適應自由形式文本輸出，MM-Vet提出了一種基于LLM的評估器，以實現(xiàn)跨不同問題類型和答案風格的評估。

6.4.5、Tool Creation工具創(chuàng)建

NLP領域：探索通過編寫代碼或指令來即時創(chuàng)建工具以滿足用戶需求，比如CREATOR

多模態(tài)智能體領域：尋找方法來創(chuàng)建能夠處理多模態(tài)輸入的工具，比如ViperGPT/AutoML GPT

Imagine if we have a completely new scenario without a robust tool to use. Can we create a tool based on the user need on-the-fly? In NLP, CREATOR (Qian et al., 2023) proposes to create tools by writing python code for math reasoning, as opposed to calling math solver API such as Wolfram Alpha. Cai et al. (2023) further explores the capabilities of LLMs to make tools, and experiment with two LLMs, one as the tool maker and the other as the tool user to collaboratively solve com-plicated tasks, such as scheduling a meeting. In terms of multimodal agents, the challenge is how to create a tool that can process multimodal inputs. One may follow ViperGPT (Sur′?s et al., 2023) to instruct LLMs to generate python programs leveraging pre-existent python packages such as Open-CV. AutoML GPT (Zhang et al., 2023j) envisions that one can utilize LLMs to automate the model training pipeline. There may be potential to develop novel multimodal deep learning tools tailored to more effectively address the requirements of users.

想象一下，如果我們有一個全新的場景，沒有一個強大的工具可以使用。我們能否根據(jù)用戶的需求創(chuàng)建一個即時工具？

在NLP領域，CREATOR（Qian等人，2023）建議通過為數(shù)學推理編寫Python代碼來創(chuàng)建工具，而不是調用數(shù)學求解器API，例如Wolfram Alpha。

Cai等人（2023）進一步探討了LLM創(chuàng)建工具的能力，并對兩個llm進行了實驗，，一個作為工具制造商，另一個作為工具用戶，以協(xié)同解決復雜的任務，如安排會議。

在多模態(tài)智能體方面，挑戰(zhàn)是如何創(chuàng)建一個可以處理多模態(tài)輸入的工具。可以借鑒ViperGPT（Sur′?s等人，2023）的方法，指示LLM利用現(xiàn)有Python包（如Open-CV）生成python程序。AutoML GPT（Zhang等人，2023j）設想可以利用LLM自動化模型訓練管道。有可能開發(fā)出新的多模態(tài)深度學習工具，以更有效地滿足用戶的需求。

6.4.6、Retrieval-Augmented Multimodal Agents檢索增強的多模態(tài)智能體

背景：大部分信息存儲在數(shù)據(jù)庫中，用戶可能需要準確檢索這些信息

NLP領域：通過外部數(shù)據(jù)以結構化語言和關系表示來增強LLMs，通過檢索器檢索相關文檔并使用生成器生成預測，以解決無法將所有世界知識編碼到預訓練模型的權重中的問題

In real-life applications, a substantial portion of information resides within databases, and user needs may require accurate retrieval of such information. Meanwhile, it is infeasible to encode all the world knowledge into the weights of pre-trained models, particularly when it comes to the long-tail concepts and fast-evolving data.

In NLP, several works augment LLMs with external data encoded with structured language and relation representations (Peters et al., 2019; Guu et al., 2020; Lewis et al., 2020). Given input texts, such retrieved-augmented models utilize a retriever that retrieves relevant documents from an external memory, and uses a generator to generate predictions given the retrieved documents.

在實際應用中，大部分信息存儲在數(shù)據(jù)庫中，用戶可能需要準確檢索這些信息。與此同時，在將世界知識編碼到預訓練模型的權重中，特別是對于長尾概念和快速發(fā)展的數(shù)據(jù)來說，是不可行的。

在NLP領域，一些工作利用結構化語言和關系表示來增強LLMs與外部數(shù)據(jù)的連接（Peters等人，2019；Guu等人，2020；Lewis等人，2020）。這些檢索增強模型利用檢索器從外部存儲中檢索相關文檔，并使用生成器根據(jù)檢索到的文檔生成預測。

多模態(tài)智能體領域：基于檢索增強模型的啟發(fā)，利用視覺和/或文本知識來提高視覺任務，通過檢索和應用外部知識，提供核心模型所需的額外信息來改善任務性能，比如RAC/K-LITE/REACT//

Motivated by retrieval-augmented models in NLP, several recent works leverage visual and / or textual knowledge to improve vision tasks, such as image classification (Long et al., 2022), cap-tioning (Yang et al., 2023a), question answering (Wu et al., 2021; Marino et al., 2021; Yang et al., 2022d; Chen et al., 2022e), image generation (Blattmann et al., 2022; Sheynin et al., 2022; Chen et al., 2022f; Zhou et al., 2022c), and multi-modal tasks simultaneously (Yasunaga et al., 2022). RAC (Long et al., 2022) improves long-tail classification by retrieving from a non-parametric mem-ory consisting of pre-encoded images and text. K-LITE (Shen et al., 2022a) enhances the text prompts with the retrieved external knowledge that is encoded in natural language. REACT (Liu et al., 2023d) retrieve from billions of the paired knowledge of image-text and aims to improve task transfer performance for core vision problems. Among them, RA-CM3 (Yasunaga et al., 2022) builds the first retrieval-augmented LMM with a multimodal retriever to retrieve multimodal docu-ments, and a retrieval-augmented generator that can generate both text and image. Chaining tools with LLM shares a strong connection with the retrieval-augmented methods in that both leverage ex-ternal knowledge to provide additional information for the core model to utilize. In the multimodal regime, the image itself can be used as the query to gain external knowledge, either retrieved from a knowledge base, or extracted from another pre-trained vision expert models.

受NLP中的檢索增強模型的啟發(fā)，最近的一些工作利用視覺和/或文本知識來提高視覺任務的性能，例如圖像分類（Long等人，2022）、圖像描述（Yang等人，2023a）、問答（Wu等人，2021；Marino等人，2021；Yang等人，2022d；Chen等人，2022e）、圖像生成（Blattmann等人，2022；Sheynin等人，2022；Chen等人，2022f；Zhou等人，2022c）以及同時進行多模態(tài)任務（Yasunaga等人，2022）。

RAC（Long等人，2022）通過從預先編碼的圖像和文本組成的非參數(shù)存儲中檢索來提高長尾分類性能。

K-LITE（Shen等人，2022a）利用檢索到的以自然語言編碼的外部知識增強文本提示。

REACT（Liu等人，2023d）從數(shù)十億的圖像-文本對知識中檢索，并旨在提高核心視覺問題的任務遷移性能。

其中，RA-CM3（Yasunaga等人，2022）構建了第一個檢索增強LMM，使用多模態(tài)檢索器檢索多模態(tài)文檔，并使用檢索增強生成器生成文本和圖像。將工具與LLM鏈接與檢索增強方法具有很強的聯(lián)系，因為兩者都利用外部知識為核心模型提供額外信息。在多模態(tài)模式下，圖像本身可以作為查詢來獲取外部知識，或者從知識庫中檢索，或者從另一個預訓練的視覺專家模型中提取。

7、Conclusions and Research Trends結論和研究趨勢

多模態(tài)基礎模型在快速發(fā)展：共同的總體目標是—創(chuàng)建通用型模型能夠遵循人類意圖并執(zhí)行各種域外視覺任務

Multimodal foundation models have garnered significant interest among scholars in the fields of computer vision and multimodal vision-language research. Although prevailing research topics, approaches and methodologies have been evolving – encompassing image self-supervised learning, language-image contrastive learning, text-to-image generation, unified vision modeling, and large language-and-vision assistants – they converge on a common overarching objective: the creation of general-purpose models and systems capable of following human intents and effortlessly executing a diverse array of vision and vision-language tasks in the wild. In this chapter, we provide a concise summary of what has been reviewed, and delve into the prevailing research tendencies in the field.

多模態(tài)基礎模型在計算機視覺和多模態(tài)視覺語言研究領域引起了學者們的極大興趣。盡管流行的研究主題、方法和方法學一直在不斷發(fā)展，包括圖像自監(jiān)督學習、語言-圖像對比學習、文本到圖像生成、統(tǒng)一視覺建模以及大規(guī)模語言和視覺助手，但它們都聚焦于一個共同的總體目標：創(chuàng)建通用模型和系統(tǒng)，能夠遵循人類意圖并輕松執(zhí)行各種域外視覺和視覺語言任務。在本章中，我們對已經(jīng)回顧過的內容進行了簡要總結，并深入探討了該領域的主要研究趨勢。

7.1、Summary and Conclusions總結和結論：多模態(tài)基礎模型研究的最新進展的兩大類

特定用途的多模態(tài)基礎模型：關注問題相關數(shù)據(jù)的預訓練和零-少樣本遷移

This paper surveys the most recent advances at the frontier of multimodal foundation model research, categorized into two classes discussed below.

>>Specific-purpose multimodal foundation models. There is a diverse set of problems to tackle in the computer vision community. To lay a comprehensive foundation for the introduction of general-purpose visual assistants, we have discussed many seminar papers in the era of pre-training. The major paradigm during this period is pre-training on a large amount of problem-related data, and then transferring to a number of real-world scenarios of the same problem type in a zero- or few-shot fashion. More specifically, we have presented two general topics: (i) Vi-sual Understanding in Chapter 2: individual multimodal foundation models have developed to analyze the content of visual data in the image, region, pixel levels, prospectively. The language-augmented vision models are a popular family, contributing to the recent success of visual under-standing tasks in the wild. (ii) Visual Generation in Chapter 3: text-to-image generation models have served the foundation for image synthesis, which has been successfully extended to allow user controllability and customization at more fine-grained manners. The availability and cre-ation of large amount of problem-related data has played a key role in making these multimodal foundation models possible.

本文對多模態(tài)基礎模型研究前沿的最新進展進行了調查，分為以下兩類進行討論。

>>?特定用途的多模態(tài)基礎模型。在計算機視覺社區(qū)中有各種各樣的問題需要解決。為了為通用視覺助手的引入奠定一個全面的基礎，我們討論了在預訓練時代的許多研討會論文。這個時代的主要范式是在大量與問題相關的數(shù)據(jù)上進行預訓練，然后以零次或少次的方式將其轉移到相同問題類型的多種實際場景中。

更具體地說，我們提出了兩個主題：

(i) 第2章中的視覺理解：個體多模態(tài)基礎模型已經(jīng)發(fā)展起來，可以在圖像、區(qū)域、像素級別上分析視覺數(shù)據(jù)的內容。語言增強的視覺模型是一個受歡迎的系列，為域外視覺理解任務的最近成功做出了貢獻。

(ii) 第3章中的視覺生成：文本到圖像生成模型為圖像合成提供了基礎，并已成功擴展到允許用戶以更細粒度的方式進行可控性和自定義。問題相關數(shù)據(jù)的可用性和創(chuàng)建在使這些多模態(tài)基礎模型成為可能方面發(fā)揮了關鍵作用。

通用型助手：關注具有統(tǒng)一網(wǎng)絡架構和接口的通用型助手模型研究，為視覺任務提供了類似于NLP中的通用助手的解決方案

>>General-purpose assistants. We have reviewed recently emerged literature on building general-purpose assistants, which often possess an unified network architecture, an unified input-output data format, and a general interface that facilitates easy interaction with humans. Inspired by the success in NLP that LLM such as ChatGPT/GPT-4 is a general assistant for a wide range of lan-guage tasks, researchers in computer vision have explored various solutions to their counterpart for vision tasks. Based on how LLM is leveraged in the methodology, existing works can be cate-gorized into three topics: (i) Unified Vision Models in Chapter 4: The spirit of unifying modeling in LLM is borrowed to build unified vision models at different levels and across different tasks.(ii) Training with LLM in Chapter 5: Starting with a pre-trained LLM, visual data is connected to LLM for end-to-end training. (iii) Chaining with LLM in Chapter 6: By freezing LLM, existing vision experts can be triggered by prompt engineering LLM to complete specific vision tasks.

>>?通用助手。我們回顧了最近出現(xiàn)的關于構建通用助手的文獻，這些助手通常具有統(tǒng)一的網(wǎng)絡架構、統(tǒng)一的輸入輸出數(shù)據(jù)格式以及便于與人類進行輕松交互的通用接口。在NLP中，像ChatGPT/GPT-4這樣的LLM是廣泛語言任務的通用助手，受到其成功的啟發(fā)，計算機視覺研究人員已經(jīng)探索了各種解決方案來解決視覺任務。根據(jù)LLM在方法論中的運用，現(xiàn)有的工作可以分為三個主題：

(i) 第4章中的統(tǒng)一視覺模型：借鑒LLM中的統(tǒng)一建模精神，構建了不同層次和不同任務的統(tǒng)一視覺模型。

(ii) 第5章中的LLM訓練：從預訓練的LLM開始，將視覺數(shù)據(jù)與LLM連接，進行端到端的訓練。

(iii) 第6章中的LLM鏈接：通過凍結LLM，可以通過提示工程LLM觸發(fā)現(xiàn)有的視覺專家，以完成特定的視覺任務。

The comparisons among these models are summarized in Table 7.1.

這些模型之間的比較總結在表7.1中。

7.2、Towards Building General-Purpose AI Agents邁向構建通用AI代理

專門的多模態(tài)基礎模型→通用視覺助手：目前已出現(xiàn)強大的視覺助手(如Flamingo和 multimodal GPT-4相)，但相比未來的多模態(tài)AI智能體仍處于初級階段

At the end of each chapter, we have discussed future trends for individual topics. The paper itself is organized to demonstrate the transition from specialist multimodal foundation models to general-purpose visual assistants. Though powerful, existing visual assistants such as Flamingo (Alayrac et al., 2022) and multimodal GPT-4 (OpenAI, 2023b) are in the preliminary form, compared with grand vision on building a general-purpose multimodal AI agent via foundation models. In what follows, we highlight a number of research trends towards this goal.

在每章的結尾，我們討論了各個主題的未來趨勢。本文自身的組織方式旨在展示從專門的多模態(tài)基礎模型向通用視覺助手的過渡。盡管像Flamingo（Alayrac等人，2022）和多模態(tài)GPT-4（OpenAI，2023b）這樣的現(xiàn)有視覺助手已經(jīng)非常強大，但與通過基礎模型構建通用多模態(tài)AI智能體的宏偉愿景相比，還處于初級階段。接下來，我們將重點介紹一些朝著實現(xiàn)這一目標的研究趨勢。

Generalist agents with multi-modality多模態(tài)的通用代理：研究趨勢是構建一個通用多模態(tài)智能體(融合多種通道與世界進行交互),感知和合成視覺信號(如Gato/PaLM-E)，其中視覺感知是關鍵更是挑戰(zhàn)

Generalist agents with multi-modality. This aligns with the grand goal of building a single gen-eralist agent that interacts with world like humans through fusing multiple channels such as lan-guage, vision, speech and actions. From this perspective, the notion of multimodal foundation models becomes somewhat indistinct on its own. Instead, it serves as a crucial component of the agent for perceiving and synthesizing visual signals. For example, Gato (Reed et al., 2022) and PaLM-E (Driess et al., 2023) perform a wide range of language, multimodal and control tasks with a single set of model weights, where visual perception is a crucial component in understanding the environment. It also raises challenges in the effective and scalable pre-training objectives for unified vision and multimodal modeling.

這與構建一個像人類一樣通過融合多個渠道（如語言、視覺、語音和行動）與世界互動的通用智能體的宏偉目標一致。從這個角度看，多模態(tài)基礎模型的概念本身顯得有些模糊。相反，它作為智能體的重要組成部分，用于感知和合成視覺信號。例如，Gato（Reed等人，2022）和PaLM-E（Driess等人，2023）使用一組模型權重執(zhí)行各種語言、多模態(tài)和控制任務，其中視覺感知是理解環(huán)境的關鍵組成部分。這也為統(tǒng)一視覺和多模態(tài)建模的有效和可擴展的預訓練目標提出了挑戰(zhàn)。

Alignment with human intents與人類意圖的對齊：視覺提示比語言表達更好，基于視覺提示的多模態(tài)人機交互是解鎖新場景的關鍵

Alignment with human intents. AI alignment research focuses on steering AI systems towards humans’ intended goals, values, or ethical guidelines. An AI system is deemed aligned when it effectively promotes the desired goals. Though language has exhibited its generality in expressing human intents, it is not always the best option. As demonstrated in SAM (Kirillov et al., 2023) and ControlNet/GLIGEN (Zhang and Agrawala, 2023; Li et al., 2023n), human intents can be more precisely and conveniently represented in visual prompts such as key points, bounding boxes and sketch drawing, for visual understanding and generation tasks, respectively. Building foundation models that are well equipped with such a multimodal human-machine interaction interface is a key step to unlock new use scenarios, where human intents are best represented visually. For example, the spatial arrangement of elements within a scene, as well as the artistic style and visual appeal of a piece of visual art.

AI對齊研究專注于將AI系統(tǒng)引導到人類預期的目標、價值觀或道德準則上。當一個AI系統(tǒng)有效地促進所期望的目標時，AI系統(tǒng)被認為是對齊的。盡管語言在表達人類意圖方面表現(xiàn)出了其通用性，但它并不總是最佳選擇。如SAM（Kirillov等人，2023）和ControlNet/GLIGEN（Zhang和Agrawala，2023；Li等人，2023n）所示，人類意圖可以更精確、更方便地以視覺提示的形式表示，如關鍵點、邊界框和草圖繪制，用于視覺理解和生成任務。

構建具備這種多模態(tài)人機交互接口的基礎模型，是解鎖新的使用場景的關鍵步驟，其中人類意圖最好以視覺方式表示，例如場景中元素的空間排列，以及視覺藝術品的藝術風格和視覺吸引力。

AI智能體系統(tǒng)框架四大組成=基于LLM驅動的智能體大腦+三大組件(視覺模態(tài)的作用，規(guī)劃【改進視覺理解】、記憶【利用上下文學習和交錯多模態(tài)提示實現(xiàn)短期記憶+通過多模態(tài)向量空間快速檢索實現(xiàn)長期記憶】和工具使用【智能體學習利用外部API來獲取基礎模型權重中缺少的知識】)

Planning, memory, and tool use. It is highlighted in Weng (2023) that a LLM-powered au-tonomous agent system can be built, where LLM functions as the agent’s brain, complemented by several key components: planning, memory and tool use. Following the framework, we could foresee the role of multimodal foundation models in this AI agent system. (i) Planning. To com-plete complex tasks in real-world scenarios, the agent should be able to decompose large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks. In the ideal case, the?AI agent possesses the self-improvement ability, engaging in self-assessment and introspection re-garding previous actions, enabling it to learn from errors and enhance its approach for subsequent endeavors, ultimately leading to better outcomes. Visual modality is a common channel to represent state of the environment. To facilitate planning, it raises challenges in improving the capability of the current visual understanding models in perceiving more fine-grained visual details and longer sequence videos. (ii) Memory. For short-term memory, in-context learning (or prompt engineering) is utilized as short-term memory for the model to learn. Interleaved multimodal prompts can enable new scenarios to clarify the human intents. For long-term memory, it provides the agent with the capability to recall external knowledge over extended sessions, which can be implemented by fast retrieving from a multi-modal vector space (Liu et al., 2023d). In term of modeling, foundation models are required to learn the new skills to effectively leverage both types of memory. (iii) Tool use. The agent learns to utilize external APIs for knowledge that is missing from the foundation model weights. New capabilities are required to deal with the vision modality in several scenar-ios. For example, based on both the input visual signal and instructions, the model decides and plans whether certain external APIs are needed to complete the goal, such as code execution of detection/segmentation/OCR/generator experts.

在Weng（2023）中強調了可可以構建一個由LLM驅動的自主智能體系統(tǒng)，其中LLM作為智能體的大腦，輔以幾個關鍵組成部分：規(guī)劃、記憶和工具使用。在這個框架下，我們可以預見多模態(tài)基礎模型在這一AI智能體系統(tǒng)中的作用。

(i) 規(guī)劃。為了在現(xiàn)實場景中完成復雜的任務，智能體應該能夠將大型任務分解成較小、可管理的子目標，從而實現(xiàn)對復雜任務的高效處理。在理想情況下，AI智能體應具備自我改進的能力，進行自我評估和對先前行動的反思，使其能夠從錯誤中學習，并增強其在后續(xù)任務中的方法，最終實現(xiàn)更好的結果。視覺模態(tài)是表示環(huán)境狀態(tài)的常見通道。為了便于規(guī)劃，這對當前視覺理解模型在感知更細粒度的視覺細節(jié)和更長的序列視頻方面的能力提出了挑戰(zhàn)。

(ii) 記憶。對于短期記憶，可以利用上下文學習（或提示工程）作為模型的短期記憶，以便學習。交錯的多模態(tài)提示可以啟用新的場景來澄清人類的意圖。對于長期記憶，它為智能體提供了在長時間會話中回憶外部知識的能力，這可以通過從多模態(tài)向量空間中快速檢索來實現(xiàn)(Liu et al.， 2023)。在建模方面，基礎模型需要學習新的技能來有效地利用這兩種類型的記憶。

(iii) 工具使用。智能體學習利用外部API來獲取基礎模型權重中缺少的知識。在一些場景中，需要新的功能來處理視覺模式。例如，基于輸入的視覺信號和指令，模型決定和計劃是否需要某些外部API來完成目標，例如檢測/分割/OCR/生成器專家的代碼執(zhí)行。

The field of multimodal foundation models is evolving at a rapid speed, with new directions/methods emerging frequently. There are many important research topics that are not discussed in this paper, mostly due to the daily-updated research innovation. We are optimistic about the future of mul-timodal foundation models, not only because we are convinced that foreseeable exciting research innovations/ideas in individual areas are becoming reality by following the path of LLM in the near future, but also because connecting computer vision with the broader AI community, and building general-purpose AI agents is going to significantly advance the daily life of human being.

多模態(tài)基礎模型領域正在快速發(fā)展，新的研究方向和方法不斷涌現(xiàn)。由于研究創(chuàng)新每天都在更新，因此本文未討論的許多重要研究主題。我們對多模態(tài)基礎模型的未來充滿信心，不僅因為我們相信在不久的將來，通過追隨LLM的道路，各個領域可預見的令人興奮的研究創(chuàng)新/想法正在成為現(xiàn)實，而且還因為將計算機視覺與更廣泛的AI社區(qū)聯(lián)系起來，構建通用的人工智能智能體將大大改善人類的日常生活。

Acknowledgments

This book is largely based on our CVPR 2023 tutorial on vision foundation models. Many people have supported us and provided valuable feedback to the writing of this book. We thank all the authors who have contributed to the related papers, which makes the tutorial and book possible. We are also grateful to Mark de Jongh, the editor from the journal of Foundations and Trends? in Computer Graphics and Vision, for inspiring and encouraging us to write the book on multimodal foundation models.

本書主要基于我們在CVPR 2023上關于視覺基礎模型的教程。許多人在書寫本書過程中為我們提供了支持和寶貴的反饋意見。我們感謝所有為相關論文做出貢獻的作者，這使得教程和書籍得以實現(xiàn)。我們還感謝《計算機圖形與視覺基礎與趨勢》期刊的編輯Mark de Jongh，他啟發(fā)并鼓勵我們撰寫關于多模態(tài)基礎模型的書籍。

本站僅提供存儲服務，所有內容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權內容，請點擊舉報。

国产一级a片免费看高清,亚洲熟女中文字幕在线视频,黄三级高清在线播放,免费黄色视频在线看

相關文章