Introduction
Large language models sometimes generate convincing but completely incorrect factual information - a problem known as “hallucination.” A new technique called Chain-of-Verification (CoVe) from researchers at Meta AI aims to address this by having models verify their own responses.
A fundamental assumption of CoVe is that with the right prompt, the LLM can generate and execute a verification plan to check its work and use this verification analysis to provide an improved response.
The overall process consists of four steps:
Generate Baseline Response: The LLM generates a first initial response.
Plan Verifications: The LLM is given the query and the initial response and generates a list of verification questions. These questions are designed to force the LLM to self-analyze if there are any mistakes in the original response.
Execute Verifications: The LLM answers each verification question independently and checks the answer against the original response to check for inconsistencies or mistakes.
Generate Final Verified Response: The LLM generates the revised response based on the verification results. This response is expected to be more accurate.
CoVe Approach
In this section, we will go over the CoVe steps in more detail.
Generate Baseline Response: This is the first step in the pipeline. It also serves as the baseline that we seek to improve. In this step, the LLM generates a response to a given query without special tricks.
Plan Verifications: The original query and the baseline response are used to prompt the LLM to generate a series of verification questions for fact-checking purposes. There is no template for these questions, and the LLM can phrase the questions freely, not bound to any template, and not forced to match the phrasing of the original input closely. Fig. 1 shows some example verification questions for step 2.
Execute Verifications: Given the planned verification questions from Step 2, this step answers them to assess if any hallucinations exist. There are four variants of verification execution:
Joint: This variant performs the planning and execution steps (Steps 2 and 3) jointly using a single prompt. The few-shot examples include both the verification questions and their answers. A potential drawback of this method is that the verification answers are conditioned on the verification questions and the initial response, which can lead to hallucinations in a way similar to the baseline response.
2-step: The variant addresses the drawback of the joint method. Planning and execution are separate steps, having their own different prompts. Planning uses the baseline response in the prompt, while execution uses the verification questions from the planning step to answer these questions.
Factored: This more sophisticated approach separates and answers each verification question independently in separate prompts. This method eliminates any potential interference from the verification questions. Also, it has the potential advantage of handling more verification questions that can be run in parallel to compensate for the computational cost.
Factor+Revise: Once the verification questions are generated, the pipeline cross-checks if the answers match the original response. To achieve this, a separate “cross-check” prompt for each question is used that contains the baseline response and the verification question with its answer.
Generate Final Verified Response: The revised response is generated based on the verification process. A few-shot prompt that contains the previous reasoning steps, the baseline response, and the verification question and answer pairs is used. In the case of the Factor+Revise method, the output of the cross-check inconsistency detection is also included in the prompt.
We also note that the generations CoVe produces come with verifications which, if viewed by the user, add more interpretability to its decisions but come at the cost of increased computational expense due to generating more tokens in the output, similar to other reasoning methods such as Chain-of-Thought.
Experimental key findings
The experiments used the Llama 65B with greedy decoding as a base LLM. The CoVe performance was measured on the following tasks:
Wikidata and Wiki Category List: Questions with answers in the form of a list of entities (e.g., “Who are some politicians born in Boston?”)
MultiSpan QA: Questions that seek multiple independent answers from non-adjacent text sections. For example, Q: “Who invented the first printing press and in what year?”, A: “Johannes Gutenberg, 1450”.
Longform generation of biographies: Questions that require detailed or lengthy responses, such as biographical questions. For example, “Tell me a bio of <entity>.”
The main experimental findings are the following:
CoVe improves performance on all tasks (i.e., list-based answer tasks, closed-book QA, and longform generation)
Factored and 2-step CoVe enhance performance across all tasks
Further explicit reasoning helps remove hallucinations
Instruction-tuning and CoT do not reduce hallucinations
CoVe-based Llama outperforms InstructGPT, ChatGPT and PerplexityAI
Shortform verification questions are more accurately answered than longform queries
LLM-based verification questions outperform heuristics
Open verification questions exceed yes/no-based questions
Limitations
Fundamental limitations of the Chain-of-Verification (CoVe) method:
1. Not Full Removal of Hallucinations: Even though there might be improvements over the baseline response, it is not guaranteed that CoVe completely eliminates hallucinations.
2. Limited Scope of Hallucination Mitigation: Not all forms of hallucinations, such as reasoning errors, can be handled efficiently. CoVe primarily addresses hallucinations in the form of directly stated factual inaccuracies.
3. Increased Computational Expense: Like Chain-of-Thought, CoVe is burdened with additional computation costs from generating and executing verification alongside responses.
4. Upper Bound on Improvement: The effectiveness of CoVe is bound by the overall capabilities of the underlying language model, particularly in its ability to identify and rectify its own mistakes.
References
[1] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.