When faced with logical puzzles that require reasoning about others’ knowledge and counterfactual scenarios, large language models (LLMs) demonstrate a "distinctive and revealing pattern of failure," as highlighted in a bulletin from the Bank for International Settlements (BIS).
With the rise of ChatGPT capturing public interest and central banks worldwide exploring LLM applications, BIS has been testing these models’ cognitive capabilities. In its examination, BIS presented GPT-4 with the famous Cheryl’s birthday logic puzzle, and the model successfully solved it when given the original phrasing.
However, the authors observed that GPT-4 consistently struggled when minor details—like character names or specific dates—were modified. This indicates a lack of genuine understanding of the underlying logic, according to the BIS bulletin.
Despite these findings, BIS asserts that significant advancements have been made in applying machine learning to data management, macroeconomic analysis, and regulation within central banking. Nonetheless, they caution that large language models should be used with care in scenarios that require meticulous and rigorous economic reasoning.
The evidence thus far suggests that the current generation of LLMs does not meet the standards of rigor and clarity essential for the high-stakes analyses involved in central banking operations.