Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - Approaching Human-Level Forecasting with Language Models by Fred Zhang

2024-02-29

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Approaching Human-Level Forecasting with Language Models, published by Fred Zhang on February 29, 2024 on The AI Alignment Forum. TL;DR: We present a retrieval-augmented LM system that nears the human crowd performance on judgemental forecasting. Paper: https://arxiv.org/abs/2402.18563 (Danny Halawi*, Fred...

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Approaching Human-Level Forecasting with Language Models, published by Fred Zhang on February 29, 2024 on The AI Alignment Forum.
TL;DR: We present a retrieval-augmented LM system that nears the human crowd performance on judgemental forecasting.
Paper: https://arxiv.org/abs/2402.18563 (Danny Halawi*, Fred Zhang*, Chen Yueh-Han*, and Jacob Steinhardt)
Twitter thread: https://twitter.com/JacobSteinhardt/status/1763243868353622089
Abstract
Forecasting future events is important for policy and decision-making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms.
Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters and in some settings, surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help inform institutional decision-making.
For safety motivations on automated forecasting, see
Unsolved Problems in ML Safety (2021) for discussions. In the following, we summarize our main research findings.
Current LMs are not naturally good at forecasting
First, we find that LMs are not naturally good at forecasting when evaluated zero-shot (with no fine-tuning and no retrieval). On 914 test questions that were opened after June 1, 2023 (post the knowledge cut-offs of these models), most LMs get near chance performance.
Here, all questions are binary, so random guessing gives a Brier score of 0.25. Averaging across all community predictions over time, the human crowd gets 0.149. We present the score of the best model of each series. Only GPT-4 and Claude-2 series beat random guessing (by a margin of >0.3), though still very far from human aggregates.
System building
Towards better automated forecasting, we build and optimize a retrieval-augmented LM pipeline for this task.
It functions in 3 steps, mimicking the traditional forecasting procedure:
Retrieval, which gathers relevant information from news sources. Here, we use LM to generate search queries given a question, use these queries to query a news corpus for articles, filter out irrelevant articles, and summarize the remaining.
Reasoning, which weighs available data and makes a forecast. Here, we prompt base and fine-tuned GPT-4 models to generate forecasts and (verbal) reasonings.
Aggregation, which ensembles individual forecasts into an aggregated prediction. We use trimmed mean to aggregate all the predictions.
We optimize the system's hyperparameters and apply a self-supervised approach to fine-tune a base GPT-4 to obtain the fine-tuned LM. See Section 5 of the full paper for details.
Data and models
We use GPT-4-1106 and GPT-3.5 in our system, whose knowledge cut-offs are in April 2023 and September 2021.
To optimize and evaluate the system, we collect a dataset of forecasting questions from 5 competitive forecasting platforms, including Metaculus, Good Judgment Open, INFER, Polymarket, and Manifold.
The test set consists only of questions published after June 1st, 2023. Crucially, this is after the knowledge cut-off date of GPT-4 and GPT-3.5, preventing leakage from pre-training.
The train and validation set contains questions before June 1st, 2023, used for hyperparameter search and fine-tuning a GPT-4 base model.
Evaluation results
For each question, we perform information retrieval at up to 5 different dates during the question's time span and e...

View more

Comments (3)

More Episodes

You may also like

The Saad Truth with Dr. Saad

Adulting with Autism

The Pacific War - week by week

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

All Ears English Podcast

‌BPLUS بی‌پلاس پادکست فارسی خلاصه کتاب

جافکری | Jafekri

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com