Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are language models good at making predictions?, published by dynomight on November 6, 2023 on LessWrong.
To get a crude answer to this question, we took 5000 questions from Manifold markets that were resolved after GPT-4's current knowledge cutoff of Jan 1, 2022. We gave the text of each of them to GPT-4, along with these instructions:
You are an expert superforecaster,...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are language models good at making predictions?, published by dynomight on November 6, 2023 on LessWrong.
To get a crude answer to this question, we took 5000 questions from Manifold markets that were resolved after GPT-4's current knowledge cutoff of Jan 1, 2022. We gave the text of each of them to GPT-4, along with these instructions:
You are an expert superforecaster, familiar with the work of Tetlock and others. For each question in the following json block, make a prediction of the probability that the question will be resolved as true.
Also you must determine category of the question. Some examples include: Sports, American politics, Science etc. Use make_predictions function to record your decisions. You MUST give a probability estimate between 0 and 1 UNDER ALL CIRCUMSTANCES. If for some reason you can't answer, pick the base rate, but return a number between 0 and 1.
This produced a big table:
question
prediction P(YES)
category
actually happened?
Will the #6 Golden State Warriors win Game 2 of the West Semifinals against the #7 LA Lakers in the 2023 NBA Playoffs?
0.5
Sports
YES
Will Destiny's main YouTube channel be banned before February 1st, 2023?
0.4
Social Media
NO
Will Qualy show up to EAG DC in full Quostume?
0.3
Entertainment
NO
Will I make it to a NYC airport by 2pm on Saturday, the 24th?
0.5
Travel
YES
Will this market have more Yes Trades then No Trades
0.5
Investment
CANCEL
Will Litecoin (LTC/USD) Close Higher July 22nd Than July 21st?
0.5
Finance
NO
Will at least 20 people come to a New Year's Resolutions live event on the Manifold Discord?
0.4
Social Event
YES
hmmmm {i}
0.5
Uncategorized
YES
Will there be multiple Masters brackets in Leagues season 4?
0.4
Gaming
NO
Will the FDA approve OTC birth control by the end of February 2023?
0.5
Health
NO
Will Max Verstappen win the 2023 Formula 1 Austrian Grand Prix?
0.5
Sports
YES
Will SBF make a tweet before Dec 31, 2022 11:59pm ET?
0.9
Social Media
YES
Will Balaji Srinivasan actually bet $1m to 1 BTC, BEFORE 90 days pass? (June 15st, 2023)
0.3
Finance
YES
Will a majority of the Bangalore LessWrong/ACX meet-up attendees on 8th Jan 2023 find the discussion useful that day?
0.7
Community Event
YES
Will Jessica-Rose Clark beat Tainara Lisboa?
0.6
Sports
NO
Will X (formerly twitter) censor any registered U.S presidential candidates before the 2024 election?
0.4
American Politics
CANCEL
test question
0.5
Test
YES
stonk
0.5
Test
YES
Will I create at least 100 additional self-described high-quality Manifold markets before June 1st 2023?
0.8
Personal Goal
YES
Will @Gabrielle promote to ???
0.5
Career Advancement
NO
Will the Mpox (monkeypox) outbreak in the US end in February 2023?
0.45
Health
YES
Will I have taken the GWWC pledge by Jul 1st?
0.3
Personal
NO
FIFA U-20 World Cup - Will Uruguay win their semi-final against Israel?
0.5
Sports
YES
Will Manifold display the amount a market has been tipped by end of September?
0.6
Technology
NO
In retrospect maybe we have filtered these. Many questions are a bit silly for our purposes, though they're typically classified as "Test", "Uncategorized", or "Personal".
Is this good?
One way to measure if you're good at predicting stuff is to check your calibration: When you say something has a 30% probability, does it actually happen 30% of the time?
To check this, you need to make a lot of predictions. Then you dump all your 30% predictions together, and see how many of them happened.
GPT-4 is not well-calibrated.
Here, the x-axis is the range of probabilities GPT-4 gave, broken down into bins of size 5%. For each bin, the green line shows how often those things actually happened. Ideally, this would match the dotted black line. For reference, the bars show how many predictions GPT-4 gave that fell into each of the bins. (The lines are labeled on the y-axis on the left,...
View more