How predictable is having a child in the next three years? Results of the first data challenge in population research

Talk
Prediction
Fertility

A data challenge is a competition where many researchers work on the same dataset to make the best predictions of a well-defined outcome. Clear measures of predictive ability allow easy comparison of the strength of models and encourage participants to improve upon existing benchmarks. Data challenges led to major progress in data science by fostering the development of new algorithms with increasing predictive ability. Data challenges can also accelerate scientific progress because the ability to compare different models gives insights into the research problem at hand. However, in the social sciences data challenges are still rare. Here we present results of the first data challenge in population research, the Predicting Fertility (PreFer) data challenge. Hundreds of researchers engaged in the first phase of the data challenge where they used Dutch survey data to predict who had a child in 2021-2023 based on thousands of variables from 2007-2020. Selected researchers were also invited to work with Dutch administrative data to predict the same outcome. This setup provided a unique opportunity to study the limits of the predictability of fertility. Models combining theoretical knowledge about fertility behaviour and advanced machine learning algorithms outperformed theory-driven models in terms of predictive ability, providing novel insights into the complexity of fertility behaviour. We discuss the limits of predictability of fertility and how survey data and administrative data place different constraints on predictive ability. We further discuss how data-driven approaches can complement theory-driven research and how data science techniques can advance population research.

Author

Gert Stulp

Published

January 20, 2025

Summary


     How predictable is having a child in the next three years? Results of the first data challenge in population research

     Zurich Population Research Conference, Switzerland

     Click here for website

Description

A data challenge is a competition where many researchers work on the same dataset to make the best predictions of a well-defined outcome. Clear measures of predictive ability allow easy comparison of the strength of models and encourage participants to improve upon existing benchmarks. Data challenges led to major progress in data science by fostering the development of new algorithms with increasing predictive ability. Data challenges can also accelerate scientific progress because the ability to compare different models gives insights into the research problem at hand. However, in the social sciences data challenges are still rare. Here we present results of the first data challenge in population research, the Predicting Fertility (PreFer) data challenge. Hundreds of researchers engaged in the first phase of the data challenge where they used Dutch survey data to predict who had a child in 2021-2023 based on thousands of variables from 2007-2020. Selected researchers were also invited to work with Dutch administrative data to predict the same outcome. This setup provided a unique opportunity to study the limits of the predictability of fertility. Models combining theoretical knowledge about fertility behaviour and advanced machine learning algorithms outperformed theory-driven models in terms of predictive ability, providing novel insights into the complexity of fertility behaviour. We discuss the limits of predictability of fertility and how survey data and administrative data place different constraints on predictive ability. We further discuss how data-driven approaches can complement theory-driven research and how data science techniques can advance population research.