Forecasting the Unseen: AI Weather Models and Gray Swan Extreme Events

Artificial intelligence (AI) has revolutionized daily to biweekly weather forecasting. Certain deep learning models—like NVIDIA’s FourCastNet [8], Google DeepMind’s GraphCast and GenCast [5, 9], Microsoft's Aurora [1], and the Artificial Intelligence Forecasting System by the European Centre for Medium-Range Weather Forecasts (ECMWF) [7]—outperform traditional numerical weather prediction models in both accuracy and computational speed. These developments are transforming short- and medium-range weather forecasting—and potentially long-term climate projections—but a crucial question remains: Can these models reliably predict the most extreme and rare weather events, including those on which they were never trained? Such events often cause the most societal impact.

This query comprises the heart of out-of-distribution (OOD) generalization in statistical learning, the limits of data-driven inference in physical systems, and philosophical questions about AI’s aptitude to actually learn and understand. In a recent study, we evaluated this question by investigating an AI model’s capability to forecast so-called gray swan tropical cyclones (TCs): physically possible but exceedingly rare OOD events [12]. Our work involves major practical applications and highlights a key research area in the fast-paced AI revolution of weather and climate modeling.

Experimental Setup: A Controlled OOD Test

We trained five versions of the FourCastNet model—a transformer-based deep neural network that predicts the evolution of the three-dimensional atmospheric state every six hours—on variations of the ECMWF Reanalysis v5 (ERA5) dataset (see Figure 1) [12]. In the version called noTC, we removed all training samples from 1979-2015 that contained major TCs (Category 3 to 5 storms) anywhere in the world [12]. We then tested the model’s prowess on 20 Category 5 TCs from 2018-2023.

<strong>Figure 1.</strong> FourCastNet fails to predict unseen strong tropical cyclones (TCs). <strong>1a.</strong> Schematic of our experimental framework, including training data variants and forecast evaluation for Hurricane Lee in 2023. Lower mean sea-level pressure (MSLP) corresponds to stronger TCs. <strong>1b.</strong> An example that is consistent with the results from all 20 test cases, indicating that the model that was trained with Category 3 to 5 TCs (FourCastNet-Full and FourCastNet-Rand) can accurately predict out-of-sample Category 5 TCs like Hurricane Lee (top two panels). The version that did not see such strong TCs during training (FourCastNet-noTC) fails to do so (bottom panel). Figure courtesy of [12]. — **Figure 1.** FourCastNet fails to predict unseen strong tropical cyclones (TCs). **1a.** Schematic of our experimental framework, including training data variants and forecast evaluation for Hurricane Lee in 2023. Lower mean sea-level pressure (MSLP) corresponds to stronger TCs. **1b.** An example that is consistent with the results from all 20 test cases, indicating that the model that was trained with Category 3 to 5 TCs (FourCastNet-Full and FourCastNet-Rand) can accurately predict out-of-sample Category 5 TCs like Hurricane Lee (top two panels). The version that did not see such strong TCs during training (FourCastNet-noTC) fails to do so (bottom panel). Figure courtesy of [12].

Our highly controlled experimental setup offered an ideal opportunity to study OOD generalization, specifically extrapolation, in a system that is governed by complex physics. One can reasonably imagine that FourCastNet and other similar models might be able to predict unseen events, such as Category 5 TCs, by learning fundamental physical relations from weaker events, such as Category 1 and 2 TCs, and extrapolating based on those relations. This type of thinking was inspired by recent encouraging results, which suggest that AI weather models can learn atmospheric dynamics with some success [3].

Our analysis aimed to diagnose AI models’ capability to learn weather dynamics well enough for extrapolation, which is a key aspect of “learning” [4]. This type of controlled experiment is uncommon, as training such models is extremely computationally intensive. Training resources typically seek to improve operational performance rather than increase our understanding of AI’s functionality. We also included models with randomly removed data (Rand) and targeted removal of Category 3 to 5 TCs from only the North Atlantic (noNA) or Western Pacific (noWP) basins.

Results: Failure to Extrapolate

The Full and Rand models successfully predicted the intensification of Category 5 storms. The noTC model, however, failed entirely. When forecasting Hurricane Lee (an out-of-sample Category 5 TC), all ensemble members in noTC predicted a weakening storm (see Figure 1b). In fact, the lowest predicted pressure never dropped below 980 hectopascals (the Category 2 TC range defined in ERA5) — far above the 960 hectopascal value in ERA5. Note that for TCs, lower pressure means stronger storms, i.e., stronger winds.

Mathematically, this failure reveals a key breakdown; the model does not extrapolate from Category 1 and 2 storms to Category 5. It reverts to its learned distribution to confidently predict mild conditions when a catastrophe is approaching.

Why Physics Matters: The Joint Distributions

<strong>Figure 2.</strong> The pressure-wind relationship differs in the tropics and the middle latitudes. <strong>2a.</strong> Joint probability density function (PDF) of mean sea-level pressure (MSLP) and 10-meter winds in the tropics. <strong>2b.</strong> Probability density of 10-meter winds in the tropics. <strong>2c.</strong> Joint PDF of MSLP and 10-meter winds in the Northern Hemisphere, an extratropical region. <strong>2d.</strong> Probability density of 10-meter winds in the Northern Hemisphere. Figure courtesy of [12]. — **Figure 2.** The pressure-wind relationship differs in the tropics and the middle latitudes. **2a.** Joint probability density function (PDF) of mean sea-level pressure (MSLP) and 10-meter winds in the tropics. **2b.** Probability density of 10-meter winds in the tropics. **2c.** Joint PDF of MSLP and 10-meter winds in the Northern Hemisphere, an extratropical region. **2d.** Probability density of 10-meter winds in the Northern Hemisphere. Figure courtesy of [12].

One might think that the breakdown occurs simply because FourCastNet-noTC has never seen the intense low pressures and high wind speeds that correspond to Category 5 TCs. However, this is not the case; such extreme values still exist outside of the tropics in the context of extratropical cyclones, which are abundant in all training datasets. The question thus becomes: Why don’t these extratropical training examples help FourCastNet-noTC?

To explain the failure, we utilized a mathematical diagnostic: the joint probability density function of mean sea-level pressure and 10-meter wind speeds. The results in Figure 2 demonstrate why extratropical storms—which the model had witnessed during training—could not substitute for TCs.

In the tropics, low pressure is tightly coupled with high winds — a manifestation of convective dynamics and latent heating. In the middle latitudes, seasonal cycle and other large-scale dynamics obscure the pressure-wind relationship. Although the noTC model had seen strong pressure anomalies in the middle latitudes, the dynamics that are associated with those anomalies were substantially different than TC dynamics. As a result, the FourCastNet-noTC could learn from such middle latitude low-pressure events for OOD generalization in the tropics.

Some Hope: Transfer Across Basins

Encouragingly, the noNA and noWP models—which were trained with the removal of TCs from only one tropical basin—still forecasted strong storms across basins, just not as well as the model with access to all of the data. This outcome suggests that FourCastNet captures some dynamical similarity across regions. In other words, it did not overfit on location-specific data in the latitude-longitude coordinate, but instead learned structures that are likely in a low-dimensional representation. Further work has exhibited similar behavior in other AI weather models for extreme precipitation, hence allowing the models to forecast events that are gray swans for a given region but common in other parts of the world [11]. Atmospheric dynamicists and applied mathematicians will naturally wish to understand this transfer, which may motivate further work in mathematical climate science to (i) identify invariant representations that support better generalization across heterogeneous regions or over time, and (ii) embed such representations in the next generation of AI weather and climate models.

Physics-informed AI: A Mathematical Necessity

Our study indicates that the augmentation of training data benefits extrapolation beyond historical data. We also learned that despite its apparent forecasting skill, FourCastNet does not obey known physical laws. Specifically, its wind and pressure fields violate the gradient-wind balance — a foundational dynamical equilibrium in TCs. From a mathematical modeling perspective, such a violation suggests that physics-agnostic learning leads to solutions that are outside of a lower-dimensional manifold on which the physical system concentrates. This realization invites the question: Can we improve extrapolation by enforcing more physical constraints?

Implications for AI Applications and Beyond

While our study focused on one AI model and one type of extreme event, we expect that the key results will more or less apply to other AI models and weather extremes. Our findings pose both a challenge and an opportunity for the applied math and broader AI communities:

How can we characterize and quantify OOD generalization in high-dimensional, physical, and spatiotemporal systems?
What is the role of manifold learning, operator theory, or rare-event sampling in the construction of more robust AI weather and climate models?
Can dynamical systems theory help explain or guide the training process for edge-case extremes?

Moreover, this work motivates increased collaboration between atmospheric scientists, mathematicians, and computer scientists who wish to develop methods that preserve physical structure, improve uncertainty quantification, and extend trustworthiness to rare but societally critical events. One example of such a collaborative approach couples AI weather models and mathematical frameworks from rare event sampling; when used with traditional weather models, the latter have shown promising results in handling gray swan events [2, 10]. Coupling these frameworks with fast AI models can improve both the frameworks and the models themselves [6].

Conclusion: Predicting the Unpredictable

AI weather models are revolutionizing forecasting techniques, but scientists do not yet understand their limits, and how and what they learn. Our study reveals that these models do not extrapolate to gray swan events that they have not seen during training, despite thousands of accurate forecasts in normal conditions. Overcoming this limitation requires the mathematical machinery of rare-event theory, physics-informed learning, or dynamical constraints. Predicting the unpredictable is not just a computational problem; it’s a scientific and mathematical frontier.

Y. Qiang Sun delivered a minisymposium presentation on this research at the 2025 SIAM Conference on Applications of Dynamical Systems, which took place in Denver, Colo., earlier this year.

References
[1] Bodnar, C., Bruinsma, W.P., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., … Perdikaris, P. (2025). A foundation model for the Earth system. Nature, 641, 1180-1187.
[2] Finkel, J., Gerber, E.P., Abbot, D.S., & Weare, J. (2023). Revealing the statistics of extreme events hidden in short weather forecast data. AGU Advances, 4(2), e2023AV000881.
[3] Hakim, G.J., & Masanam, S. (2024). Dynamical tests of a deep learning weather prediction model. Artif. Intell. Earth Syst., 3(3), e230090.
[4] Lake, B.M., Ullman, T.D., Tenenbaum, J.B., & Gershman, S.J. (2017). Building machines that learn and think like people. Behav. Brain Sci., 40, e253.
[5] Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., … Battaglia, P. (2023). Learning skillful medium-range global weather forecasting. Science, 382(6677), 1416-1421.
[6] Lancelin, A., Wikner, A., Hassanzadeh, P., Abbot, D., Bouchet, F., Dubus, L., & Weare, J. (2025). Coupling AI emulators and rare event algorithms to sample extreme heatwaves. In EGU general assembly 2025. Vienna, Austria: European Geosciences Union.
[7] Lang, S., Alexe, M., Chantry, M, Dramsch, J., Pinault, F., Raoult, B., … Rabier, F. (2024). AIFS —ECMWF’s data-driven forecasting system. Preprint, arXiv:2406.01465.
[8] Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., … Anandkumar, A. (2022). FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. Preprint, arXiv:2202.11214.
[9] Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T.R., El-Kadi, A., Masters, D., … Willson, M. (2025). Probabilistic weather forecasting with machine learning. Nature, 637(8044), 84-90.
[10] Ragone, F., & Bouchet, F. (2021). Rare event algorithm study of extreme warm summers and heatwaves over Europe. Geophys. Res. Lett., 48(12), e2020GL091197.
[11] Sun, Y.Q., Hassanzadeh, P., Shaw, T.A., & Pahlavan, H.A. (2025). Predicting beyond training data via extrapolation versus translocation: AI weather models and Dubai’s unprecedented 2024 rainfall. Preprint, arXiv:2505.10241.
[12] Sun, Y.Q., Hassanzadeh, P., Zand, M., Chattopadhyay, A., Weare, J., & Abbot, D.S. (2025). Can AI weather models predict out-of-distribution gray swan tropical cyclones? Proc. Natl. Acad. Sci., 122(21), e2420914122.

About the Authors

Y. Qiang Sun

Research scientist, University of Chicago

Y. Qiang Sun is a research scientist at the University of Chicago who is passionate about the chaotic Earth system. His current work uses data-driven models and high-performance computing to understand and predict extreme weather and “gray swan” events.

Pedram Hassanzadeh

Associate professor, University of Chicago

Pedram Hassanzadeh is an associate professor in the Department of Geophysical Sciences, Committee on Computational and Applied Mathematics, and Data Science Institute at the University of Chicago. He works at the intersection of climate dynamics, scientific machine learning, and computational and applied mathematics.

Jonathan Weare

Professor, New York University

Jonathan Weare is a professor of mathematics in the Courant Institute of Mathematical Sciences at New York University. His work primarily focuses on the design, mathematical analysis, and application of stochastic algorithms and models.

Dorian S. Abbot

Professor, University of Chicago.

Dorian S. Abbot is a professor of geophysical sciences at the University of Chicago.

Experimental Setup: A Controlled OOD Test

Results: Failure to Extrapolate

Why Physics Matters: The Joint Distributions

Some Hope: Transfer Across Basins

Physics-informed AI: A Mathematical Necessity

Implications for AI Applications and Beyond

Conclusion: Predicting the Unpredictable

About the Authors

Related Reading

Foundation Model of the Earth Improves Atmospheric Forecasting

Deducing the Physics That Underlie Extreme Weather

Rare Event Sampling for the Earth and Planetary Sciences

Stay Up-to-Date with Email Alerts