Which models suck and why?

Rob H

EF5
Joined
Mar 11, 2009
Messages
825
Location
Twin Cities, MN
Hell yeah, that's a baited topic :)

But seriously though, there are a bunch of chaser "truths" that get tossed around all the time, such as:

  • GFS sucks
  • ECMWF is better than the GFS
  • American models suck
  • NAM/RAP overdoes moisture
  • GFS always has systems too fast/too slow
  • Off hour 06z/18z are worthless
  • It's pointless to look at anything besides high-res models on the day of
  • and so on...

But then you start looking into these truths and they can fall apart. Like with off hour runs of the GFS, and the statement from this PDF that "The 06Z 18-hr fcst is always better than the 00Z 24-hr fcst."

http://www.emc.ncep.noaa.gov/gmb/wx24fy/doc/GFS4cycle_GCWMB_briefing_13dec2012_fyang.pdf

So what are all of these "truths"? Let's try debunking, or verifying, some of them!
 
Hell yeah, that's a baited topic :)

But seriously though, there are a bunch of chaser "truths" that get tossed around all the time, such as:

  • GFS sucks
  • ECMWF is better than the GFS
  • American models suck
  • NAM/RAP overdoes moisture
  • GFS always has systems too fast/too slow
  • Off hour 06z/18z are worthless
  • It's pointless to look at anything besides high-res models on the day of
  • and so on...

But then you start looking into these truths and they can fall apart. Like with off hour runs of the GFS, and the statement from this PDF that "The 06Z 18-hr fcst is always better than the 00Z 24-hr fcst."

http://www.emc.ncep.noaa.gov/gmb/wx24fy/doc/GFS4cycle_GCWMB_briefing_13dec2012_fyang.pdf

So what are all of these "truths"? Let's try debunking, or verifying, some of them!


Well, the 'truth' is always a bit more complicated, but the generalizations have some truth to them. Like your 'off-hours 06Z/18Z are worthless' example. Everyone knows from experience that the off hours runs 'suck'. Just depends on how you define suckiness. :) The PDF you link is interesting, and it basically backs up the idea that the off hours suck, especially the 06Z. You just cherry picked the only optimistic statement from the whole study, which was used as an attempt to justify the run's utility. I wouldn't argue for eliminating that run either--but that doesn't change the fact that it still sucks!
That highlighted conclusion only applies to one forecast time, for one model, and for basically one parameter (500mb height, cause no one cares about tropical winds or whatever...) If you look at the study as a whole, it supports the conclusion that the 06Z GFS sucks. And incidentally, so does the 06Z NAM...etc, etc. ;)
 
If you look at the study as a whole, it supports the conclusion that the 06Z GFS sucks. And incidentally, so does the 06Z NAM...etc, etc. ;)

The meat and bones of what I was getting at! :)

Yes, the study shows that the 00z will have the best forecast skill for 500mb winds, and it sounds like that may be the case for most (all?) models in continental scenarios. But is the difference noted between runs significant, from either a statistical view or real world view? I guess I would say "suckiness" infers that the data is useless. Are we intentionally crippling ourselves by including/excluding off hour runs as amateur forecasters? Does that change when switching contexts to professional forecasting?

Do WFOs ignore 6z runs for the long term? The short term?
 
Do WFOs ignore 6z runs for the long term? The short term?

Can't speak for the WFOs, but personally I don't ignore them. Any additional data is useful, so long as you know the limitations. Suckiness doesn't infer uselessness to me, it 's just a term you throw out in frustration at the models. Which sort of leads towards your last 'stereotype'--should one only use High-Res models 'the day of'? I certainly don't spend hours analyzing the ECMWF 250mb prog on a moderate risk day, but I can recall many instances when the high res solutions gave an unrealistic expectation, while the general idea of the low-res forecasts was more correct. Sometimes it's better to look at the big picture, than get lost in the noise.
 
When I've done my own stats work, I've seen big changes in model skill based on region and forecast variable, and unfortunately the best model in one specific area is not necessarily the best elsewhere. Some forecasts are indeed worthless(at least from a pure statistics point of view, with no gain in MAE/RMSE/Large error rate/Bias when including that forecast in the group consensus). Evey once and a while you have a forecast that is so bad it has negative value, decreasing the group forecast's skill if you consider that model at all (NAM near-surface winds, I'm looking at you). I have found that no single modern model is universally bad or good.
 
For me, yesterday all of the models sucked :)

Hence why this is a timely topic to learn about IMHO. For me, the models sucked, but I took them only as a tool and not a solid on what the weather was going to do. I used to think the models were 'the truth' as it would be, and now see they are very much 'possible' truths, but limited in their validity based on the elements in real-world weather they don't seem to account for (such as local variations below the scale or sensitivity they are programmed to sense). I for one would like to eventually see a "mesoscale local" model that could do 1-2 hour forecasts of a storm that develops in an area and what it might do in the next couple hours. Is that even a possibility? And if so, would it 'suck' too?
 
I particularly like this thread and I want to go back to what Rob H was saying about these little pearls of wisdom that some of the more experienced model forecasters can throw out. I've been doing a lot of studying on Meted and I've been getting lost in some of the lessons about the reliability and limitations of the models. You have to have a very intimate knowledge of each particular model to be able to follow along and apply the knowledge of limitations and reliability. I understand in general that the parameterizations, equations, resolutions, cold starts, etc. all vary from model to model and have an impact on the forecast solution. Usually it takes lots of hours and experience with each of the models to identify their idiosyncrasies. So, what specific guidelines or rules for each of the models should people keep in mind? I find it particularly interesting that everyone agrees that the off hour model runs are trash. That kind of suck factor is good to keep in mind.
 
I find it particularly interesting that everyone agrees that the off hour model runs are trash. That kind of suck factor is good to keep in mind.

The presentation I linked to, and the thread on American WX that I got it from, seems to disagree though. There are "red tags", aka. meteorologists by career, that say there is nothing substantially wrong with off hour runs. They're incorporating nearly as much data as the rawinsonde runs, and at least for a few parameters like 500mb heights, the most current off hour run will always be more accurate than the rawinsonde run before it. Granted this is at one height for one parameter, but 500mb heights dictate a *lot*. If your 500s aren't accurate, your 700s won't be, and your 850s won't be, and so on.

Maybe it would be easier to tackle individual topics? One that seems to burn a lot of people is "NAM/RAP consistently overdoes moisture". Surface dewpoints of 57° when NAM forecasted 62° is sometimes seemingly the major difference between tornadic and non-tornadic storms in spring. This could also lead to higher CAPE values everything else being the same. Not that CAPE is a good tornadic discriminator, but I feel a lot better about 1700 MLCAPE than I do 1200 MLCAPE.

So is there truth to this specific "fact"? This brings a whole host of associated questions:

1) How do you assess how the model is performing compared to other models and reality?
2) How do you integrate any discrepancies into your forecasting?
3) How is this accommodated for? Are things manually adjusted, or are the numbers left as is, and forecasters just expected to know this bias?
 
1) How do you assess how the model is performing compared to other models and reality?

Compile error statistics. Yes, some fields are pretty hard to get reliable real-world measurements, but for many fields you can compare model forecasts directly with results and integrate the errors over time. This will give you direct mathematical evidence of the best performing models, once you get a large enough sample size.

2) How do you integrate any discrepancies into your forecasting?
3) How is this accommodated for? Are things manually adjusted, or are the numbers left as is, and forecasters just expected to know this bias?

You can write computer code to automatically statistically adjust for calculated biases, the modeller can add this manually after his or her own research to computer model forecasts, or you let the forecasters know about it and let the adjust it on their own. All three work, but I prefer the first option for simple error stats and the second option for more complex regime switching decision trees.
 
This is one of those threads that requires a textbook to answer.

Numerical weather prediction is extremely complex! Here is a simplified list of all of the aspects of NWP forecasts:
-mathematical formulation of the PDEs that govern the atmosphere (typically called "model dynamics")
-treatment of sub-grid scale processes (depends on the model resolution, typically called "model physics" or "parameterizations")
-initial and lateral boundary condition data
-model configuration (horizontal and vertical resolution, finite difference or spectral, time step, vertical coordinate, number of soil levels or ocean levels, topography)
-post-processing

Things you need to understand about each model to really get an idea how it should differ from other models
Model dynamics
-Which schemes are used to discretize the equations? Leapfrog? Adams-Bashforth? Forward Euler? Backward Euler? Each one has known strengths and weaknesses.
-What order of truncation was used for each scheme? Higher order schemes generally give better results, but also increase computational expense.
-Is this model using finite differencing to represent the derivatives or is it using Fourier series and waves to represent the fields?

Physics parameterizations
-Which sub-grid scale processes are being parameterized? Deep convection? Shallow convection? cloud/rain physics? boundary layer? land surface? urban surface? sub-surface? radiation?
-For each process that is being parameterized, which scheme is being used? For example, there are about 3 or 4 different cumulus parameterization schemes that operational forecast models use. Some are well documented and their strengths and weaknesses well known, while others are new or are improved versions of well known schemes but haven't been rigorously verified or documented. For some schemes, no documentation exists at all (it was written and maintained by one person). Keep in mind that although Weisman et al. (1997) is typically cited as the paper that said you don't need to use convective parameterization starting at 4 km grid spacing, but convective processes are not resolved at 4 km! The entire range between about 1 km and 10 km is a gray zone where conventional convective parameterization schemes used in many modern forecast models are not meant to be used, but deep convection is still not fully resolved. It's inaccurate and unfair to call 4 km models "convection-resolving", because they aren't.

Initial and lateral boundary condition data
-This is where the meat of the PDF that Rob linked to falls. The amount, type, and quality of data ingested and processed by data assimilation schemes must be known. Also, there are different types of data assimilation (3DVAR, 4DVAR, EnKF etc.), and different configurations within each type of assimilation. There are also different ways of taking irregularly spaced data and transforming it to a gridded array (Cressman, Barnes etc.). Many of these are well documented and have known strengths and weaknesses (advantages/disadvantages), but you need to know which model system uses what.
-Global models don't need lateral boundary condition data, but "limited area" models like the NAM, RAP, HRRR, SREF etc. do. Limited area model output is strongly correlated with the skill of the model that provided the lateral boundary conditions past a certain forecast hour (depending on the size of the limited area model domain). Also, how was the lateral boundary condition data used? Was it only applied to the outermost grid point? The outer 5? Was it filtered at all?

Model configruation
-Horizontal resolution is big, obviously. But one thing many people tend to overlook is the vertical resolution. Back in the day when grid spacings were tens of kilometers, grid columns were wide and short, as the vertical resolution was much finer than the horizontal resolution. Vertical resolution hasn't increased nearly as much as horizontal resolution has. In convection-allowing models today, the grid columns are much skinnier than they used to be, as individual grid boxes are much taller than they used to be. This impacts how processes such as convection are treated.
-Vertical coordinate: while the model output you see on websites is generally given on isobaric surfaces, NWP models generally do not use an isobaric or fixed height vertical coordinate. Most models use a terrain following sigma or eta vertical coordinate, or an isentropic one (the RUC used a hybrid isentropic-sigma coordinate).
-Topography: when you setup a WRF run, you can select the quality of the topography that the model assumes. This is hugely significant when considering processes impacted by interaction with the Earth's surface.
-Is the model strictly an atmospheric model (having only grid points within the atmosphere)? Many climate models are actually "Earth system" models that include grid points in the soil and under water, and include dynamics and physics parameterizations to prognosticate soil temperature, soil moisture, SST etc.

Post-processing
-As mentioned before, the output you see on a website is not the raw model output. Rather, the output was post-processed from the native model levels to isobaric or iso-height surfaces. There are different ways to interpolate vertically.
-Was there a post-processing scheme or method used to alter the raw model output to either correct for known biases in the model or to force ensemble output to fit a Gaussian distribution? This is especially important when viewing output from ensembles. Also keep in mind that while you can find "CAPE" as a field to view in model output, you should determine if it's surface-based, mixed-layer, most-unstable, or some other level CAPE. Some websites don't distinguish between those types. Also, did they use the virtual temperature correction? The GFS didn't until a few years ago. Not sure about anything else.

As MClarkson said, you can compile your own error statistics by obtaining a large sample size to determine any deficiencies or particular strengths of a model. However, to really know, you'll need to know every aspect of the model to be sure. Also, keep in mind that error statistics are heavily quantitative, yet Rob asked questions like "GFS always has systems that are too fast/slow", which is much more qualitative, and isn't easily addressed by examining basic error statistics. This is because you are crossing the line between pure quantitative statistics and into feature-based identification, which computers are much farther behind compared to how they do with pure quantitative statistics.

I've spent the last 5 years or so in grad school learning about many of these elements and I still know that I don't know s--t about models. They're just so crazy complicated, and the complexity will only increase in the future.
 
Last edited by a moderator:
Compile error statistics. Yes, some fields are pretty hard to get reliable real-world measurements, but for many fields you can compare model forecasts directly with results and integrate the errors over time. This will give you direct mathematical evidence of the best performing models, once you get a large enough sample size.



You can write computer code to automatically statistically adjust for calculated biases, the modeller can add this manually after his or her own research to computer model forecasts, or you let the forecasters know about it and let the adjust it on their own. All three work, but I prefer the first option for simple error stats and the second option for more complex regime switching decision trees.

This is good information but way beyond the reach of 99% of StormTrack. I've written a NEXRAD parser from scratch and I've written custom formulas for IDV and this is still way beyond my means :)

Has someone already done this work, where Joe Chaser can look at a website or even an AMS article or something that says "NAM overshoots surface moisture in the Central CONUS by 1-4° in late spring/summer? The best I've been able to do is look at meteograms, but that's still rather time consuming. Most chasers don't have several hours every week available for analyzing these things.
 
I agree a large and meaningful study almost certainly has to be automated. If you can automatically download and decode grib2 files, you can do this on your own. No, it is not particularly easy(but not insanely hard either, at least not like coding those models in the first place), and you probably want a command-line linux platform. After I was first exposed to linux-grib2 processes, it took another couple months before I had everything ready to go for comparing various model data to observations (often metars for surface data). The good news is that NCEP and Environment Canada provide this data to the public completely free. Some local university WRFs and UKMET/JMA might also give you data for free if you are not using for commercial interests.

Studies on the subject, at least those publicly available, are usually pretty generalized(like 500mb heights over a whole continent). For something specific like NAM moisture fields in a certain location in a certain time of year, its very likely that you(or a ST member) has to do it yourself. In the medium range future my site could have this publicly available, but that tool is not yet complete.
 
The downside of such a study is that the models are tweaked at least once every two years, if not more often. For example a major NAM upgrade is in testing stage now and will be implemented next month, so everything you determined last year would be invalid this summer.
 
WOW! I would like to give a huge thanks to Jeff for his detailed and thorough breakdown of the complications of dealing with models. I've been reading a lot about these factors during my studying, but Good Lord! I read his post and thought that it's amazing that anyone besides the most brilliant scientists could look at models and get anything reliable out of them! It makes me wonder how chasers are really using model data. Don't get me wrong... many chasers have years and years of experience and often degrees (multiple degrees) in this field, and they know what to look for. But, the way it sounds, looking at the models and diagnosing the atmosphere would have to be your full time job to use them confidently. No wonder there are so many SPC chasers out there! I mean really... are most chasers cross examining a few models for basic parameters the day before and day of and driving out to a general area that looks good? Why not!? It seems like the best you could do without giving your life over to model study. Just check the NAM, GFS, HRRR, and SREF for upper level pressure and vorticity, 500 mb winds, 850 mb moisture and winds, soundings, CAPE, LI's, TT, SRH, Hodos, surface obs, and satellite... Then take your dart and throw it at the map on your wall and you're ready to go!
 
Back
Top