Which models suck and why?

Rob H

EF5
Mar 11, 2009
825
6
0
Twin Cities, MN
Hell yeah, that's a baited topic :)

But seriously though, there are a bunch of chaser "truths" that get tossed around all the time, such as:

  • GFS sucks
  • ECMWF is better than the GFS
  • American models suck
  • NAM/RAP overdoes moisture
  • GFS always has systems too fast/too slow
  • Off hour 06z/18z are worthless
  • It's pointless to look at anything besides high-res models on the day of
  • and so on...

But then you start looking into these truths and they can fall apart. Like with off hour runs of the GFS, and the statement from this PDF that "The 06Z 18-hr fcst is always better than the 00Z 24-hr fcst."

http://www.emc.ncep.noaa.gov/gmb/wx24fy/doc/GFS4cycle_GCWMB_briefing_13dec2012_fyang.pdf

So what are all of these "truths"? Let's try debunking, or verifying, some of them!
 
Hell yeah, that's a baited topic :)

But seriously though, there are a bunch of chaser "truths" that get tossed around all the time, such as:

  • GFS sucks
  • ECMWF is better than the GFS
  • American models suck
  • NAM/RAP overdoes moisture
  • GFS always has systems too fast/too slow
  • Off hour 06z/18z are worthless
  • It's pointless to look at anything besides high-res models on the day of
  • and so on...

But then you start looking into these truths and they can fall apart. Like with off hour runs of the GFS, and the statement from this PDF that "The 06Z 18-hr fcst is always better than the 00Z 24-hr fcst."

http://www.emc.ncep.noaa.gov/gmb/wx24fy/doc/GFS4cycle_GCWMB_briefing_13dec2012_fyang.pdf

So what are all of these "truths"? Let's try debunking, or verifying, some of them!

Well, the 'truth' is always a bit more complicated, but the generalizations have some truth to them. Like your 'off-hours 06Z/18Z are worthless' example. Everyone knows from experience that the off hours runs 'suck'. Just depends on how you define suckiness. :) The PDF you link is interesting, and it basically backs up the idea that the off hours suck, especially the 06Z. You just cherry picked the only optimistic statement from the whole study, which was used as an attempt to justify the run's utility. I wouldn't argue for eliminating that run either--but that doesn't change the fact that it still sucks!
That highlighted conclusion only applies to one forecast time, for one model, and for basically one parameter (500mb height, cause no one cares about tropical winds or whatever...) If you look at the study as a whole, it supports the conclusion that the 06Z GFS sucks. And incidentally, so does the 06Z NAM...etc, etc. ;)
 

Rob H

EF5
Mar 11, 2009
825
6
0
Twin Cities, MN
If you look at the study as a whole, it supports the conclusion that the 06Z GFS sucks. And incidentally, so does the 06Z NAM...etc, etc. ;)
The meat and bones of what I was getting at! :)

Yes, the study shows that the 00z will have the best forecast skill for 500mb winds, and it sounds like that may be the case for most (all?) models in continental scenarios. But is the difference noted between runs significant, from either a statistical view or real world view? I guess I would say "suckiness" infers that the data is useless. Are we intentionally crippling ourselves by including/excluding off hour runs as amateur forecasters? Does that change when switching contexts to professional forecasting?

Do WFOs ignore 6z runs for the long term? The short term?
 
Do WFOs ignore 6z runs for the long term? The short term?
Can't speak for the WFOs, but personally I don't ignore them. Any additional data is useful, so long as you know the limitations. Suckiness doesn't infer uselessness to me, it 's just a term you throw out in frustration at the models. Which sort of leads towards your last 'stereotype'--should one only use High-Res models 'the day of'? I certainly don't spend hours analyzing the ECMWF 250mb prog on a moderate risk day, but I can recall many instances when the high res solutions gave an unrealistic expectation, while the general idea of the low-res forecasts was more correct. Sometimes it's better to look at the big picture, than get lost in the noise.
 

MClarkson

EF5
Sep 2, 2004
892
28
11
Blacksburg, VA
When I've done my own stats work, I've seen big changes in model skill based on region and forecast variable, and unfortunately the best model in one specific area is not necessarily the best elsewhere. Some forecasts are indeed worthless(at least from a pure statistics point of view, with no gain in MAE/RMSE/Large error rate/Bias when including that forecast in the group consensus). Evey once and a while you have a forecast that is so bad it has negative value, decreasing the group forecast's skill if you consider that model at all (NAM near-surface winds, I'm looking at you). I have found that no single modern model is universally bad or good.
 
Apr 22, 2009
230
11
11
preferrably near a storm
For me, yesterday all of the models sucked :)
Hence why this is a timely topic to learn about IMHO. For me, the models sucked, but I took them only as a tool and not a solid on what the weather was going to do. I used to think the models were 'the truth' as it would be, and now see they are very much 'possible' truths, but limited in their validity based on the elements in real-world weather they don't seem to account for (such as local variations below the scale or sensitivity they are programmed to sense). I for one would like to eventually see a "mesoscale local" model that could do 1-2 hour forecasts of a storm that develops in an area and what it might do in the next couple hours. Is that even a possibility? And if so, would it 'suck' too?
 
Oct 14, 2008
293
114
11
39
Tulsa, OK
I particularly like this thread and I want to go back to what Rob H was saying about these little pearls of wisdom that some of the more experienced model forecasters can throw out. I've been doing a lot of studying on Meted and I've been getting lost in some of the lessons about the reliability and limitations of the models. You have to have a very intimate knowledge of each particular model to be able to follow along and apply the knowledge of limitations and reliability. I understand in general that the parameterizations, equations, resolutions, cold starts, etc. all vary from model to model and have an impact on the forecast solution. Usually it takes lots of hours and experience with each of the models to identify their idiosyncrasies. So, what specific guidelines or rules for each of the models should people keep in mind? I find it particularly interesting that everyone agrees that the off hour model runs are trash. That kind of suck factor is good to keep in mind.
 

Rob H

EF5
Mar 11, 2009
825
6
0
Twin Cities, MN
I find it particularly interesting that everyone agrees that the off hour model runs are trash. That kind of suck factor is good to keep in mind.
The presentation I linked to, and the thread on American WX that I got it from, seems to disagree though. There are "red tags", aka. meteorologists by career, that say there is nothing substantially wrong with off hour runs. They're incorporating nearly as much data as the rawinsonde runs, and at least for a few parameters like 500mb heights, the most current off hour run will always be more accurate than the rawinsonde run before it. Granted this is at one height for one parameter, but 500mb heights dictate a *lot*. If your 500s aren't accurate, your 700s won't be, and your 850s won't be, and so on.

Maybe it would be easier to tackle individual topics? One that seems to burn a lot of people is "NAM/RAP consistently overdoes moisture". Surface dewpoints of 57° when NAM forecasted 62° is sometimes seemingly the major difference between tornadic and non-tornadic storms in spring. This could also lead to higher CAPE values everything else being the same. Not that CAPE is a good tornadic discriminator, but I feel a lot better about 1700 MLCAPE than I do 1200 MLCAPE.

So is there truth to this specific "fact"? This brings a whole host of associated questions:

1) How do you assess how the model is performing compared to other models and reality?
2) How do you integrate any discrepancies into your forecasting?
3) How is this accommodated for? Are things manually adjusted, or are the numbers left as is, and forecasters just expected to know this bias?
 

MClarkson

EF5
Sep 2, 2004
892
28
11
Blacksburg, VA
1) How do you assess how the model is performing compared to other models and reality?
Compile error statistics. Yes, some fields are pretty hard to get reliable real-world measurements, but for many fields you can compare model forecasts directly with results and integrate the errors over time. This will give you direct mathematical evidence of the best performing models, once you get a large enough sample size.

2) How do you integrate any discrepancies into your forecasting?
3) How is this accommodated for? Are things manually adjusted, or are the numbers left as is, and forecasters just expected to know this bias?
You can write computer code to automatically statistically adjust for calculated biases, the modeller can add this manually after his or her own research to computer model forecasts, or you let the forecasters know about it and let the adjust it on their own. All three work, but I prefer the first option for simple error stats and the second option for more complex regime switching decision trees.
 

Jeff Duda

EF6+, PhD
Staff member
Supporter
Oct 7, 2008
3,455
2,403
21
Broomfield, CO
www.meteor.iastate.edu
This is one of those threads that requires a textbook to answer.

Numerical weather prediction is extremely complex! Here is a simplified list of all of the aspects of NWP forecasts:
-mathematical formulation of the PDEs that govern the atmosphere (typically called "model dynamics")
-treatment of sub-grid scale processes (depends on the model resolution, typically called "model physics" or "parameterizations")
-initial and lateral boundary condition data
-model configuration (horizontal and vertical resolution, finite difference or spectral, time step, vertical coordinate, number of soil levels or ocean levels, topography)
-post-processing

Things you need to understand about each model to really get an idea how it should differ from other models
Model dynamics
-Which schemes are used to discretize the equations? Leapfrog? Adams-Bashforth? Forward Euler? Backward Euler? Each one has known strengths and weaknesses.
-What order of truncation was used for each scheme? Higher order schemes generally give better results, but also increase computational expense.
-Is this model using finite differencing to represent the derivatives or is it using Fourier series and waves to represent the fields?

Physics parameterizations
-Which sub-grid scale processes are being parameterized? Deep convection? Shallow convection? cloud/rain physics? boundary layer? land surface? urban surface? sub-surface? radiation?
-For each process that is being parameterized, which scheme is being used? For example, there are about 3 or 4 different cumulus parameterization schemes that operational forecast models use. Some are well documented and their strengths and weaknesses well known, while others are new or are improved versions of well known schemes but haven't been rigorously verified or documented. For some schemes, no documentation exists at all (it was written and maintained by one person). Keep in mind that although Weisman et al. (1997) is typically cited as the paper that said you don't need to use convective parameterization starting at 4 km grid spacing, but convective processes are not resolved at 4 km! The entire range between about 1 km and 10 km is a gray zone where conventional convective parameterization schemes used in many modern forecast models are not meant to be used, but deep convection is still not fully resolved. It's inaccurate and unfair to call 4 km models "convection-resolving", because they aren't.

Initial and lateral boundary condition data
-This is where the meat of the PDF that Rob linked to falls. The amount, type, and quality of data ingested and processed by data assimilation schemes must be known. Also, there are different types of data assimilation (3DVAR, 4DVAR, EnKF etc.), and different configurations within each type of assimilation. There are also different ways of taking irregularly spaced data and transforming it to a gridded array (Cressman, Barnes etc.). Many of these are well documented and have known strengths and weaknesses (advantages/disadvantages), but you need to know which model system uses what.
-Global models don't need lateral boundary condition data, but "limited area" models like the NAM, RAP, HRRR, SREF etc. do. Limited area model output is strongly correlated with the skill of the model that provided the lateral boundary conditions past a certain forecast hour (depending on the size of the limited area model domain). Also, how was the lateral boundary condition data used? Was it only applied to the outermost grid point? The outer 5? Was it filtered at all?

Model configruation
-Horizontal resolution is big, obviously. But one thing many people tend to overlook is the vertical resolution. Back in the day when grid spacings were tens of kilometers, grid columns were wide and short, as the vertical resolution was much finer than the horizontal resolution. Vertical resolution hasn't increased nearly as much as horizontal resolution has. In convection-allowing models today, the grid columns are much skinnier than they used to be, as individual grid boxes are much taller than they used to be. This impacts how processes such as convection are treated.
-Vertical coordinate: while the model output you see on websites is generally given on isobaric surfaces, NWP models generally do not use an isobaric or fixed height vertical coordinate. Most models use a terrain following sigma or eta vertical coordinate, or an isentropic one (the RUC used a hybrid isentropic-sigma coordinate).
-Topography: when you setup a WRF run, you can select the quality of the topography that the model assumes. This is hugely significant when considering processes impacted by interaction with the Earth's surface.
-Is the model strictly an atmospheric model (having only grid points within the atmosphere)? Many climate models are actually "Earth system" models that include grid points in the soil and under water, and include dynamics and physics parameterizations to prognosticate soil temperature, soil moisture, SST etc.

Post-processing
-As mentioned before, the output you see on a website is not the raw model output. Rather, the output was post-processed from the native model levels to isobaric or iso-height surfaces. There are different ways to interpolate vertically.
-Was there a post-processing scheme or method used to alter the raw model output to either correct for known biases in the model or to force ensemble output to fit a Gaussian distribution? This is especially important when viewing output from ensembles. Also keep in mind that while you can find "CAPE" as a field to view in model output, you should determine if it's surface-based, mixed-layer, most-unstable, or some other level CAPE. Some websites don't distinguish between those types. Also, did they use the virtual temperature correction? The GFS didn't until a few years ago. Not sure about anything else.

As MClarkson said, you can compile your own error statistics by obtaining a large sample size to determine any deficiencies or particular strengths of a model. However, to really know, you'll need to know every aspect of the model to be sure. Also, keep in mind that error statistics are heavily quantitative, yet Rob asked questions like "GFS always has systems that are too fast/slow", which is much more qualitative, and isn't easily addressed by examining basic error statistics. This is because you are crossing the line between pure quantitative statistics and into feature-based identification, which computers are much farther behind compared to how they do with pure quantitative statistics.

I've spent the last 5 years or so in grad school learning about many of these elements and I still know that I don't know s--t about models. They're just so crazy complicated, and the complexity will only increase in the future.
 
Last edited by a moderator:

Rob H

EF5
Mar 11, 2009
825
6
0
Twin Cities, MN
Compile error statistics. Yes, some fields are pretty hard to get reliable real-world measurements, but for many fields you can compare model forecasts directly with results and integrate the errors over time. This will give you direct mathematical evidence of the best performing models, once you get a large enough sample size.



You can write computer code to automatically statistically adjust for calculated biases, the modeller can add this manually after his or her own research to computer model forecasts, or you let the forecasters know about it and let the adjust it on their own. All three work, but I prefer the first option for simple error stats and the second option for more complex regime switching decision trees.
This is good information but way beyond the reach of 99% of StormTrack. I've written a NEXRAD parser from scratch and I've written custom formulas for IDV and this is still way beyond my means :)

Has someone already done this work, where Joe Chaser can look at a website or even an AMS article or something that says "NAM overshoots surface moisture in the Central CONUS by 1-4° in late spring/summer? The best I've been able to do is look at meteograms, but that's still rather time consuming. Most chasers don't have several hours every week available for analyzing these things.
 

MClarkson

EF5
Sep 2, 2004
892
28
11
Blacksburg, VA
I agree a large and meaningful study almost certainly has to be automated. If you can automatically download and decode grib2 files, you can do this on your own. No, it is not particularly easy(but not insanely hard either, at least not like coding those models in the first place), and you probably want a command-line linux platform. After I was first exposed to linux-grib2 processes, it took another couple months before I had everything ready to go for comparing various model data to observations (often metars for surface data). The good news is that NCEP and Environment Canada provide this data to the public completely free. Some local university WRFs and UKMET/JMA might also give you data for free if you are not using for commercial interests.

Studies on the subject, at least those publicly available, are usually pretty generalized(like 500mb heights over a whole continent). For something specific like NAM moisture fields in a certain location in a certain time of year, its very likely that you(or a ST member) has to do it yourself. In the medium range future my site could have this publicly available, but that tool is not yet complete.
 

rdale

EF5
Mar 1, 2004
7,343
914
21
51
Lansing, MI
skywatch.org
The downside of such a study is that the models are tweaked at least once every two years, if not more often. For example a major NAM upgrade is in testing stage now and will be implemented next month, so everything you determined last year would be invalid this summer.
 
Oct 14, 2008
293
114
11
39
Tulsa, OK
WOW! I would like to give a huge thanks to Jeff for his detailed and thorough breakdown of the complications of dealing with models. I've been reading a lot about these factors during my studying, but Good Lord! I read his post and thought that it's amazing that anyone besides the most brilliant scientists could look at models and get anything reliable out of them! It makes me wonder how chasers are really using model data. Don't get me wrong... many chasers have years and years of experience and often degrees (multiple degrees) in this field, and they know what to look for. But, the way it sounds, looking at the models and diagnosing the atmosphere would have to be your full time job to use them confidently. No wonder there are so many SPC chasers out there! I mean really... are most chasers cross examining a few models for basic parameters the day before and day of and driving out to a general area that looks good? Why not!? It seems like the best you could do without giving your life over to model study. Just check the NAM, GFS, HRRR, and SREF for upper level pressure and vorticity, 500 mb winds, 850 mb moisture and winds, soundings, CAPE, LI's, TT, SRH, Hodos, surface obs, and satellite... Then take your dart and throw it at the map on your wall and you're ready to go!
 
WOW! I would like to give a huge thanks to Jeff for his detailed and thorough breakdown of the complications of dealing with models. I've been reading a lot about these factors during my studying, but Good Lord! I read his post and thought that it's amazing that anyone besides the most brilliant scientists could look at models and get anything reliable out of them! It makes me wonder how chasers are really using model data. Don't get me wrong... many chasers have years and years of experience and often degrees (multiple degrees) in this field, and they know what to look for. But, the way it sounds, looking at the models and diagnosing the atmosphere would have to be your full time job to use them confidently. No wonder there are so many SPC chasers out there! I mean really... are most chasers cross examining a few models for basic parameters the day before and day of and driving out to a general area that looks good? Why not!? It seems like the best you could do without giving your life over to model study. Just check the NAM, GFS, HRRR, and SREF for upper level pressure and vorticity, 500 mb winds, 850 mb moisture and winds, soundings, CAPE, LI's, TT, SRH, Hodos, surface obs, and satellite... Then take your dart and throw it at the map on your wall and you're ready to go!
Well, I think it's becoming increasingly obvious that chasers are relying more and more on the high-res models, and this will be the wave of the future. The HRRR is the classic example. Two years ago, or even last year for that matter, I had little faith in it and while I looked at it I rarely relied on it in isolation for decision making. Recently, I haven't a clue what the developers are doing to tweak it, but it's obviously wayyyy improved this year. For the first time I relied heavily on early morning runs and in several cases, the model was spot on. And evidently other people recognized the trends too cause I saw more chasers arrive early at very specific spots that they wouldn't have chosen otherwise. Of course, it hasn't been exactly right all the time, but it's headed in the direction where these sort of models are going to be serious game changers. The ND tornado (Watford 5-26) is a classic example. Unfortunately, the HRRR only goes out 15 hours on a delay, so no time to race from TX (where all the chasers were that day, including yours truly) to ND in 12 hours! SPC didn't even have a slight risk up there. Not even a 2% tornado--nadda! But if you had used the 12Z (or 13Z...or...) HRRR run that day it would have been a no-brainer to target that area, and you would have placed yourself no more than 5-10 miles from where the tornado hit. Imagine that run being available 2 days in advance! It will be a game changer, im telling you--a much bigger influence on the future of chasing than cell phones or whatever. Come back here 10 years from now and ill say 'I told ya so!' ;)
 
Last edited by a moderator:

MClarkson

EF5
Sep 2, 2004
892
28
11
Blacksburg, VA
The downside of such a study is that the models are tweaked at least once every two years, if not more often.
All the more reason to automate the process. A one-off manual study might be outdated, but if you have the code that just sits there computing once per day, for example, the previous 3 months error stats... its a simple matter of waiting for the sample size to return after a major update. You don't even have to do any more work that way, unless NCEP changes the names of the file, in which case you'd have to do like 3 minutes of work.
 
Jan 14, 2011
2,941
2,753
21
St. Louis
stormhighway.com
Stan, I feel the same way. I think the short term high-res models (specifically the Updraft Helicity plot) will be all that is needed to pick a chase target in the next 5 years or so. I'm wondering if there will be any advantage to being experienced or having advanced (or even basic) forecasting knowledge in 5 to 10 years.

Right now, the HRR/4km WRF models still blow it enough that one can't yet rely on them. June 3 in Nebraska is one example. When they nail it though, they really nail it - and so they're definitely worth heeding even if one suspects they may be wrong on any given day. They'll only improve as time goes on.
 

rdale

EF5
Mar 1, 2004
7,343
914
21
51
Lansing, MI
skywatch.org
A one-off manual study might be outdated, but if you have the code that just sits there computing once per day, for example, the previous 3 months error stats...
At that point seasonal differences come into play... It may not have handled moisture advection well in the spring, but nailed it in July. Yet your algorithm says it struggle... I'm not saying it wouldn't be valuable, just not sure how much value it'd have.

http://www.hpc.ncep.noaa.gov/mdlbias/

http://www.hpc.ncep.noaa.gov/html/model2.shtml#verification
 

Rob H

EF5
Mar 11, 2009
825
6
0
Twin Cities, MN
There are a lot of insightful details popping up in here, but since no one is directly answering the questions I started off with, and brought up again later, I'm assuming that means that you can't make blanket statements like the following:

The NAM overdoes moisture

even when you further define the question by saying "in the Central Plains, in late spring".

What I'm getting out of this thread is that you can't make statements like I listed out in the OP, and that every model needs to be independently evaluated within a certain time period, in a specific location, as a specific parameter. Any generic biases tend to be identified and corrected. And if you want to know how the NAM is handling moisture there isn't really a good way, other than digging into the data yourself. Is all that correct?
 

MClarkson

EF5
Sep 2, 2004
892
28
11
Blacksburg, VA
Any generic biases tend to be identified and corrected.
That one is not entirely true. Some of these are in regions of low priority to model developers, and other times some models are intentionally tuned to the performance of one specific field, at the possible expense of others. Also, some bias comes about, especially in courser resolution models, due to averaging over a grid square. That 1km wide mountain valley temperature will never be correct in the raw ~30km GFS or GEM output.
 

Jeff Duda

EF6+, PhD
Staff member
Supporter
Oct 7, 2008
3,455
2,403
21
Broomfield, CO
www.meteor.iastate.edu
There are a lot of insightful details popping up in here, but since no one is directly answering the questions I started off with, and brought up again later, I'm assuming that means that you can't make blanket statements like the following:

The NAM overdoes moisture

even when you further define the question by saying "in the Central Plains, in late spring".

What I'm getting out of this thread is that you can't make statements like I listed out in the OP, and that every model needs to be independently evaluated within a certain time period, in a specific location, as a specific parameter. Any generic biases tend to be identified and corrected. And if you want to know how the NAM is handling moisture there isn't really a good way, other than digging into the data yourself. Is all that correct?
Go back to what I said in my lengthy post. You need to understand what's going on in a model to see how it might be mis-handling things. One thing the NAM seemed to do this spring was indeed overdo moisture. There was a fairly straightforward explanation for that, however. The NAM uses a monthly greenness fraction climatology in the land-surface model. Well, this spring the vegetation green-up was far from climatology. In fact, green-up was significantly delayed. So the model was assuming vegetation was more green than it actually was. Hence it was assuming too much ET and thus you would see too much near-surface moisture.

You just kinda have to know things like that. Unfortunately it's not always obvious or easy to determine things like this.
 

Rob H

EF5
Mar 11, 2009
825
6
0
Twin Cities, MN
Go back to what I said in my lengthy post. You need to understand what's going on in a model to see how it might be mis-handling things. One thing the NAM seemed to do this spring was indeed overdo moisture. There was a fairly straightforward explanation for that, however. The NAM uses a monthly greenness fraction climatology in the land-surface model. Well, this spring the vegetation green-up was far from climatology. In fact, green-up was significantly delayed. So the model was assuming vegetation was more green than it actually was. Hence it was assuming too much ET and thus you would see too much near-surface moisture.

You just kinda have to know things like that. Unfortunately it's not always obvious or easy to determine things like this.
I read your lengthy post and it was great, but only addressed the first part of what I was getting at. :)

To use your example, I'm making an assumption that every central plains forecaster and every SPC forecaster and every researcher needs to know that the NAM is overshooting moisture because it can affect day-to-day operations at their job. So is it assumed that these dozens or hundreds of people are all familiar with the NAM and recognized that the monthly greenness fraction climatology is off? There isn't a status product, or a bulletin, or something somewhere where the NAM gatekeeper said "hold up, this green up isn't happening the way it should" that disseminates that information to everyone that needs it? And no one goes in and tweaks it so that it's more accurate? It's up to each individual forecaster to notice this and account for it in their own way?
 

rdale

EF5
Mar 1, 2004
7,343
914
21
51
Lansing, MI
skywatch.org
Well no there is no "status product." And you can't just go in and tweak climatology. Shortcuts have to be taken just because resources are not unlimited (well, maybe they are for the ECMWF production :) ) But yes, those of us who use forecast because it's our job and a daily responsibility know that. Just as we know that GFS at 240hrs in November will ALWAYS have a snowstorm, and in the summer it will ALWAYS have a Gulf hurricane.

Some of your education might be found at https://www.meted.ucar.edu/training_module.php?id=902

and the full set of NWP trainings at https://www.meted.ucar.edu/training...guageSorting=1&module_sorting=publishDateDesc