This is one of those threads that requires a textbook to answer.
Numerical weather prediction is extremely complex! Here is a simplified list of all of the aspects of NWP forecasts:
-mathematical formulation of the PDEs that govern the atmosphere (typically called "model dynamics")
-treatment of sub-grid scale processes (depends on the model resolution, typically called "model physics" or "parameterizations")
-initial and lateral boundary condition data
-model configuration (horizontal and vertical resolution, finite difference or spectral, time step, vertical coordinate, number of soil levels or ocean levels, topography)
-post-processing
Things you need to understand about each model to really get an idea how it should differ from other models
Model dynamics
-Which schemes are used to discretize the equations? Leapfrog? Adams-Bashforth? Forward Euler? Backward Euler? Each one has known strengths and weaknesses.
-What order of truncation was used for each scheme? Higher order schemes generally give better results, but also increase computational expense.
-Is this model using finite differencing to represent the derivatives or is it using Fourier series and waves to represent the fields?
Physics parameterizations
-Which sub-grid scale processes are being parameterized? Deep convection? Shallow convection? cloud/rain physics? boundary layer? land surface? urban surface? sub-surface? radiation?
-For each process that is being parameterized, which scheme is being used? For example, there are about 3 or 4 different cumulus parameterization schemes that operational forecast models use. Some are well documented and their strengths and weaknesses well known, while others are new or are improved versions of well known schemes but haven't been rigorously verified or documented. For some schemes, no documentation exists at all (it was written and maintained by one person). Keep in mind that although Weisman et al. (1997) is typically cited as the paper that said you don't need to use convective parameterization starting at 4 km grid spacing, but convective processes are not resolved at 4 km! The entire range between about 1 km and 10 km is a gray zone where conventional convective parameterization schemes used in many modern forecast models are not meant to be used, but deep convection is still not fully resolved. It's inaccurate and unfair to call 4 km models "convection-resolving", because they aren't.
Initial and lateral boundary condition data
-This is where the meat of the PDF that Rob linked to falls. The amount, type, and quality of data ingested and processed by data assimilation schemes must be known. Also, there are different types of data assimilation (3DVAR, 4DVAR, EnKF etc.), and different configurations within each type of assimilation. There are also different ways of taking irregularly spaced data and transforming it to a gridded array (Cressman, Barnes etc.). Many of these are well documented and have known strengths and weaknesses (advantages/disadvantages), but you need to know which model system uses what.
-Global models don't need lateral boundary condition data, but "limited area" models like the NAM, RAP, HRRR, SREF etc. do. Limited area model output is strongly correlated with the skill of the model that provided the lateral boundary conditions past a certain forecast hour (depending on the size of the limited area model domain). Also, how was the lateral boundary condition data used? Was it only applied to the outermost grid point? The outer 5? Was it filtered at all?
Model configruation
-Horizontal resolution is big, obviously. But one thing many people tend to overlook is the vertical resolution. Back in the day when grid spacings were tens of kilometers, grid columns were wide and short, as the vertical resolution was much finer than the horizontal resolution. Vertical resolution hasn't increased nearly as much as horizontal resolution has. In convection-allowing models today, the grid columns are much skinnier than they used to be, as individual grid boxes are much taller than they used to be. This impacts how processes such as convection are treated.
-Vertical coordinate: while the model output you see on websites is generally given on isobaric surfaces, NWP models generally do not use an isobaric or fixed height vertical coordinate. Most models use a terrain following sigma or eta vertical coordinate, or an isentropic one (the RUC used a hybrid isentropic-sigma coordinate).
-Topography: when you setup a WRF run, you can select the quality of the topography that the model assumes. This is hugely significant when considering processes impacted by interaction with the Earth's surface.
-Is the model strictly an atmospheric model (having only grid points within the atmosphere)? Many climate models are actually "Earth system" models that include grid points in the soil and under water, and include dynamics and physics parameterizations to prognosticate soil temperature, soil moisture, SST etc.
Post-processing
-As mentioned before, the output you see on a website is not the raw model output. Rather, the output was post-processed from the native model levels to isobaric or iso-height surfaces. There are different ways to interpolate vertically.
-Was there a post-processing scheme or method used to alter the raw model output to either correct for known biases in the model or to force ensemble output to fit a Gaussian distribution? This is especially important when viewing output from ensembles. Also keep in mind that while you can find "CAPE" as a field to view in model output, you should determine if it's surface-based, mixed-layer, most-unstable, or some other level CAPE. Some websites don't distinguish between those types. Also, did they use the virtual temperature correction? The GFS didn't until a few years ago. Not sure about anything else.
As MClarkson said, you can compile your own error statistics by obtaining a large sample size to determine any deficiencies or particular strengths of a model. However, to really know, you'll need to know every aspect of the model to be sure. Also, keep in mind that error statistics are heavily quantitative, yet Rob asked questions like "GFS always has systems that are too fast/slow", which is much more qualitative, and isn't easily addressed by examining basic error statistics. This is because you are crossing the line between pure quantitative statistics and into feature-based identification, which computers are much farther behind compared to how they do with pure quantitative statistics.
I've spent the last 5 years or so in grad school learning about many of these elements and I still know that I don't know s--t about models. They're just so crazy complicated, and the complexity will only increase in the future.