I emailed Pat almost two months ago and told him that there would be something new on Doubt About It on Easter Sunday. Yes, there is some unintentional symbolism there, but it more had to do with the events of April 1st. March 31st was more or less a strict deadline to get this on the record.

Let me back up just a bit. If you've been here before, then you know Doubt About It was a Pittsburgh sports blog written by four friends. We felt that we had some ideas on things people may like to read regarding our favorite teams and the time to write about them. Both the ideas and time started to run out after a while. So for the past 3 years it has sat dormant. Then a couple of months ago I started thinking about the upcoming Pirates season. I have a very difficult time being objective when judging the Pirates and sports teams in general. At the same time, Alex, a former co-worker, got me thinking about statistical modeling. Putting two and two together, I figured my Pirates prediction for the upcoming year would be a lot better if I could remove myself from it as much as possible. Hence the totally unoriginal idea of using statistical projections to forecast the Pirates season.

I don't expect this to be any better than any other projection systems out there, but hopefully it shows a relatively simple way of going about it. I am hoping this is more a bit more transparent than other projections, starting with showing the probability of each scenario instead of just predicting a single number of wins for the entire season. Dave Cameron makes a good case for this here.

The loose goal for this was to forecast the Pirates season, giving what I thought was an appropriate amount of uncertainty and presented in clean and transparent manner. From there, I am hoping to do the same for the other teams in the division (entire MLB seems a bit daunting at this point). I'm posting it online instead of just emailing it to my friends because maybe someone has an idea, critique, or a comment that can help in some way for a future post. Or maybe this is the last post for another three years. We'll see.

The model uses the pythagorean expectation formula with the pythagenpat modified exponent ($n$) to predict the win percentage, $W\%$:

\begin{equation} W\% = \frac{R^{n}}{R^{n}+RA^{n}} \end{equation} \begin{equation} n = (\frac{R+RA}{G})^{0.287} \end{equation}

where $R$ is runs scored, $RA$ is runs against and $G$ is games. In the sake of time and space, equations and fleshed out explanation of their use will be added later. To get runs scored, the wRC stat created by Tangotiger is useful. It takes the wOBA for each player and converts that into a number of runs created based on plate appearances and league average wOBA. Add up each player's contribution and you have total runs scored for the team for year. The projected wOBA for each player was taken from an average of the projection published at Fangraphs: Bill James, Steamer, Oliver, and ZiPS. These projections were used because they were on each player's page and for no other reason. Runs against is a bit trickier, mostly because the ability to accurately gauge defensive ability is under developed relative to pitching and hitting. I decided to trust in those same projections for the pitchers. You'll notice both projected stats are appearance based, so plate appearances and innings pitched also had to be projected. While the projection systems give these values, were either adjusted or ignored for certain players in order to retain an accurate value for team plate appearances and innings pitched. Final numbers for hitters and pitchers are given below:

Name | PA | wOBA | Dev. | Name | ERA | IP | Dev. | |
---|---|---|---|---|---|---|---|---|

Andrew McCutchen | 673 | 0.3695 | 0.1 | A.J. Burnett | 3.983 | 193.0 | 0.1 | |

Pedro Alvarez | 600 | 0.33175 | 0.15 | Wandy Rodriguez | 3.773 | 192.5 | 0.1 | |

Neil Walker | 650 | 0.3325 | 0.1 | James McDonald | 3.993 | 161.4 | 0.15 | |

Garrett Jones | 500 | 0.33075 | 0.1 | Jeff Karstens | 3.97 | 116.8 | 0.1 | |

Clint Barmes | 480 | 0.27925 | 0.15 | Francisco Liriano | 3.908 | 114.9 | 0.15 | |

Jose Tabata | 420 | 0.316 | 0.2 | Gerritt Cole | 4.193 | 81.4 | 0.15 | |

Russell Martin | 484 | 0.31725 | 0.1 | Chris Leroux | 3.953 | 69.0 | 0.15 | |

Starling Marte | 590 | 0.33 | 0.15 | Jared Hughes | 3.848 | 65.7 | 0.15 | |

Travis Snider | 480 | 0.332 | 0.15 | Mark Melancon | 3.465 | 60.0 | 0.15 | |

Michael McKenry | 162 | 0.30275 | 0.1 | Jeff Locke | 4.175 | 59.5 | 0.15 | |

Brandon Inge | 182 | 0.289 | 0.1 | Tony Watson | 3.418 | 56.3 | 0.1 | |

Jordy Mercer | 120 | 0.288333333 | 0.15 | Jason Grilli | 3.22 | 56.0 | 0.1 | |

Gaby Sanchez | 400 | 0.32575 | 0.2 | Charlie Morton | 4.38 | 51.5 | 0.15 | |

Pitchers | 359 | 0.125 | 0.15 | Jonathon Sanchez | 4.598 | 50.0 | 0.15 | |

Bryan Morris | 3.66 | 25.0 | 0.15 | |||||

Kyle McPherson | 4.113 | 37.1 | 0.15 | |||||

Justin Wilson | 4.423 | 33.0 | 0.15 | |||||

Jeanmar Gomez | 4.548 | 25.0 | 0.15 |

Again, the above values are averages of other projections from Fangraphs - the accuracy of this model is first and foremost dependent on those numbers. For the projection, a player dependent standard deviation of 10-20% was used to randomize the above statistics following a gaussian distribution. An explanation for the numbers used will be added later. The total unearned runs (UER) was assumed to scale with ER, based on a regression between these two stats over the past three years. The correlation was weak ($R^2 = 0.137$), but it served as a quick and easy way to obtain UER. The wOBA, ERAs, and corresponding RS, RA, UER and wins were calculated for each player and the entire team, respectively. This was repeated 100,000 times, giving the following distribution

The mean value was 82.56 wins with the 95% confidence interval between 67.92 and 95.79 wins. More importantly, the model predicts the Pirates will finish 0.500 or better 61.25% of the time. The O/U gambling line for the Pirates that I've seen in is 77.5, which the model predicts the Pirates will be over 75.6% of the time.

The model clearly assumes quite a bit. Not only can the projected statistics be questioned, but the number of IP/PA, and deviation for each player can also be criticized. The model is bullish compared to PECOTA or to very similar analysis done by David Manel. This will have to do for now. An explanation for the variation statistic used and overview of the assumptions made will be given later.

## 1 comment:

DAI is risen!

Post a Comment