A principle component analysis depends greatly on the variables fed into it. For hitters, I used the singles, doubles, triples, homers, walks, and strikeouts per plate appearance as the input variables. While I could do that here, I thought I would use variables over which the pitcher had more direct control. Using Fangraphs pitch data, I used the following: % of Fastballs Thrown (including cutters), % of Sliders, % of Changeups, Velocity of Fastball, Ground Ball%, Walks per PA, and Strikeouts per PA. I thought about using Hits per PA, and HR per PA, but since those are largely a function of luck and I didn’t want to measure that, I decided to leave them out. Like before, each variable was normalized before putting it into the model.
For hitters I was uncertain of what to expect, however for pitchers I had a fairly good idea. I expected that the two groupings of pitchers would be between power pitchers and control pitchers. However, I wasn’t exactly sure how it would break it down. Running the analysis, the factor loadings for the first principle component were as follows: …
and here’s the two types of hitters post:
For those unfamiliar with the type analysis, the point of it is to reduce a large number of potentially correlated variables down to a few key underlying factors that explain the variables. The researcher feeds the computer a bunch of records (in the this case, players) and several key variables (in this case, their statistics), The computer, blind to what those variables actually mean, spits out a set of underlying factors which explain the “true” underlying causes for the variables in question. It does this by maximizing the variability between the players. It’s then up to the researcher to interpret what each factor represents. In this case, I’m looking for the one underlying factor that best describes a player.
In the baseball world, I wondered what one underlying factor best determined a player’s statistics. Normally, this type of analysis would be done on many more variables, but I wanted to see what it would pick out from players’ basic, non-team influenced statistics: 1B, 2B, 3B, HR, BB, K.