Statistical Analysis: The Fallacy of Small Sample Sizes


Looking at statistics is one key way for us to help evaluate a player and their contributions to a baseball team. However, when looking at these statistics, it is critical that the sample size is sufficiently large. Let’s take a look at some numbers, and see what conclusions we can draw from them.

Player A: .300 batting average, 3 HR, 10 RBI
Player B: .300 batting average, 3 HR, 10 RBI

Without any other information, these appear to be essentially the same in terms of production, right? At least until we look at some more details.

Player A: .300 BA (6/20), 3 HR, 10 RBI
Player B: .300 BA (90/300), 3 HR, 10 RBI

Knowing what kind of sample size we are looking at can help us to determine better about a player. In the example above, the sample size for Player A is simply too small to judge effectively without further information. In his case, we would want either a full season worth of statistics or some other statistics from previous seasons or levels.

Player B helps us to draw a few more conclusions. In 300 at bats, a player is likely to be portraying a majority of the skills he has. The fact that he hit .300 over that span can give an evaluator at least some confidence in their ability to do so in the future. They can also infer that the player is not a particularly powerful hitter at this point, as evidenced by only 3 HR and 10 RBI. While there are other factors outside of that player’s control, the sample size is large enough that some reasonable conclusions can be drawn.

The poster child for small sample sizes is Chris Shelton. Back in 2006, Shelton had fallen into the first base job for the Tigers, after having Carlos Pena perform poorly and end up being released. He had never played a full season at the majors, and he got off to a hot start in the month of April.

April: .326/.404/.783, 10 HR, 20 RBI in 104 plate appearances

Unfortunately, that was easily the best month he had of the season.

May: .286/.340/.363, 1 HR, 8 RBI in 100 plate appearances
June: .205/.286/.364, 4 HR, 9 RBI in 97 plate appearances
July: .289/.344/.386, 1 HR, 9 RBI in 90 plate appearances

He missed all of August and a majority of September due to injuries. I have to imagine that there were more than a few fantasy owners that want some trades involving him back. Looking at his minor league statistics to that point, the most home runs he had hit in any single season was 21, so the players looking at the 10 homers in a month and thinking he could hit 40 were rudely awakened.

Conclusions

The key to remember when doing any statistical analysis is to look at the period of time you are drawing your statistics from, and determining if it is a relevant period of time. It’s an unfair judge of a player and their abilities to look at too small of a time period.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s