MLB park factors analysis

Goal Link to heading

To investigate whether individual MLB stadiums have unique park factors that influence offensive performance.

Unlike the NBA, NHL, or NFL, MLB stadiums vary widely in dimensions, wall heights, and environmental conditions. This project explores whether those differences contribute to variations in offensive outcomes, and if certain parks consistently favor hitters or pitchers.

Data was scraped from statcast website using the baseballr package, includes every pitch thrown during the 2024 MLB season. Was filtered down to only batted balls that were put into play.

dftestcood <- mlbam_xy_transformation(df2024)

comerica.plot <- dftestcood %>%
  filter(home_team == "DET") %>%
  filter(events == "home_run" | events == "double") %>%
  ggplot(aes(x = hc_x_, y = hc_y_, color = events)) +
  geom_mlb_stadium(stadium_ids = "tigers", stadium_segments = "all", stadium_transform_coords = TRUE) +
  geom_point(size = 2, alpha = 0.75) +
  scale_color_manual(
    values = c("home_run" = "#0C2340", "double" = "#FA4616")
  ) +
  coord_fixed() +
  labs(subtitle = "Comerica Park") +
  theme_void()

fenway.plot <- dftestcood %>%
  filter(home_team == "BOS") %>%
  filter(events %in% c("home_run", "double")) %>%
  ggplot(aes(x = hc_x_, y = hc_y_, color = events)) +
  geom_mlb_stadium(stadium_ids = "red_sox", stadium_segments = "all", stadium_transform_coords = TRUE) +
  geom_point(size = 2, alpha = 0.75) +
  scale_color_manual(
    values = c("home_run" = "#BD3039", "double" = "#0C2340")
  ) +
  coord_fixed() +
  labs(subtitle = "Fenway Park") +
  theme_void()

comerica.plot + fenway.plot +
  plot_annotation(
  title = "Spray Chart Comparison",
  subtitle = "Home Runs & Doubles by Stadium",
  caption = "Data: Statcast via baseballr"
)

com.heat <- dftestcood %>%
  filter(home_team == "DET") %>%
  filter(events %in% c("home_run", "double")) %>%
  ggplot(aes(x = hc_x_, y = hc_y_)) +
  geom_hex() +
  scale_fill_distiller(palette = "Oranges", direction = 1) +
  geom_mlb_stadium(stadium_ids = "tigers", stadium_transform_coords = T, stadium_segments = "all") +
  coord_fixed() +
  theme_void() +
  labs(subtitle = "Comerica Park")

fenway.heat <- dftestcood %>%
  filter(home_team == "BOS") %>%
  filter(events %in% c("home_run", "double")) %>%
  ggplot(aes(x = hc_x_, y = hc_y_)) +
  geom_hex() +
  scale_fill_distiller(palette = "Reds", direction = 1) +
  geom_mlb_stadium(
    stadium_ids = "red_sox", 
    stadium_transform_coords = TRUE, 
    stadium_segments = "all"
  ) +
  coord_fixed() +
  theme_void() +
  labs(subtitle = "Fenway Park")

com.heat + fenway.heat +
  plot_annotation(title = "Heat Maps", subtitle = "For HR's and Doubles")

To begin, the plots shows how home runs and doubles compare across two stadiums. Comerica is known for their deep center field and fenway is known for the very high left field fence (green monster) and we can see there is a higher distribution of home runs in center field in Fenway than Comerica. It can also be seen there are more doubles along the whole left field than at Comerica.

theoretical derivations Link to heading

home_summary <- df2024 %>%
  group_by(game_pk, home_team) %>%
  summarize(
    home_runs_scored = max(post_home_score, na.rm = TRUE),
    home_runs_allowed = max(post_away_score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(home_team) %>%
  summarize(
    home_games = n(),
    total_home_runs_scored = sum(home_runs_scored),
    total_home_runs_allowed = sum(home_runs_allowed),
    .groups = "drop"
  )


away_summary <- df2024 %>%
  group_by(game_pk, away_team) %>%
  summarize(
    away_runs_scored = max(post_away_score, na.rm = TRUE),
    away_runs_allowed = max(post_home_score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(away_team) %>%
  summarize(
    away_games = n(),
    total_away_runs_scored = sum(away_runs_scored),
    total_away_runs_allowed = sum(away_runs_allowed),
    .groups = "drop"
  )


team_run_summary <- full_join(
  home_summary, 
  away_summary, 
  by = c("home_team" = "away_team")
) %>%
  rename(team = home_team) %>%
  arrange(team)

table1

	Home			Away
Team	Games	Runs Scored	Runs Allowed	Games	Runs Scored	Runs Allowed
ATL	77	298	297	78	356	297
AZ	74	416	371	80	420	373
BAL	78	353	327	75	381	331
BOS	74	343	363	79	375	340
CHC	76	284	271	77	369	371
CIN	78	343	374	73	324	284
CLE	74	354	286	78	322	299
COL	75	366	422	76	272	426
CWS	74	236	350	76	242	410
DET	74	312	301	81	352	320
HOU	79	379	319	74	328	299
KC	76	363	329	73	313	265
LAA	75	296	369	78	300	379
LAD	77	371	310	74	388	328
MIA	79	340	470	71	240	306
MIL	74	353	294	83	417	324
MIN	75	379	337	78	337	351
NYM	78	386	317	78	367	344
NYY	78	373	334	75	393	284
OAK	79	318	366	73	282	349
PHI	75	383	298	77	360	330
PIT	79	317	343	73	276	312
SD	81	362	348	70	339	265
SEA	73	278	232	79	360	343
SF	74	306	312	77	344	336
STL	75	306	307	78	322	370
TB	77	293	313	75	276	308
TEX	81	321	324	71	297	386
TOR	77	320	371	75	317	313
WSH	70	287	327	81	313	393

If we assume that the only factors affecting runs production are the teams offense, teams defense, and then average opponent on offensive and defensive (since we are including every match up it averages out), then the only difference between production at home and away would be the park factor. Note there may be some games left out due to the way scraping worked with the baseballr package and since it included alot of data but since we are averaging it should be negligible.

we can calculate park factor derivation as Runs per game at home stadium divided by runs per game away.
Example calculation for CLE would be [(354 + 286) / 74] / [(322 + 299) / 78] = 1.086, ie roughly 108 runs scored at Progressive field is equal to 100 runs scored at the average stadium.

park.plt <- team_run_summary %>%
  mutate(prkFactor = ((total_home_runs_scored + total_home_runs_allowed) / home_games) / 
                     ((total_away_runs_scored + total_away_runs_allowed) / away_games)) %>%
  ggplot(aes(x = fct_reorder(team, prkFactor), y = prkFactor)) +
  geom_col(aes(color = team, fill = team)) +
  mlbplotR::scale_color_mlb(type = "secondary") +
  mlbplotR::scale_fill_mlb(alpha = 0.54) +
  theme_minimal() +
  labs(
    y = "Park Factor",
    title = "Estimated Park Factor by Team (2024)",
    x = "Team",
    subtitle = "Teams listed represent their home stadium"
  ) +
  coord_flip() 



park.plt2 <- team_run_summary %>%
  mutate(prkFactor = ((total_home_runs_scored + total_home_runs_allowed) / home_games) / 
                     ((total_away_runs_scored + total_away_runs_allowed) / away_games)) %>%
  ggplot(aes(x = fct_reorder(team, prkFactor), y = prkFactor)) +
  geom_mlb_dot_logos(aes(team_abbr = team), width = 0.035, alpha = 0.9) +
  theme_minimal() +
  labs(
    y = "Park Factor",
    title = "Estimated Park Factor by Team (2024)",
    x = "Team"
  ) +
  coord_flip()

park.plt2

Now looking back at the spraycharts between Comerica and Fenway, Fenway had some area that were more concentrated with home runs and doubles where comerica did not and the park factor agrees saying that Fenway has a factor over 1 whereas Comerica’s is at 1 exactly. It raises the question, are some teams really better offensively than other or does their home stadium just inflate those numbers?

To explore this question we can again assume that away fields average out to 1, thus a new runs scored stat including park factor would be runs scored at home/park factor + runs scored away / 1.

rpg.df <- team_run_summary %>%
  mutate(prkFactor = ((total_home_runs_scored + total_home_runs_allowed) / home_games) / 
                     ((total_away_runs_scored + total_away_runs_allowed) / away_games)) %>%
  mutate(rpg = (((total_home_runs_scored / prkFactor) + total_away_runs_scored) / 
              (home_games + away_games)))
  
plt8 <- ggplot(rpg.df, aes(y = fct_reorder(team, rpg))) +
  geom_mlb_dot_logos(aes(x = R.G, team_abbr = team), width = 0.035, alpha = 0.4) +
  geom_point(aes(x = rpg), size = 3, alpha = 0.9, color = "darkred") +
  labs(title = "Runs/Game plot", y = "Team", x = "Runs Per Game", subtitle = "Logos are actual value, red is adjusted for park factor")

plt8

From this plot we can see adjusting for park factors some offensives can be viewed differently, like Miami may have seemed better than they were from their stadium. Even with the park factors the Dodgers were still the best offense which makes sense as they won the World Series in 2024. This plot is interesting as the teams who did really well over the season are still ranked in the top but we can see some small adjustments like maybe the Diamondbacks would have struggled more at a different stadium.

Clustering by rates Link to heading

stadium_summary <- df2024 %>%
  filter(!is.na(launch_angle),!is.na(launch_speed),!is.na(events)) %>%
  group_by(home_team) %>%
  summarize(
    avg_launch_angle=mean(launch_angle),
    avg_exit_velo=mean(launch_speed),
    hr_rate=mean(events=="home_run"),
    double_rate=mean(events=="double"),
    triple_rate=mean(events=="triple"),
    single_rate=mean(events=="single"),
 )
stadium_scaled<-stadium_summary %>%
  column_to_rownames("home_team") %>%
  scale(center=T)

fviz_nbclust(stadium_scaled,hcut, method="wss",k.max=6)

k.out <- kmeans(stadium_scaled, centers = 2)
fviz_cluster(k.out, data = stadium_scaled)

As a way of finding which parks were similar, I clustered the stadium by rates of hits (single, double, triple, home run) and values like exit velocity and launch angle. Using k-means clustering is appear that stadiums fall into 2 groups which may suggest some sort of hitter friendly or pitcher friendly park. Interestingly we can see that LAD and AZ are in different clusters even when they were the top 2 offensive teams in 2024.

Individual park effect Link to heading

Next to further investigate park factors well take a look at exact locations for batted ball. The data includes locations as where the position players closest was, it follows the normal convention of listing positions as number. (pitcher = 1, catcher = 2, first base = 3…)

df_model <- df2024 %>%
  filter(!is.na(hit_location)) %>%
  filter(!is.na(launch_speed)) %>%
  filter(!is.na(launch_angle)) %>%
  mutate(hit_location = as.factor(hit_location))


exPlot <- dftestcood %>%
  filter(home_team == "BOS") %>%
  filter(hit_location == 8) %>%
  ggplot(aes(x = hc_x_, y = hc_y_)) +
  geom_mlb_stadium(stadium_ids = "red_sox", stadium_segments = "all", stadium_transform_coords = TRUE) +
  geom_point(size = 2, alpha = 0.65, col = "#BD3039") +
  coord_fixed() +
  labs(subtitle = "Fenway Park", title = "Hit Location 8") +
  theme_void()

exPlot

This plot shows an example of all batted balls at Fenway Park that were labeled hit location 8 or center fielder.

plot4 <- df_model %>%
  filter(home_team %in% c("COL", "CHC"), events %in% c("home_run", "double")) %>%
  as.data.frame() %>%
  ggplot(aes(x = launch_speed, y = launch_angle, color = hit_location)) +
  geom_point(alpha = 0.4, size = 1.5) +
  facet_wrap(~ home_team) +
  labs(
    title = "Hit Location by Stadium",
    subtitle = "Home Runs & Doubles",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (degrees)",
    color = "Location"
  ) +
  theme_minimal()

plot4

This plot compares launch conditions for home runs and doubles in Wrigley Field (CHC) and Coors Field (COL) which were rated low and high for park factor above. While both parks feature high exit velocities, Coors Field saw batted balls with higher launch angles and similar exit velocities to become hit. This illustrates how the same kind of batted ball can lead to different outcomes depending on the stadium.

set.seed(127)

df_shuffled <- df_model %>% slice_sample(prop = 1)

# 70/30
train_df <- df_shuffled %>% slice_head(prop = 0.7)
test_df  <- df_shuffled %>% slice_tail(prop = 0.3)

rf1 <- randomForest(hit_location ~ launch_speed + launch_angle + home_team,
                    data = train_df)

confPlot

This confusion matrix evaluates how well the random forest model predicts the fielder location (hit_location) based on exit velo, angle, and stadium. While the model performs reasonably well for outfield positions (e.g., 7, 8, 9), there is still some misclassification, particularly in the infield.

Linear model for park effect Link to heading

The data includes variables, wOBA, and expected wOBA, where wOBA is weighted on base percentage, and expected wOBA is formulated using exit velocity, launch angle, sprint speed and batted ball type. We can assume then that the park factor would contribute to the difference in expected value and the actual value. So if we model the difference between these 2 with linear regression and use launch angle and exit velocity as covariates then adding in which ball park it took place in the coefficients will give an estimate to the park factor. So positive values would indicate inflated stats where as negative values would indicate deflated stats.

df2024$woba_diff <- df2024$woba_value - df2024$estimated_woba_using_speedangle

lm_model <- lm(woba_diff ~ launch_speed + launch_angle + home_team, data = df2024)

woba_effects <- coef(summary(lm_model)) %>%
  as.data.frame() %>%
  rownames_to_column("term") %>%
  filter(grepl("home_team", term)) %>%
  mutate(team = gsub("home_team", "", term))

prkfactorplot <- ggplot(woba_effects, aes(x = reorder(team, Estimate), y = Estimate)) +
  geom_col(fill = "#333366", color = "#C4CED4") +
  coord_flip() +
  labs(title = "Estimated Park Effect w/ wOBA (Actual - Expected)",
       y = "wOBA Difference coefficient", x = "Stadium") +
  theme_minimal()
prkfactorplot

This plot shows the estimated effect of each stadium on a batted ball’s value using a linear model of actual wOBA minus expected wOBA. Parks like COL and BOS inflate expected value, while SEA and STL slightly suppress outcomes.

Random Forest simulation Link to heading

Finally lets explore whether a batted ball with the same launch conditions ie. velocity and angle would be different from stadium to stadium. To test this I will again use random forest to model the event being a hit (single, double, triple, home run) using launch conditions and also stadium, then I will predict on a random 20% of the events from the dataset and compare how the model treats them all across every stadium.

df_model <- df_model %>%
  mutate(is_hit = ifelse(events != "out", 1, 0)) 
df_model$is_hit <- as.factor(df_model$is_hit)

rf2 <- randomForest(is_hit ~ launch_speed + launch_angle + home_team,
                    data = df_model, ntree = 200)

set.seed(42)
example_hits <- df_model %>%
  sample_n(20000) %>%
  select(launch_speed, launch_angle)


stadiums <- sort(unique(df_model$home_team))

simulated_hits <- example_hits %>%
  slice(rep(1:n(), each = length(stadiums))) %>%
  mutate(home_team = rep(stadiums, times = nrow(example_hits)))

simulated_hits$prob_hr <- predict(rf2, newdata = simulated_hits, type = "prob")[, "1"]


prkplot3 <- ggplot(simulated_hits, aes(x = fct_reorder(home_team, prob_hr, .fun = median), y = prob_hr)) +
  geom_boxplot(fill = "#00A3E0", alpha = 0.7, color = "#000000") +
  coord_flip() +
  labs(
    title = "Predicted Hit Probability Across Parks",
    subtitle = "Same Launch Conditions Simulated Across All Stadiums",
    y = "Predicted Hit Probability",
    x = "Stadium"
  ) +
  theme_minimal()
prkplot3

By simulating identical batted ball across all stadiums, I estimated the prob of a hit in each stadium. Again Colorado at the top (median value) while others were lower again. This reinforces the idea that park factors exist but with probability very similar in median and spread it might not be the most important aspect.

Conclusion Link to heading

This project supports the idea that MLB stadiums have some park factors that influence the game, like offensive performance. Through spray charts, park factor calculations, clustering stadiums and supervised learning techniques, it appears appropriate to conclude not all batted balls are equal and depends on where they happened. By simulating identical batted balls across every stadium we found a consistent difference between certain stadiums, like Colorado’s vs Seattle. These finding can support the importance of adjusting for location when evaluating performance.