Back to Projects

CSAS 2025: Quantifying MLB Home Run Attempts

2025 Sports Analytics R, Statistical Modeling

1. Overview

For this data challenge, your goal is to use new baseball data on bat speed and swing length to analyze some aspect of the pitcher/batter interaction. We provide pitch-level data from Baseball Savant for 346,250 Major League Baseball plate appearances from 4/2/2024 to 6/30/2024, including relevant Statcast data along with bat speed and swing length on pitches with a swing tracked. Data from the second half of the season will be added after the conclusion of the regular season. Your analysis should involve bat speed and swing length to study any topic related to the batter, pitcher, or batter-pitcher interaction during an at bat.

Since these data are new, there are a variety of topics that have not previously been studied. Below are a few example topics. However, we note that this list is far from exhaustive. Participants should feel free to study any aspect of the batter, pitcher, or batter-pitcher interaction that interests them, provided that bat speed and swing length are used in the analysis in some meaningful way.

1-1. Proposal

Questions:

1. Which conditions indicate a batter most likely to try to hit a home run (Michael & Austin)

  • Define which variables determine a home run attempt (swing velocity, angle etc.)
  • Find distribution of numerical variables that constitute home run attempt
  • Transform events variable to make a separate variable determining home run attempts
  • Fit various classification models - What score, innings, batting counts, out counts, and runners are related to successful home runs
  • Find a trend related to home run attempts

1-2. Variables

Variable List
column namedescription
pitch_typeThe type of pitch derived from Statcast.
game_dateDate of the Game.
release_speedPitch velocities from 2008-16 are via Pitch F/X, and adjusted to roughly out-of-hand release point. All velocities from 2017 and beyond are Statcast, which are reported out-of-hand.
release_pos_xHorizontal Release Position of the ball measured in feet from the catcher's perspective.
release_pos_zVertical Release Position of the ball measured in feet from the catcher's perspective.
player_namePlayer's name tied to the event of the search.
batterMLB Player Id tied to the play event.
pitcherMLB Player Id tied to the play event.
eventsEvent of the resulting Plate Appearance.
descriptionDescription of the resulting pitch.
spin_dir* Deprecated field from the old tracking system.
spin_rate_deprecated* Deprecated field from the old tracking system. Replaced by release_spin
break_angle_deprecated* Deprecated field from the old tracking system.
break_length_deprecated* Deprecated field from the old tracking system.
zoneZone location of the ball when it crosses the plate from the catcher's perspective.
desPlate appearance description from game day.
game_typeType of Game. E = Exhibition, S = Spring Training, R = Regular Season, F = Wild Card, D = Divisional Series, L = League Championship Series, W = World Series
standSide of the plate batter is standing.
p_throwsHand pitcher throws with.
home_teamAbbreviation of home team.
away_teamAbbreviation of away team.
typeShort hand of pitch result. B = ball, S = strike, X = in play.
hit_locationPosition of first fielder to touch the ball.
bb_typeBatted ball type, ground_ball, line_drive, fly_ball, popup.
ballsPre-pitch number of balls in count.
strikesPre-pitch number of strikes in count.
game_yearYear game took place.
pfx_xHorizontal movement in feet from the catcher's perspective.
pfx_zVertical movement in feet from the catcher's perpsective.
plate_xHorizontal position of the ball when it crosses home plate from the catcher's perspective.
plate_zVertical position of the ball when it crosses home plate from the catcher's perspective.
on_3bPre-pitch MLB Player Id of Runner on 3B.
on_2bPre-pitch MLB Player Id of Runner on 2B.
on_1bPre-pitch MLB Player Id of Runner on 1B.
outs_when_upPre-pitch number of outs.
inningPre-pitch inning number.
inning_topbotPre-pitch top or bottom of inning.
hc_xHit coordinate X of batted ball.
hc_yHit coordinate Y of batted ball.
tfs_deprecated* Deprecated field from old tracking system.
tfs_zulu_deprecated* Deprecated field from old tracking system.
fielder_2Pre-pitch MLB Player Id of Catcher.
umpire* Deprecated field from old tracking system.
sv_idNon-unique Id of play event per game.
vx0The velocity of the pitch, in feet per second, in x-dimension, determined at y=50 feet.
vy0The velocity of the pitch, in feet per second, in y-dimension, determined at y=50 feet.
vz0The velocity of the pitch, in feet per second, in z-dimension, determined at y=50 feet.
axThe acceleration of the pitch, in feet per second per second, in x-dimension, determined at y=50 feet.
ayThe acceleration of the pitch, in feet per second per second, in y-dimension, determined at y=50 feet.
azThe acceleration of the pitch, in feet per second per second, in z-dimension, determined at y=50 feet.
sz_topTop of the batter's strike zone set by the operator when the ball is halfway to the plate.
sz_botBottom of the batter's strike zone set by the operator when the ball is halfway to the plate.
hit_distanceProjected hit distance of the batted ball.
launch_speedExit velocity of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here.
launch_angleLaunch angle of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here.
effective_speedDerived speed based on the the extension of the pitcher's release.
release_spinSpin rate of pitch tracked by Statcast.
release_extensionRelease extension of pitch in feet as tracked by Statcast.
game_pkUnique Id for Game.
pitcherMLB Player Id tied to the play event.
fielder_2MLB Player Id for catcher.
fielder_3MLB Player Id for 1B.
fielder_4MLB Player Id for 2B.
fielder_5MLB Player Id for 3B.
fielder_6MLB Player Id for SS.
fielder_7MLB Player Id for LF.
fielder_8MLB Player Id for CF.
fielder_9MLB Player Id for RF.
release_pos_yRelease position of pitch measured in feet from the catcher's perspective.
estimated_ba_using_speedangleEstimated Batting Avg based on launch angle and exit velocity.
estimated_woba_using_speedangleEstimated wOBA based on launch angle and exit velocity.
woba_valuewOBA value based on result of play.
woba_denomwOBA denominator based on result of play.
babip_valueBABIP value based on result of play.
iso_valueISO value based on result of play.
launch_speed_angleLaunch speed/angle zone based on launch angle and exit velocity. 1: Weak 2: Topped 3: Under 4: Flare/Burner 5: Solid Contact 6: Barrel
at_bat_numberPlate appearance number of the game.
pitch_numberTotal pitch number of the plate appearance.
pitch_nameThe name of the pitch derived from the Statcast Data.
home_scorePre-pitch home score
away_scorePre-pitch away score
bat_scorePre-pitch bat team score
fld_scorePre-pitch field team score
post_home_scorePost-pitch home score
post_away_scorePost-pitch away score
post_bat_scorePost-pitch bat team score
if_fielding_alignmentInfield fielding alignment at the time of the pitch.
of_fielding_alignmentOutfield fielding alignment at the time of the pitch.
spin_axisThe Spin Axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball
delta_home_win_expThe change in Win Expectancy before the Plate Appearance and after the Plate Appearance
delta_run_expThe change in Run Expectancy before the Pitch and after the Pitch

2. Data Cleaning

2-1. Factors influencing a home run

Since home run attempt includes both success and failure, we decided to count two simple factors:

  1. Bat speed : how fast a batter swings their bat
  2. Swing length : distance captured from start of a swing to an impact point

2-2. Setting optimal threshold

Since we have set the standards of a home run attempt, we have to decide the threshold for each of them to see from what point should we consider that the batter tried to hit a home run. For this, we took the following steps:

  1. Create an ROC curve comparing home runs against bat speed and swing length
  2. Calculate Youden index using sensitivity and specificity at all possible threshold values, where \( J = \max\{\text{sensitivity}(t) + \text{specificity}(t) - 1\} \)
ROC curve and Youden index

With the above steps, we got the optimal threshold of 72.65 mph for bat speed and 7.05 ft for swing length to create our response variable of home_run_attempt. The graph below shows the distribution of home runs for bat speed and swing length along with their optimal thresholds.

Distribution of Bat Speed and Swing Length by Home Run

2-3. Variable Selection Method

In terms of variable selection, we first started off by picking 30 potentially significant variables to predict home run attempts, including game situation (inning, outs), pitch characteristics (release position, velocity), and batter/pitcher attributes.

Then, we have tried multiple approaches for variable selection method:

  1. Forward stepwise selection using BIC
  2. Backward stepwise selection using BIC
  3. Recursive Feature Elimination (RFE) with 5-fold CV

The table below is the result:

Variable Selection MethodNumber of features
Forward Selection9
Backward Selection9
RFE2

Forward and backward stepwise selection method provided the same number of features, but we decided to use backward selection, since it had a lower BIC. Even though RFE would be a very great approach for a variable selection, it gave too few variables this time, so we decided to not use this method.

Below are the list of final selected features:

  • balls
  • strikes
  • at_bat_number
  • release_pos_z
  • vz0
  • az
  • effective_speed
  • stand

3. Data Modeling

3-1. Model Fitting

After cleaning the data, we have tried fitting various models to predict home_run_attempt:

  1. Logistic regression : basic linear model
  2. GAM : taking into account for non-linear relationship as well as interpretability
  3. Mixed Effects Model : taking into account for individual tendencies of batters and pitchers
  4. Random Forest : general machine learning model

We decided to choose our best model based on accuracy and the char below was the result:

Model accuracy comparison

3-2. Result of The Best Performing Model

Since mixed effects model was the best performing model, we decided to take a further look at its result to examine significant factors of home run attempts.

Mixed Effects Model output

Result summary:

  • The variance of batter is significantly higher than pitcher, suggesting that batting characteristics have more random variation than pitching characteristics
  • balls : higher ball counts increase the likelihood for batters to attempt home run
  • strikes : higher strike counts decrease the likelihood for batters to attempt home run
  • at_bat_number : more at bat number slightly decreases the likelihood
  • release_pos_z : higher release position of pitcher decreases the likelihood
  • vz0 : higher pitch velocity in z-dimension decreases the likelihood
  • az : higher pitch acceleration in z-dimension decreases the likelihood
  • effective_speed : higher pitch speed decreases the likelihood
  • standR : right-handed batters are more likely to attempt home run

4. Top 10 Batters with Highest Home Run Success Rate

The chart below has the list of ten batters with the highest home run success rate among those who are above 75% quantile range in the number of making home run attempts. This filter is set to show major batters who are most likely to play the whole game.

Top 10 batters home run success rate

After looking at this chart, another questions came up, which is: "Does higher bat speed and swing length lead to better home run success rate?" To answer the question, we have set the hypothesis as below:

Hypothesis: Larger swing length and bat speed will contribute to higher home run success rate
Swing length and bat speed for top batters Home run rate vs average bat speed and swing length

Looking at the charts above, we can see that the bat speed and swing length does not actually seem to correlate as much to the home run success rate as we have expected. This might mean couple of things:

  1. There might be an optimal range to have a better successful home run rate than a "higher is better" kind of relationship.
  2. The timing and the point of contact might matter more significantly than just a bat speed and a swing length.