Eight Thirty Four

A Primer for Understanding Survival Curves in Relation to How Players Accrue Events

Kaplan-Meier plots are a way of estimating and visualizing survival curves/functions. A survival curve, in general, is used to map the length of time that elapses before an event occurs. Here, they give the probability that a player has "survived" to a certain time without committing a particular number event (rebound, assist, etc). These curves are useful for understanding how a player accrues events while accounting for the total length of time during which a player is followed, and allows us to compare how the different events are accrued. Furthermore, Kaplan-Meier curves can account for right censoring.

If games were infinitely long, and players continued to play indefinitely, we would observe every events for every player. As games are of finite length, every player is subject to some form of right censoring due the end of follow up time. Therefore, it makes sense that higher number events are subject to sampling bias.

Let's use an example to illustrate. Imagine a game where Russell Westbrook registers 12 assists. Westbrook plays the entire 4th quarter of the game and records his 11th assist with 5 minutes remaining in the game. He then registers his 12th assist on a buzzer beater to end the game. The time between his 11th and 12th assists is recorded as 5 minutes or 300 seconds. Now imagine a game where has 11 assists. Again, Westbrook plays the entire 4th quarter and records his 11th assist with 5 minutes remaining in the game. But this time, he never gets a 12th assist and therefore there is no recorded time between the 11th and 12th assists. Had the game continued into overtime, we might see Westbrook record his 12th assist. But the game ends after regulation, so the time to event for the 12th assist is censored. Therefore a raw analysis examining time between 11th and 12th assists would only count the time from the first game, even though at least 5 minutes occurred between assists in the second game. Survival models and Kaplan-Meier curves help account for this censoring by accounting for the fact that Westbrook was still playing for the final 5 minutes of the second game, even though we don't observe the 12th assist.

However, we must also bear in mind that due to censoring, we never see a hypothetical 13th assist in either of our imaginary games. Which would cause comparison problems if we included a game where Westbrook accrues 15 assists in our analysis.

Our broad and crude method for somewhat adjusting for this censoring is to restrict analysis to only games where at least the median number of events happened. For example when analyzing Westbrook's assists in his historic triple-double 2016-2017 season, the restricted analysis only examines games with at least 11 assists. This limits analysis to only 37 games so bear in mind that the restricted analyses have a bunch smaller sample size by design. Furthermore, while such a restriction allows for good comparisons of early events, we still have the selection bias problem from the censoring of later events. We can increase sample size by combining data across years, though none of the current analyses account for differences in seasons.

To provide more statistical rigor, we analyze our players using a conditional risk set model for ordered events estimated using a stratified Cox proportional hazards (PH) model. This method models the hazard at each event time as a function of the current number of events accumulated and time since the last event. Under this model, the assumption is made that an observation does not become at risk of a second event until the first event is experienced. The model is flexible and can include other covariates as needed. Currently, we allow the user to adjust for the score difference between the two teams and the amount of game time left on the clock.

The coefficients given by the model are (log) hazard ratios for that event with respect to the baseline of the first event. There is no way to find a strict probability of an event, as a Cox PH model only estimates the comparisons between events. We are not modeling the probability Westbrook gets his 5th assist or his 10th assist. Rather we are modeling how likely it is that Westbrook will get his 5th assist given he already has 4 assists and comparing that to the chance that Westbrook gets his 10th assist given he already has 9 assists.

There are a lot of sophisticated and subtle statistical ideas at work here.

Here are some suggestions for players and stats to look at to get started:

Developed by © Udam Singh Saini Follow @UdamSaini , Katherine L. Evans Follow @CausalKathy .