Description Usage Format Note Author(s) Source Examples

Three weeks of horse race data from tracks worldwide.

1 |

A `data.frame`

object with 36,418 observations and 19 columns.

The columns are defined as follows:

`EventId`

An integer ID denoting the event (race). These range from 1 to 4486.

`TrackId`

An integer ID number of the the track. There are 64 different tracks represented.

`Type`

The type of event, one of “Thoroughbred” or “Harness”.

`RaceNum`

The integer race number within a group of races at a track on a given date.

`CorrectedPostTime`

The ‘corrected’ post time of the race, in the form

`%Y-%m-%d %H:%M:%S`

, presumably in the PDT time zone. Has values like “2019-03-05 02:30:00”.`Yards`

The length of the race, in yards.

`SurfaceText`

A string, one of “Turf”, “Dirt”, “All-Weather” or

`NA`

.`HorseName`

The string name of the horse.

`HorseId`

A unique integer ID for each horse. As different horses can have the same name, this ID is constructed from the name of the Horse, the Sire and the Dam.

`Age`

The age of the horse, in integer years, at the time of the event. Typically less than 10.

`Sex`

A single character denoting the sex of the horse. I believe the codes are “M” for “Mare” (female four years or older), “G” for “Gelding”, “F” for “Filly” (female under four years of age), “C” for “Colt” (male under four years of age), “H” for “Horse” (male four years of age and up), “R” for “Rig” (hard to explain), “A” for “???”. There are some

`NA`

values as well.`Weight_lbs`

The weight in integer pounds of the jockey and any equipment. Typically around 120.

`PostPosition`

The integer starting position of the horse. Typically there is a slight advantage to starting at the first or second post position.

`Medication`

One of several codes indicating any medication the horse may be taking at the time of the race. I believe “L” stands for “Lasix”, a common medication for lung conditions that is thought to give horses a slight boost in speed.

`MorningLine`

A double indicating the “morning betting line” for win bets on the horse. It is not clear how to interpret this value, perhaps it is return on a dollar. Values range from 0.40 to 80.

`WN_pool`

The total combined pool in win bets, in dollars, on this horse at post time.

`PL_pool`

The total combined pool in place bets, in dollars, on this horse at post time.

`SH_pool`

The total combined pool in show bets, in dollars, on this horse at post time.

`Finish`

The integer finishing position of the horse. A 1 means first place. We only collect values of 1, 2, and 3, while the remaining finishing places are unknown and left as

`NA`

.

The author makes no guarantees regarding correctness of this data.

Steven E. Pav shabbychef@gmail.com

Data were sourced from the web. Don't ask.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | ```
library(dplyr)
data(race_data)
# compute win bet efficiency
efficiency <- race_data %>%
group_by(EventId) %>%
mutate(ImpliedOdds=WN_pool / sum(WN_pool,na.rm=TRUE)) %>%
ungroup() %>%
mutate(OddsBucket=cut(ImpliedOdds,c(0,0.05,seq(0.1,1,by=0.10)),include.lowest=TRUE)) %>%
group_by(OddsBucket) %>%
summarize(PropWin=mean(as.numeric(coalesce(Finish==1,FALSE)),na.rm=TRUE),
MedImpl=median(ImpliedOdds,na.rm=TRUE),
nObs=n()) %>%
ungroup()
if (require('ggplot2') && require('scales')) {
efficiency %>%
ggplot(aes(MedImpl,PropWin,size=nObs)) +
geom_point() +
scale_x_sqrt(labels=percent) +
scale_y_sqrt(labels=percent) +
geom_abline(slope=1,intercept=0,linetype=2,alpha=0.6) +
labs(title='actual win probability versus implied win probability',
size='# horses',
x='implied win probability',
y='observed win probability')
}
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.