Festival of Champions

So, having nothing better to do, I downloaded the scores into Minitab. Here are the boxplots of the scores. The Median Score is the line within the box. The interquartile range box represents the middle 50% of the data. The whiskers extend from either side of the box. The whiskers represent the ranges for the bottom 25% and the top 25% of the data values, excluding outliers.

Scores Boxplot

I find it interesting to look at the spread of the scores and the amount of “overlap” of the variation in scores.

Last place scores are obviously different…but look at the scores of the riders placed 3rd thru 6th place. Lots of overlap in that data.

If I were doing this in manufacturing to discriminate quality between production runs, there would be no difference in “samples” 3 thru 6.

I will continue to play with the data.

2 Likes

Sorry, coffee hasn’t kicked in yet. Is your x-axis plotting riders and all their scores for the week?

The X axis is rider placement from first to 12th place for the class in the link in my post that @evtrmom provided.

The class in that link seem to have started the whole discussion about the judges and judging system. The boxplots indicate the “spread” or variation in scores given by the 5 judges.

Here are the individual scores plotted by place.

Scores Individual Values

So you can see some rides has lots of “opinions” as shown by the wide spread of scores, other rides, the judges seem to agree as show by the smaller spread in opinions (scores).

Non-data types probably want to skip the following.

How were “outliers” defined? Also, doesn’t the quality determination regarding variation depend on the tolerance established in advance? I think that’s the question at the core of the discussion of how much variation in judging is expected/acceptable.

That has been Schockemoehle’s business model for decades - esp. for stallions. I personally know two people who bought stallions from him that they exchanged for other stallions when the first one didn’t do as well on his SPT as hoped. I also know a third person who was made the same offer but elected to keep hers anyway because he was a great fit for her size-wise and temperament-wise.

1 Like

I would not consider Steffen Peters as an example of someone who buys young horses and produces them himself. While he may have done that with Udon (I don’t know if he did or didn’t buy him as a youngster), his other top horses were secured after they had reached a pretty good level of training.

And while I am sure there are some top riders who have gotten their mounts as youngsters (i.e., not yet under saddle), at the moment I can’t think of anyone except Graves. Tarjan acquires many of hers as youngsters but the jury is out as to whether she can be considered a “top rider” (despite her recent success). Even top riders in Europe are rarely getting their mounts when young or super green and producing them themselves to international GP. Super promising foals and young horses are usually started by young horse specialists and often brought up to a certain level by those specialists or another rider who takes a “started” horse and brings it up to FEI, often (usually) under the tutelage of a top trainer. A “star” rider then takes the horse and finishes it to GP/international GP. Top, top riders with international ranking generally don’t want to climb on youngsters - they either don’t have the time or don’t want to risk getting injured (a legitimate concern).

5 Likes

I’m not a statistics nerd but I’m loving your graphs!

For the second graph, is there a way to use different colored dots for the different judges? I think that would be super interesting.

1 Like

Minitab uses an asterisk (*) symbol to identify outliers. There do not seems to be any “outliers” in the boxplots. Minitab defines outliers as observations that are at least 1.5 times the interquartile range (Q3 – Q1) from the edge of the box.

There is also a National Standard for determining treatment of outlier observations
ASTM E-178 Standard Practice for Dealing With Outlying Observations

Perhaps you are thinking of the “level of significance”…aka “alpha.” There is no “sure thing.” Alpha measure risk of reaching a wrong conclusion (Type 1 and Type 2 Error). Alpha is specified in a statistical test as the tolerance for error to determine if there are real differences between samples.

Most statistical tests assume that there is NO difference. This is considered “The Null Hypothesis.” So the statistical tests are really a test to REJECT the assumption.

For example an alpha=0.05 says that there is a 5% chance of the differences between samples being due to random chance. This alpha of 0.05 relates to the 95% confidence interval (95% CI). An alpha of 0.05 means you can say that you are 95% confident is saying that there are real differences between the samples.

You are correct in asking how much variability in dressage judging is acceptable. It has never been studied.

That is why I say that a measurement system analysis is done to validate a measurement system…BEFORE the measurement system is used to measure.

1 Like

We will see if we can change that.

Minitab has changed their user interface (for the worse IMHO) and buried commands from the prior version I was used to using. Let me see what is possible about coloring the dots

1 Like

I imagine this business model is very common as it’s done that way for lots of things for the uberwealthy, especially if they don’t mind losing money on the deal.

1 Like

I am curious who Christian Simonson trains with. Anyone know?

Adrienne.

3 Likes

Here is the ANOVA 95% Confidence Interval for the mean.

The “Mean” is the average of the 5 judges and the score that was awarded for the ride.

The 95% CI is the range where the Mean would be expected given a significance level of 0.05.

Look at the placings for positions 3,4,5,6. There is no statistical difference between those scores as they all fall within the 95% CI.

Unfortunately the table formatting goes away when I copy/paste to here

Place N Mean StDev 95% CI
1 5 72.760 2.32 (71.58, 73.95)
2 5 71.794 0.583 (70.607, 72.980)
3 5 70.970 1.405 (69.784, 72.156)
4 5 70.882 0.560 (69.696, 72.068)
5 5 70.735 0.771 (69.549, 71.921)
6 5 70.647 1.502 (69.461, 71.833)
7 5 69.588 0.669 (68.401, 70.774)
8 5 67.411 2.045 (66.225, 68.597)
9 5 66.676 0.910 (65.490, 67.862)
10 5 66.293 1.439 (65.107, 67.480)
11 5 65.764 1.296 (64.578, 66.950)
12 5 62.764 0.903 (61.578, 63.950)
Pooled StDev = 1.31914

2 Likes

The judges are going to need to have a conference after this Children’s class. Phew. There’s a big difference between a 68% and an 82% and a 67% and a 79% :open_mouth:

1 Like

Got a link?

Here is the Analysis of Means (ANOM). It shows that Judge B was right in line with the “average.” Judge C scored lower than the average and Judge M scored higher than the average across all placings.

ANOM for Score
|E| Janet Lee Foy|
|H| Natalie Lamping|
|C| William Warren|
|M| Michael Osinski|
|B| Heidi Berry|

That would agree with what I have known about how these judges typically score, so at least the data agrees.

https://equestrian-hub.com/public/show/184925/competition/dressage/2415484

And there are two judges at B for this class!

If you click on a rider to see the score breakdown, the judge at C is judging a regular test but the judge(s?) at B are judging collective marks like position and seat, use of aids, etc

https://equestrian-hub.com/public/show/184925/competition/dressage/2415484