Unlimited access >

Musings on the meanings of scores

This morning I’m seeing posts from FB friends who received these medals certificates from USDF:

They are well deserved, no question.

However, I have trouble reconciling that a score of 67% merits distinction (and I definitely think that it should), yet 70% means, on average, only “fairly good.” I know that many of the individual scores that make up a test score of 67% are higher than 7. I also know that at the same time that would mean that many are lower, as well.

Again, I have no beef with what USDF considers distinction; I have a beef with 67%, on average, essentially being defined as less than “fairly good.” I don’t know too many people who are disappointed to see a 7 on a test. Is it really only “fairly good?”

I would like USEF to take a hard look at the meanings of individual scores and come up with more meaningful, um, meanings.

7 Likes

The USEF needs professional “test writers.”…people who validate whether tests actually can quantify or measure whether the test instrument accurately measures to some standard (the rule book).

The “meanings” of the scores are there in the rulebook

10 Excellent
9 Very Good
8 Good
7 Fairly Good
6 Satisfactory
5 Marginal
4 Insufficient
3 Fairly Bad
2 Bad
1 Very Bad
0 Not executed

This ranking is called a “Likert Scale.”

The USDF/USEF (dressage committee is populated by same people) who changed a score of 5 (Used to be Sufficient, which is still used by FEI) to be defined by “Marginal.” This really adds to the confusion.

Even when I go to the dictionary, it doesn’t clarify things…

Of the many definitions of “marginal…
close to the lower limit of qualification, acceptability, or function"…or
barely exceeding the minimum requirements.”*

I have audited 3 L-Judges training, riding in one as a demo rider for 2nd Level. I found it interesting that in the videos used in the instruction, there were few (may no) examples of how to use the low scores. There did not seem to be any videos of riding that merited 0, 1, 2, 3…there might have been a few examples providing discussion of movements meriting a 4…but only in the discussions of “Is it a 4 or a 5?” I don’t recall any examples of “this is clearly a 1, 2, 3” for movements illustrated in vids.

If the judges are not educated on the use of the full grading scale to clearly understand what riding constitutes meriting scores in the lower half of the scales, then we have " score compression"…eg, the scores are all clustered around 5-6-7-8…and thus the current need for “half-points.”

This is not new news to anyone who studies measurement systems and who does "MSA’s -Measurement System Analysis- professionally…as I did for at least 15 years.

Also, in measurement theory, these scales are considered “linear”…eg., there is the same “spacing” between the numbers.

The way it SHOULD BE
0…1…2…3…4…5…6…7…8…9…10

As I see the dressage scoring scale currently applied, the scale is extremely NON-linear…eg:

The way IT IS NOW…something like this
0–1--2–3-----4------5--------6------7-----------8-------------9---------------------------10

Edited for formatting to show non-linearity

25 Likes

Ah well.

It’s hard to hit 70 if the horse is not a dressage talented breed, and easier to hit 70 in the lower levels if you have a big floaty beast of a baby WB. Dressage scores are very breed and gait linked.

Also the score distribution varies a lot between levels of dressage and even areas. if we watch real FEI CDI level competition we get used to seeing 8s and 9s. But if we watch even FEI tests at the local level, the scores are nowhere near that. Arguably the scores get a bit inflated locally and a horse might get a 70s score that wouldn’t even make it into the arena at a CDI.

Also what’s the point of the medals? Canada just eliminated them, as far as I could see here they were mostly pursued by amateurs. If coaches want credentials they do the extensive widely recognized EC accreditation process, not medals. I realize coaches in the US get and advertise their medals so that’s a different culture

I always figured medals were primarily to engage ammies who were not necessarily getting ribbons but wanted some goal to persevere and pay fees to rated shows. They are not primarily for coaches.

Taking all that in mind, 67 % for distinction and whatever lower 60s for the medal seems to me really fair for an amateur award. This really does weed out an awful lot of people. Going up to third level with scores of 67% is huge for most ammies. Most wash out before 2nd level.

Also strategically if you put the distinction level up to say 70% then you are likely to be limiting it to people who are already regularly getting good ribbons. People who are already getting the dopamine rush of winning and are moving up the levels may not be that interested in back tracking to get scores for lower levels. Absolutely a wall of ribbons for PSG and up is going to outweigh a Bronze Medal.

So the scores for medals need to capture that group that are not necessarily getting reliable ribbons and want a “personal goal” to work towards. And as I said above quality of gait can really be a deciding factor in ribbons.

6 Likes

The FEI has a dressage judging handbook that covers how to use the scores - it’s pretty interesting reading. https://inside.fei.org/fei/about-fei/fei-library/dressage-handbook.

Typically the reason you don’t see more 1s and 2s is that would mean the horse is not doing anything resembling the movement - the horse would be rearing or otherwise severely resisting, completely in the wrong gait (for example, trotting the entire free walk), etc. - and fortunately we don’t see that too often in dressage competition.

3 Likes

Incorrect. The above bolded is what YOU think this means. The descriptors for the scores are:

3 Fairly Bad
2 Bad
1 Very Bad
0 Not executed

Only when you get to a score of “0” is the movement “not executed.” So we really don’t know what the judges would score, or differentiate “Fairly Bad” from a simple “Bad” to a really “Very Bad” piaffe, passage…etc.

To test the level of absolute agreement among appraisers, researchers often use the Kappa statistic. The Kappa statistic cannot discern a difference between 1 and a 5 as distinct from 1 and a 2. To assess the level of agreement when you have a ranking system, use Kendall’s Coefficient of Concordance (KCC). KCC, unlike Kappa, accounts for the magnitude of the difference among scores.

This stuff is out there being used on a daily basis by industries who rely on having quality products and processes. From reading a mammogram, (shades of grey on a CRT) to quality of a weld on a nuclear reactor (reading radiographs and ultrasound scans), this sort of validation of the judge (the doctor or weld inspector) is done every day in industry.

7 Likes

I havent shown in quite a while but have done my share of spectating! I would agree that the lower scores are rarely used - perhaps less than they should be. It seems especially with L grads and newer r judges, something dramatic has to happen before a score drops below a 4. Even if the horse is inverted and crooked and crawling across the diagonal in a “lengthening” the score rarely drops.

I think there is some insecurity with some judges and a desire to be “encouraging”. They also want jobs and giving out low scores (even deservedly) might cost them jobs.

Even at the higher levels, this can be an issue. Some might remember the California ride that was livestreamed that looked to be the definition of “insufficient” or worse, yet scored in the 50s.

1 Like

So when I mark college composition papers that are insufficient I can do a C minus, a D or an F. Students need a C plus in the course to satisfy program requirements. Going C or below will mean they need to retake it.

D is in the range of 50s, F is under 50 and can be a zero if it’s egregious. I usually only do this for actual smoking gun plagiarism and it’s a 0.

For my purposes, with the D papers and below there’s not that much mileage in deciding if it’s a 30 or a 40 or a 50.

The distinctions that really interest everyone are the C plus or B minus, or the B plus to A range. We are only talking about a couple of points on the course grade but there is a huge psychological component.

At the average ammie lower level show, the real thought is going to go into distinguishing the 5/6/7 range.

The person whose horse bolts across the diagonal is going to know they messed up big time and whether they get a 0 or a compassionate 3 or 4 and “nicely ridden” as a comment, they will still be mortified and accept any score and just be glad they didn’t jump the rail and get rung out :slight_smile:

So I’m not sure there’s huge mileage in fine tuning the degrees of failure, when the scores that people scrutinize and take to heart are the middling ones.

I would not worry about the more rarely used fail scores, but would hope that the judge was consistent in how they awarded the middling scores which take much more focus.

4 Likes

Fair - I should have said they aren’t doing MUCH resembling the movement. For example, they may have followed the path from H - X- F, but mostly in the wrong gait, as in the example of trotting in the free walk. One break is typically a 4 (I have scribed a lot and continue to do so, and for what it’s worth am a recently licensed judge for eventing) but multiple breaks or consistently the wrong gait will go lower - 3 or 2 depending on the extent of the problem. I gave a 2 recently for a horse that did this - had much more trot than walk in what was supposed to be a free walk.

Why not go lower? Well, what happens if the next horse comes in and runs sideways throughout, or bucks, or rears? That needs to be scored lower than the horse that was trotting because it is more severe resistance. But if you’ve gone to the bottom of the scale just for trotting, you’ve got nowhere left to go to identify that this was worse.

Just because the lower numbers aren’t used often doesn’t mean they aren’t on the scale - but the scale has to cover everything a horse might do. Yes that means the majority of horses are going to fall somewhere near the upper middle of the scale (6 - 8) if they are reasonably well prepared for the test, but that doesn’t mean the scale is wrong, just that there is great variability in what might be seen in the ring, with the extreme behaviours not often presented for judging. Or hopefully at judging clinics. And therefore not often seen in the scores.

I’d also recommend checking out the FEI judging book if this is confusing - it’s a great resource as to how the scores should be applied, broken down by every gait and movement.

3 Likes

None of this is confusing. I know wherefore I speak.

I have done this professionally for over 15 years…in situations having life-and-death consequences such as passing judgement whether a 7,000 psi reactor (not nuclear) is fit for service and not going to blow up and level the local manufacturing plant…not just some horse & rider going round and round a 20x60 arena.

If a certifying entity is going to certify (or license) “appraisers”…eg., judges, it needs to show that the appraisers can judge to the standard. Statistics can measure how well appraisers can do this…and also how well appraisers vary in ranking. As far as I know this has not been rigorously done…eg. to look for Kendall or Kappa scores for judging.

The point is that you need an absolute scale, then hold the judges to that scale.

The fact that we are having this discussion tells me that there is no clear vision of how to judge a movement…NOT in “handbook”…that’s theory…but in the reality of a real horse and rider in front of a judge.

Back in the dark ages (1980’s), auditors could attend judge training at Gladstone. I attended 2x. Auditors sat down the long side, silent, while riders rode a demo test. The instructor judges commented on what was expected then another rider rode the test from start to finish.

The judges who were up for license renewal had paddles with scores. After each movement the judges held a paddle to say what their score was. The scores varied all over the place. I remember Klaus Fraesdorf (RIP) was a hardass giving lower scores than most of the other judges. I don’t believe things have changed much, just that the dissenters have turned in their judge’s cards or else they keep silent these days. I know at the L-Judges training the L-candidates who would speak candidly said they did not ask controversial questions because they did not want to be black-balled.

Oh, I have scribed…a lot…up to and including GP at Dressage at Devon…though again, back in the dark ages, including for O-judge Axel Steiner.

2 Likes

When I judge, I would definitely give that a zero. The movement is free walk. The rider did not demonstrate a free walk good, bad, or indifferent - that’s a zero. However, I agree with Pluvinel in that you rarely see judges use the entire scale. 5,6,7,8, then almost never 9, and more rarely 10. I had a horse bolt off with me during a test. I got one zero, one 2, and one 3 before I got her under control again. I wish I could recall the judge. (30+ years ago, sorry). I was shocked that I wasn’t excused, but I wasn’t. I finished the test.

3 Likes

This is where I differ with you. Yes, judges should be able to judge to the standard, but that’s where you get off track. You absolutely cannot judge to a standard that isn’t hard and fast. These aren’t machines that have hard and fast specs that they must conform to. Dressage is extremely subjective. Yes, you can say that the canter must be a clear, consistent 3 beat. OK. But is that the standard? A warmblood that has a clear, consistent 3-beat canter is going to have an entirely different canter than an Arab for instance. The quality of both canters being equal, one judge may score the warmblood higher, another judge may score them the same, a third judge may score the Arab higher. It’s entirely subjective. There is no subjectivity (or shouldn’t be) in QC of machinery. Here’s the specs, here’s what this particular machine has, ergo, this machine is not to the standard. I really feel like this is an apples and oranges comparison. YMMV

(I’ve also scribed for Axel Steiner back in the day. As well as 3 other O Judges at a CDI-W) :smiley:

11 Likes

I totally agree with what you’ve said, i do think this:

Is the reason for the score compression, it doesn’t matter. Scores can be anywhere, is only important to the rider (and possibly their coach).
If someone gets s a 6 for a 4 movement, it might not teach them what they need to know, but it’s not going to level a building.

Dressage is a subjective art rather than an objective sport like jumpers where either the rail falls or it doesn’t.

Like marking college papers, in each contest or assignment the judge/professor tries to be consistent across the rubric. But it’s absolutely true that here at my little 4 year college my marks are much softer than I gave teaching as PhD candidate in a top research university. The quality of undergrads was much higher. The standards demanded were higher. That’s why colleges are ranked and tiered. That said, the range of grades in the top university was much smaller because all the students were excellent. They did everything you could ask of them. Like a CDI where everyone is getting 8 and 9.

Anyhow judging and marking are both subjective but that doesn’t mean entirely without standards. It does mean there is no fail safe engineering test to evaluate either a paper or a ride. Also as noted the stakes are much lower. And the student who gets a C minus instead of a D or the rider who gets a 3 instead of a 0 on a move knows they missed.

7 Likes

You are confused if you think this only applies to “machines.”

How do you think medical professional make subjective judgements about a patient’s symptoms? By education and training to some standard for evaluating a patient’s condition from a psychiatric condition or a physical issue.

You absolutely cannot judge to a standard that isn’t hard and fast.

And THAT is the problem…there is NO STANDARD.

But is that the standard? A warmblood that has a clear, consistent 3-beat canter is going to have an entirely different canter than an Arab for instance. The quality of both canters being equal, one judge may score the warmblood higher, another judge may score them the same, a third judge may score the Arab higher. It’s entirely subjective.

You just made my point. There is no standard. There should be a caucus of the dressage decision makers to arrive at a consensus opinion on how to judge a canter…regardless of breed.

1 Like

Sure it matters. It matters to the competitors. It matters to the integrity of the sport.

This is an Olympic level sport! I would think that there would be rigorous standards for evaluating performance.

1 Like

Well…if dressage is THAT subjective, then it should not be a competitive sport. It should then be an exhibition art and perhaps be judged by the audience.

We do not have “competitive ballet.” We do have gymnastics where performance is judged to a standard.

As far as judging term papers…welcome to my world…I just spent spring break reading +200 pages of term papers. You bet I put the rubric in front of me and judge to the rubric. And no, I do not have a sliding scale. I have an absolute scale as these are going to be graduating engineers whose designs may impact life-safety decisions.

So I am very familiar with subjective judging…and it does not just have to be for “machines”… human performance can be and regularly is quantified by metrics to measure performance against a standard.

1 Like

I agree that it matters within the sport, but obviously it doesn’t matter enough to rigorously train judges to use the entire set of scores available to them without those judges being not invited back to judge again.

It’s incredibly subjective otherwise not-warmbloods could score as well as warmbloods because it’s about the performed movements for the horse that is on course, not comparing the house your looking at to the best one you’ve seen all day.

2 Likes

Gymnastics and ice skating are more like showjumping, you either did it or didn’t do it. Dressage is more like ice dancing or artistic gymnastics, where there had to be a lot more subjective judging.

3 Likes

Follow the money…By jove, I think you have hit the nail on the head…

2 Likes

I scribe quite a lot and have seen 2s and 9s on the same test. In Canada at least, I think judging clinics in recent years have focused on encouraging judges to use the entire scale. In my experience the less experienced / lower level judges tend to stick between 4-6.5, while the FEI judges and the more senior national judges are more comfortable giving 8s and 9s. None of them want to give a mark below 4 and always make a point of clarifying why when they do: ie multiple mistakes in the same line of changes, little of the movement shown, etc.

As to whether 7 should be something more than “fairly good?” I’m not sure - at least not if we want all dressage riders judged on the same standard scale. To most of us lower level ammies, 7 is a great score, but when you look at the top pros, 7 is just “fairly good.” Total scores of 69-72 are solidly mediocre in that context, while the best of the best are well in the 80s and some into the 90s.

That doesn’t mean we shouldn’t be proud of getting “fairly good” scores, especially if they are better than the scores we got last year. I’d be beyond thrilled to score above 70%. For an average amateur who rides only one horse 3-4 times a week, “fairly good” in competition is a solid achievement.

3 Likes