[QUOTE=tom;3692513]
The most fundamental problem with the analysis is that they are sampling on the dependent variable – international showjumpers. They have a list of international showjumpers (their dependent variable) and they are sorting the showjumpers into their respective motherlines. Then they are ranking the motherlines by the number of international showjumpers each motherline has produced.
What is wrong with this approach? You are an epidemiologist so let’s use an example from your field. Let’s say there has been a large increase in brain cancer deaths in a jurisdiction and you are asked to investigate it.
An inexperienced researcher will develop a study that contains, let’s say, all the people who died of brain cancer and then try to identify causes. The inexperienced researcher notes that 98.8% of the people who died of brain cancer watched Monday Night Football and this variable is the one that the greatest number of brain cancer patients shared. There you go! Monday Night Football must have caused the brain cancer!
But of course that is a silly conclusion.
A more appropriate technique woud have been to have drawn a random sample (of some type, a purely random sample would not provide enough cases of people with brain cancer since the incidence of brain cancer is still low in the population despite the spike in cases) to estimate a model that attempts to discriminate (to sort) between those with brain cancer and those without. And you could use a statistical estimator such as discriminant analysis or logistic regression (died of brain cancer = 1 and not developed brain cancer = 0) to test your theory. And if your theory is good, and your data are good, and you have enough statistical power you might find one or more variables that are important predictors/explainers, from both a substantive (remember Monday Night Football?) and a statistical point of view, of why certain people died of brain cancer and others did not. And included in that model would have to be control variables such as age, sex, race, etc. – variables that may be correlated to your key independent viables but are themselves certainly not possible causes of brain cancer.
In the case of these mare families we have a similar problem of sampling on the dependent varable. The authors took as their population to be studied the population of international showjumpers (i.e., the brain cancer casualties). They should have taken as their population to be studied the entire population of each mare family (i.e., the brain cancer casualties AND the people without brain cancer) and then sorted each mare family into those horses that are/were international showjumpers and those that are/were not. So the solution is so much easier here with the mare families than it is with the cancer study: all they have to do is control for the size of each mare family! Simple.
Let me give another example. Ireland (was and maybe still is) the second largest exporter of software in the world after the USA. So the USA is like the most successful motherline in the study and Ireland is like the second most successful motherline in the study. But wait a minute. Ireland has only about 1.5 percent of the total US population. So on a per capita basis Ireland is a much more “powerful” exporter of software than the USA.
Or consider whose record is more impressive in producing Nobel Laureates in Literature? Ireland has one-third (4) the number of Nobel prizes in literature as does the USA (12). Who impresses you the most by their production of Nobels in Literature? Ireland or the USA? For me the answer is easy: Ireland.
Let’s go back to the mare families. For this ranking to have any utility we have to know the denominator, how many horses are in each of the mare families. We know stamm 776 is big. But how big is it? If we took into account the size of each of the mare families would the ranking be the same? Holding size of the mareline constant does the ranking stay the same? Or do the positions shift dramatically once we take into account how many horses are in each mare family?
Is stamm 776’s production of international showjumpers impressive? I do not know and the only way to truly know is if we control for size. And the most reader-friendly way to do that is to compute a rate – number of international showjumpers per 100 horses in each motherline.
Without controlling for the size of the mare family and/or computing a rate we simply have no idea which mare families are truly powerful producers of international showjumpers and which ones only appear to be so because they are so big.[/QUOTE]Been reading this forum off and on for what, 10 years… or however long it’s been around… this discussion is one of the most useful ever. Thank you.