Categories
Statistics

13. Study design and choosing a statistical test

Design

In many ways the design of a study is more important than the analysis. A badly designed study can never be retrieved, whereas a poorly analysed one can usually be reanalysed. (1) Consideration of design is also important because the design of a study will govern how the data are to be analysed.

Most medical studies consider an input, which may be a medical intervention or exposure to a potentially toxic compound, and an output, which is some measure of health that the intervention is supposed to affect. The simplest way to categorise studies is with reference to the time sequence in which the input and output are studied.

The most powerful studies are prospective studies, and the paradigm for these is the randomised controlled trial. In this subjects with a disease are randomised to one of two (or more) treatments, one of which may be a control treatment. Methods of randomisation have been described in Chapter 3. The importance of randomisation is that we Imow in the long run treatment groups will be balanced in known and unknown prognostic factors. It is important that the treatments are concurrent – that the active and control treatments occur in the same period of time.

A parallel group design is one in which treatment and control are allocated to different individuals. To allow for the therapeutic effect of simply being given treatment, the control may consist of a placebo, an inert substance that is physically identical to the active compound. If possible a study should be double blinded – neither the investigator nor the subject being aware of what treatment the subject is undergoing. Sometimes it is impossible to blind the subjects, for example when the treatment is some form of health education, but often it is possible to ensure that the people evaluating the outcome are unaware of the treatment. An example of a parallel group trial is given in table 7.1, in which different bran preparations have been tested on different individuals.

A matched design comes about when randomisation is between matched pairs, such as in Exercise 6.2, in which randomisation was between different parts of a patient’s body.

A crossover study is one in which two or more treatments are applied sequentially to the same subject. The advantages are that each subject then acts as their own control and so fewer subjects may be required. The main disadvantage is that there may be a carry over effect in that the action of the second treatment is affected by the first treatment. An example of a crossover trial is given in table 7.2, in which different dosages of bran are compared within the same individual. A number of excellent books are available on clinical trials.(2, 3)

One of the major threats to validity of a clinical trial is compliance. Patients are likely to drop out of trials if the treatment is unpleasant, and often fail to take medication as prescribed. It is usual to adopt a pragmatic approach and analyse by intention to treat , that is analyse the study by the treatment that the subject was assigned to, not the one they actually took. The alternative is to analyse per protocol or on study . Drop outs should of course be reported by treatment group. A checklist for writing reports on clinical trials is available.(4, 5)

A quasi experimental design is one in which treatment allocation is not random. An example of this is given in table 9.1 in which injuries are compared in two dropping zones. This is subject to potential biases in that the reason why a person is allocated to a particular dropping zone may be related to their risk of a sprained ankle.

A cohort study is one in which subjects, initially disease free, are followed up over a period of time. Some will be exposed to some risk factor, for example cigarette smoking. The outcome may be death and we may be interested in relating the risk factor to a particular cause of death. Clearly, these have to be large, long term studies and tend to be costly to carry out. If records have been kept routinely in the past then a historical cohort study may be carried out, an example of which is the appendicitis study discussed in Chapter 6. Here, the cohort is all cases of appendicitis admitted over a given period and a sample of the records could be inspected retrospectively. A typical example would be to look at birth weight records and relate birth weight to disease in later life.

These studies differ in essence from retrospective studies, which start with diseased subjects and then examine possible exposure. Such case control studies are commonly undertaken as a preliminary investigation, because they are relatively quick and inexpensive. The comparison of the blood pressure in farmers and printers given in Chapter 3 is an example of a case control study. It is retrospective because we argued from the blood pressure to the occupation and did not start out with subjects assigned to occupation. There are many confounding factors in case control studies. For example, does occupational stress cause high blood pressure, or do people prone to high blood pressure choose stressful occupations? A particular problem is recall bias, in that the cases, with the disease, are more motivated to recall apparently trivial episodes in the past than controls, who are disease free.

Cross sectional studies are common and include surveys, laboratory experiments and studies to examine the prevalence of a disease. Studies validating instruments and questionnaires are also cross sectional studies. The study of urinary concentration of lead in children described in Chapter 1 and the study of the relationship between height and pulmonary anatomical dead space in Chapter 11 were also cross sectional studies.

Sample size

One of the most common questions asked of a statistician about design is the number of patients to include. It is an important question, because if a study is too small it will not be able to answer the question posed, and would be a waste of time and money. It could also be deemed unethical because patients may be put at risk with no apparent benefit. However, studies should not be too large because resources would be wasted if fewer patients would have sufficed. The sample size depends on four critical quantities: the type I and type II error rates α and β(discussed in Chapter 5), the variability of the data σ², and the effect size d. In a trial the effect size is the amount by which we would expect the two treatments to differ, or is the difference that would be clinically worthwhile.

Usually α and β are fixed at 5% and 20% (or 10%), respectively. A simple formula for a two group parallel trial with a continuous outcome is that the required sample size per group is given by for two sided α of 5% and β of 20%. For example, in a trial to reduce blood pressure, if a clinically worthwhile effect for diastolic blood pressure is 5 mmHg and the between subjects standard deviation is 10 mmHg, we would require n = 16 x 100/25 = 64 patients per group in the study. The sample size goes up as the square of the standard deviation of the data (the variance) and goes down inversely as the square of the effect size. Doubling the effect size reduces the sample size by four – it is much easier to detect large effects! In practice, the sample size is often fixed by other criteria, such as finance or resources, and the formula is used to determine a realistic effect size. If this is too large, then the study will have to be abandoned or increased in size. Machin et al. give advice on a sample size calculations for a wide variety of study designs.(6)

Choice of test

In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses.

A sensible plan is to limit severely the number of confirmatory hypotheses. Although it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as very tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – one unlikely to reject the null hypothesis.

The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case control study in which the controls were matched to the cases by age, sex and social class, are not independent. It is generally true that the analysis should reflect the design, and so a matched design should be followed by a matched analysis. Results measured over time require special care.(7) One of the most common mistakes in statistical analysis is to treat dependent variables as independent. For example, suppose we were looking at treatment of leg ulcers, in which some people had an ulcer on each leg. We might have 20 subjects with 30 ulcers but the number of independent pieces of information is 20 because the state of an ulcer on one leg may influence the state of the ulcer on the other leg and an analysis that considered ulcers as independent observations would be incorrect. For a correct analysis of mixed paired and unpaired data consult a statistician.

The next question is “what types of data are being measured?” The test used should be determined by the data. The choice of test for matched or paired data is described in and for independent data in .

It is helpful to decide the input variables and the outcome variables. For example in a clinical trial the input variable is type of treatment – a nominal variable – and the outcome may be some clinical measure perhaps Normally distributed. The required test is then the t test (table 13.2). However, if the input variable is continuous, say a clinical score, and the outcome is nominal, say cured or not cured, logistic regression is the required analysis. A t test in this case may help but would not give us what we require, namely the probability of a cure for a given value of the clinical score. As another example, suppose we have a cross sectional study in which we ask a random sample of people whether they think their general practitioner is doing a good job, on a five point scale, and we wish to ascertain whether women have a higher opinion of general practitioners than men have. The input variable is gender, which is nominal. The outcome variable is the five point ordinal scale. Each person’s opinion is independent of the others, so we have independent data. From we should use a χ² test for trend, or a Mann-Whitney U test (with correction for ties). Note, however, if some people share a general practitioner and others do not, then the data are not independent and a more sophisticated analysis is called for.

Note that these tables should be considered as guides only, and each case should be considered on its merits.

(a) If data are censored.

(b) The Kruskal-Wallis test is used for comparing ordinal or non-Normal variables for more than two groups, and is a generalisation of the Mann-Whitney U test. The technique is beyond the scope of this book, but is described in more advanced books and is available in common software (Epi-Info, Minitab, SPSS).

(c) Analysis of variance is a general technique, and one version (one way analysis of variance) is used to compare Normally distributed variables for more than two groups, and is the parametric equivalent of the Kruskal-Wallis test.

(d) If the outcome variable is the dependent variable, then provided the residuals (see ) are plausibly Normal, then the distribution of the independent variable is not important.

(e) There are a number of more advanced techniques, such as Poisson regression, for dealing with these situations. However, they require certain assumptions and it is often easier to either dichotomise the outcome variable or treat it as continuous.

References

  1. Campbell MJ, Machin D. In: Medical Statistics: A Common-sense Approach , 2nd edn. Chichester: Wiley, 1993:2.
  2. Pocock SJ. Clinical trials: A Practical Approach . Chichester: Wiley, 1982.
  3. Senn SJ. The Design and Analysis of Cross-Over Trials . Chichester: Wiley, 1992.
  4. Gardner MJ, Altman DG (eds) In: Statistics with Confidence . BMJ Publishing Group, 1989:103-5.
  5. Gardner MJ, Machin D, Campbell MJ. The use of checklists in assessing the statistical content of medical studies. BMJ 1986; 292 :810-12.
  6. Macbin D, Campbell MJ, Payers P, Pinol A. Statistical Tables for the Design of Clinical Studies . Oxford: Blackwell Scientific Publications, 1996.
  7. Matthews JNS, Altman DG, Campbell MJ, Royston JP. Analysis of senal measurements in medical research. BMJ 1990; 300 :230-5.
  8. Altman DG. Practical Statistics for Medical Research . London: Chapman & Hall, 1991.
  9. Armitage P, Berry G. In: Statistical Methods in Medical Research . Oxford: Blackwell Scientific Publications, 1994.

Exercises

State the type of study described in each of the following.

13.1 To investigate the relationship between egg consumption and heart disease, a group of patients admitted to hospital with myocardial infarction were questioned about their egg consumption. A group of age and sex matched patients admitted to a fracture clinic were also questioned about their egg consumption using an identical protocol.

13.2 To investigate the relationship between certain solvents and cancer, all employees at a factory were questioned about their exposure to an industrial solvent, and the amount and length of exposure measured. These subjects were regularly monitored, and after 10 years a copy of the death certificate for all those who had died was obtained.

13.3 A survey was conducted of all nurses employed at a particular hospital. Among other questions, the questionnaire asked about the grade of the nurse and whether she was satisfied with her career prospects.

13.4 To evaluate a new back school, patients with lower back pain were randomly allocated to either the new school or to conventional occupational therapy. After 3 months they were questioned about their back pain, and observed lifting a weight by independent monitors.

13.5 A new triage system has been set up at the local Accident and Emergency Unit. To evaluate it the waiting times of patients were measured for 6 months and compared with the waiting times at a comparable nearby hospital.

ebmj-statistics-square-one-answers-exercises.pdf

Categories
Statistics

12. Survival analysis

Survival analysis is concerned with studying the time between entry to a study and a subsequent event. Originally the analysis was concerned with time from treatment until death, hence the name, but survival analysis is applicable to many areas as well as mortality. Recent examples include time to discontinuation of a contraceptive, maximum dose of bronchoconstrictor required to reduce a patient’s lung function to 80% of baseline, time taken to exercise to maximum tolerance, time that a transdermal patch can be left in place, time for a leg fracture to heal.

When the outcome of a study is the time between one event and another, a number of problems can occur.

  1. The times are most unlikely to be Normally distributed.
  2. We cannot afford to wait until events have happened to all the subjects, for example until all are dead. Some patients might have left the study early – they are lost to follow up. Thus the only information we have about some patients is that they were still alive at the last follow up. These are termed censored observations

Kaplan-Meier survival curve

We look at the data using a Kaplan-Meier survival curve. Suppose that the survival times, including censored observations, after entry into the study (ordered by increasing duration) of a group of n subjects are The proportion of subjects, S(t), surviving beyond any follow up time () is estimated by

where is the largest survival time less than or equal to t and is the number of subjects alive just before time (the ith ordered survival time), denotes the number who died at time where i can be any value between 1 and p. For censored observations

= 0.

Method

Order the survival time by increasing duration starting with the shortest one. At each event (i) work out the number alive immediately before the event (r i). Before the first event all the patients are alive and so S(t) = 1. If we denote the start of the study as , where = 0, then we have S = 1. We can now calculate the survival times , for each value of i from 1 to n by means of the following recurrence formula.

Given the number of events (deaths), , at time and the number alive, , just before calculate

We do this only for the events and not for censored observations. The survival curve is unchanged at the time of a censored observation, but at the next event after the censored observation the number of people “at risk” is reduced by the number censored between the two events.

Example of calculation of survival curve

Mclllmurray and Turkie (2) describe a clinical trial of 69 patients for the treatment of Dukes’ C colorectal cancer. The data for the two treatments, linoleic acid or control are given in Table 12.1 (3).

The calculation of the Kaplan-Meier survival curve for the 25 patients randomly assigned to receive 7 linoleic acid is described in Table 12.2 . The + sign indicates censored data. Until 6 months after treatment, there are no deaths, 50 S(t) 1. The effect of the censoring is to remove from the alive group those that are censored. At time 6 months two subjects have been censored and so the number alive just before 6 months is 23. There are two deaths at 6 months.

Thus,

We now reduce the number alive (“at risk”) by two. The censored event at 9 months reduces the “at risk” set to 20. At 10 months there are two deaths, so the proportion surviving is 18/20 = 0.90 and the cumulative proportion surviving is 0.913 x 0.90 = 0.8217. The cumulative survival is conveniently stored in the memory of a calculator. As one can see the effect of the censored observations is to reduce the number at risk without affecting the survival curve S(t).

Finally we plot the survival curve, as shown in . The censored observations are shown as ticks on the line.

Figure 12.1 Survival curve of 25 patients with Dukes’ C colorectal cancer treated with linoleic acid.

Log rank test

To compare two survival curves produced from two groups A and B we use the rather curiously named log rank test,1 so called because it can be shown to be related to a test that uses the logarithms of the ranks of the data.

The assumptions used in this test are:

  1. That the survival times are ordinal or continuous.
  2. That the risk of an event in one group relative to the other does not change with time. Thus if linoleic acid reduces the risk of death in patients with colorectal cancer, then this risk reduction does not change with time (the so called proportional hazards assumption ).

We first order the data for the two groups combined, as shown in Table 12.3 . As for the Kaplan-Meier survival curve, we now consider each event in turn, starting at time t = 0.

At each event (death) at time we consider the total number alive and the total number still alive in group A up to that point. If we had a total of events at time then, under the null hypothesis, we consider what proportion of these would have been expected in group A. Clearly the more people at risk in one group the more deaths (under the null hypothesis) we would expect.

The effect of the censored observations is to reduce the numbers at risk, but they do not contribute to the expected numbers.

Further methods

In the same way that multiple regression is an extension of linear regression, an extension of the log rank test includes, for example, allowance for prognostic factors. This was developed by DR Cox, and so is called Cox regression. It is beyond the scope of this book, but is described elsewhere.(4, 5)

Common questions

Do I need to test for a constant relative risk before doing the log rank test?

This is a similar problem to testing for Normality for a t test. The log rank test is quite “robust” against departures from proportional hazards, but care should be taken. If the Kaplan-Meier survival curves cross then this is clear departure from proportional hazards, and the log rank test should not be used. This can happen, for example, in a two drug trial for cancer, if one drug is very toxic initially but produces more long term cures. In this case there is no simple answer to the question “is one drug better than the other?”, because the answer depends on the time scale.

If I don’t have any censored observations, do I need to use survival analysis?

Not necessarily, you could use a rank test such as the Mann-Whitney U test, but the survival method would yield an estimate of risk, which is often required, and lends itself to a useful way of displaying the data.

References

  1. Peto R, Pike MC, Armitage P et al . Design and analysis of randomized clinical trials requiring prolonged observation of each patient: II. Analysis and examples. Br J Cancer l977; 35 :l-39.
  2. McIllmurray MB, Turkie W. Controlled trial of linoleic acid in Dukes’ C colorectal cancer. BMJ 1987; 294 :1260, 295 :475.
  3. Gardner MJ, Altman DG (Eds). In: Statistics with Confidence, Confidence Intervals and Statistical Guidelines . London: BMJ Publishing Group, 1989; Chapter 7.
  4. Armitage P, Berry G. In: Statistical Methods in Medical Practice . Oxford: Blackwell Scientific Publications, 1994:477-81.
  5. Altman DG. Practical Statistics for Medical Research .. London: Chapman & Hall, 1991.

Exercises

12.1 Twenty patients, ten of normal weight and ten severely overweight underwent an exercise stress test, in which they had to lift a progressively increasing load for up to 12 minutes, but they were allowed to stop earlier if they could do no more. On two occasions the equipment failed before 12 minutes. The times (in minutes) achieved were:

Normal weight: 4, 10, 12*, 2, 8, 12*, 8**, 6, 9, 12*

Overweight: 7**, 5, 11, 6, 3, 9, 4, 1, 7, 12*

*Reached end of test; **equipment failure. (I am grateful to C Osmond for these data). What are the observed and expected values? What is the value of the log rank test to compare these groups?

12.2 What is the risk of stopping in the normal weight group compared with the overweight group, and a 95% confidence interval?

eBMJ — Statistics at Square One- Answers to exercises.pdf

Categories
Statistics

11. Correlation and regression

The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of wheeziness. However, in statistical terms we use correlation to denote association between two quantitative variables. We also assume that the association is linear, that one variable increases or decreases a fixed amount for a unit increase or decrease in the other. The other technique that is often used in these circumstances is regression, which involves estimating the best straight line to summarise the association.

Correlation coefficient

The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes called Pearson’s correlation coefficient after its originator and is a measure of linear association. If a curved line is needed to express the relationship, other and more complicated measures of the correlation must be used.

The correlation coefficient is measured on a scale that varies from + 1 through 0 to – 1. Complete correlation between two variables is expressed by either + 1 or -1. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative. Complete absence of correlation is represented by 0. Figure 11.1 gives some graphical representations of correlation.

Figure 11.1 Correlation illustrated.

Looking at data: scatter diagrams

When an investigator has collected two series of observations and wishes to see whether there is a relationship between them, he or she should first construct a scatter diagram. The vertical scale represents one set of measurements and the horizontal scale the other. If one set of observations consists of experimental results and the other consists of a time scale or observed classification of some kind, it is usual to put the experimental results on the vertical axis. These represent what is called the “dependent variable”. The “independent variable”, such as time or height or some other observed classification, is measured along the horizontal axis, or baseline.

The words “independent” and “dependent” could puzzle the beginner because it is sometimes not clear what is dependent on what. This confusion is a triumph of common sense over misleading terminology, because often each variable is dependent on some third variable, which may or may not be mentioned. It is reasonable, for instance, to think of the height of children as dependent on age rather than the converse but consider a positive correlation between mean tar yield and nicotine yield of certain brands of cigarette.’ The nicotine liberated is unlikely to have its origin in the tar: both vary in parallel with some other factor or factors in the composition of the cigarettes. The yield of the one does not seem to be “dependent” on the other in the sense that, on average, the height of a child depends on his age. In such cases it often does not matter which scale is put on which axis of the scatter diagram. However, if the intention is to make inferences about one variable from the other, the observations from which the inferences are to be made are usually put on the baseline. As a further example, a plot of monthly deaths from heart disease against monthly sales of ice cream would show a negative association. However, it is hardly likely that eating ice cream protects from heart disease! It is simply that the mortality rate from heart disease is inversely related – and ice cream consumption positively related – to a third factor, namely environmental temperature.

Calculation of the correlation coefficient

A paediatric registrar has measured the pulmonary anatomical dead space (in ml) and height (in cm) of 15 children. The data are given in table 11.1 and the scatter diagram shown in figure 11.2 Each dot represents one child, and it is placed at the point corresponding to the measurement of the height (horizontal axis) and the dead space (vertical axis). The registrar now inspects the pattern to see whether it seems likely that the area covered by the dots centres on a straight line or whether a curved line is needed. In this case the paediatrician decides that a straight line can adequately describe the general trend of the dots. His next step will therefore be to calculate the correlation coefficient.

When making the scatter diagram (figure 11.2 ) to show the heights and pulmonary anatomical dead spaces in the 15 children, the paediatrician set out figures as in columns (1), (2), and (3) of table 11.1 . It is helpful to arrange the observations in serial order of the independent variable when one of the two variables is clearly identifiable as independent. The corresponding figures for the dependent variable can then be examined in relation to the increasing series for the independent variable. In this way we get the same picture, but in numerical form, as appears in the scatter diagram.

Figure 11.2 Scatter diagram of relation in 15 children between height and pulmonary anatomical dead space.

The calculation of the correlation coefficient is as follows, with x representing the values of the independent variable (in this case height) and y representing the values of the dependent variable (in this case anatomical dead space). The formula to be used is:

which can be shown to be equal to:

Calculator procedure

Find the mean and standard deviation of x, as described in

Find the mean and standard deviation of y:

Subtract 1 from n and multiply by SD(x) and SD(y), (n – 1)SD(x)SD(y)

This gives us the denominator of the formula. (Remember to exit from “Stat” mode.)

For the numerator multiply each value of x by the corresponding value of y, add these values together and store them.

110 x 44 = Min

116 x 31 = M+

etc.

This stores in memory. Subtract

MR – 15 x 144.6 x 66.93 (5426.6)

Finally divide the numerator by the denominator.

r = 5426.6/6412.0609 = 0.846.

The correlation coefficient of 0.846 indicates a strong positive correlation between size of pulmonary anatomical dead space and height of child. But in interpreting correlation it is important to remember that correlation is not causation. There may or may not be a causative connection between the two correlated variables. Moreover, if there is a connection it may be indirect.

A part of the variation in one of the variables (as measured by its variance) can be thought of as being due to its relationship with the other variable and another part as due to undetermined (often “random”) causes. The part due to the dependence of one variable on the other is measured by Rho . For these data Rho= 0.716 so we can say that 72% of the variation between children in size of the anatomical dead space is accounted for by the height of the child. If we wish to label the strength of the association, for absolute values of r, 0-0.19 is regarded as very weak, 0.2-0.39 as weak, 0.40-0.59 as moderate, 0.6-0.79 as strong and 0.8-1 as very strong correlation, but these are rather arbitrary limits, and the context of the results should be considered.

Significance test

To test whether the association is merely apparent, and might have arisen by chance use the t test in the following calculation:

The t Appendix Table B.pdf

is entered at n – 2 degrees of freedom.

For example, the correlation coefficient for these data was 0.846.

The number of pairs of observations was 15. Applying equation 11.1, we have:

Entering table B at 15 – 2 = 13 degrees of freedom we find that at t = 5.72, P<0.001 so the correlation coefficient may be regarded as highly significant. Thus (as could be seen immediately from the scatter plot) we have a very strong correlation between dead space and height which is most unlikely to have arisen by chance.

The assumptions governing this test are:

  1. That both variables are plausibly Normally distributed.
  2. That there is a linear relationship between them.
  3. The null hypothesis is that there is no association between them.

The test should not be used for comparing two methods of measuring the same quantity, such as two methods of measuring peak expiratory flow rate. Its use in this way appears to be a common mistake, with a significant result being interpreted as meaning that one method is equivalent to the other. The reasons have been extensively discussed(2) but it is worth recalling that a significant result tells us little about the strength of a relationship. From the formula it should be clear that with even with a very weak relationship (say r = 0.1) we would get a significant result with a large enough sample (say n over 1000).

Spearman rank correlation

A plot of the data may reveal outlying points well away from the main body of the data, which could unduly influence the calculation of the correlation coefficient. Alternatively the variables may be quantitative discrete such as a mole count, or ordered categorical such as a pain score. A non-parametric procedure, due to Spearman, is to replace the observations by their ranks in the calculation of the correlation coefficient.

This results in a simple formula for Spearman’s rank correlation, Rho.

where d is the difference in the ranks of the two variables for a given individual. Thus we can derive table 11.2 from the data in table 11.1 .

From this we get that

In this case the value is very close to that of the Pearson correlation coefficient. For n> 10, the Spearman rank correlation coefficient can be tested for significance using the t test given earlier.

The regression equation

Correlation describes the strength of an association between two variables, and is completely symmetrical, the correlation between A and B is the same as the correlation between B and A. However, if the two variables are related it means that when one changes by a certain amount the other changes on an average by a certain amount. For instance, in the children described earlier greater height is associated, on average, with greater anatomical dead Space. If y represents the dependent variable and x the independent variable, this relationship is described as the regression of y on x.

The relationship can be represented by a simple equation called the regression equation. In this context “regression” (the term is a historical anomaly) simply means that the average value of y is a “function” of x, that is, it changes with x.

The regression equation representing how much y changes with any given change of x can be used to construct a regression line on a scatter diagram, and in the simplest case this is assumed to be a straight line. The direction in which the line slopes depends on whether the correlation is positive or negative. When the two sets of observations increase or decrease together (positive) the line slopes upwards from left to right; when one set decreases as the other increases the line slopes downwards from left to right. As the line must be straight, it will probably pass through few, if any, of the dots. Given that the association is well described by a straight line we have to define two features of the line if we are to place it correctly on the diagram. The first of these is its distance above the baseline; the second is its slope. They are expressed in the following regression equation :

With this equation we can find a series of values of the variable, that correspond to each of a series of values of x, the independent variable. The parameters α and β have to be estimated from the data. The parameter signifies the distance above the baseline at which the regression line cuts the vertical (y) axis; that is, when y = 0. The parameter β (the regression coefficient) signifies the amount by which change in x must be multiplied to give the corresponding average change in y, or the amount y changes for a unit increase in x. In this way it represents the degree to which the line slopes upwards or downwards.

The regression equation is often more useful than the correlation coefficient. It enables us to predict y from x and gives us a better summary of the relationship between the two variables. If, for a particular value of x, x i, the regression equation predicts a value of y fit , the prediction error is . It can easily be shown that any straight line passing through the mean values x and y will give a total prediction error of zero because the positive and negative terms exactly cancel. To remove the negative signs we square the differences and the regression equation chosen to minimise the sum of squares of the prediction errors, We denote the sample estimates of Alpha and Beta by a and b. It can be shown that the one straight line that minimises , the least squares estimate, is given by

and

it can be shown that

which is of use because we have calculated all the components of equation (11.2) in the calculation of the correlation coefficient.

The calculation of the correlation coefficient on the data in table 11.2 gave the following:

Applying these figures to the formulae for the regression coefficients, we have:

Therefore, in this case, the equation for the regression of y on x becomes

This means that, on average, for every increase in height of 1 cm the increase in anatomical dead space is 1.033 ml over the range of measurements made.

The line representing the equation is shown superimposed on the scatter diagram of the data in figure 11.2. The way to draw the line is to take three values of x, one on the left side of the scatter diagram, one in the middle and one on the right, and substitute these in the equation, as follows:

If x = 110, y = (1.033 x 110) – 82.4 = 31.2

If x = 140, y = (1.033 x 140) – 82.4 = 62.2

If x = 170, y = (1.033 x 170) – 82.4 = 93.2

Although two points are enough to define the line, three are better as a check. Having put them on a scatter diagram, we simply draw the line through them.

Figure 11.3 Regression line drawn on scatter diagram relating height and pulmonaiy anatomical dead space in 15 children

The standard error of the slope SE(b) is given by:

where is the residual standard deviation, given by:

This can be shown to be algebraically equal to

We already have to hand all of the terms in this expression. Thus is the square root of . The denominator of (11.3) is 72.4680. Thus SE(b) = 13.08445/72.4680 = 0.18055.

We can test whether the slope is significantly different from zero by:

t = b/SE(b) = 1.033/0.18055 = 5.72.

Again, this has n – 2 = 15 – 2 = 13 degrees of freedom. The assumptions governing this test are:

  1. That the prediction errors are approximately Normally distributed. Note this does not mean that the x or y variables have to be Normally distributed.
  2. That the relationship between the two variables is linear.
  3. That the scatter of points about the line is approximately constant – we would not wish the variability of the dependent variable to be growing as the independent variable increases. If this is the case try taking logarithms of both the x and y variables.

Note that the test of significance for the slope gives exactly the same value of P as the test of significance for the correlation coefficient. Although the two tests are derived differently, they are algebraically equivalent, which makes intuitive sense.

We can obtain a 95% confidence interval for b from

where the tstatistic from has 13 degrees of freedom, and is equal to 2.160.

Thus the 95% confidence interval is

l.033 – 2.160 x 0.18055 to l.033 + 2.160 x 0.18055 = 0.643 to 1.422.

Regression lines give us useful information about the data they are collected from. They show how one variable changes on average with another, and they can be used to find out what one variable is likely to be when we know the other – provided that we ask this question within the limits of the scatter diagram. To project the line at either end – to extrapolate – is always risky because the relationship between x and y may change or some kind of cut off point may exist. For instance, a regression line might be drawn relating the chronological age of some children to their bone age, and it might be a straight line between, say, the ages of 5 and 10 years, but to project it up to the age of 30 would clearly lead to error. Computer packages will often produce the intercept from a regression equation, with no warning that it may be totally meaningless. Consider a regression of blood pressure against age in middle aged men. The regression coefficient is often positive, indicating that blood pressure increases with age. The intercept is often close to zero, but it would be wrong to conclude that this is a reliable estimate of the blood pressure in newly born male infants!

More advanced methods

More than one independent variable is possible – in such a case the method is known as multiple regression. (3,4 )This is the most versatile of statistical methods and can be used in many situations. Examples include: to allow for more than one predictor, age as well as height in the above example; to allow for covariates – in a clinical trial the dependent variable may be outcome after treatment, the first independent variable can be binary, 0 for placebo and 1 for active treatment and the second independent variable may be a baseline variable, measured before treatment, but likely to affect outcome.

Common questions

If two variables are correlated are they causally related?

It is a common error to confuse correlation and causation. All that correlation shows is that the two variables are associated. There may be a third variable, a confounding variable that is related to both of them. For example, monthly deaths by drowning and monthly sales of ice-cream are positively correlated, but no-one would say the relationship was causal!

How do I test the assumptions underlying linear regression?

Firstly always look at the scatter plot and ask, is it linear? Having obtained the regression equation, calculate the residuals A histogram of will reveal departures from Normality and a plot of Figure 11.35versus will reveal whether the residuals increase in size as increases.

References

  1. Russell MAH, Cole PY, Idle MS, Adams L. Carbon monoxide yields of cigarettes and their relation to nicotine yield and type of filter. BMJ 1975; 3:713.
  2. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; i:307-10.
  3. Brown RA, Swanson-Beck J. Medical Statistics on Personal Computers , 2nd edn. London: BMJ Publishing Group, 1993.
  4. Armitage P, Berry G. In: Statistical Methods in Medical Research , 3rd edn. Oxford: Blackwell Scientific Publications, 1994:312-41.

Exercises

11.1 A study was carried out into the attendance rate at a hospital of people in 16 different geographical areas, over a fixed period of time. The distance of the centre from the hospital of each area was measured in miles. The results were as follows:

(1) 21%, 6.8; (2) 12%, 10.3; (3) 30%, 1.7; (4) 8%, 14.2; (5) 10%, 8.8; (6) 26%, 5.8; (7) 42%, 2.1; (8) 31%, 3.3; (9) 21%, 4.3; (10) 15%, 9.0; (11) 19%, 3.2; (12) 6%, 12.7; (13) 18%, 8.2; (14) 12%, 7.0; (15) 23%, 5.1; (16) 34%, 4.1.

What is the correlation coefficient between the attendance rate and mean distance of the geographical area?

11.2 Find the Spearman rank correlation for the data given in 11.1.

11.3 If the values of x from the data in 11.1 represent mean distance of the area from the hospital and values of y represent attendance rates, what is the equation for the regression of y on x? What does it mean?

11.4 Find the standard error and 95% confidence interval for the slope

Answers to exercises Ch 11.pdf

Categories
Statistics

10. Rank score tests

Population distributions are characterised, or defined, by parameters such as the mean and standard deviation. For skew distributions we would need to know other parameters such as the degree of skewness before the distribution could be identified uniquely, but the mean and standard deviation identify the Normal distribution uniquely. The t test described earlier depends for its validity on an assumption that the data originate from a Normally distributed population, and, when two groups are compared, the difference between the two samples arises simply because they differ only in their mean value. However, if we were concerned that the data did not originate from a Normally distributed population, then there are tests available which do not make use of this assumption. Because the data are no longer Normally distributed, the distribution cannot be characterised by a few parameters, and so the tests are often called “non-parametric”. This is somewhat of a misnomer because, as we shall see, to be able to say anything useful about the population we must compare parameters. As was mentioned in Chapter 5, if the sample sizes in both groups are large lack of Normality is of less concern, and the large sample tests described in that chapter would apply.

Wilcoxon signed rank sum test

Wilcoxon and Mann and Whitney described rank sum tests, which have been shown to be the same. Convention has now ascribed the Wilcoxon test to paired data and the Mann-Whitney U test to unpaired data.

Boogert et al (1) (data also given in Shott (2) used ultrasound to record fetal movements before and after chorionic villus sampling. The percentage of time the fetus spent moving is given in table 10.1 for ten pregnant women.

If we are concerned that the differences in percentage of time spent moving are unlikely to be Normally distributed we could use the Wilcoxon signed rank test using the following assumptions:

  1. The paired differences are independent.
  2. The differences come from a symmetrical distribution.

We do not need to perform a test to ensure that the differences come from a

symmetrical distribution: an “eyeball” test will suffice. A plot of the differences in column (4) of table 10.1 is given in figure 10.1. and shows that distribution of the differences is plausibly symmetrical. The differences are then ranked in column 5 (negative values are ignored and zero values omitted). When two or more differences are identical each is allotted the point half way between the ranks they would fill if distinct, irrespective of the plus or minus sign. For instance, the differences of -1 (patient 6) and +1 (patient 9) fill ranks 1 and 2. As (1 + 2)/2 = 1.5, they are allotted rank 1.5. In column (6) the ranks are repeated for column (5), but to each is attached the sign of the difference from column (4). A useful check is that the sum of the ranks must add to n(n + 1)/2. In this case 10(10 + 1)/2 = 55.

The numbers representing the positive ranks and the negative ranks in column (6) are added up separately and only the smaller of the two totals is used. Irrespective of its sign, the total is referred to Appendix Table D.pdf

against the number of pairs used in the investigation. Rank totals larger than those in the table are nonsignificant at the level of probability shown. In this case the smaller of the ranks is 23.5. This is larger than the number (8) given for ten pairs in table D and so the result is not significant. A confidence interval for the interval is described by Campbell and Gardner (2) and Gardner and Altman, (4) . and is easily obtained from the programs CIA (5) or MINITAB. (6) The median difference is zero. CIA gives the 95% confidence interval as – 2.50 to 4.00. This is quite narrow and so from this small study we can conclude that we have little evidence that chorionic villus sampling alters the movement of the fetus.

Note, perhaps contrary to intuition, that the Wilcoxon test, although a rank test, may give a different value if the data are transformed, say by taking logarithms. Thus it may be worth plotting the distribution of the differences for a number of transformations to see if they make the distribution appear more symmetrical.

Unpaired samples

A senior registrar in the rheumatology clinic of a district hospital has designed a clinical trial of a new drug for rheumatoid arthritis.

Twenty patients were randomised into two groups of ten to receive either the standard therapy A or a new treatment, B. The plasma globulin fractions after treatment are listed in table 10.2

We wish to test whether the new treatment has changed the plasma globulin, and we are worried about the assumption of Normality.

The first step is to plot the data (see fig 10.2).

The clinician was concerned about the lack of Normality of the underlying distribution of the data and so decided to use a nonparametric test. The appropriate test is the Mann-Whitney U test and is computed as follows.

The observations in the two samples are combined into a single series and ranked in order but in the ranking the figures from one sample must be distinguished from those of the other. The data appear as set out in table 10.3 . To save space they have been set out in two columns, but a single ranking is done. The figures for sample B are set in bold type. Again the sum of the ranks is n(n + 1)/2.

Totals of ranks: sample A, 81.5; sample B, 128.5

The ranks for the two samples are now added separately, and the smaller total is used. It is referred to Appendix Table E.pdf, with equal to the number of observations in one sample and equal to the number of observations in the other sample. In this case they both equal 10. At = 10 and = 10 the upper part of the table shows the figure 78. The smaller total of the ranks is 81.5. Since this is slightly larger than 78 it does not reach the 5% level of probability. The result is therefore not significant at that level. In the lower part of , which gives the figures for the 1% level of probability, the figure for = 10 and

= 10 is 71. As expected, the result is further from that than the 5% figure of 78.

To calculate a meaningful confidence interval we assume that if the two samples come from different populations the distribution of these populations differs only in that one appears shifted to the left or right of the other. This means, for example, that we do not expect one sample to be strongly right skewed and one to be strongly left skewed. If the assumption is reasonable then a confidence interval for the median difference can be calculated.(3, 4) Note that the computer program does not calculate the difference in medians, but rather the median of all possible differences between the two samples. This is usually close to the median difference and has theoretical advantages. From CIA we find that the difference in medians is -5.5 and the approximate 95% confidence interval is – 10 to 1.0. As might be expected from the significance test this interval includes zero. Although this result is not significant it would be unwise to conclude that there was no evidence that treatments A and B differed because the confidence interval is quite wide. This suggests that a larger study should be planned.

If the two samples are of unequal size a further calculation is needed after the ranking has been carried out as in table 10.3 .

Let = number of patients or objects in the smaller sample and the total of the ranks for that sample. Let number of patients or objects in the larger sample. Then calculate

from the following formula:

Finally enter table E with the smaller of T1 or T2

As before, only totals smaller than the critical points in are significant. See Exercise 10.2 for an example of this method.

If there are only a few ties, that is if two or more values in the data are equal (say less than 10% of the data) then for sample sizes outside the range of we can calculate

On the null hypothesis that the two samples come from the same population, z is approximately Normally distributed, mean zero and standard deviation one, and can be referred to Appendix table A.pdf to calculate the P value.

From the data of table 10.2 we obtain

and from Appendix table A.pdf we find that P is about 0.075, which corroborates the earlier result.

The advantages of these tests based on ranking are that they can be safely used on data that are not at all Normally distributed, that they are quick to carry out, and that no calculator is needed. Non-Normally distributed data can sometimes be transformed by the use of logarithms or some other method to make them Normally distributed, and a ttest performed. Consequently the best procedure to adopt may require careful thought. The extent and nature of the difference between two samples is often brought out more clearly by standard deviations and t tests than by non-parametric tests.

Common questions

Non-parametric tests are valid for both non-Normally distributed data and Normally distributed data, so why not use them all the time?

It would seem prudent to use non-parametric tests in all cases, which would save one the bother of testing for Normality. Parametric tests are preferred, however, for the following reasons:

  1. 1. As I have tried to emphasise in this book, we are rarely interested in a significance test alone; we would like to say something about the population from which the samples came, and this is best done with estimates of parameters and confidence intervals.
  2. 2. It is difficult to do flexible modelling with non-parametric tests, for example allowing for confounding factors using multiple regression (see Chapter 11).

Do non-parametric tests compare medians?

It is a commonly held belief that a Mann-Whitney U test is in fact a test for differences in medians. However, two groups could have the same median and yet have a significant Mann-Whitney U test. Consider the following data for two groups, each with 100 observations. Group 1: 98 (0), 1, 2; Group 2: 51 (0), 1, 48 (2). The median in both cases is 0, but from the Mann-Whitney test P<0.000 1.

Only if we are prepared to make the additional assumption that the difference in the two groups is simply a shift in location (that is, the distribution of the data in one group is simply shifted by a fixed amount from the other) can we say that the test is a test of the difference in medians. However, if the groups have the same distribution, then a shift in location will move medians and means by the same amount and so the difference in medians is the same as the difference in means. Thus the Mann-Whitney U test is also a test for the difference in means.

How is the Mann- Whitney U test related to the t test?

If one were to input the ranks of the data rather than the data themselves into a two sample t test program, the P value obtained would be very close to that produced by a Mann-Whitney U test.

References

  1. Boogert A, Manhigh A, Visser GHA. The immediate effects of chronic villus sampling on fetal movements. Am J Obstet Gynecol 1987; 157 :137-9.
  2. Shott S. Statistics for Health Professionals. Philadelphia: WB Saunders, 1990.
  3. Campbell MJ, Gardner MJ. Calculating confidence intervals for some non-parametric analyses. BMJ l988; 296 :l369-7l.
  4. Gardner MJ, Altman DG. Statistics with Confidence . Confidence Intervals and Statistical Guidelines. London: BMJ Publishing Group, 1989.
  5. Gardner MJ, Gardner SB, Winter PD. CIA (Confidence Interval Analysis) . London: BMJ Publishing Group, 1989.
  6. Ryan BF, Joiner BL, Ryan TA. Minitab Handbook , 2nd ed. Boston: Duxbury Press, 1985.

Exercises

10.1 A new treatment in the form of tablets for the prophylaxis of migraine has been introduced, to be taken before an impending attack. Twelve patients agree to try this remedy in addition to the usual general measures they take, subject to advice from their doctor on the taking of analgesics also.

A crossover trial with identical placebo tablets is carried out over a period of 8 months. The numbers of attacks experienced by each patient on, first, the new treatment and, secondly, the placebo were as follows: patient (1) 4 and 2; patient (2) 12 and 6; patient (3) 6 and 6; patient (4) 3 and 5; patient (5)15 and 9; patient (6) 10 and 11; patient (7) 2 and 4; patient (8) 5 and 6; patient (9)11 and 3; patient (10) 4 and 7; patient (11) 6 and 0; patient (12) 2 and 5. In a Wilcoxon rank sum test what is the smaller total of ranks? Is it significant at the 5% level?

10.2 Another doctor carried out a similar pilot study with this preparation on 12 patients, giving the same placebo to ten other patients. The numbers of migraine attacks experienced by the patients over a period of 6 months were as follows.

Group receiving new preparation: patient (1) 8; (2) 6; (3) 0; (4) 3; (5) 14; (6) 5; (7) 11; (8) 2

Group receiving placebo: patient (9) 7; (10) 10; (11) 4; (12) 11; (13) 2; (14) 8; (15) 8; (16) 6; (17)1; (18) 5.

In a Mann-Whitney two sample test what is the smaller total of ranks? Which sample of patients provides it? Is the difference significant at the 5% level?

Answers Ch 10.pdf

Categories
Statistics

9. Exact probability test

Sometimes in a comparison of the frequency of observations in a fourfold table the numbers are too small for the χ² test (Chapter 8). The exact probability test devised by Fisher, Irwin, and Yates (1) provides a way out of the difficulty. Tables based on it have been published – for example by Geigy (2) – showing levels at which the null hypothesis can be rejected. The method will be described here because, with the aid of a calculator, the exact probability is easily computed.

Consider the following circumstances. Some soldiers are being trained as parachutists. One rather windy afternoon 55 practice jumps take place at two localities, dropping zone A and dropping zone B. Of 15 men who jump at dropping zone A, five suffer sprained ankles, and of 40 who jump at dropping zone B, two suffer this injury. The casualty rate at dropping zone A seems unduly high, so the medical officer in charge decides to investigate the disparity. Is it a difference that might be expected by chance? If not it deserves deeper study. The figures are set out in table 9.1 . The null hypothesis is that there is no difference in the probability of injury generating the proportion of injured men at each dropping zone.

Table showing numbers of men injured and uninjured in parachute training at two dropping zones

The method to be described tests the exact probability of observing the particular set of frequencies in the table if the marginal totals (that is, the totals in the last row and column) are kept at their present values. But to the probability of getting this particular set of frequencies we have to add the probability of getting a set of frequencies showing greater disparity between the two dropping zones. This is because we are concerned to know the probability not only of the observed figures but also of the more extreme cases. This may seem obscure, but it ties in with the idea of calculating tail areas in the continuous case.

For convenience of computation the table is changed round to get the smallest number in the top left hand cell. We therefore begin by constructing table 9.2 from table 9.1 by transposing the upper and lower rows.

Missing alternative text

Missing alternative text

The exact probability for any table is now determined from the following formula:

Missing alternative text

The exclamation mark denotes “factorial” and means successive multiplication by cardinal numbers in descending series; for example 4! means 4 x 3 x 2 x 1. By convention 0! = 1. Factorial functions are available on most calculators, but care is needed not to exceed the maximum number available on the calculator. Generally factorials can be cancelled out for easy computation on a calculator (see below).

With this formula we have to find the probability attached to the observations in table 9.1 , which is equivalent to table 9.2 , and is denoted by set 2 in table 9.3 . We also have to find the probabilities attached to the more extreme cases. If ad-bc is negative, then the extreme cases are obtained by progressively decreasing cells a and d and increasing b and c by the same amount. If ad – bc is positive, then progressively increase cells a and d and decrease b and c by the same amount.(3) For table 9.2 ad – bc is negative and so the more extreme cases are sets 0 and 1.

The best way of doing this is to start with set 0. Call the probability attached to this set Missing alternative text. Then, applying the formula, we get:

Missing alternative text

This cancels down to

Missing alternative text

For computation on a calculator the factorials can be cancelled out further by removing 8! from 15! and 48! from 55! to give

Missing alternative text

We now start from the left and divide and multiply alternately. However, on an eight digit calculator we would thereby obtain the result 0.0000317 which does not give enough significant figures. Consequently we first multiply the 15 by 1000. Alternate dividing and multiplying then gives 0.0317107. We continue to work with this figure, which is Missing alternative textx 1000, and we now enter it in the memory while also retaining it on the display.

Remembering that we are now working with units 1000 times larger than the real units, to calculate the probability for set 1 we take the value of Missing alternative text, multiply it by b and c from set 0, and divide it by a and d from set 1. That is

Missing alternative text

The figure for Missing alternative text is retained on the display.

Likewise, to calculate the probability for set 2:

Missing alternative text

This is as far as we need go, but for illustration we will calculate the probabilities for all possible tables for the given marginal totals.

Missing alternative text

A useful check is that all the probabilities should sum to one (within the limits of rounding).

The observed set has a probability of 0.0115427. The P value is the probability of getting the observed set, or one more extreme. A one tailed P value would be

0.0115427 + 0.0009866 + 0.0000317 = 0.01256

and this is the conventional approach. Armitage and Berry (1) favour the mid P value, which is

(0.5) x 0.0115427 + 0.0009866 + 0.0000317 – 00068.

To get the two tailed value we double the one tailed result, thus P = 0.025 for the conventional or P = 0.0136 for the mid P approach.

The conventional approach to calculating the P value for Fisher’s exact test has been shown to be conservative (that is, it requires more evidence than is necessary to reject a false null hypothesis). The mid P is less conservative (that is more powerful) and also has some theoretical advantages. This is the one we advocate. For larger samples the P value obtained from a χ² test with Yates’ correction will correspond to the conventional approach, and the P value from the uncorrected test will correspond to the mid P value.

In either case, the P value is less than the conventional 5% level; the medical officer can conclude that there is a problem in dropping zone A. The calculation of confidence intervals for the difference in proportions for small samples is complicated so we rely on the large sample formula given in Chapter 6. The way to present the results is: Injury rate in dropping zone A was 33%, in dropping zone B 5%; difference 28% (95% confidence interval 3.5 to 53.1% (from )), P = 0.0136 (Fisher’s Exact test mid P).

Common questions

Why is Fisher’s test called an exact test?

Because of the discrete nature of the data, and the limited amount of it, combinations of results which give the same marginal totals can be listed, and probabilities attached to them. Thus, given these marginal totals we can work out exactly what is the probability of getting an observed result, in the same way that we can work out exactly the probability of getting six heads out of ten tosses of a fair coin. One difficulty is that there may not be combinations which correspond “exactly” to 95%, 50 we cannot get an “exact” 95% confidence interval but (say) one with a 97% coverage or one with a 94% coverage.

  1. Armitage P, Berry G. In: Statistical Methods in Medical Research. Oxford: Blackwell Scientific Publications, 1994:1234.
  2. Lentner C, ed. Geige Scientific Tables, 8th ed. Basle: Geigy, 1982.
  3. Strike PW. Statistical Methods in Laboratory Medicine. Oxford: Butterworth-Heinemann, 1991.

Exercises

9.1 Of 30 men employed in a small workshop 18 worked in one department and 12 in another department. In one year five of the 18 reported sick with septic hands, and of the 12 men in the other department one did so. Is there a difference in the departments and how would you report this result?

Answers Ch 9.pdf

Categories
Statistics

8. The Chi squared tests

The χ²tests

The distribution of a categorical variable in a sample often needs to be compared with the distribution of a categorical variable in another sample. For example, over a period of 2 years a psychiatrist has classified by socioeconomic class the women aged 20-64 admitted to her unit suffering from self poisoning sample A. At the same time she has likewise classified the women of similar age admitted to a gastroenterological unit in the same hospital sample B. She has employed the Registrar General’s five socioeconomic classes, and generally classified the women by reference to their father’s or husband’s occupation. The results are set out in table 8.1.

The psychiatrist wants to investigate whether the distribution of the patients by social class differed in these two units. She therefore erects the null hypothesis that there is no difference between the two distributions. This is what is tested by the chi squared (χ²) test (pronounced with a hard ch as in “sky”). By default, all χ² tests are two sided.

It is important to emphasise here that χ² tests may be carried out for this purpose only on the actual numbers of occurrences, not on percentages, proportions, means of observations, or other derived statistics. Note, we distinguish here the Greek (χ²) for the test and the distribution and the Roman (x²) for the calculated statistic, which is what is obtained from the test.

The χ² test is carried out in the following steps:

For each observed number (0) in the table find an “expected” number (E); this procedure is discussed below.

To calculate the expected number for each cell of the table consider the null hypothesis, which in this case is that the numbers in each cell are proportionately the same in sample A as they are in sample B. We therefore construct a parallel table in which the proportions are exactly the same for both samples. This has been done in columns (2) and (3) of table 8.2 . The proportions are obtained from the totals column in table 8.1 and are applied to the totals row. For instance, in table 8.2 , column (2), 11.80 = (22/289) x 155; 24.67 = (46/289) x 155; in column (3) 10.20 = (22/289) x 134; 21.33 = (46/289) x 134 and so on.

Thus by simple proportions from the totals we find an expected number to match each observed number. The sum of the expected numbers for each sample must equal the sum of the observed numbers for each sample, which is a useful check. We now subtract each expected number from its corresponding observed number.

The results are given in columns (4) and (5) of table 8.2 . Here two points may be noted.

  1. The sum of these differences always equals zero in each column.
  2. Each difference for sample A is matched by the same figure, but with opposite sign, for sample B.

Again these are useful checks.

The figures in columns (4) and (5) are then each squared and divided by the corresponding expected numbers in columns (2) and (3). The results are given in columns (6) and (7). Finally these results, (O-E)²/E are added. The sum of them is x²

A helpful technical procedure in calculating the expected numbers may be noted here. Most electronic calculators allow successive multiplication by a constant multiplier by a short cut of some kind. To calculate the expected numbers a constant multiplier for each sample is obtained by dividing the total of the sample by the grand total for both samples. In table 8.1 for sample A this is 155/289 = 0.5363. This fraction is then successively multiplied by 22, 46, 73, 91, and 57. For sample B the fraction is 134/289 = 0.4636. This too is successively multiplied by 22, 46, 73, 91, and 57.

The results are shown in table 8.2 , columns (2) and (3).

Having obtained a value for we look up in a table of χ² distribution the probability attached to it (Appendix Table C.pdf ). Just as with the t table, we must enter this table at a certain number of degrees of freedom. To ascertain these requires some care.

When a comparison is made between one sample and another, as in table 8.1 , a simple rule is that the degrees of freedom equal (number of columns minus one) x (number of rows minus one) (not counting the row and column containing the totals). For the data in table 8.1 this gives (2 – 1) x (5 – 1) = 4. Another way of looking at this is to ask for the minimum number of figures that must be supplied in table 8.1 , in addition to all the totals, to allow us to complete the whole table. Four numbers disposed anyhow in samples A and B provided they are in separate rows will suffice.

Entering Table C at four degrees of freedom and reading along the row we find that the value of x²(7.147) lies between 3.357 and 7.779. The corresponding probability is: 0.10<P<0.50. This is well above the conventionally significant level of 0.05, or 5%, so the null hypothesis is not disproved. It is therefore quite conceivable that in the distribution of the patients between socioeconomic classes the population from which sample A was drawn were the same as the population from which sample B was drawn.

Quick method

The above method of calculating x² illustrates the nature of the statistic clearly and is often used in practice. A quicker method, similar to the quick method for calculating the standard deviation, is particularly suitable for use with electronic calculators.(1)

The data are set out as in table 8.1 . Take the left hand column of figures (Sample A) and call each observation a. Their total, which is 155, is then .

Let p = the proportion formed when each observation a is divided by the corresponding figure in the total column. Thus here p in turn equals 17/22, 25/46… 32/57.

Let = the proportion formed when the total of the observations in the left hand column, , is divided by the total of all the observations.

Here = 155/289. Let = 1 – , which is the same as 134/289.

Then

Calculator procedure

Working with the figures in table 8.1, we use this formula on an electronic calculator (Casio fx-350) in the following way:

Withdraw result from memory on to display screen

MR (1.7769764)

We now have to divide this by Here = 155/289 and = 134/289.

This gives us x²= 7.146.

The calculation naturally gives the same result if the figures for sample B are used instead of those for sample A. Owing to rounding off of the numbers the two methods for calculating x² may lead to trivially different results.

Fourfold tables

A special form of the χ² test is particularly common in practice and quick to calculate. It is applicable when the results of an investigation can be set out in a “fourfold table” or “2 x 2 contingency table”.

For example, the practitioner whose data we displayed in believed that the wives of the printers and farmers should be encouraged to breast feed their babies. She has records for her practice going back over 10 years, in which she has noted whether the mother breast fed the baby for at least 3 months or not, and these records show whether the husband was a printer or a sheep farmer (or some other occupation less well represented in her practice). The figures from her records are set out in table 8.3

The disparity seems considerable, for, although 28% of the printers’ wives breast fed their babies for three months or more, as many as 45% of the farmers’ wives did so. What is its significance?

The null hypothesis is set up that there is no difference between printers’ wives and farmers’ wives in the period for which they breast fed their babies. The χ² test on a fourfold table may be carried out by a formula that provides a short cut to the conclusion. If a, b, c, and d are the numbers in the cells of the fourfold table as shown in table 8.4 (in this case Variable 1 is breast feeding (<3 months 0, 3 months 1) and Variable 2 is husband’s occupation (Printer (0) or Farmer (1)), x²is calculated from the following formula:

With a fourfold table there is one degree of freedom in accordance with the rule given earlier.

As many electronic calculators have a capacity limited to eight digits, it is advisable not to do all the multiplication or all the division in one series of operations, lest the number become too big for the display.

Calculator procedure

Multiply a by d and store in memory

Multiply b by c and subtract from memory

Entering the χ² table with one degree of freedom we read along the row and find that 3.418 lies between 2.706 and 3.84 1. Therefore 0.05<P<0 1. So, despite an apparently considerable difference between the proportions of printers’ wives and the farmers’ wives breast feeding their babies for 3 months or more, the probability of this result or one more extreme occurring by chance is more than 5%.

We now calculate a confidence interval of the differences between the two proportions, as described in Chapter 6 In this case we use the standard error based on the observed data, not the null hypothesis. We could calculate the confidence interval on either the rows or the columns and it is important that we compare proportions of the outcome variable, that is, breast feeding.

The 95% confidence interval is

0.17 – 1.96 x 0.0924 to 0.17 + 1.96 x 0.0924 = -0.011 to 0.351

Thus the 95% confidence interval is wide, and includes zero, as one might expect because the χ² test was not significant at the 5% level.

Increasing the precision of the P value in 2 x 2 tables

It can be shown mathematically that if χ is a Normally distributed variable, mean zero and variance 1, then x² has a χ² distribution with one degree of freedom. The converse also holds true and we can use this fact to improve the precision of our P values. In the above example we have = 3.418, with one degree of freedom. Thus X = 1.85, and from we find P to be about 0.065. However, we do need the tables for more than one degree of freedom.

Small numbers

When the numbers in a 2 x 2 contingency table are small, the χ² approximation becomes poor. The following recommendations may be regarded as a sound guide. (2) In fourfold tables a χ² test is inappropriate if the total of the table is less than 20, or if the total lies between 20 and 40 and the smallest expected (not observed) value is less than 5; in contingency tables with more than one degree of freedom it is inappropriate if more than about one fifth of the cells have expected values less than 5 or any cell an expected value of less than 1. An alternative to the χ² test for fourfold tables is known as Fisher’s Exact test and is described in Chapter 9

When the values in a fourfold table are fairly small a “correction for continuity” known as the “Yates’ correction” may be applied (3). Although there is no precise rule defining the circumstances in which to use Yates’ correction, a common practice is to incorporate it into χ² calculations on tables with a total of under 100 or with any cell containing a value less than 10. The χ² test on a fourfold table is then modified as follows:

The vertical bars on either side of ad – bc mean that the smaller of those two products is taken from the larger. Half the total of the four values is then subtracted from that the difference to provide Yates’ correction. The effect of the correction is to reduce the value of x².

Applying it to the figures in table 8.3 gives the following result:

In this case x²=2.711 falls within the same range of P values as the x²= 3.418 we got without Yates’ correction (0.05<P<0.1), but the P value is closer to 0.1 than it was in the previous calculation. In fourfold tables containing lower frequencies than table 8.3 the reduction in P value by Yates’ correction may change a result from significant to non-significant; in any case care should be exercised when making decisions from small samples.

Comparing proportions

Earlier in this chapter we compared two samples by the χ² test to answer the question “Are the distributions of the members of these two samples between five classes significantly different?” Another way of putting this is to ask “Are the relative proportions of the two samples the same in each class?”

For example, an industrial medical officer of a large factory wants to immunise the employees against influenza. Five vaccines of various types based on the current viruses are available, but nobody knows which is preferable. From the work force 1350 employees agree to be immunised with one of the vaccines in the first week of December, 50 the medical officer divides the total into five approximately equal groups. Disparities occur between their total numbers owing to the layout of the factory complex. In the first week of the following March he examines the records he has been keeping to see how many employees got influenza and how many did not. These records are classified by the type of vaccine used ( table 8.5 ).

In table 8.6 the figures are analysed by the χ² test. For this we have to determine the expected values. The null hypothesis is that there is no difference between vaccines in their efficacy against influenza. We therefore assume that the proportion of employees contracting influenza is the same for each vaccine as it is for all combined. This proportion is derived from the total who got influenza, and is 225/1350. To find the expected number in each vaccine group who would contract the disease we multiply the actual numbers in the Total column of table 8.5 by this proportion. Thus 280 x (225/1350) = 46.7; 250 x (225/1350) = 41.7; and so on. Likewise the proportion who did not get influenza is 1125/1350.

The expected numbers of those who would avoid the disease are calculated in the same way from the totals in table 8.5, so that 280 x (1125/1350) = 233.3; 250 x (1250/1350) = 208.3; and so on.

The procedure is thus the same as shown in table 8.1 and table 8.2 .

The calculations made in table 8.6 show that χ² with four degrees of freedom is 16.564, and 0.001<P<0.01. This is a highly significant result. But what does it mean?

Splitting of χ²

Inspection of table 8.6 shows that the largest contribution to the total x² comes from the figures for vaccine III. They are 8.889 and 1.778, which together equal 10.667. If this figure is subtracted from the total x², 16.564 – 10.667 – 5.897. This gives an approximate figure for x² for the remainder of the table with three degrees of freedom (by removing the vaccine III we have reduced the table to four rows and two columns). We then find that 0.1<P<0.5, a non-significant result. However, this is only a rough approximation. To check it exactly we apply the χ² test to the figures in table 8.4 minus the row for vaccine III. In other words, the test is now performed on the figures for vaccines I, II, IV, and V. On these figures x²= 2.983; d.f. = 3; 0.1<P<0.5. Thus the probability falls within the same broad limits as obtained by the approximate short cut given above. We can conclude that the figures for vaccine III are responsible for the highly significant result of the total x² of 16.564.

But this is not quite the end of the story. Before concluding from these figures that vaccine III is superior to the others we ought to carry out a check on other possible explanations for the disparity. The process of randomisation in the choice of the persons to receive each of the vaccines should have balanced out any differences between the groups, but some may have remained by chance. The sort of questions worth examining now are: Were the people receiving vaccine III as likely to be exposed to infection as those receiving the other vaccines? Could they have had a higher level of immunity from previous infection? Were they of comparable socioeconomic status? Of similar age on average? Were the sexes comparably distributed? Although some of these characteristics could have been more or less balanced by stratified randomisation, it is as well to check that they have in fact been equalised before attributing the numeral discrepancy in the result to the potency of the vaccine.

χ² Test for trend

Table 8.1 is a 5 x 2 table, because there are five socioeconomic classes and two samples. Socioeconomic groupings may be thought of as an example of an ordered categorical variable, as there are some outcomes (for example, mortality) in which it is sensible to state that (say) social class II is between social class I and social class III. The χ² test described at that stage did not make use of this information; if we had interchanged any of the rows the value of x² would have been exactly the same. Looking at the proportions p in table 8.1 we can see that there is no real ordering by social class in the proportions of self poisoning; social class V is between social classes I and II. However in many cases, when the outcome variable is an ordered categorical variable, a more powerful test can be devised which uses this information.

Consider a randomised controlled trial of health promotion in general practice to change people’s eating habits.(5) Table 8.7 gives the results from a review at 2 years, to look at the change in the proportion eating poultry.

If we give each category a score x the χ² test for trend is calculated in the following way:

Thus

This has one degree of freedom because the linear scoring means that when one expected value is given all the others are fixed, and we find p = 0.02. The usual χ² test gives a value of = 5.51; d.f. = 2; 0.05<P<0.10. Thus the more sensitive χ² test for trend yields a significant result because the test used more information about the experimental design. The values for the scores are to some extent arbitrary. However, it is usual to choose them equally spaced on either side of zero. Thus if there are four groups the scores would be -3, -1, +1, +3, and for five groups -2, -1, 0, + 1, +2. The statistic is quite robust to other values for the scores provided that they are steadily increasing or steadily decreasing.

Note that this is another way of splitting the overall x² statistic. The overall x² will always be greater than the for trend, but because the latter uses only one degree of freedom, it is often associated with a smaller probability. Although one is often counselled not to decide on a statistical test after having looked at the data, it is obviously sensible to look at the proportions to see if they are plausibly monotonic (go steadily up or down) with the ordered variable, especially if the overall χ² test is nonsignificant.

Comparison of an observed and a theoretical distribution

In the cases so far discussed the observed values in one sample have been compared with the observed values in another. But sometimes we want to compare the observed values in one sample with a theoretical distribution.

For example, a geneticist has a breeding population of mice in his laboratory. Some are entirely white, some have a small patch of brown hairs on the skin, and others have a large patch. According to the genetic theory for the inheritance of these coloured patches of hair the population of mice should include 51.0% entirely white, 40.8% with a small brown patch, and 8.2% with a large brown patch. In fact, among the 784 mice in the laboratory 380 are entirely white, 330 have a small brown patch, and 74 have a large brown patch. Do the proportions differ from those expected?

The data are set out in table 8.8 . The expected numbers are calculated by applying the theoretical proportions to the total, namely 0.510 x 784, 0.408 x 784, and 0.082 x 784. The degrees of freedom are calculated from the fact that the only constraint is that the total for the expected cases must equal the total for the observed cases, and so the degrees of freedom are the number of rows minus one. Thereafter the procedure is the same as in previous calculations of x². In this case it comes to 2.875. The x² table is entered at two degrees of freedom. We find that 0.2<P<0.3. Consequently the null hypothesis of no difference between the observed distribution and the theoretically expected one is not disproved. The data conform to the theory.

McNemar’s test

McNemar’s test for paired nominal data was described in , using a Normal approximation. In view of the relationship between the Normal distribution and the χ² distribution with one degree of freedom, we can recast the McNemar test as a variant of a χ² test. The results are often expressed as in table 8.9.

From appendix-table-c.pdf we find that for both χ² values 0.02<P<0.05. The result is identical to that given using the Normal approximation described in Chapter 6, which is the square root of this result.

Extensions of the χ² test

If the outcome variable in a study is nominal, the χ² test can be extended to look at the effect of more than one input variable, for example to allow for confounding variables. This is most easily done using multiple logistic regression , a generalisation of multiple regression , which is described in Chapter 11. If the data are matched, then a further technique ( conditional logistic regression ) should be employed. This is described in advanced textbooks and will not be discussed further here.

Common questions

I have matched data, but the matching criteria were very weak. Should I use McNemar’s test?

The general principle is that if the data are matched in any way, the analysis should take account of it. If the matching is weak then the matched analysis and the unmatched analysis should agree. In some cases when there are a large number of pairs with the same outcome, it would appear that the McNemar’s test is discarding a lot of information, and so is losing power. However, imagine we are trying to decide which of two high jumpers is the better. They each jump over a bar at a fixed height, and then the height is increased. It is only when one fails to jump a given height and the other succeeds that a winner can be announced. It does not matter how many jumps both have cleared.

References

  1. Snedecor GW, Cochran WG. In: Statistical Methods , 7th ed. Iowa: Iowa State University Press, 191,0:47.
  2. Cochran WG. Some methods for strengthening the common χ² tests. Biometrics 1956; l0 :4l7-5l.
  3. Yates F. Contingency tables involving small numbers and the χ² test. J Roy Stat Soc Suppl 1934; 1:217-3.
  4. Capples ME, McKnight A. Randomised controlled trial of health promotions in general practice for patients at high cardiovascular risk. BMJ l994;3O9:993-6.

Exercises

8.1 In a trial of new drug against a standard drug for the treatment of depression the new drug caused some improvement in 56% of 73 patients and the standard drug some improvement in 41% of 70 patients. The results were assessed in five categories as follows:

What is the value of x² which takes no account of the ordered value of data, what is the value of the x² test for trend, and the P value? How many degrees of freedom are there? What is the value of P in each case?

Answer

8.2 An outbreak of pediculosis capitis is being investigated in a girls’ school containing 291 pupils. Of 130 children who live in a nearby housing estate 18 were infested and of 161 who live elsewhere 37 were infested. What is the x² value of the difference, and what is its significance? Find the difference in infestation rates and a 95% confidence interval for the difference.

Answers Ch 8.pdf

8.3 The 55 affected girls were divided at random into two groups of 29 and 26. The first group received a standard local application and the second group a new local application. The efficacy of each was measured by clearance of the infestation after one application. By this measure the standard application failed in ten cases and the new application in five. What is the χ² value of the difference (with Yates’ correction), and what is its significance? What is the difference in clearance rates and an approximate 95% confidence interval?

Answers Ch 8.pdf

8.4 A general practitioner reviewed all patient notes in four practices for 1 year. Newly diagnosed cases of asthma were noted, and whether or not the case was referred to hospital. The following referrals were found (total cases in parentheses): practice A, 14 (103); practice B, 11 (92); practice C, 39 (166); practice D, 31 (221). What are the x² and P values for the distribution of the referrals in these practices? Do they suggest that any one practice has significantly more referrals than others?

Answers Ch 8.pdf

Categories
Statistics

7. The t tests

Previously we have considered how to test the null hypothesis that there is no difference between the mean of a sample and the population mean, and no difference between the means of two samples. We obtained the difference between the means by subtraction, and then divided this difference by the standard error of the difference. If the difference is 196 times its standard error, or more, it is likely to occur by chance with a frequency of only 1 in 20, or less.

With small samples, where more chance variation must be allowed for, these ratios are not entirely accurate because the uncertainty in estimating the standard error has been ignored. Some modification of the procedure of dividing the difference by its standard error is needed, and the technique to use is the t test. Its foundations were laid by WS Gosset, writing under the pseudonym “Student” so that it is sometimes known as Student’s t test. The procedure does not differ greatly from the one used for large samples, but is preferable when the number of observations is less than 60, and certainly when they amount to 30 or less.

The application of the t distribution to the following four types of problem will now be considered.

  1. The calculation of a confidence interval for a sample mean.
  2. The mean and standard deviation of a sample are calculated and a value is postulated for the mean of the population. How significantly does the sample mean differ from the postulated population mean?
  3. The means and standard deviations of two samples are calculated. Could both samples have been taken from the same population?
  4. Paired observations are made on two samples (or in succession on one sample). What is the significance of the difference between the means of the two sets of observations?

In each case the problem is essentially the same – namely, to establish multiples of standard errors to which probabilities can be attached. These multiples are the number of times a difference can be divided by its standard error. We have seen that with large samples 1.96 times the standard error has a probability of 5% or less, and 2.576 times the standard error a probability of 1% or less (Appendix table A ). With small samples these multiples are larger, and the smaller the sample the larger they become.

Confidence interval for the mean from a small sample

A rare congenital disease, Everley’s syndrome, generally causes a reduction in concentration of blood sodium. This is thought to provide a useful diagnostic sign as well as a clue to the efficacy of treatment. Little is known about the subject, but the director of a dermatological department in a London teaching hospital is known to be interested in the disease and has seen more cases than anyone else. Even so, he has seen only 18. The patients were all aged between 20 and 44.

The mean blood sodium concentration of these 18 cases was 115 mmol/l, with standard deviation of 12 mmol/l. Assuming that blood sodium concentration is Normally distributed what is the 95% confidence interval within which the mean of the total population of such cases may be expected to lie?

The data are set out as follows:

To find the 95% confidence interval above and below the mean we now have to find a multiple of the standard error. In large samples we have seen that the multiple is 1.96 (Chapter 4). For small samples we use the table of t given in Appendix Table B.pdf. As the sample becomes smaller t becomes larger for any particular level of probability. Conversely, as the sample becomes larger t becomes smaller and approaches the values given in table A, reaching them for infinitely large samples.

Since the size of the sample influences the value of t, the size of the sample is taken into account in relating the value of t to probabilities in the table. Some useful parts of the full t table appear in . The left hand column is headed d.f. for “degrees of freedom”. The use of these was noted in the calculation of the standard deviation (Chapter 2). In practice the degrees of freedom amount in these circumstances to one less than the number of observations in the sample. With these data we have 18 – 1 = 17 d.f. This is because only 17 observations plus the total number of observations are needed to specify the sample, the 18th being determined by subtraction.

To find the number by which we must multiply the standard error to give the 95% confidence interval we enter table B at 17 in the left hand column and read across to the column headed 0.05 to discover the number 2.110. The 95% confidence intervals of the mean are now set as follows:

Mean + 2.110 SE to Mean – 2.110 SE

which gives us:

115 – (2.110 x 283) to 115 + 2.110 x 2.83 or 109.03 to 120.97 mmol/l.

We may then say, with a 95% chance of being correct, that the range 109.03 to 120.97 mmol/l includes the population mean.

Likewise from Appendix Table B.pdf the 99% confidence interval of the mean is as follows:

Mean + 2.898 SE to Mean – 2.898 SE

which gives:

115 – (2.898 x 2.83) to 115 + (2.898 x 2.83) or 106.80 to 123.20 mmol/l.

Difference of sample mean from population mean (one sample t test)

Estimations of plasma calcium concentration in the 18 patients with Everley’s syndrome gave a mean of 3.2 mmol/l, with standard deviation 1.1. Previous experience from a number of investigations and published reports had shown that the mean was commonly close to 2.5 mmol/l in healthy people aged 20-44, the age range of the patients. Is the mean in these patients abnormally high?

We set the figures out as follows:

t difference between means divided by standard error of sample mean. Ignoring the sign of the t value, and entering table B at 17 degrees of freedom, we find that 2.69 comes between probability values of 0.02 and 0.01, in other words between 2% and 1% and so It is therefore unlikely that the sample with mean 3.2 came from the population with mean 2.5, and we may conclude that the sample mean is, at least statistically, unusually high. Whether it should be regarded clinically as abnormally high is something that needs to be considered separately by the physician in charge of that case.

Difference between means of two samples

Here we apply a modified procedure for finding the standard error of the difference between two means and testing the size of the difference by this standard error (see Chapter 5 for large samples). For large samples we used the standard deviation of each sample, computed separately, to calculate the standard error of the difference between the means. For small samples we calculate a combined standard deviation for the two samples.

The assumptions are:

  1. that the data are quantitative and plausibly Normal
  2. that the two samples come from distributions that may differ in their mean value, but not in the standard deviation
  3. that the observations are independent of each other.
  4. The third assumption is the most important. In general, repeated measurements on the same individual are not independent. If we had 20 leg ulcers on 15 patients, then we have only 15 independent observations.

The following example illustrates the procedure.

The addition of bran to the diet has been reported to benefit patients with diverticulosis. Several different bran preparations are available, and a clinician wants to test the efficacy of two of them on patients, since favourable claims have been made for each. Among the consequences of administering bran that requires testing is the transit time through the alimentary canal. Does it differ in the two groups of patients taking these two preparations?

The null hypothesis is that the two groups come from the same population. By random allocation the clinician selects two groups of patients aged 40-64 with diverticulosis of comparable severity. Sample 1 contains 15 patients who are given treatment A, and sample 2 contains 12 patients who are given treatment B. The transit times of food through the gut are measured by a standard technique with marked pellets and the results are recorded, in order of increasing time, in Table 7.1 .

Table 7.1

These data are shown in figure 7.1 . The assumption of approximate Normality and equality of variance are satisfied. The design suggests that the observations are indeed independent. Since it is possible for the difference in mean transit times for A-B to be positive or negative, we will employ a two sided test.

Figure 7.1

With treatment A the mean transit time was 68.40 h and with treatment B 83.42 h. What is the significance of the difference, 15.02h?

The procedure is as follows:

Obtain the standard deviation in sample 1:

Obtain the standard deviation in sample 2:

Multiply the square of the standard deviation of sample 1 by the degrees of freedom, which is the number of subjects minus one:

Repeat for sample 2

Add the two together and divide by the total degrees of freedom

The standard error of the difference between the means is

which can be written

When the difference between the means is divided by this standard error the result is t. Thus,

The table of the tdistribution Table B (appendix) which gives two sided P values is entered at degrees of freedom.

For the transit times of table 7.1,

shows that at 25 degrees of freedom (that is (15 – 1) + (12 – 1)), t= 2.282 lies between 2.060 and 2.485. Consequently, this degree of probability is smaller than the conventional level of 5%. The null hypothesis that there is no difference between the means is therefore somewhat unlikely.

A 95% confidence interval is given by This becomes

83.42 – 68.40 2.06 x 6.582

15.02 – 13.56 to 15.02 + 13.56 or 1.46 to 18.58 h.

Unequal standard deviations

If the standard deviations in the two groups are markedly different, for example if the ratio of the larger to the smaller is greater than two, then one of the assumptions of the ttest (that the two samples come from populations with the same standard deviation) is unlikely to hold. An approximate test, due to Sattherwaite, and described by Armitage and Berry, (1)which allows for unequal standard deviations, is as follows.

Rather than use the pooled estimate of variance, compute

This is analogous to calculating the standard error of the difference in two proportions under the alternative hypothesis as described in Chapter 6

We now compute We then test this using a t statistic, in which the degrees of freedom are:

Although this may look very complicated, it can be evaluated very easily on a calculator without having to write down intermediate steps (see below). It can produce a degree of freedom which is not an integer, and so not available in the tables. In this case one should round to the nearest integer. Many statistical packages now carry out this test as the default, and to get the equal variances I statistic one has to specifically ask for it. The unequal variance t test tends to be less powerful than the usual t test if the variances are in fact the same, since it uses fewer assumptions. However, it should not be used indiscriminantly because, if the standard deviations are different, how can we interpret a nonsignificant difference in means, for example? Often a better strategy is to try a data transformation, such as taking logarithms as described in Chapter 2. Transformations that render distributions closer to Normality often also make the standard deviations similar. If a log transformation is successful use the usual t test on the logged data. Applying this method to the data of Table 7.1 , the calculator method (using a Casio fx-350) for calculating the standard error is:

Difference between means of paired samples (paired t test).

When the effects of two alternative treatments or experiments are compared, for example in cross over trials, randomised trials in which randomisation is between matched pairs, or matched case control studies (see Chapter 13 ), it is sometimes possible to make comparisons in pairs. Matching controls for the matched variables, so can lead to a more powerful study.

The test is derived from the single sample t test, using the following assumptions.

  1. The data are quantitative
  2. The distribution of the differences (not the original data), is plausibly Normal.
  3. The differences are independent of each other.

The first case to consider is when each member of the sample acts as his own control. Whether treatment A or treatment B is given first or second to each member of the sample should be determined by the use of the table of random numbers Table F (Appendix). In this way any effect of one treatment on the other, even indirectly through the patient’s attitude to treatment, for instance, can be minimised. Occasionally it is possible to give both treatments simultaneously, as in the treatment of a skin disease by applying a remedy to the skin on opposite sides of the body.

Let us use as an example the studies of bran in the treatment of diverticulosis discussed earlier. The clinician wonders whether transit time would be shorter if bran is given in the same dosage in three meals during the day (treatment A) or in one meal (treatment B). A random sample of patients with disease of comparable severity and aged 20-44 is chosen and the two treatments administered on two successive occasions, the order of the treatments also being determined from the table of random numbers. The alimentary transit times and the differences for each pair of treatments are set out in Table 7.2

Table 7.2

In calculating t on the paired observations we work with the difference, d, between the members of each pair. Our first task is to find the mean of the differences between the observations and then the standard error of the mean, proceeding as follows:

Entering Appendix Table B.pdf at 11 degrees of freedom (n – 1) and ignoring the minus sign, we find that this value lies between 0.697 and 1.796. Reading off the probability value, we see that 0.1<P<0.5. The null hypothesis is that there is no difference between the mean transit times on these two forms of treatment. From our calculations, it is not disproved. However, this does not mean that the two treatments are equivalent. To help us decide this we calculate the confidence interval.

A 95% confidence interval for the mean difference is given by

In this case t 11 at P = 0.05 is 2.201 (table B) and so the 95% confidence interval is:

-6.5 – 2.201 x 4.37 to -6.5 + 2.201 x 4.37 h. or -16.1 to 3.1h.

This is quite wide, so we cannot really conclude that the two preparations are equivalent, and should look to a larger study.

The second case of a paired comparison to consider is when two samples are chosen and each member of sample 1 is paired with one member of sample 2, as in a matched case control study. As the aim is to test the difference, if any, between two types of treatment, the choice of members for each pair is designed to make them as alike as possible. The more alike they are, the more apparent will be any differences due to treatment, because they will not be confused with differences in the results caused by disparities between members of the pair. The likeness within the pairs applies to attributes relating to the study in question. For instance, in a test for a drug reducing blood pressure the colour of the patients’ eyes would probably be irrelevant, but their resting diastolic blood pressure could well provide a basis for selecting the pairs. Another (perhaps related) basis is the prognosis for the disease in patients: in general, patients with a similar prognosis are best paired. Whatever criteria are chosen, it is essential that the pairs are constructed before the treatment is given, for the pairing must be uninfluenced by knowledge of the effects of treatment.

Further methods

Suppose we had a clinical trial with more than two treatments. It is not valid to compare each treatment with each other treatment using t tests because the overall type I error rate will be bigger than the conventional level set for each individual test. A method of controlling for this to use a one way analysis of variance .(2)

Common questions

Should I test my data for Normality before using the t test?

It would seem logical that, because the t test assumes Normality, one should test for Normality first. The problem is that the test for Normality is dependent on the sample size. With a small sample a non-significant result does not mean that the data come from a Normal distribution. On the other hand, with a large sample, a significant result does not mean that we could not use the t test, because the t test is robust to moderate departures from Normality – that is, the P value obtained can be validly interpreted. There is something illogical about using one significance test conditional on the results of another significance test. In general it is a matter of knowing and looking at the data. One can “eyeball” the data and if the distributions are not extremely skewed, and particularly if (for the two sample t test) the numbers of observations are similar in the two groups, then the t test will be valid. The main problem is often that outliers will inflate the standard deviations and render the test less sensitive. Also, it is not generally appreciated that if the data originate from a randomised controlled trial, then the process of randomisation will ensure the validity of the I test, irrespective of the original distribution of the data.

Should I test for equality of the standard deviations before using the usual t test?

The same argument prevails here as for the previous question about Normality. The test for equality of variances is dependent on the sample size. A rule of thumb is that if the ratio of the larger to smaller standard deviation is greater than two, then the unequal variance test should be used. With a computer one can easily do both the equal and unequal variance t test and see if the answers differ.

Why should I use a paired test if my data are paired? What happens if I don’t?

Pairing provides information about an experiment, and the more information that can be provided in the analysis the more sensitive the test. One of the major sources of variability is between subjects variability. By repeating measures within subjects, each subject acts as its own control, and the between subjects variability is removed. In general this means that if there is a true difference between the pairs the paired test is more likely to pick it up: it is more powerful. When the pairs are generated by matching the matching criteria may not be important. In this case, the paired and unpaired tests should give similar results.

References

  1. Armitage P, Berry G. Statistical Methods in Medical Research. 3rd ed. Oxford: Blackwell Scientific Publications, 1994:112-13.
  2. Armitage P, Berry G. Statistical Methods in Medical Research. 3rd ed. Oxford: Blackwell Scientific Publications, 1994:207-14.

Exercises

7.1 In 22 patients with an unusual liver disease the plasma alkaline phosphatase was found by a certain laboratory to have a mean value of 39 King-Armstrong units, standard deviation 3.4 units. What is the 95% confidence interval within which the mean of the population of such cases whose specimens come to the same laboratory may be expected to lie?

Answer

7.2 In the 18 patients with Everley’s syndrome the mean level of plasma phosphate was 1.7 mmol/l, standard deviation 0.8. If the mean level in the general population is taken as 1.2 mmol/l, what is the significance of the difference between that mean and the mean of these 18 patients?

Answers Ch 7.pdfAnswer

7.3 In two wards for elderly women in a geriatric hospital the following levels of haemoglobin were found:

Ward A: 12.2, 11.1, 14.0, 11.3, 10.8, 12.5, 12.2, 11.9, 13.6, 12.7, 13.4, 13.7 g/dl;

Ward B: 11.9, 10.7, 12.3, 13.9, 11.1, 11.2, 13.3, 11.4, 12.0, 11.1 g/dl.

What is the difference between the mean levels in the two wards, and what is its significance? What is the 95% confidence interval for the difference in treatments?

Answers Ch 7.pdf

7.4 A new treatment for varicose ulcer is compared with a standard treatment on ten matched pairs of patients, where treatment between pairs is decided using random numbers. The outcome is the number of days from start of treatment to healing of ulcer. One doctor is responsible for treatment and a second doctor assesses healing without knowing which treatment each patient had. The following treatment times were recorded.

Standard treatment: 35, 104, 27, 53, 72, 64, 97, 121, 86, 41 days;

New treatment: 27, 52, 46, 33, 37, 82, 51, 92, 68, 62 days.

What are the mean difference in the healing time, the value of t, the number of degrees of freedom, and the probability? What is the 95% confidence interval for the difference?

Answers Ch 7.pdf

Categories
Statistics

6. Differences between percentages and paired alternatives

Standard error of difference between percentages or proportions

The surgical registrar who investigated appendicitis cases, referred to in Chapter 3 , wonders whether the percentages of men and women in the sample differ from the percentages of all the other men and women aged 65 and over admitted to the surgical wards during the same period. After excluding his sample of appendicitis cases, so that they are not counted twice, he makes a rough estimate of the number of patients admitted in those 10 years and finds it to be about 12-13 000. He selects a systematic random sample of 640 patients, of whom 363 (56.7%) were women and 277 (43.3%) men.

The percentage of women in the appendicitis sample was 60.8% and differs from the percentage of women in the general surgical sample by 60.8 – 56.7 = 4.1%. Is this difference of any significance? In other words, could this have arisen by chance?

There are two ways of calculating the standard error of the difference between two percentages: one is based on the null hypothesis that the two groups come from the same population; the other on the alternative hypothesis that they are different. For Normally distributed variables these two are the same if the standard deviations are assumed to be the same, but in the binomial case the standard deviations depend on the estimates of the proportions, and so if these are different so are the standard deviations. Usually both methods give almost the same result.

Confidence interval for a difference in proportions or percentages

The calculation of the standard error of a difference in proportions p1 – p2 follows the same logic as the calculation of the standard error of two means; sum the squares of the individual standard errors and then take the square root. It is based on the alternative hypothesis that there is a real difference in proportions (further discussion on this point is given in Common questions at the end of this chapter).

Note that this is an approximate formula; the exact one would use the population proportions rather than the sample estimates. With our appendicitis data we have:

Thus a 95% confidence interval for the difference in percentages is

4.1 – 1.96 x 4.87 to 4.1 + 1.96 x 4.87 = -5.4 to 13.6%.

Significance test for a difference in two proportions

For a significance test we have to use a slightly different formula, based on the null hypothesis that both samples have a common population proportion, estimated by p.

To obtain p we must amalgamate the two samples and calculate the percentage of women in the two combined; 100 – p is then the percentage of men in the two combined. The numbers in each sample are and .

Number of women in the samples: 73 + 363 = 436

Number of people in the samples: 120 + 640 = 760

Percentage of women: (436 x 100)/760 = 57.4

Percentage of men: (324 x 100)/760 = 42.6

Putting these numbers in the formula, we find the standard error of the difference between the percentages is 4.1-1.96 x 4.87 to 4.1 + 1.96 x 4.87 = -5.4 to 13.6%

This is very close to the standard error estimated under the alternative hypothesis.

The difference between the percentage of women (and men) in the two samples was 4.1%. To find the probability attached to this difference we divide it by its standard error: z = 4.1/4.92 = 0.83. From Table A (Appendix table A.pdf) we find that P is about 0.4 and so the difference between the percentages in the two samples could have been due to chance alone, as might have been expected from the confidence interval. Note that this test gives results identical to those obtained by the test without continuity correction (described in Chapter 7).

Standard error of a total

The total number of deaths in a town from a particular disease varies from year to year. If the population of the town or area where they occur is fairly large, say, some thousands, and provided that the deaths are independent of one another, the standard error of the number of deaths from a specified cause is given approximately by its square root, Further, the standard error of the difference between two numbers of deaths, and , can be taken as .

This can be used to estimate the significance of a difference between two totals by dividing the difference by its standard error:

(6.1)

It is important to note that the deaths must be independently caused; for example, they must not be the result of an epidemic such as influenza. The reports of the deaths must likewise be independent; for example, the criteria for diagnosis must be consistent from year to year and not suddenly change in accordance with a new fashion or test, and the population at risk must be the same size over the period of study.

In spite of its limitations this method has its uses. For instance, in Carlisle the number of deaths from ischaemic heart disease in 1973 was 276. Is this significantly higher than the total for 1972, which was 246? The difference is 30. The standard error of the difference is . We then take z = 30/22.8 = 1.313. This is clearly much less than 1.96 times the standard error at the 5% level of probability. Reference to appendix-table.pdftable A shows that P = 0.2. The difference could therefore easily be a chance fluctuation.

This method should be regarded as giving no more than approximate but useful guidance, and is unlikely to be valid over a period of more than very few years owing to changes in diagnostic techniques. An extension of it to the study of paired alternatives follows.

Paired alternatives

Sometimes it is possible to record the results of treatment or some sort of test or investigation as one of two alternatives. For instance, two treatments or tests might be carried out on pairs obtained by matching individuals chosen by random sampling, or the pairs might consist of successive treatments of the same individual (see Chapter 7 for a comparison of pairs by the tt test). The result might then be recorded as “responded or did not respond”, “improved or did not improve”, “positive or negative”, and so on. This type of study yields results that can be set out as shown in table 6.1.

Table 6.1

Member of pair receiving treatment A Member of pair receiving treatment B
Responded Responded (1)
Responded Did not respond (2)
Did not respond Responded (3)
Did not respond Did not respond (2)

The significance of the results can then be simply tested by McNemar’s test in the following way. Ignore rows (1) and (4), and examine rows (2) and (3). Let the larger number of pairs in either of rows (2) or (3) be called n1 and the smaller number of pairs in either of those two rows be n2. We may then use formula ( 6.1 ) to obtain the result, z. This is approximately Normally distributed under the null hypothesis, and its probability can be read from appendix-table.pdftable A.

However, in practice, the fairly small numbers that form the subject of this type of investigation make a correction advisable. We therefore diminish the difference between n1and n2 by using the following formula:

where the vertical lines mean “take the absolute value”.

Again, the result is Normally distributed, and its probability can be read from. As for the unpaired case, there is a slightly different formula for the standard error used to calculate the confidence interval.(1) Suppose N is the total number of pairs, then

For example, a registrar in the gastroenterological unit of a large hospital in an industrial city sees a considerable number of patients with severe recurrent aphthous ulcer of the mouth. Claims have been made that a recently introduced preparation stops the pain of these ulcers and promotes quicker healing than existing preparations.

Over a period of 6 months the registrar selected every patient with this disorder and paired them off as far as possible by reference to age, sex, and frequency of ulceration. Finally she had 108 patients in 54 pairs. To one member of each pair, chosen by the toss of a coin, she gave treatment A, which she and her colleagues in the unit had hitherto regarded as the best; to the other member she gave the new treatment, B. Both forms of treatment are local applications, and they cannot be made to look alike. Consequently to avoid bias in the assessment of the results a colleague recorded the results of treatment without knowing which patient in each pair had which treatment. The results are shown in Table 6.2

Member of pair receiving treatment A Member of pair receiving treatment B Pairs of patients
Responded Responded 16
Responded Did not respond 23
Did not respond Responded 10
Did not respond Did not respond 5
Total 54

Here n1=23, n2=10. Entering these values in formula (6.1) we obtain

The probability value associated with 2.089 is about 0.04 Table A. Therefore we may conclude that treatment A gave significantly better results than treatment B. The standard error for the confidence interval is


23/54 – 10/54 = 0.241

The 95% confidence interval for the difference in proportions is
0.241 – 1.96 x 0.101 to 0.241 + 1.96 x 0.10 that is, 0.043 to 0.439.

Although this does not include zero, the confidence interval is quite wide, reflecting uncertainty as to the true difference because the sample size is small. An exact method is also available.

Common questions

Why is the standard error used for calculating a confidence interval for the difference in two proportions different from the standard error used for calculating the significance?

For nominal variables the standard deviation is not independent of the mean. If we suppose that a nominal variable simply takes the value 0 or 1, then the mean is simply the proportion of is and the standard deviation is directly dependent on the mean, being largest when the mean is 0.5. The null and alternative hypotheses are hypotheses about means, either that they are the same (null) or different (alternative). Thus for nominal variables the standard deviations (and thus the standard errors) will also be different for the null and alternative hypotheses. For a confidence interval, the alternative hypothesis is assumed to be true, whereas for a significance test the null hypothesis is assumed to be true. In general the difference in the values of the two methods of calculating the standard errors is likely to be small, and use of either would lead to the same inferences. The reason this is mentioned here is that there is a close connection between the test of significance described in this chapter and the Chi square test described in Chapter 8. The difference in the arithmetic for the significance test, and that for calculating the confidence interval, could lead some readers to believe that they are unrelated, whereas in fact they are complementary. The problem does not arise with continuous variables, where the standard deviation is usually assumed independent of the mean, and is also assumed to be the same value under both the null and alternative hypotheses.

It is worth pointing out that the formula for calculating the standard error of an estimate is not necessarily unique: it depends on underlying assumptions, and so different assumptions or study designs will lead to different estimates for standard errors for data sets that might be numerically identical.

References

1. Gardner MJ, Altman DG, editors. Statistics with Confidence. London: BMJ Publishing, 1989:31.

Exercises

Exercise 6.1

In an obstetric hospital l7.8% of 320 women were delivered by forceps in 1975. What is the standard error of this percentage? In another hospital in the same region 21.2% of 185 women were delivered by forceps. What is the standard error of the difference between the percentages at this hospital and the first? What is the difference between these percentages of forceps delivery with a 95% confidence interval and what is its significance?

Answers Ch 6 Q1.pdf

Exercise 6.2

A dermatologist tested a new topical application for the treatment of psoriasis on 47 patients. He applied it to the lesions on one part of the patient’s body and what he considered to be the best traditional remedy to the lesions on another but comparable part of the body, the choice of area being made by the toss of a coin. In three patients both areas of psoriasis responded; in 28 patients the disease responded to the traditional remedy but hardly or not at all to the new one; in 13 it responded to the new one but hardly or not at all to the traditional remedy; and in four cases neither remedy caused an appreciable response. Did either remedy cause a significantly better response than the other?

Answers Ch 6 Q2.pdf

Categories
Statistics

5. Differences between means: type I and type II errors and power

We saw in Chapter 3 that the mean of a sample has a standard error, and a mean that departs by more than twice its standard error from the population mean would be expected by chance only in about 5% of samples. Likewise, the difference between the means of two samples has a standard error. We do not usually know the population mean, so we may suppose that the mean of one of our samples estimates it. The sample mean may happen to be identical with the population mean but it more probably lies somewhere above or below the population mean, and there is a 95% chance that it is within 1.96 standard errors of it.

Consider now the mean of the second sample. If the sample comes from the same population its mean will also have a 95% chance of lying within 196 standard errors of the population mean but if we do not know the population mean we have only the means of our samples to guide us. Therefore, if we want to know whether they are likely to have come from the same population, we ask whether they lie within a certain range, represented by their standard errors, of each other.

Large sample standard error of difference between means

If SD1 represents the standard deviation of sample 1 and SD2 the standard deviation of sample 2, n1 the number in sample 1 and n2 the number in sample 2, the formula denoting the standard error of the difference between two means is:

(5.1)

The computation is straightforward.

Square the standard deviation of sample 1 and divide by the number of observations in the sample:(1)

Square the standard deviation of sample 2 and divide by the number of observations in the sample:(2)

Add (1) and (2).

Take the square root, to give equation 5.1. This is the standard error of the difference between the two means.

Large sample confidence interval for the difference in two means

From the data in the general practitioner wants to compare the mean of the printers’ blood pressures with the mean of the farmers’ blood pressures. The figures are set out first as in table 5.1 (which repeats table 3.1 ).

Table 5.1

Analysing these figures in accordance with the formula given above, we have:

The difference between the means is 88 – 79 = 9 mmHg.

For large samples we can calculate a 95% confidence interval for the difference in means as

9 – 1.96 x 0.81 to 9 + 1.96 x 0.81 which is 7.41 to 10.59 mmHg.

For a small sample we need to modify this procedure, as described in Chapter 7.

Null hypothesis and type I error

In comparing the mean blood pressures of the printers and the farmers we are testing the hypothesis that the two samples came from the same population of blood pressures. The hypothesis that there is no difference between the population from which the printers’ blood pressures were drawn and the population from which the farmers’ blood pressures were drawn is called the null hypothesis.

But what do we mean by “no difference”? Chance alone will almost certainly ensure that there is some difference between the sample means, for they are most unlikely to be identical. Consequently we set limits within which we shall regard the samples as not having any significant difference. If we set the limits at twice the standard error of the difference, and regard a mean outside this range as coming from another population, we shall on average be wrong about one time in 20 if the null hypothesis is in fact true. If we do obtain a mean difference bigger than two standard errors we are faced with two choices: either an unusual event has happened, or the null hypothesis is incorrect. Imagine tossing a coin five times and getting the same face each time. This has nearly the same probability (6.3%) as obtaining a mean difference bigger than two standard errors when the null hypothesis is true. Do we regard it as a lucky event or suspect a biased coin? If we are unwilling to believe in unlucky events, we reject the null hypothesis, in this case that the coin is a fair one.

To reject the null hypothesis when it is true is to make what is known as a type I error . The level at which a result is declared significant is known as the type I error rate, often denoted by α. We try to show that a null hypothesis is unlikely , not its converse (that it is likely), so a difference which is greater than the limits we have set, and which we therefore regard as “significant”, makes the null hypothesis unlikely . However, a difference within the limits we have set, and which we therefore regard as “non-significant”, does not make the hypothesis likely.

A range of not more than two standard errors is often taken as implying “no difference” but there is nothing to stop investigators choosing a range of three standard errors (or more) if they want to reduce the chances of a type I error.

Testing for differences of two means

To find out whether the difference in blood pressure of printers and farmers could have arisen by chance the general practitioner erects the null hypothesis that there is no significant difference between them. The question is, how many multiples of its standard error does the difference in means difference represent? Since the difference in means is 9 mmHg and its standard error is 0.81 mmHg, the answer is: 9/0.81 = 11.1. We usually denote the ratio of an estimate to its standard error by “z”, that is, z = 11.1. Reference to Table A (Appendix table A.pdf) shows that z is far beyond the figure of 3.291 standard deviations, representing a probability of 0.001 (or 1 in 1000). The probability of a difference of 11.1 standard errors or more occurring by chance is therefore exceedingly low, and correspondingly the null hypothesis that these two samples came from the same population of observations is exceedingly unlikely. The probability is known as the P value and may be written P<0.001.

It is worth recapping this procedure, which is at the heart of statistical inference. Suppose that we have samples from two groups of subjects, and we wish to see if they could plausibly come from the same population. The first approach would be to calculate the difference between two statistics (such as the means of the two groups) and calculate the 95% confidence interval. If the two samples were from the same population we would expect the confidence interval to include zero 95% of the time, and so if the confidence interval excludes zero we suspect that they are from a different population. The other approach is to compute the probability of getting the observed value, or one that is more extreme , if the null hypothesis were correct. This is the P value. If this is less than a specified level (usually 5%) then the result is declared significant and the null hypothesis is rejected. These two approaches, the estimation and hypothesis testing approach, are complementary. Imagine if the 95% confidence interval just captured the value zero, what would be the P value? A moment’s thought should convince one that it is 2.5%. This is known as a one sided P value , because it is the probability of getting the observed result or one bigger than it. However, the 95% confidence interval is two sided, because it excludes not only the 2.5% above the upper limit but also the 2.5% below the lower limit. To support the complementarity of the confidence interval approach and the null hypothesis testing approach, most authorities double the one sided P value to obtain a two sided P value (see below for the distinction between one sided and two sided tests).

Sometimes an investigator knows a mean from a very large number of observations and wants to compare the mean of her sample with it. We may not know the standard deviation of the large number of observations or the standard error of their mean but this need not hinder the comparison if we can assume that the standard error of the mean of the large number of observations is near zero or at least very small in relation to the standard error of the mean of the small sample.

This is because in equation 5.1 for calculating the standard error of the difference between the two means, when n1 is very large then becomes so small as to be negligible. The formula thus reduces to

which is the same as that for standard error of the sample mean, namely

Consequently we find the standard error of the mean of the sample and divide it into the difference between the means.

For example, a large number of observations has shown that the mean count of erythrocytes in men is In a sample of 100 men a mean count of 5.35 was found with standard deviation 1.1. The standard error of this mean is ,. The difference between the two means is 5.5 – 5.35 = 0.15. This difference, divided by the standard error, gives z = 0.15/0.11 = 136. This figure is well below the 5% level of 1.96 and in fact is below the 10% level of 1.645 (see table A ). We therefore conclude that the difference could have arisen by chance.

Alternative hypothesis and type II error

It is important to realise that when we are comparing two groups a non-significant result does not mean that we have proved the two samples come from the same population – it simply means that we have failed to prove that they do not come from the population. When planning studies it is useful to think of what differences are likely to arise between the two groups, or what would be clinically worthwhile; for example, what do we expect to be the improved benefit from a new treatment in a clinical trial? This leads to a study hypothesis , which is a difference we would like to demonstrate. To contrast the study hypothesis with the null hypothesis, it is often called the alternative hypothesis . If we do not reject the null hypothesis when in fact there is a difference between the groups we make what is known as a type II error . The type II error rate is often denoted as . The power of a study is defined as 1 – and is the probability of rejecting the null hypothesis when it is false. The most common reason for type II errors is that the study is too small.

The concept of power is really only relevant when a study is being planned (see Chapter 13 for sample size calculations). After a study has been completed, we wish to make statements not about hypothetical alternative hypotheses but about the data, and the way to do this is with estimates and confidence intervals.(1)

Common questions

Why is the P value not the probability that the null hypothesis is true?

A moment’s reflection should convince you that the P value could not be the probability that the null hypothesis is true. Suppose we got exactly the same value for the mean in two samples (if the samples were small and the observations coarsely rounded this would not be uncommon; the difference between the means is zero). The probability of getting the observed result (zero) or a result more extreme (a result that is either positive or negative) is unity, that is we can be certain that we must obtain a result which is positive, negative or zero. However, we can never be certain that the null hypothesis is true, especially with small samples, so clearly the statement that the P value is the probability that the null hypothesis is true is in error. We can think of it as a measure of the strength of evidence against the null hypothesis, but since it is critically dependent on the sample size we should not compare P values to argue that a difference found in one group is more “significant” than a difference found in another.

References

Gardner MJ Altman DG, editors. Statistics with Confidence . London: BMJ Publishing Group. Differences between means: type I and type II errors and power

Exercises

5.1 In one group of 62 patients with iron deficiency anaemia the haemoglobin level was 1 2.2 g/dl, standard deviation 1.8 g/dl; in another group of 35 patients it was 10.9 g/dl, standard deviation 2.1 g/dl.

Answers chapter 5 Q1.pdf

What is the standard error of the difference between the two means, and what is the significance of the difference? What is the difference? Give an approximate 95% confidence interval for the difference. 5.2 If the mean haemoglobin level in the general population is taken as 14.4 g/dl, what is the standard error of the difference between the mean of the first sample and the population mean and what is the significance of this difference?

Answers chapter 5 Q2.pdf

Categories
Statistics

4. Statements of probability and confidence intervals

We have seen that when a set of observations have a Normal distribution multiples of the standard deviation mark certain limits on the scatter of the observations. For instance, 1.96 (or approximately 2) standard deviations above and 1.96 standard deviations below the mean (±1.96SD mark the points within which 95% of the observations lie.

Reference ranges

We noted in Chapter 1 that 140 children had a mean urinary lead concentration of 2.18 µmol24hr, with standard deviation 0.87. The points that include 95% of the observations are 2.18 ± (1.96 × 0.87), giving a range of 0.48 to 3.89. One of the children had a urinary lead concentration of just over 4.0 µmol24hr. This observation is greater than 3.89 and so falls in the 5% beyond the 95% probability limits. We can say that the probability of each of such observations occurring is 5% or less. Another way of looking at this is to see that if one chose one child at random out of the 140, the chance that their urinary lead concentration exceeded 3.89 or was less than 0.48 is 5%. This probability is usually used expressed as a fraction of 1 rather than of 100, and written µmol24hr

Standard deviations thus set limits about which probability statements can be made. Some of these are set out in Table A (Appendix table A.pdf). To use to estimate the probability of finding an observed value, say a urinary lead concentration of 4 µmol24hr, in sampling from the same population of observations as the 140 children provided, we proceed as follows. The distance of the new observation from the mean is 4.8 – 2.18 = 2.62. How many standard deviations does this represent? Dividing the difference by the standard deviation gives 2.62/0.87 = 3.01. This number is greater than 2.576 but less than 3.291 in , so the probability of finding a deviation as large or more extreme than this lies between 0.01 and 0.001, which maybe expressed as 0.001P<0.01. In fact Table A shows that the probability is very close to 0.0027. This probability is small, so the observation probably did not come from the same population as the 140 other children.

To take another example, the mean diastolic blood pressure of printers was found to be 88 mmHg and the standard deviation 4.5 mmHg. One of the printers had a diastolic blood pressure of 100 mmHg. The mean plus or minus 1.96 times its standard deviation gives the following two figures:

88 + (1.96 x 4.5) = 96.8 mmHg

88 – (1.96 x 4.5) = 79.2 mmHg.

We can say therefore that only 1 in 20 (or 5%) of printers in the population from which the sample is drawn would be expected to have a diastolic blood pressure below 79 or above about 97 mmHg. These are the 95% limits. The 99.73% limits lie three standard deviations below and three above the mean. The blood pressure of 100 mmHg noted in one printer thus lies beyond the 95% limit of 97 but within the 99.73% limit of 101.5 (= 88 + (3 x 4.5)).

The 95% limits are often referred to as a “reference range”. For many biological variables, they define what is regarded as the normal (meaning standard or typical) range. Anything outside the range is regarded as abnormal. Given a sample of disease free subjects, an alternative method of defining a normal range would be simply to define points that exclude 2.5% of subjects at the top end and 2.5% of subjects at the lower end. This would give an empirical normal range. Thus in the 140 children we might choose to exclude the three highest and three lowest values. However, it is much more efficient to use the mean 2 SD, unless the data set is quite large (say >400).

Confidence intervals

The means and their standard errors can be treated in a similar fashion. If a series of samples are drawn and the mean of each calculated, 95% of the means would be expected to fall within the range of two standard errors above and two below the mean of these means. This common mean would be expected to lie very close to the mean of the population. So the standard error of a mean provides a statement of probability about the difference between the mean of the population and the mean of the sample.

In our sample of 72 printers, the standard error of the mean was 0.53 mmHg. The sample mean plus or minus 196 times its standard error gives the following two figures:

88 + (1.96 x 0.53) = 89.04 mmHg

88 – (1.96 x 0.53) = 86.96 mmHg.

This is called the 95% confidence interval , and we can say that there is only a 5% chance that the range 86.96 to 89.04 mmHg excludes the mean of the population. If we take the mean plus or minus three times its standard error, the range would be 86.41 to 89.59. This is the 99.73% confidence interval, and the chance of this range excluding the population mean is 1 in 370. Confidence intervals provide the key to a useful device for arguing from a sample back to the population from which it came.

The standard error for the percentage of male patients with appendicitis, described in Chapter 3, was 4.46. This is also the standard error of the percentage of female patients with appendicitis, since the formula remains the same if p is replaced by 100 – p. With this standard error we can get 95% confidence intervals on the two percentages:

60.8 (1.96 x 4.46) = 52.1 and 69.5

39.2 (1.96 x 4.46) = 30.5 and 47.9.

These confidence intervals exclude 50%. Can we conclude that males are more likely to get appendicitis? This is the subject of the rest of the book, namely inference .

With small samples – say under 30 observations – larger multiples of the standard error are needed to set confidence limits. This subject is discussed under the tdistribution (Chapter 7).

There is much confusion over the interpretation of the probability attached to confidence intervals. To understand it we have to resort to the concept of repeated sampling. Imagine taking repeated samples of the same size from the same population. For each sample calculate a 95% confidence interval. Since the samples are different, so are the confidence intervals. We know that 95% of these intervals will include the population parameter. However, without any additional information we cannot say which ones! Thus with only one sample, and no other information about the population parameter, we can say there is a 95% chance of including the parameter in our interval. Note that this does not mean that we would expect with 95% probability that the mean from another sample is in this interval. In this case we are considering differences between two sample means, which is the subject of the next chapter.

Common questions

What is the difference between a reference range and a confidence interval?

There is precisely the same relationship between a reference range and a confidence interval as between the standard deviation and the standard error. The reference range refers to individuals and the confidence intervals to estimates . It is important to realise that samples are not unique. Different investigators taking samples from the same population will obtain different estimates, and have different 95% confidence intervals. However, we know that for 95 of every 100 investigators the confidence interval will include the population mean interval.

When should one quote a confidence interval?

There is now a great emphasis on confidence intervals in the literature, and some authors attach them to every estimate they make. In general, unless the main purpose of a study is to actually estimate a mean or a percentage, confidence intervals are best restricted to the main outcome of a study, which is usually a contrast (that is, a difference) between means or percentages. This is the topic for the next two chapters.

Exercises

4.1 A count of malaria parasites in 100 fields with a 2 mm oil immersion lens gave a mean of 35 parasites per field, standard deviation 11.6 (note that, although the counts are quantitative discrete, the counts can be assumed to follow a Normal distribution because the average is large). On counting one more field the pathologist found 52 parasites. Does this number lie outside the 95% reference range? What is the reference range?

Answers chapter4 Q1.pdf

4.2 What is the 95% confidence interval for the mean of the population from which this sample count of parasites was drawn?

Answers chapter4 Q2.pdf