Categories
Statistics

3. Populations and samples

Populations

In statistics the term “population” has a slightly different meaning from the one given to it in ordinary speech. It need not refer only to people or to animate creatures – the population of Britain, for instance or the dog population of London. Statisticians also speak of a population of objects, or events, or procedures, or observations, including such things as the quantity of lead in urine, visits to the doctor, or surgical operations. A population is thus an aggregate of creatures, things, cases and so on.

Although a statistician should clearly define the population he or she is dealing with, they may not be able to enumerate it exactly. For instance, in ordinary usage the population of England denotes the number of people within England’s boundaries, perhaps as enumerated at a census. But a physician might embark on a study to try to answer the question “What is the average systolic blood pressure of Englishmen aged 40-59?” But who are the “Englishmen” referred to here? Not all Englishmen live in England, and the social and genetic background of those that do may vary. A surgeon may study the effects of two alternative operations for gastric ulcer. But how old are the patients? What sex are they? How severe is their disease? Where do they live? And so on. The reader needs precise information on such matters to draw valid inferences from the sample that was studied to the population being considered. Statistics such as averages and standard deviations, when taken from populations are referred to as population parameters. They are often denoted by Greek letters: the population mean is denoted by μ(mu) and the standard deviation denoted by ς (low case sigma)

Samples

A population commonly contains too many individuals to study conveniently, so an investigation is often restricted to one or more samples drawn from it. A well chosen sample will contain most of the information about a particular population parameter but the relation between the sample and the population must be such as to allow true inferences to be made about a population from that sample.

Consequently, the first important attribute of a sample is that every individual in the population from which it is drawn must have a known non-zero chance of being included in it; a natural suggestion is that these chances should be equal. We would like the choices to be made independently; in other words, the choice of one subject will not affect the chance of other subjects being chosen. To ensure this we make the choice by means of a process in which chance alone operates, such as spinning a coin or, more usually, the use of a table of random numbers. A limited table is given in the Table F (Appendix), and more extensive ones have been published.(1-4) A sample so chosen is called a random sample.The word “random” does not describe the sample as such but the way in which it is selected.

To draw a satisfactory sample sometimes presents greater problems than to analyse statistically the observations made on it. A full discussion of the topic is beyond the scope of this book, but guidance is readily available(1)(2). In this book only an introduction is offered.

Before drawing a sample the investigator should define the population from which it is to come. Sometimes he or she can completely enumerate its members before beginning analysis – for example, all the livers studied at necropsy over the previous year, all the patients aged 20-44 admitted to hospital with perforated peptic ulcer in the previous 20 months. In retrospective studies of this kind numbers can be allotted serially from any point in the table to each patient or specimen. Suppose we have a population of size 150, and we wish to take a sample of size five. contains a set of computer generated random digits arranged in groups of five. Choose any row and column, say the last column of five digits. Read only the first three digits, and go down the column starting with the first row. Thus we have 265, 881, 722, etc. If a number appears between 001 and 150 then we include it in our sample. Thus, in order, in the sample will be subjects numbered 24, 59, 107, 73, and 65. If necessary we can carry on down the next column to the left until the full sample is chosen.

The use of random numbers in this way is generally preferable to taking every alternate patient or every fifth specimen, or acting on some other such regular plan. The regularity of the plan can occasionally coincide by chance with some unforeseen regularity in the presentation of the material for study – for example, by hospital appointments being made from patients from certain practices on certain days of the week, or specimens being prepared in batches in accordance with some schedule.

As susceptibility to disease generally varies in relation to age, sex, occupation, family history, exposure to risk, inoculation state, country lived in or visited, and many other genetic or environmental factors, it is advisable to examine samples when drawn to see whether they are, on average, comparable in these respects. The random process of selection is intended to make them so, but sometimes it can by chance lead to disparities. To guard against this possibility the sampling may be stratified.This means that a framework is laid down initially, and the patients or objects of the study in a random sample are then allotted to the compartments of the framework. For instance, the framework might have a primary division into males and females and then a secondary division of each of those categories into five age groups, the result being a framework with ten compartments. It is then important to bear in mind that the distributions of the categories on two samples made up on such a framework may be truly comparable, but they will not reflect the distribution of these categories in the population from which the sample is drawn unless the compartments in the framework have been designed with that in mind. For instance, equal numbers might be admitted to the male and female categories, but males and females are not equally numerous in the general population, and their relative proportions vary with age. This is known as stratified random sampling.For taking a sample from a long list a compromise between strict theory and practicalities is known as a systematic random sample.In this case we choose subjects a fixed interval apart on the list, say every tenth subject, but we choose the starting point within the first interval at random.

Unbiasedness and precision

The terms unbiased and precision have acquired special meanings in statistics. When we say that a measurement is unbiased we mean that the average of a large set of unbiased measurements will be close to the true value. When we say it is precise we mean that it is repeatable. Repeated measurements will be close to one another, but not necessarily close to the true value. We would like a measurement that is both accurate and precise. Some authors equate unbiasedness with accuracy,but this is not universal and others use the term accuracy to mean a measurement that is both unbiased and precise. Strike (5) gives a good discussion of the problem.

An estimate of a parameter taken from a random sample is known to be unbiased. As the sample size increases, it gets more precise.

Randomisation

Another use of random number tables is to randomise the allocation of treatments to patients in a clinical trial. This ensures that there is no bias in treatment allocation and, in the long run, the subjects in each treatment group are comparable in both known and unknown prognostic factors. A common method is to use blocked randomisation. This is to ensure that at regular intervals there are equal numbers in the two groups. Usual sizes for blocks are two, four, six, eight, and ten. Suppose we chose a block size of ten. A simple method using Table F (Appendix) is to choose the first five unique digits in any row. If we chose the first row, the first five unique digits are 3, 5, 6, 8, and 4. Thus we would allocate the third, fourth, fifth, sixth, and eighth subjects to one treatment and the first, second, seventh, ninth, and tenth to the other. If the block size was less than ten we would ignore digits bigger than the block size. To allocate further subjects to treatment, we carry on along the same row, choosing the next five unique digits for the first treatment. In randomised controlled trials it is advisable to change the block size from time to time to make it more difficult to guess what the next treatment is going to be.

It is important to realise that patients in a randomised trial are not a random sample from the population of people with the disease in question but rather a highly selected set of eligible and willing patients. However, randomisation ensures that in the long run any differences in outcome in the two treatment groups are due solely to differences in treatment.

Variation between samples

Even if we ensure that every member of a population has a known, and usually an equal, chance of being included in a sample, it does not follow that a series of samples drawn from one population and fulfilling this criterion will be identical. They will show chance variations from one to another, and the variation may be slight or considerable. For example, a series of samples of the body temperature of healthy people would show very little variation from one to another, but the variation between samples of the systolic blood pressure would be considerable. Thus the variation between samples depends partly on the amount of variation in the population from which they are drawn.

Furthermore, it is a matter of common observation that a small sample is a much less certain guide to the population from which it was drawn than a large sample. In other words, the more members of a population that are included in a sample the more chance will that sample have of accurately representing the population, provided a random process is used to construct the sample. A consequence of this is that, if two or more samples are drawn from a population, the larger they are the more likely they are to resemble each other – again provided that the random technique is followed. Thus the variation between samples depends partly also on the size of the sample. Usually, however, we are not in a position to take a random sample; our sample is simply those subjects available for study. This is a “convenience” sample. For valid generalisations to be made we would like to assert that our sample is in some way representative of the population as a whole and for this reason the first stage in a report is to describe the sample, say by age, sex, and disease status, so that other readers can decide if it is representative of the type of patients they encounter.

Standard error of the mean

If we draw a series of samples and calculate the mean of the observations in each, we have a series of means. These means generally conform to a Normal distribution, and they often do so even if the observations from which they were obtained do not (see Exercise 3.3). This can be proven mathematically and is known as the “Central Limit Theorem”. The series of means, like the series of observations in each sample, has a standard deviation. The standard error of the mean of one sample is an estimate of the standard deviation that would be obtained from the means of a large number of samples drawn from that population.

As noted above, if random samples are drawn from a population their means will vary from one to another. The variation depends on the variation of the population and the size of the sample. We do not know the variation in the population so we use the variation in the sample as an estimate of it. This is expressed in the standard deviation. If we now divide the standard deviation by the square root of the number of observations in the sample we have an estimate of the standard error of the mean, . It is important to realise that we do not have to take repeated samples in order to estimate the standard error, there is sufficient information within a single sample. However, the conception is that ifwe were to take repeated random samples from the population, this is how we would expect the mean to vary, purely by chance.

A general practitioner in Yorkshire has a practice which includes part of a town with a large printing works and some of the adjacent sheep farming country. With her patients’ informed consent she has been investigating whether the diastolic blood pressure of men aged 20-44 differs between the printers and the farm workers. For this purpose she has obtained a random sample of 72 printers and 48 farm workers and calculated the mean and standard deviations, as shown in Table 3.1.

To calculate the standard errors of the two mean blood pressures the standard deviation of each sample is divided by the square root of the number of the observations in the sample.

These standard errors may be used to study the significance of the difference between the two means, as described in successive chapters

Table 3.1

Standard error of a proportion or a percentage

Just as we can calculate a standard error associated with a mean so we can also calculate a standard error associated with a percentage or a proportion. Here the size of the sample will affect the size of the standard error but the amount of variation is determined by the value of the percentage or proportion in the population itself, and so we do not need an estimate of the standard deviation. For example, a senior surgical registrar in a large hospital is investigating acute appendicitis in people aged 65 and over. As a preliminary study he examines the hospital case notes over the previous 10 years and finds that of 120 patients in this age group with a diagnosis confirmed at operation 73 (60.8%) were women and 47 (39.2%) were men.

If p represents one percentage, 100 p represents the other. Then the standard error of each of these percentages is obtained by (1) multiplying them together, (2) dividing the product by the number in the sample, and (3) taking the square root:

which for the appendicitis data given above is as follows:

Problems with non-random samples

In general we do not have the luxury of a random sample; we have to make do with what is available, a “convenience sample“. In order to be able to make generalisations we should investigate whether biases could have crept in, which mean that the patients available are not typical. Common biases are:

  • hospital patients are not the same as ones seen in the community;
  • volunteers are not typical of non-volunteers;
  • patients who return questionnaires are different from those who do not.

In order to persuade the reader that the patients included are typical it is important to give as much detail as possible at the beginning of a report of the selection process and some demographic data such as age, sex, social class and response rate.

Common questions

Given measurements on a sample, what is the difference between a standard deviation and a standard error?

A standard deviation is a sample estimate of the population parameter; that is, it is an estimate of the variability of the observations. Since the population is unique, it has a unique standard deviation, which may be large or small depending on how variable the observations are. We would not expect the sample standard deviation to get smaller because the sample gets larger. However, a large sample would provide a more precise estimate of the population standard deviation than a small sample.

A standard error, on the other hand, is a measure of precision of an estimate of a population parameter. A standard error is always attached to a parameter, and one can have standard errors of any estimate, such as mean, median, fifth centile, even the standard error of the standard deviation. Since one would expect the precision of the estimate to increase with the sample size, the standard error of an estimate will decrease as the sample size increases.

When should I use a standard deviation to describe data and when should I use a standard error?

It is a common mistake to try and use the standard error to describe data. Usually it is done because the standard error is smaller, and so the study appears more precise. If the purpose is to describe the data (for example so that one can see if the patients are typical) and if the data are plausibly Normal, then one should use the standard deviation (mnemonic D for Description and D for Deviation). If the purpose is to describe the outcome of a study, for example to estimate the prevalence of a disease, or the mean height of a group, then one should use a standard error (or, better, a confidence interval; see Chapter 4) (mnemonic E for Estimate and E for Error).

References

  1. Altman DG. Practical Statistics for Medical Research.London: Chapman & Hall, 1991
  2. Armitage P, Berry G. Statistical Methods in Medical Research.Oxford: Blackwell Scientific Publications, 1994.
  3. Campbell MJ, Machin D. Medical Statistics: A Commonsense Approach.2nd ed. Chichester: John Wiley, 1993.
  4. Fisher RA, Yates F. Statistical Tables for Biological, Agricultural and Medical Research,6th ed. London: Longman, 1974.
  5. Strike PW. Measurement and control. Statistical Methods in Laboratory Medicine.Oxford: Butterworth-Heinemann, 1991:255.

Exercises

Exercise 3.1

The mean urinary lead concentration in 140 children was 2.18 mol/24 h, with standard deviation 0.87. What is the standard error of the mean?

Answers Chapter 3 Q1.pdf

Exercise 3.2

In Table F (Appendix), what is the distribution of the digits, and what are the mean and standard deviation?

Answers Chapter 3 Q2.pdf

Exercise 3.3

For the first column of five digits in Table F take the mean value of the five digits and do this for all rows of five digits in the column.

What would you expect a histogram of the means to look like?

What would you expect the mean and standard deviation to be?

Answers Chapter 3 Q3.pdf

Categories
Statistics

2. Mean and standard deviation

 

Missing alternative text

The median is known as a measure of location; that is, it tells us where the data are. As stated in , we do not need to know all the exact values to calculate the median; if we made the smallest value even smaller or the largest value even larger, it would not change the value of the median. Thus the median does not use all the information in the data and so it can be shown to be less efficient than the mean or average, which does use all values of the data. To calculate the mean we add up the observed values and divide by the number of them. The total of the values obtained in Table 1.1 was 22.5 Missing alternative text , which was divided by their number, 15, to give a mean of 1.5. This familiar process is
conveniently expressed by the following symbols:

Missing alternative text

Missing alternative text (pronounced “x bar”) signifies the mean; x is each of the values of urinary lead; n is the number of these values; and σ , the Greek capital sigma (our “S”) denotes “sum of”. A major disadvantage of the mean is that it is sensitive to outlying points. For example, replacing 2.2 by 22 in Table 1.1 increases the mean to 2.82 , whereas the median will be unchanged.

As well as measures of location we need measures of how variable the data are. We met two of these measures, the range and interquartile range, in Chapter 1.

The range is an important measurement, for figures at the top and bottom of it denote the findings furthest removed from the generality. However, they do not give much indication of the spread of observations about the mean. This is where the standard deviation (SD) comes in.

The theoretical basis of the standard deviation is complex and need not trouble the ordinary user. We will discuss sampling and populations in Chapter 3. A practical point to note here is that, when the population from which the data arise have a distribution that is approximately “Normal” (or Gaussian), then the standard deviation provides a useful basis for interpreting the data in terms of probability.

The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population. The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population. However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape.

Many biological characteristics conform to a Normal distribution closely enough for it to be commonly used – for example, heights of adult men and women, blood pressures in a healthy population, random errors in many types of laboratory measurements and biochemical data. Figure 2.1 shows a Normal curve calculated from the diastolic blood pressures of 500 men, mean 82 mmHg, standard deviation 10 mmHg. The ranges representing [+-1SD, +12SD, and +-3SD] about the mean are marked. A more extensive set of values is given in Table A of the print edition.

Figure 2.1

Missing alternative text

The reason why the standard deviation is such a useful measure of the scatter of the observations is this: if the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation below it

Missing alternative text

includes about 68% of the observations; a range of two standard deviations above and two below () about 95% of the observations; and of three standard deviations above and three below () about 99.7% of the observations. Consequently, if we know the mean and standard deviation of a set of observations, we can obtain some useful information by simple arithmetic. By putting one, two, or three standard deviations above and below the mean we can estimate the ranges that would be expected to include about 68%, 95%, and 99.7% of the observations.

Standard deviation from ungrouped data

The standard deviation is a summary measure of the differences of each observation from the mean. If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero. Consequently the squares of the differences are added. The sum of the squares is then divided by the number of observations minus oneto give the mean of the squares, and the square root is taken to bring the measurements back to the units we started with. (The division by the number of observations minus oneinstead of the number of observations itself to obtain the mean square is because “degrees of freedom” must be used. In these circumstances they are one less than the total. The theoretical justification for this need not trouble the user in practice.)

To gain an intuitive feel for degrees of freedom, consider choosing a chocolate from a box of n chocolates. Every time we come to choose a
chocolate we have a choice, until we come to the last one (normally one with a nut in it!), and then we have no choice. Thus we have n-1 choices, or “degrees of freedom”.

The calculation of the variance is illustrated in Table 2.1 with the 15 readings in the preliminary study of urinary lead concentrations (Table 1.2). The readings are set out in column (1). In column (2) the difference between each reading and the mean is recorded. The sum of the differences is 0. In column (3) the differences are squared, and the sum of those squares is given at the bottom of the column.

Table 2.1

Missing alternative text

The sum of the squares of the differences (or deviations) from the mean, 9.96, is now divided by the total number of observation minus one, to give the variance.Thus,Missing alternative text

In this case we find:

Missing alternative text

Finally, the square root of the variance provides the standard deviation:
Missing alternative text

from which we get Missing alternative text
This procedure illustrates the structure of the standard deviation, in particular that the two extreme values 0.1 and 3.2 contribute most to the sum of the differences squared.

Calculator procedure

Most inexpensive calculators have procedures that enable one to calculate the mean and standard deviations directly, using the “SD” mode. For example, on modern Casio calculators one presses SHIFT and ‘.’ and a little “SD” symbol should appear on the display. On earlier Casios one presses INV and MODE , whereas on a Sharp 2nd F and Stat should be used. The data are stored via the M+ button. Thus, having set the calculator into the “SD” or “Stat” mode, from Table 2.1 we enter 0.1 M+ , 0.4 M+ , etc. When all the data are entered, we can check that the correct number of observations have been included by Shift and n, and “15” should be displayed. The mean is displayed by Shift and Missing alternative textand the standard deviation by Shift and Missing alternative text. Avoid pressing Shift and AC between these operations as this clears the statistical memory. There is another button on many calculators. This uses the divisor n rather than n – 1 in the calculation of the standard deviation. On a Sharp calculator Missing alternative text is denotedMissing alternative text , whereas Missing alternative text is denoted s. These are the “population” values, and are derived assuming that an entire population is available or that interest focuses solely on the data in hand, and the results are not going to be generalised (see Chapter
3
for details of samples and populations). As this situation very rarely arises, Missing alternative text should be used and ignored, although even for moderate sample sizes the difference is going to be small. Remember to return to normal mode before resuming calculations because many of the usual functions are not available in “Stat” mode. On a modern Casio this is Shift 0. On earlier Casios and on Sharps one repeats the sequence that call up the “Stat” mode. Some calculators stay in “Stat”
mode even when switched off.Mullee (1) provides advice on choosing and using a calculator. The calculator formulas use the relationship

Missing alternative text

The right hand expression can be easily memorised by the expression mean of the squares minus the mean square”. The sample variance Missing alternative textis obtained from

Missing alternative text

The above equation can be seen to be true in Table 2.1, where the sum of the square of the observations, Missing alternative text, is given as 43.7l.

We thus obtain

Missing alternative text

the same value given for the total in column (3). Care should be taken because this formula involves subtracting two large numbers to get a small one, and can lead to incorrect results if the numbers are very large. For example, try finding the standard deviation of 100001, 100002, 100003 on a calculator. The correct answer is 1, but many calculators will give 0 because of rounding error. The solution is to subtract a large number from each of the observations (say 100000) and calculate the standard deviation on the remainders, namely 1, 2 and 3.

Standard deviation from grouped data

We can also calculate a standard deviation for discrete quantitative variables. For example, in addition to studying the lead concentration in the urine of 140 children, the paediatrician asked how often each of them had been examined by a doctor during the year. After collecting the information he tabulated the data shown in Table 2.2 columns (1) and (2). The mean is calculated by multiplying column (1) by column (2), adding the products, and dividing by the total number of observations. Table 2.2

Missing alternative text

As we did for continuous data, to calculate the standard deviation we square each of the observations in turn. In this case the observation is the number of visits, but because we have several children in each class, shown in column (2), each squared number (column (4)), must be multiplied by the number of children. The sum of squares is given at the foot of column (5), namely 1697. We then use the calculator formula to find the variance:Missing alternative textand Missing alternative text.Note that although the number of visits is not Normally distributed, the distribution is reasonably symmetrical about the mean. The approximate 95% range is given byFig 2.19This excludes two children with no visits and
six children with six or more visits. Thus there are eight of 140 = 5.7% outside the theoretical 95% range.Note that it is common for discrete quantitative variables to have what is known as skeweddistributions, that is they are not symmetrical. One clue to lack of symmetry from derived statistics is when the mean and the median differ considerably. Another is when the standard deviation is of the same order of magnitude as the mean, but the observations must be non-negative. Sometimes a transformation will
convert a skewed distribution into a symmetrical one. When the data are counts, such as number of visits to a doctor, often the square root transformation will help, and if there are no zero or negative values a logarithmic transformation will render the distribution more symmetrical.

Data transformation

An anaesthetist measures the pain of a procedure using a 100 mm visual analogue scale on seven patients. The results are given in Table 2.3, together with the log etransformation (the ln button on a calculator). Table 2.3

Missing alternative textThe data are plotted in Figure 2.2, which shows that the outlier does not appear so extreme in the logged data. The mean and median are 10.29 and 2, respectively, for the original data, with a standard deviation of 20.22. Where the mean is bigger than the median, the distribution is positively skewed. For the logged data the mean and median are 1.24 and 1.10 respectively, indicating that the logged data have a more symmetrical distribution. Thus it would be better to analyse the logged transformed data
in statistical tests than using the original scale.Figure 2.2

Missing alternative textIn reporting these results, the median of the raw data would be given, but it should be explained that the statistical test wascarried out on the transformed data. Note that the median of the logged data is the same as the log of the median of the raw data – however, this is not true for the mean. The mean of the logged data is not necessarily equal to the log of the mean of the raw data.
The antilog (exp or Missing alternative text on a calculator) of the mean of the logged data is known as the geometric mean,and is often a
better summary statistic than the mean for data from positively skewed distributions. For these data the geometric mean in 3.45 mm.

Between subjects and within subjects standard deviation

If repeated measurements are made of, say, blood pressure on an individual, these measurements are likely to vary. This is within subject, or intrasubject, variability and we can calculate a standard deviation of these observations. If the observations are close together in time, this standard deviation is often described as the measurement error.Measurements made on different subjects vary according to between subject, or intersubject, variability. If many observations were made on each individual, and the average taken, then we can assume that the intrasubject variability has been averaged out and the variation in the average values is due solely to the intersubject variability. Single observations on individuals clearly contain a mixture of intersubject and intrasubject variation. The coefficient of variation(CV%) is the intrasubject standard deviation divided by the mean, expressed as a percentage. It is often quoted as a measure of repeatability for biochemical assays, when an assay is carried out on several occasions on the same sample. It has the advantage of being independent of the units of measurement, but also numerous theoretical disadvantages. It is usually nonsensical to use the coefficient of variation as a measure of between subject variability.

Common questions

When should I use the mean and when should I use the median to describe my
data?

It is a commonly held misapprehension that for Normally distributed data one uses the mean, and for non-Normally distributed data one uses the median. Alas this is not so: if the data are Normally distributed the mean and the median will be close; if the data are not Normally distributed then both the mean and the median may give useful information. Consider a variable that takes the value 1 for males and 0 for females. This is clearly not Normally distributed. However, the mean gives the proportion of males in the group, whereas the median merely tells us which group contained more than 50% of the people. Similarly, the mean from ordered categorical variables can be more useful than the median, if the ordered categories can be given meaningful scores. For example, a lecture might be rated as 1 (poor) to 5 (excellent). The usual statistic for summarising the result would be the mean. In the situation where there is a small group at one extreme of a distribution (for example, annual income) then the median will be more “representative” of the distribution. My data must have values greater than zero and yet the mean and standard deviation are about the same size. How does this happen? If data have a very skewed distribution, then the standard deviation will be grossly inflated, and is not a good measure of variability to use. As we have shown, occasionally a transformation of the data, such as a log transform, will render the distribution more symmetrical. Alternatively, quote the interquartile range.

References

1. Mullee M A. How to choose and use a calculator. In: How to do it 2.BMJ
Publishing Group, 1995:58-62.

Exercises

Exercise 2.1

In the campaign against smallpox a doctor inquired into the number of times 150 people aged 16 and over in an Ethiopian village had been vaccinated. He obtained the following figures: never, 12 people; once, 24; twice, 42; three times, 38; four times, 30; five times, 4. What is the mean number of times those people had been vaccinated and what is the standard deviation?Answer

Exercise 2.2

Obtain the mean and standard deviation of the data in and an approximate
95% range.Answer

Exercise 2.3

Which points are excluded from the range mean – 2SD to mean + 2SD? What
proportion of the data is excluded? Answers
Chapter 2 Q3.pdf
Answer

 

Categories
Statistics

1. Data display and summary

Types of data

The first step, before any calculations or plotting of data, is to decide what type of data one is dealing with. There are a number of typologies, but one that has proven useful is given in Table 1.1. The basic distinction is between quantitative variables (for which one asks “how much?”) and categorical variables (for which one asks “what type?”).

Quantitative variables can be continuous or discrete. Continuous variables, such as height, can in theory take any value within a given range. Examples of discrete variables are: number of children in a family, number of attacks of asthma per week.

Categorical variables are either nominal (unordered) or ordinal (ordered). Examples of nominal variables are male/female, alive/dead, blood group O, A, B, AB. For nominal variables with more than two categories the order does not matter. For example, one cannot say that people in blood group B lie between those in A and those in AB. Sometimes, however, people can provide ordered responses, such as grade of breast cancer, or they can “agree”, “neither agree nor disagree”, or “disagree” with some statement. In this case the order does matter and it is usually important to account for it.

Table 1.1

Missing alternative text

Variables shown at the left of Table 1.1 can be converted to ones further to the right by using “cut off points”. For example, blood pressure can be turned into a nominal variable by defining “hypertension” as a diastolic blood pressure greater than 90 mmHg, and “normotension” as blood pressure less than or equal to 90 mmHg. Height (continuous) can be converted into “short”, average” or “tall” (ordinal).

In general it is easier to summarise categorical variables, and so quantitative variables are often converted to categorical ones for descriptive purposes. To make a clinical decision on someone, one does not need to know the exact serum potassium level (continuous) but whether it is within the normal range (nominal). It may be easier to think of the proportion of the population who are hypertensive than the distribution of blood pressure. However, categorising a continuous variable reduces the amount of information available and statistical tests will in general be more sensitive – that is they will have more power (see Chapter 5 for a definition of power) for a continuous variable than the corresponding nominal one, although more assumptions may have to be made about the data. Categorising data is therefore useful for summarising results, but not for statistical analysis. It is often not appreciated that the choice of appropriate cut off points can be difficult, and different choices can lead to different conclusions about a set of data.

These definitions of types of data are not unique, nor are they mutually exclusive, and are given as an aid to help an investigator decide how to display and analyse data. One should not debate overlong the typology of a particular variable!

Stem and leaf plots

Before any statistical calculation, even the simplest, is performed the data should be tabulated or plotted. If they are quantitative and relatively few, say up to about 30, they are conveniently written down in order of size.

For example, a paediatric registrar in a district general hospital is investigating the amount of lead in the urine of children from a nearby housing estate. In a particular street there are 15 children whose ages range from 1 year to under 16, and in a preliminary study the registrar has found the following amounts of urinary lead ( ), given in Table 1.2 what is called an array:

Table 1.2

<Missing alternative text

A simple way to order, and also to display, the data is to use a stem and leaf plot. To do this we need to abbreviate the observations to two significant digits. In the case of the urinary concentration data, the digit to the left of the decimal point is the “stem” and the digit to the right the “leaf”.

We first write the stems in order down the page. We then work along the data set, writing the leaves down “as they come”. Thus, for the first data point, we write a 6 opposite the 0 stem. These are as given in Figure 1.1.

Figure 1.1

Missing alternative text

We then order the leaves, as in Figure 1.2

Figure 1.2

Missing alternative text

The advantage of first setting the figures out in order of size and not simply feeding them straight from notes into a calculator (for example, to find their mean) is that the relation of each to the next can be looked at. Is there a steady progression, a noteworthy hump, a considerable gap? Simple inspection can disclose irregularities. Furthermore, a glance at the figures gives information on their range. The smallest value is 0.1 and the largest is 3.2 .

Median

To find the median (or mid point) we need to identify the point which has the property that half the data are greater than it, and half the data are less than it. For 15 points, the mid point is clearly the eighth largest, so that seven points are less than the median, and seven points are greater than it. This is easily obtained from Figure 1.2 by counting the eighth leaf, which is 1.5 .

To find the median for an even number of points, the procedure is as follows. Suppose the paediatric registrar obtained a further set of 16 urinary lead concentrations from children living in the countryside in the same county as the hospital.(Table 1.3)

Table 1.3

Missing alternative text

To obtain the median we average the eighth and ninth points (1.8 and 1.9) to get 1.85. In general, if n is even, we average the n/2th largest and the n/2 + 1th largest observations.

The main advantage of using the median as a measure of location is that it is “robust” to outliers. For example, if we had accidentally written 34 rather than 3.4 in Table 1.2 , the median would still have been 1.85. One disadvantage is that it is tedious to order a large number of observations by hand (there is usually no “median” button on a calculator).

Measures of variation

It is informative to have some measure of the variation of observations about the median. The range is very susceptible to what are known as outliers, points well outside the main body of the data. For example, if we had made the mistake of writing 34 instead 3.4 in Table 1.2, then the range would be written as 0.1 to 34 which is clearly misleading.

A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile. The variation of the data can be summarised in the interquartile range, the distance between the first and third quartile. With small data sets and if the sample size is not divisible by four, it may not be possible to divide the data set into exact quarters, and there are a variety of proposed methods to estimate the quartiles. A simple, consistent method is to find the points midway between each end of the range and the median. Thus, from Figure 1.2, there are eight points between and including the smallest, 0.1, and the median, 1.5. Thus the mid point lies between 0.8 and 1.1, or 0.95. This is the first quartile. Similarly the third quartile is mid way between 1.9 and 2.0, or 1.95. Thus, the interquartile range is 0.95 to 1.95 .

Data display

The simplest way to show data is a dot plot. Figure 1.3 shows the data from tables 1.2 and 1.3 and together with the median for each set.

Figure 1.3

Missing alternative text

Sometimes the points in separate plots may be linked in some way, for example the data in Table 1.2 and Table 1.3 may result from a matched case control study (see Chapter 13 for a description of this type of study) in which individuals from the countryside were matched by age and sex with individuals from the town. If possible the links should be maintained in the display, for example by joining matching individuals in Figure 1.3. This can lead to a more sensitive way of examining the data.

When the data sets are large, plotting individual points can be cumbersome. An alternative is a box-whisker plot. The box is marked by the first and third quartile, and the whiskers extend to the range. The median is also marked in the box, as shown in Figure 1.4

Figure 1.4

Missing alternative text

It is easy to include more information in a box-whisker plot. One method, which is implemented in some computer programs, is to extend the whiskers only to points that are 1.5 times the interquartile range below the first quartile or above the third quartile, and to show remaining points as dots, so that the number of outlying points is shown.

Histograms

Suppose the paediatric registrar referred to earlier extends the urban study to the entire estate in which the children live. He obtains figures for the urinary lead concentration in 140 children aged over 1 year and under 16. We can display these data as a grouped frequency table (Table 1.4).

Table 1.4

Missing alternative text

Figure 1.5

Missing alternative text

Bar charts

Suppose, of the 140 children, 20 lived in owner occupied houses, 70 lived in council houses and 50 lived in private rented accommodation. Figures from the census suggest that for this age group, throughout the county, 50% live in owner occupied houses, 30% in council houses, and 20% in private rented accommodation. Type of accommodation is a categorical variable, which can be displayed in a bar chart. We first express our data as percentages:

14% owner occupied, 50% council house, 36% private rented. We then display the data as a bar chart. The sample size should always be given (Figure 1.6).

Figure 1.6

Missing alternative text

Common questions

How many groups should I have for a histogram?

In general one should choose enough groups to show the shape of a distribution, but not too many to lose the shape in the noise. It is partly aesthetic judgement but, in general, between 5 and 15, depending on the sample size, gives a reasonable picture. Try to keep the intervals (known also as “bin widths”) equal. With equal intervals the height of the bars and the area of the bars are both proportional to the number of subjects in the group. With unequal intervals this link is lost, and interpretation of the figure can be difficult.

What is the distinction between a histogram and a bar chart?

Alas, with modern graphics programs the distinction is often lost. A histogram shows the distribution of a continuous variable and, since the variable is continuous, there should be no gaps between the bars. A bar chart shows the distribution of a discrete variable or a categorical one, and so will have spaces between the bars. It is a mistake to use a bar chart to display a summary statistic such as a mean, particularly when it is accompanied by some measure of variation to produce a “dynamite plunger plot”(1). It is better to use a box-whisker plot.

What is the best way to display data?

The general principle should be, as far as possible, to show the original data and to try not to obscure the desigu of a study in the display. Within the constraints of legibility show as much information as possible. If data points are matched or from the same patients link them with lines. (2) When displaying the relationship between two quantitative variables, use a scatter plot (Chapter 11) in preference to categorising one or both of the variables.

References

1. Campbell M J. How to present numerical results. In: How to do it: 2.London: BMJ Publishing, 1995:77-83.

2. Matthews J N S, Altman D G, Campbell M J, Royston J P. Analysis of serial measurements in medical research. BMJ1990; 300:230-5.

Exercises

Exercise 1.1

From the 140 children whose urinary concentration of lead were investigated 40 were chosen who were aged at least 1 year but under 5 years. The following concentrations of copper (in ) were found.

0.70, 0.45, 0.72, 0.30, 1.16, 0.69, 0.83, 0.74, 1.24, 0.77,

0.65, 0.76, 0.42, 0.94, 0.36, 0.98, 0.64, 0.90, 0.63, 0.55,

0.78, 0.10, 0.52, 0.42, 0.58, 0.62, 1.12, 0.86, 0.74, 1.04,

0.65, 0.66, 0.81, 0.48, 0.85, 0.75, 0.73, 0.50, 0.34, 0.88

Find the median, range and quartiles.

eBMJ — Statistics at Square One- Answers to exercises.pdfAnswer

Categories
Statistics

Preface

It is with trepidation that one rewrites a best seller, and Dougal Swinscow’s Statistics at Square One was one of the best selling statistical text books in the UK. It is difficult to decide how much to alter without destroying the quality that proved so popular. I chose to retain the format and structure of the original book. Most of the original examples remain; they are realistic, if not real, and tracking down the original sources to provide references would be impossible. However, I have removed the chromatic pseudonyms of the investigators. All new examples utilise real data, the source of which is referenced.

Much has changed in medical statistics since the original edition was published in 1976. Desktop computers now provide statistical facilities unimaginable then, even for mainframe enthusiasts. I think the main change has been an emphasis now on looking and plotting the data first, and on estimation rather than simple hypothesis testing. I have tried to reflect these changes in the new edition. I have found it a useful pedagogic device to pose questions to the students, and so have incorporated questions commonly asked by students or consultees at the end of each chapter. These questions cover issues often not explicitly addressed in elementary text books, such as how far one should test assumptions before proceeding with statistical tests.

I have included a number of new techniques, such as stem and leaf plots, box whisker plots, data transformation, the χ² test for trend and t test with unequal variance. I have also included a chapter on survival analysis, with the Kaplan-Meier survival curve and the log rank test, as these are now in common use. I have replaced the Kendall rank correlation coefficient by the Spearman; in spite of the theoretical advantages of the former, most statistical packages compute only the latter. The section on linear regression has been extended. I have added a final chapter on the design of studies, and would make a plea for it not to be ignored. Studies rarely fail for want of a significance test, but a flawed design may be fatal. To keep the book short I have removed some details of hand calculation.

I have assumed that the reader will not want to master a complicated statistical program, but has available a simple scientific calculator, which should cost about the same as this book. However, no serious statistical analysis should be undertaken these days without a computer. There are many good and inexpensive statistical programs. Epi-Info, for example, is produced by the Center for Disease Control (CDC) Atlanta and the World Health Organization (WHO) in Geneva. Another useful program is CIA (Confidence Interval Analysis) which is available from the BMJ.

I am most grateful to Tina Perry for secretarial help, to Simon Child for producing the figures, and to Simon Child and Tide Olayinka for help with word processing. I am particularly grateful to Paul Little, Steven Julious, Ruth Pickering and David Coggon who commented on all or part of the book, and who made many suggestions for improvement. Any errors remain my own. Finally, thanks to Deborah Reece of the BMJwho asked me to revise the book and was patient with the delays.

M J Campbell

August 1995