Introduction

“Data”

We are not interested simply in measurements, or individual datum.
We are usually trying to answer a question / learn something / make a decision
When use data, we need to bring some order to it
We will consider structured data

We often say data, but don’t away mean data

Data types

Data can take various forms: E.g. measurements, grouping factors, estimates, observations etc.

Different in terms of:

Storage
Methods for summary/processing
Interpretation
Data-generating mechanism

A few major groups of data types:

Numeric
Binary
Categorical

Continuous: values that can be constantly divided with a possible number in between
- E.g. height of a person could be 172, 173 or 172.5 cm
- Examples in NHS: physiological measurements like blood pressure
Discrete: values that can only take whole numbers, usually obtained by counting.
- E.g. Number of patients seen in a clinic could be 35 or 36 but not 35.5
- Examples in NHS: counts of patients, waiting time measured in whole minutes, length of stay measured in days, number of patient episodes

Exclusive two state variable
- E.g. 0/1, yes/no, TRUE/FALSE
- Examples in NHS: Patient dead or alive?, TRUE or FALSE answer to survey, patient status for a genetic marker

Counting binary events becomes discrete numeric.

May chose to only count the events, e.g. ‘deaths’ not ‘survives’

Taken from: xkcd https://xkcd.com/605/

Nominal: Categories without any notion of order
- E.g. Hair Colour, Brand of car, Country of residence
- Examples in NHS: Ethnicity, Admission method, Treatment speciality

Ordinal: Categories with order, but not linear like numeric
- E.g. Survey answers ‘Good, OK & Bad’. There is order, but ‘OK’ ≠ ‘Bad’ x 2 and ‘Good’ ≠ ‘Bad’ + ‘OK’
- Examples in NHS: Cancer stage, self-assessed patient answers like ‘is your health poor, OK or good,’ Age-groups <1, 1-16, 17-40 etc.

Summarising Numeric Data

“Dear BI team, I would like age of all patients admitted as an emergency to general medicine in December?”

  [1] 75 81 59 70 64 67 66 54 68 72 80 66 70 76 75 52 59 52 86 56 51 59 72 61 53
 [26] 72 75 69 64 55 74 54 61 74 86 53 68 69 76 58 59 79 59 69 91 55 59 68 58 70
 [51] 68 60 89 54 85 76 56 56 84 91 90 87 90 85 54 76 91 79 53 62 72 69 75 76 76
 [76] 76 63 85 76 85 67 63 91 63 64 69 63 60 57 83 69 60 58 70 59 85 68 85 56 79
[101] 85 76 76 73 60 87 57 67 72 92 58 55 54 71 90 55 58 59 63 77 85 77 53 66 73
[126] 53 79 70 70 77 56 65 85 64 74 66 74 59 68 79 66 56 68 63 66 66 68 70 64 72
[151] 56 83 53 69 67 77 68 63 73 57 86 52 75 78 76 61 71 77 64 62 77 69 69 66 85
[176] 65 61 72 69 73 53 77 54 56 72 70 69 67 62 78 58 54 69 76 86 59 80 84 56 78
[201] 75 57 68 91 91

How would you answer that question?

What is the question really asking? It’s quite unlikely that sending a list of numbers really answers the question. They requester may have various different questions, but they probably want some idea of the range of ages, but also how the people are distributed across different ages.

How might we show this in a better way?
Summary figures?

Visualising data types:

Scatter plots
Stem & leaf plot
Histogram or Kernel Density (sounds more impressive than it is)
Box plot

Scatter plots

Plots and x variable by a y variable by point

Why doesn’t this help?

We’ve only got one variable, not two
No summary information
We want to see some kind of distribution

Stem & Leaf Plots

Easy to do by hand or on computer
Decide on grouping size (5- year in example below)
Major units on left, minor on right
We essentially create a tally


  The decimal point is 1 digit(s) to the right of the |

  5 | 122233333334444444
  5 | 555566666666677778888889999999999
  6 | 000011112223333333444444
  6 | 5566666666777778888888888999999999999
  7 | 00000000112222222233334444
  7 | 555555666666666666777777788899999
  8 | 0013344
  8 | 55555555556666779
  9 | 0001111112

Histograms and Bar Charts

Plot of binned counts
Good way to visualise distribution
Bin sizes can vary & do not have to be equal
Bar charts are related, but do not share the ‘binning’ idea. Can be used with categorical

Kernel Density

Similar to a smoothed histogram
Plots the probability density of data rather than counts values
Conceptually harder, but nicer visualisation

Box Plots

Box range is (“hinges”):

25th percentile
75th percentile
Line is the median (50th percentile)
Whiskers extend hinge ± 1.5 * IQR
Outliers (further points) are represented
Terms will be explained in the following slides

Violin plot

A collision of the box plot and the density plot. This is ‘two-sided’ here, but could be set to a single side.

Summary figures

Quick notation definition

x : is the variable of interest. If we have more than one, they might by y, z etc.
n : is the number of observations, or the count of x
i : is usually use to denote an individual value, rather than all values. E.g. x_i usually mean ‘each value of x’, rather than all values of x. It is also the index, e.g. i=3 means the value at position 3
\bar{} : is usually used to signify the mean. E.g. \bar{y} is the mean of y.
\Sigma : is a ‘sum’ operator. Usually given with a range at the top and a starting value at the bottom. e.g. \sum_{i=1}^n reads as ’sum n values, starting at 1 index.

### Center of the data We commonly want some measure of the Central point, and description of ‘weight’ of data

“Mean” is average of data, calculated centre: \bar{x} = \frac{\sum_{i=1}^n x_i}{n}
“Median” is middle value in rank order
“Mode” is the most common value

### Spread

There are various measures of spread:

Extremes of the distributions (highest / lowest)
Quantiles (often percentiles) - the group of observations within certain ranges, e.g. middle 50%
The standard deviation (sd or \sigma): the average distance of a point from the mean.

sd = \sqrt{ \frac{\sum_{i=1}^n \left(x_i - \bar{x} \right)^2}{n-1}} or, in easier Excel terms:

sqrt\left(\frac{sum((x_i-\bar{x})^2)}{count(x)-1}\right)

How do we calculate a percentile?

Take some values: e.g. the heights in earlier slides
Order them from low to high
Add an index

How do we calculate a percentile? (2)

Find the percentile you want, using the index

e.g. 10% of 205 age values
0.1 * 205 = 20.5th index value

If index value is a whole number, use it directly

If index value is between numbers, various rules, including:

Round up/down
Average
Weighted average

10th percentile = 55

The ‘Normal’ or ‘Gaussian’ distribution:

Equally distributed with most common values in the centre.
‘Bell-curve’
Mean, median and mode are identical
Principles apply more widely to lots of areas of statistics

Visualising the Normal Distribution

Mean is in the centre of the curve

Mean + standard deviation

How much of the distribution is here?

Adding in more standard deviations

Change the x scale to values.

We modelled a distribution with a mean of 500, and a standar deviation of 150.

Z-scores

Standard score, “invariant” to raw scale
Changes scale but not distribution
Each value is divided the standard-deviation, so:
- Mean = 0
- 1 = Mean + 1sd
- -1 = Mean – 1sd
We are now talking in units of standard deviation

\large z = \frac{(x - \bar{x})}{sd}

Z-scores a common way to compare different indicators, on different natural scales

Skewed Distributions

Not all distributions follow the pattern we’ve seen above. We call these non-symmetrical distributions ‘skewed’ and they can’t be treated in quite the same way. Our normal (or “parametric” methods) assume the same shape / distribution on both sides of a mean.

Imagine the case of hospital leng-of-stay. What happens here?

Measured in days: numeric, discrete
Can’t be < 0, so there is a hard limit at zero
Rarely, some patients stay in a really long time.

Summarising Skewed Data

We might describe data as “left” or “right” skewed. This has effects on the mean, but not the median.

Summarising Skewed Data with percentiles

Summarising Skewed Data principles

Use median (50th percentile) for mid point
25th and 75th percentiles commonly used
75th - 25th percentile referred to as ‘Inter-quartile range’ (IQR).
We can also list extremes of distribution, or 5th and 95th percentiles.
‘Five-number summary’: minimum, “lower-hinge”, median, “upper-hinge”, maximum
- ‘hinge’ is the median of upper of lower half, usually the 25th percentile.
Mean & SD not a good representation as the mean is affected by the extreme values and the standard deviation assumes symmetrical shape either side of mean.

Summarising Binary or categorical

We have fewer options here, and some of the methods above are not suitable for these types:

We can use:

Counts / frequency in groups
Percentage for relative values - lose absolute values

As an example, we will use the ages data set but group it into 10-year categories.

Summary

Is the question asking about the data, or for you to do something with the data to summarise?
Consider data type:
- Numeric / Binary / Categorical
- Within that, is it discrete / ordered etc.
Plot it accordingly
Chose appropriate summary statistics