[1] 75 81 59 70 64 67 66 54 68 72 80 66 70 76 75 52 59 52 86 56 51 59 72 61 53
[26] 72 75 69 64 55 74 54 61 74 86 53 68 69 76 58 59 79 59 69 91 55 59 68 58 70
[51] 68 60 89 54 85 76 56 56 84 91 90 87 90 85 54 76 91 79 53 62 72 69 75 76 76
[76] 76 63 85 76 85 67 63 91 63 64 69 63 60 57 83 69 60 58 70 59 85 68 85 56 79
[101] 85 76 76 73 60 87 57 67 72 92 58 55 54 71 90 55 58 59 63 77 85 77 53 66 73
[126] 53 79 70 70 77 56 65 85 64 74 66 74 59 68 79 66 56 68 63 66 66 68 70 64 72
[151] 56 83 53 69 67 77 68 63 73 57 86 52 75 78 76 61 71 77 64 62 77 69 69 66 85
[176] 65 61 72 69 73 53 77 54 56 72 70 69 67 62 78 58 54 69 76 86 59 80 84 56 78
[201] 75 57 68 91 91
Introduction
“Data”
We are not interested simply in measurements, or individual datum.
We are usually trying to answer a question / learn something / make a decision
When use data, we need to bring some order to it
We will consider structured data
We often say data, but don’t away mean data
Data types
Data can take various forms: E.g. measurements, grouping factors, estimates, observations etc.
Different in terms of:
- Storage
- Methods for summary/processing
- Interpretation
- Data-generating mechanism
A few major groups of data types:
- Numeric
- Binary
- Categorical
Continuous: values that can be constantly divided with a possible number in between
E.g. height of a person could be 172, 173 or 172.5 cm
Examples in NHS: physiological measurements like blood pressure
Discrete: values that can only take whole numbers, usually obtained by counting.
E.g. Number of patients seen in a clinic could be 35 or 36 but not 35.5
Examples in NHS: counts of patients, waiting time measured in whole minutes, length of stay measured in days, number of patient episodes
Exclusive two state variable
E.g. 0/1, yes/no, TRUE/FALSE
Examples in NHS: Patient dead or alive?, TRUE or FALSE answer to survey, patient status for a genetic marker
Counting binary events becomes discrete numeric.
May chose to only count the events, e.g. ‘deaths’ not ‘survives’
Taken from: xkcd https://xkcd.com/605/
- Nominal: Categories without any notion of order
- E.g. Hair Colour, Brand of car, Country of residence
- Examples in NHS: Ethnicity, Admission method, Treatment speciality
- Ordinal: Categories with order, but not linear like numeric
- E.g. Survey answers ‘Good, OK & Bad’. There is order, but ‘OK’ ≠ ‘Bad’ x 2 and ‘Good’ ≠ ‘Bad’ + ‘OK’
- Examples in NHS: Cancer stage, self-assessed patient answers like ‘is your health poor, OK or good,’ Age-groups <1, 1-16, 17-40 etc.
Summarising Numeric Data
“Dear BI team, I would like age of all patients admitted as an emergency to general medicine in December?”
How would you answer that question?
What is the question really asking? It’s quite unlikely that sending a list of numbers really answers the question. They requester may have various different questions, but they probably want some idea of the range of ages, but also how the people are distributed across different ages.
How might we show this in a better way?
Summary figures?
Visualising data types:
- Scatter plots
- Stem & leaf plot
- Histogram or Kernel Density (sounds more impressive than it is)
- Box plot
Scatter plots
Plots and x variable by a y variable by point
Why doesn’t this help?
- We’ve only got one variable, not two
- No summary information
- We want to see some kind of distribution
Stem & Leaf Plots
Easy to do by hand or on computer
Decide on grouping size (5- year in example below)
Major units on left, minor on right
We essentially create a tally
The decimal point is 1 digit(s) to the right of the |
5 | 122233333334444444
5 | 555566666666677778888889999999999
6 | 000011112223333333444444
6 | 5566666666777778888888888999999999999
7 | 00000000112222222233334444
7 | 555555666666666666777777788899999
8 | 0013344
8 | 55555555556666779
9 | 0001111112
Histograms and Bar Charts
Plot of binned counts
Good way to visualise distribution
Bin sizes can vary & do not have to be equal
Bar charts are related, but do not share the ‘binning’ idea. Can be used with categorical
Kernel Density
Similar to a smoothed histogram
Plots the probability density of data rather than counts values
Conceptually harder, but nicer visualisation
Box Plots
Box range is (“hinges”):
- 25th percentile
- 75th percentile
- Line is the median (50th percentile)
- Whiskers extend hinge ± 1.5 * IQR
- Outliers (further points) are represented
- Terms will be explained in the following slides
Violin plot
A collision of the box plot and the density plot. This is ‘two-sided’ here, but could be set to a single side.
Summary figures
Quick notation definition
x : is the variable of interest. If we have more than one, they might by y, z etc.
n : is the number of observations, or the count of x
i : is usually use to denote an individual value, rather than all values. E.g. x_i usually mean ‘each value of x’, rather than all values of x. It is also the index, e.g. i=3 means the value at position 3
\bar{} : is usually used to signify the mean. E.g. \bar{y} is the mean of y.
\Sigma : is a ‘sum’ operator. Usually given with a range at the top and a starting value at the bottom. e.g. \sum_{i=1}^n reads as ’sum n values, starting at 1 index.
### Center of the data We commonly want some measure of the Central point, and description of ‘weight’ of data
“Mean” is average of data, calculated centre: \bar{x} = \frac{\sum_{i=1}^n x_i}{n}
“Median” is middle value in rank order
“Mode” is the most common value
### Spread
There are various measures of spread:
Extremes of the distributions (highest / lowest)
Quantiles (often percentiles) - the group of observations within certain ranges, e.g. middle 50%
The standard deviation (sd or \sigma): the average distance of a point from the mean.
sd = \sqrt{ \frac{\sum_{i=1}^n \left(x_i - \bar{x} \right)^2}{n-1}} or, in easier Excel terms:
sqrt\left(\frac{sum((x_i-\bar{x})^2)}{count(x)-1}\right)
How do we calculate a percentile?
Take some values: e.g. the heights in earlier slides
Order them from low to high
Add an index
How do we calculate a percentile? (2)
Find the percentile you want, using the index
- e.g. 10% of 205 age values
- 0.1 * 205 = 20.5th index value
If index value is a whole number, use it directly
If index value is between numbers, various rules, including:
- Round up/down
- Average
- Weighted average
10th percentile = 55
The ‘Normal’ or ‘Gaussian’ distribution:
- Equally distributed with most common values in the centre.
- ‘Bell-curve’
- Mean, median and mode are identical
- Principles apply more widely to lots of areas of statistics
Visualising the Normal Distribution
Mean is in the centre of the curve
Mean + standard deviation
How much of the distribution is here?
Adding in more standard deviations
Change the x scale to values.
We modelled a distribution with a mean of 500, and a standar deviation of 150.
Z-scores
- Standard score, “invariant” to raw scale
- Changes scale but not distribution
- Each value is divided the standard-deviation, so:
- Mean = 0
- 1 = Mean + 1sd
- -1 = Mean – 1sd
- We are now talking in units of standard deviation
\large z = \frac{(x - \bar{x})}{sd}
Z-scores a common way to compare different indicators, on different natural scales
Skewed Distributions
Not all distributions follow the pattern we’ve seen above. We call these non-symmetrical distributions ‘skewed’ and they can’t be treated in quite the same way. Our normal (or “parametric” methods) assume the same shape / distribution on both sides of a mean.
Imagine the case of hospital leng-of-stay. What happens here?
Measured in days: numeric, discrete
Can’t be < 0, so there is a hard limit at zero
Rarely, some patients stay in a really long time.
Summarising Skewed Data
We might describe data as “left” or “right” skewed. This has effects on the mean, but not the median.
Summarising Skewed Data with percentiles
Summarising Skewed Data principles
Use median (50th percentile) for mid point
25th and 75th percentiles commonly used
75th - 25th percentile referred to as ‘Inter-quartile range’ (IQR).
We can also list extremes of distribution, or 5th and 95th percentiles.
‘Five-number summary’: minimum, “lower-hinge”, median, “upper-hinge”, maximum
- ‘hinge’ is the median of upper of lower half, usually the 25th percentile.
Mean & SD not a good representation as the mean is affected by the extreme values and the standard deviation assumes symmetrical shape either side of mean.
Summarising Binary or categorical
We have fewer options here, and some of the methods above are not suitable for these types:
We can use:
- Counts / frequency in groups
- Percentage for relative values - lose absolute values
As an example, we will use the ages data set but group it into 10-year categories.
Summary
Is the question asking about the data, or for you to do something with the data to summarise?
Consider data type:
- Numeric / Binary / Categorical
- Within that, is it discrete / ordered etc.
Plot it accordingly
Chose appropriate summary statistics