Session 1: Data types and summary statistics

Author

Chris Mainey

Published

4 April 2025

Introduction

“Data”

  • We are not interested simply in measurements, or individual datum.

  • We are usually trying to answer a question / learn something / make a decision

  • When use data, we need to bring some order to it

  • We will consider structured data

The pyramid of knowledge with a base of data, the next layer built of information, then knowledge and finally wisdom at the peak.

We often say data, but don’t away mean data

Data types

Data can take various forms: E.g. measurements, grouping factors, estimates, observations etc.

Different in terms of:

  • Storage
  • Methods for summary/processing
  • Interpretation
  • Data-generating mechanism

A few major groups of data types:

  • Numeric
  • Binary
  • Categorical


  • Continuous: values that can be constantly divided with a possible number in between

    • E.g. height of a person could be 172, 173 or 172.5 cm

    • Examples in NHS: physiological measurements like blood pressure

  • Discrete: values that can only take whole numbers, usually obtained by counting.

    • E.g. Number of patients seen in a clinic could be 35 or 36 but not 35.5

    • Examples in NHS: counts of patients, waiting time measured in whole minutes, length of stay measured in days, number of patient episodes

  • Exclusive two state variable

    • E.g. 0/1, yes/no, TRUE/FALSE

    • Examples in NHS: Patient dead or alive?, TRUE or FALSE answer to survey, patient status for a genetic marker

Counting binary events becomes discrete numeric.

May chose to only count the events, e.g. ‘deaths’ not ‘survives’


Taken from: xkcd https://xkcd.com/605/

  • Nominal: Categories without any notion of order
    • E.g. Hair Colour, Brand of car, Country of residence
    • Examples in NHS: Ethnicity, Admission method, Treatment speciality


  • Ordinal: Categories with order, but not linear like numeric
    • E.g. Survey answers ‘Good, OK & Bad’. There is order, but ‘OK’ ≠ ‘Bad’ x 2 and ‘Good’ ≠ ‘Bad’ + ‘OK’
    • Examples in NHS: Cancer stage, self-assessed patient answers like ‘is your health poor, OK or good,’ Age-groups <1, 1-16, 17-40 etc.

Summarising Numeric Data



“Dear BI team, I would like age of all patients admitted as an emergency to general medicine in December?”



  [1] 75 81 59 70 64 67 66 54 68 72 80 66 70 76 75 52 59 52 86 56 51 59 72 61 53
 [26] 72 75 69 64 55 74 54 61 74 86 53 68 69 76 58 59 79 59 69 91 55 59 68 58 70
 [51] 68 60 89 54 85 76 56 56 84 91 90 87 90 85 54 76 91 79 53 62 72 69 75 76 76
 [76] 76 63 85 76 85 67 63 91 63 64 69 63 60 57 83 69 60 58 70 59 85 68 85 56 79
[101] 85 76 76 73 60 87 57 67 72 92 58 55 54 71 90 55 58 59 63 77 85 77 53 66 73
[126] 53 79 70 70 77 56 65 85 64 74 66 74 59 68 79 66 56 68 63 66 66 68 70 64 72
[151] 56 83 53 69 67 77 68 63 73 57 86 52 75 78 76 61 71 77 64 62 77 69 69 66 85
[176] 65 61 72 69 73 53 77 54 56 72 70 69 67 62 78 58 54 69 76 86 59 80 84 56 78
[201] 75 57 68 91 91



How would you answer that question?



What is the question really asking? It’s quite unlikely that sending a list of numbers really answers the question. They requester may have various different questions, but they probably want some idea of the range of ages, but also how the people are distributed across different ages.

  • How might we show this in a better way?

  • Summary figures?


Visualising data types:

  • Scatter plots
  • Stem & leaf plot
  • Histogram or Kernel Density (sounds more impressive than it is)
  • Box plot

Scatter plots

Plots and x variable by a y variable by point

Why doesn’t this help?

  • We’ve only got one variable, not two
  • No summary information
  • We want to see some kind of distribution

Stem & Leaf Plots

  • Easy to do by hand or on computer

  • Decide on grouping size (5- year in example below)

  • Major units on left, minor on right

  • We essentially create a tally


  The decimal point is 1 digit(s) to the right of the |

  5 | 122233333334444444
  5 | 555566666666677778888889999999999
  6 | 000011112223333333444444
  6 | 5566666666777778888888888999999999999
  7 | 00000000112222222233334444
  7 | 555555666666666666777777788899999
  8 | 0013344
  8 | 55555555556666779
  9 | 0001111112

Histograms and Bar Charts

  • Plot of binned counts

  • Good way to visualise distribution

  • Bin sizes can vary & do not have to be equal

  • Bar charts are related, but do not share the ‘binning’ idea. Can be used with categorical

Kernel Density

  • Similar to a smoothed histogram

  • Plots the probability density of data rather than counts values

  • Conceptually harder, but nicer visualisation

Box Plots

Box range is (“hinges”):

  • 25th percentile
  • 75th percentile
  • Line is the median (50th percentile)
  • Whiskers extend hinge ± 1.5 * IQR
  • Outliers (further points) are represented
  • Terms will be explained in the following slides

Violin plot

A collision of the box plot and the density plot. This is ‘two-sided’ here, but could be set to a single side.

Summary figures

Quick notation definition

  • x : is the variable of interest. If we have more than one, they might by y, z etc.

  • n : is the number of observations, or the count of x

  • i : is usually use to denote an individual value, rather than all values. E.g. x_i usually mean ‘each value of x’, rather than all values of x. It is also the index, e.g. i=3 means the value at position 3

  • \bar{} : is usually used to signify the mean. E.g. \bar{y} is the mean of y.

  • \Sigma : is a ‘sum’ operator. Usually given with a range at the top and a starting value at the bottom. e.g. \sum_{i=1}^n reads as ’sum n values, starting at 1 index.


### Center of the data We commonly want some measure of the Central point, and description of ‘weight’ of data

  • “Mean” is average of data, calculated centre: \bar{x} = \frac{\sum_{i=1}^n x_i}{n}

  • “Median” is middle value in rank order

  • “Mode” is the most common value


### Spread

There are various measures of spread:

  • Extremes of the distributions (highest / lowest)

  • Quantiles (often percentiles) - the group of observations within certain ranges, e.g. middle 50%

  • The standard deviation (sd or \sigma): the average distance of a point from the mean.

sd = \sqrt{ \frac{\sum_{i=1}^n \left(x_i - \bar{x} \right)^2}{n-1}} or, in easier Excel terms:

sqrt\left(\frac{sum((x_i-\bar{x})^2)}{count(x)-1}\right)


How do we calculate a percentile?

  • Take some values: e.g. the heights in earlier slides

  • Order them from low to high

  • Add an index

How do we calculate a percentile? (2)

Find the percentile you want, using the index

  • e.g. 10% of 205 age values
  • 0.1 * 205 = 20.5th index value

If index value is a whole number, use it directly


If index value is between numbers, various rules, including:

  • Round up/down
  • Average
  • Weighted average

10th percentile = 55


The ‘Normal’ or ‘Gaussian’ distribution:

  • Equally distributed with most common values in the centre.
  • ‘Bell-curve’
  • Mean, median and mode are identical
  • Principles apply more widely to lots of areas of statistics

Visualising the Normal Distribution

Mean is in the centre of the curve

Mean + standard deviation

How much of the distribution is here?

Adding in more standard deviations

Change the x scale to values.

We modelled a distribution with a mean of 500, and a standar deviation of 150.

Z-scores

  • Standard score, “invariant” to raw scale
  • Changes scale but not distribution

  • Each value is divided the standard-deviation, so:
    • Mean = 0
    • 1 = Mean + 1sd
    • -1 = Mean – 1sd
  • We are now talking in units of standard deviation

\large z = \frac{(x - \bar{x})}{sd}



Z-scores a common way to compare different indicators, on different natural scales

Skewed Distributions

Not all distributions follow the pattern we’ve seen above. We call these non-symmetrical distributions ‘skewed’ and they can’t be treated in quite the same way. Our normal (or “parametric” methods) assume the same shape / distribution on both sides of a mean.

Imagine the case of hospital leng-of-stay. What happens here?

  • Measured in days: numeric, discrete

  • Can’t be < 0, so there is a hard limit at zero

  • Rarely, some patients stay in a really long time.

Summarising Skewed Data

We might describe data as “left” or “right” skewed. This has effects on the mean, but not the median.

Summarising Skewed Data with percentiles

Summarising Skewed Data principles

  • Use median (50th percentile) for mid point

  • 25th and 75th percentiles commonly used

  • 75th - 25th percentile referred to as ‘Inter-quartile range’ (IQR).

  • We can also list extremes of distribution, or 5th and 95th percentiles.

  • ‘Five-number summary’: minimum, “lower-hinge”, median, “upper-hinge”, maximum

    • ‘hinge’ is the median of upper of lower half, usually the 25th percentile.
  • Mean & SD not a good representation as the mean is affected by the extreme values and the standard deviation assumes symmetrical shape either side of mean.

Summarising Binary or categorical

We have fewer options here, and some of the methods above are not suitable for these types:

We can use:

  • Counts / frequency in groups
  • Percentage for relative values - lose absolute values

As an example, we will use the ages data set but group it into 10-year categories.

Summary

  • Is the question asking about the data, or for you to do something with the data to summarise?

  • Consider data type:

    • Numeric / Binary / Categorical
    • Within that, is it discrete / ordered etc.
  • Plot it accordingly

  • Chose appropriate summary statistics