One Thing You Should Know About Coronavirus Statistics

There are lots of feeds for statistics about the spread of CVID-19 (eg worldmeterBerliner MorgenpostUK Gov dashboard, etc). Few are definitive; they aggregate reports from newspapers, government press releases, institution updates and medical studies. The numbers are clear and ‘exciting’ but suffer from all sorts of problems: missed data, double counted data, and the wrong type of data.

Using new cases is the biggest problem. A ‘case’ is a medical administrative term; a case is monitored and tracked. This is not at all the same as new infections, but this distinction is lost in most media. To track the situation we need to know about new infections, not new cases – and the difference is not minor.

Let’s say we have a population, a third of whom have the virus. They are under perfect lockdown and so have few new infections. If we test 3,000 people per day we will find 1,000 new cases every day – new cases that are discoveries of existing infections, not cases of new infections.

Worse, if we improve our test regimes then these new cases accelerate. Yesterday we tested 3,000 people and found 1,000 new cases. Today we test 6,000 and find 2,000 new cases. Tomorrow we will test 9,000 people and expect 3,000 new cases. The number of new cases is accelerating – but only because the number of people we are testing is accelerating.

new cases are not new infections

To find new infections we need to sample the wider population, not just those with symptoms. We need to sample enough to estimate the proportion of those with the virus – the number of people being tested and the number of positive results. And while it’s easy to find reports of new cases and deaths by just typing ‘coronavirus deaths’ into google, it’s surprisingly difficult to find regular updates of testing rates and results.

Dodgey Proxy

To get an idea of what proportion of the general population has infections, we use the number of people tested with the number of positive results.

[Edit: replaced section that scraped data from a BBC article] Some twitter monitoring eventually pointed me at this site that publishes tests and positive result counts:

It doesn’t indicate whether ‘completed tests’ means tested individuals or the number of test kits used, and there can be quite a considerable difference, so this will need some more adjusting when this is clear.

We prefer days with relatively large numbers of tests because this reduces the impact of actual new infections that will also be included. However there are likely delays in early reporting and aggregating results; for example, we can see that the odd dip in the number of tests on the 20th March is probably reflected in the dip in the number of new cases a day or so later in the new cases. The correlation seems clearer by the 24th.

[Update: ‘Murtaman’ has already been scraping some of this information from the published daily updates and created this rather excellent spreadsheet]

These are by no means reliable numbers and certainly should not be used to estimate trends. The combination of the dip in tests on 20th March and various delays in reporting make anything before the 23rd unreliable. If the proportion of cases went down significantly as the number of tests went up this would tell us that many of the cases are new reports of new infections, but this does not appear to so. Altogether this strongly suggests that between 1/5 and a 1/3 of the UK population already has the virus.

We can use cruise ship cases to sanity check this estimate as some of those populations were fully tested. The Princess Diamond had a population that closely-mixed for two weeks between first contact and quarantine, and the whole population was tested later. They found over 700 infected in a population of 4,000; nearly a fifth. This is not far from our UK estimate above. (Death rates in these cruise ship cases are harder to generalise from as half the population were over 60, many with age-related complications).

What does this mean? A ‘worst case scenario’ of 20 million people with a disease that has a 1% death rate is a looming disaster. Or perhaps it shows that the virus is not nearly as dangerous as it seemed initially. What data can we use to help resolve this?

Death rates suffer from reporting and timing problems. We should compare deaths against recoveries of people who were infected at the same time, but we don’t know when they were infected, and deaths occur sooner than recovery. There is no public database that I am aware of that tracks this relationship.

Popular visualisers compare current deaths with current recoveries and current cases.

None of these have any relevant relationship to each other when the situation is rapidly changing.

Death rates are also not consistently reported because not everyone who died is tested for the virus. Half a million people die – mostly of age-related conditions – every year in the UK; over two thousand a day in the winter. In Italy, people tested as with the virus were listed as dying of the virus. Germany didn’t even test – early on – many possibly relevant deaths. Visualisers often report the date the death was reported, not the date it happened, showing odd fluctuations that are due to administrative delays.

Data on Intensive Care Unit (ICU) patients may be better indicators of changes in harmfulness as these are a relatively small number of patients that are strongly scrutinised. The UK ICNARC agency audits ICU use and recently released a report; again a daily update would be more useful! This report shows the ICU load in the UK (ignore the shaded ‘lag’ section which is incomplete due to reporting delays):

These ICU cases appear to be on the late stages of the ‘S curve’, which suggests that the likely number of ICU patients with the virus will reach less than 1,000 in a few days. (Contrast this with the Guardian’s more exciting article on the same report, which has phrases like “doubling every three days”, “overwhelming NHS intensive care”, “surge”, “soared”, “surge” again, and so on.).

This should be compared with the number of ICU units available. The NHS Critical Care Bed Capacity reports that the UK normally has over 4,000 adult care beds with a common occupancy rate of 80%. That is, we already have around 800 spare ICU beds across the UK, although obviously clusters of cases will overwhelm individual units.

Pointless Early Speculating

All the above is about identifying the right data, and not being distracted by exciting data or hyperbole. We don’t really have enough yet to be confident – another week should be enough – but speculating with what we have is a useful exercise. 

If  around 1/4 of the UK population is already infected, and if  the ICU cases are reaching the top of the ‘S’ curve, we can see an end to the crisis.

The ICNARC report shows that about half of ICU cases are resolved within 10 days, with about 2/3 of patients recovering well enough transfer out of ICU, and about 1/3 dying (see figure 6).

That means that in a week or two new cases will have tailed off and half of the original cases resolved, releasing those ICU beds. Most people could go back to work – keeping the vulnerable in isolation. Even if we double the number of people infected to half the population within the following two weeks after that, we still remain within the emergency capacity of the NHS.

Two weeks after non-vulnerable adults return to work, at the end of the Easter holiday, children go back to school. We see another rise in infection as children further pass around the virus. Even if most of the population (80%) is now infected, much higher than the cruise ship cases, the hospital cases are within current emergency capacity.

In fact we can expect ICU load to be lower as long as the vulnerable remain isolated, because the current ICU load includes those vulnerable people who were infected when not isolated. If, of course, we’re all still happy with the idea of keeping people locked away for their own safety.

In a week we should have enough data to confirm or deny this speculation.

Summarising the One Thing: Use the right data

  • Use new infections to track pandemics, not new reported cases. As testing widens, use infection proportions from broad population samples.
  • Use ICU load, whose fewer patients are better monitored than all deaths, to track changes in the damage the pandemic is doing. 

Bonus Other Thing

As always don’t believe anything you read from a bloke on the internet – and the same applies to mainstream media. The news is full of quotes by experts, some of them contradictory. Ask yourself, when you read a quote by an expert:

  • Are they expert in the thing they’re being quoted about? Someone who is an expert in virus biology may not be an expert in pandemic statistics. 
  • What else did the expert say that is not being quoted? Journalists will report individual phrases without context, and often paraphrase them. A WHO report found a crude death rate of 3-4%, but most quoting articles lost this crucial detail.
  • Who else are experts but are not being quoted? A pandemic expert whose job is planning for worst cases will provide suitably exciting quotes, but other pandemic experts who predict likely outcomes instead are less likely to reach the limelight.

Stay safe.

Leave a Comment

Your email address will not be published.