CBCU Research

Technologies for data security and disclosure control

The background
Under Section 60 of the Health and Social Care Act 2001 cancer registries are in the privileged position of being able to collect information on patients without specific informed consent. There is therefore a strict data release policy for the registry to ensure that identifiable information is never passed to those who do not have a right to see it. However, even information that contains no data that itself would allow an individual to be identified can be used to identify someone when compared or combined with another information source that is already known. Data of this type although anonymous is said to be ‘disclosive’. For example, full postcodes, possibly combined with age and sex of the population of interest, can lead to someone being identified where the population density is sparse.

The problem of disclosure becomes more likely as the number of cases in a population falls or where the total size of the population itself is small; it is a particular problem for less common or childhood malignancies. However, disclosure can also occur when two populations do not overlap – so for example, if data is published for cancers in local authority wards and PCTs where the geography is not coterminous – calculating the small number of cases that fall in the non-overlapping region can lead to potential identification. This ‘disclosure through differencing’ applies not just to overlapping geographies, but sex differences and time series.

The most common policy adopted by the cancer registries and other organisations such as the Office for National Statistics to prevent disclosure is to suppress small number data – usually less than five counts in a data cell. All data containing numbers less than five counts are treated as identifiable and require that appropriate controls are in place for its release. While this is usually effective, it leads to the suppression of a large amount of information and occasionally the values of small cells can still be calculated from row totals that are not suppressed.

One approach that avoids suppression is to randomly perturb the small numbers so that cells with counts less than 5 no longer contain exact information. Although this works well to prevent disclosure it causes problems with row totals and introduces a uniform ‘noise’ into all data that degrades its quality for all analyses.

Current and future work
Over the last year Brian Shand has worked on a new method of disclosure control through the introduction of uncertainty into small value data. This approach has the advantage that the degree of uncertainty can be applied separately to different dimensions of the data and adjusted depending upon the type of analysis being performed. Data can then be released that is safe from disclosure and where the magnitude of the uncertainty for each data cell can be explicitly provided.

In addition to the theoretical background to this work, we have created computer programmes as a practical implementation of the work. We are working in partnership with the Eastern Region Public Health Observatory to look at how this method can be applied to hospital episode statistics and have had preliminary discussions with the Office for National Statistics.

This work will be of considerable value for data control in many settings in the NHS. Our next step is to implement an online version to allow us to make cancer incidence and mortality data available in a form that can be used on publicly accessible websites.