Use of text-mining methods to improve efficiency in the calculation of drug exposure to support pharmacoepidemiology studies

Stuart McTaggart, Clifford Nangle, Jacqueline Caldwell, Samantha Alvarez-Madrazo, Helen Colhoun, Marion Bennie

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Background: Efficient generation of structured dose instructions that enable researchers to calculate drug exposure is central to pharmacoepidemiology studies. The aim was to design and test an algorithm to codify dose instructions, applied to the NHS Scotland Prescribing Information System (PIS) that records approximately 100 million prescriptions per annum.
Methods: a natural language processing (NLP) algorithm was developed enabling free-text dose instructions to be represented by three attributes: quantity; frequency; and qualifier, each specified by a set of variables. This was tested on a sample of 15 593 distinct dose instructions and manually validated. The final algorithm was then applied to the full dataset.
Results: the dataset comprised 458 227 687 prescriptions, of which 99.67% had dose instructions represented by 4 964 083 distinct free-text dose instructions; 13 593 (0.27%) of these occurred ≥1000 times accounting for 88.85% of all prescriptions. Reviewers identified 767 (5.83%) instances where the structured output (n=13 152) was incorrect, an accuracy of 94.2%. Application of the final NLP algorithm to the dataset generated an overall structured output of 92.3% which varied by therapeutic area (86.7% central nervous system to 96.8%
cardiovascular).
Conclusion: We adopted a zero assumption approach to create an NLP algorithm, operational at scale, to produce structured output which enables data users maximum flexibility to formulate, test and apply their own assumptions according to the medicines under investigation. Text mining approaches can provide a solution to the safe and efficient management and provisioning of large volumes of data generated through our health systems.
LanguageEnglish
Number of pages8
JournalInternational Journal of Epidemiology
DOIs
Publication statusPublished - 6 Feb 2018

Fingerprint

Pharmacoepidemiology
Data Mining
Natural Language Processing
Prescriptions
Pharmaceutical Preparations
Scotland
Information Systems
Central Nervous System
Research Personnel
Health
Datasets

Keywords

  • text mining
  • natural language processing
  • dose information
  • prescriptions

Cite this

McTaggart, Stuart ; Nangle, Clifford ; Caldwell, Jacqueline ; Alvarez-Madrazo, Samantha ; Colhoun, Helen ; Bennie, Marion. / Use of text-mining methods to improve efficiency in the calculation of drug exposure to support pharmacoepidemiology studies. In: International Journal of Epidemiology. 2018.
@article{c0be826a5d39401e9b6d9cff23bc3929,
title = "Use of text-mining methods to improve efficiency in the calculation of drug exposure to support pharmacoepidemiology studies",
abstract = "Background: Efficient generation of structured dose instructions that enable researchers to calculate drug exposure is central to pharmacoepidemiology studies. The aim was to design and test an algorithm to codify dose instructions, applied to the NHS Scotland Prescribing Information System (PIS) that records approximately 100 million prescriptions per annum.Methods: a natural language processing (NLP) algorithm was developed enabling free-text dose instructions to be represented by three attributes: quantity; frequency; and qualifier, each specified by a set of variables. This was tested on a sample of 15 593 distinct dose instructions and manually validated. The final algorithm was then applied to the full dataset.Results: the dataset comprised 458 227 687 prescriptions, of which 99.67{\%} had dose instructions represented by 4 964 083 distinct free-text dose instructions; 13 593 (0.27{\%}) of these occurred ≥1000 times accounting for 88.85{\%} of all prescriptions. Reviewers identified 767 (5.83{\%}) instances where the structured output (n=13 152) was incorrect, an accuracy of 94.2{\%}. Application of the final NLP algorithm to the dataset generated an overall structured output of 92.3{\%} which varied by therapeutic area (86.7{\%} central nervous system to 96.8{\%}cardiovascular).Conclusion: We adopted a zero assumption approach to create an NLP algorithm, operational at scale, to produce structured output which enables data users maximum flexibility to formulate, test and apply their own assumptions according to the medicines under investigation. Text mining approaches can provide a solution to the safe and efficient management and provisioning of large volumes of data generated through our health systems.",
keywords = "text mining, natural language processing, dose information, prescriptions",
author = "Stuart McTaggart and Clifford Nangle and Jacqueline Caldwell and Samantha Alvarez-Madrazo and Helen Colhoun and Marion Bennie",
year = "2018",
month = "2",
day = "6",
doi = "10.1093/ije/dyx264",
language = "English",
journal = "International Journal of Epidemiology",
issn = "0300-5771",

}

Use of text-mining methods to improve efficiency in the calculation of drug exposure to support pharmacoepidemiology studies. / McTaggart, Stuart; Nangle, Clifford; Caldwell, Jacqueline; Alvarez-Madrazo, Samantha; Colhoun, Helen; Bennie, Marion.

In: International Journal of Epidemiology, 06.02.2018.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Use of text-mining methods to improve efficiency in the calculation of drug exposure to support pharmacoepidemiology studies

AU - McTaggart, Stuart

AU - Nangle, Clifford

AU - Caldwell, Jacqueline

AU - Alvarez-Madrazo, Samantha

AU - Colhoun, Helen

AU - Bennie, Marion

PY - 2018/2/6

Y1 - 2018/2/6

N2 - Background: Efficient generation of structured dose instructions that enable researchers to calculate drug exposure is central to pharmacoepidemiology studies. The aim was to design and test an algorithm to codify dose instructions, applied to the NHS Scotland Prescribing Information System (PIS) that records approximately 100 million prescriptions per annum.Methods: a natural language processing (NLP) algorithm was developed enabling free-text dose instructions to be represented by three attributes: quantity; frequency; and qualifier, each specified by a set of variables. This was tested on a sample of 15 593 distinct dose instructions and manually validated. The final algorithm was then applied to the full dataset.Results: the dataset comprised 458 227 687 prescriptions, of which 99.67% had dose instructions represented by 4 964 083 distinct free-text dose instructions; 13 593 (0.27%) of these occurred ≥1000 times accounting for 88.85% of all prescriptions. Reviewers identified 767 (5.83%) instances where the structured output (n=13 152) was incorrect, an accuracy of 94.2%. Application of the final NLP algorithm to the dataset generated an overall structured output of 92.3% which varied by therapeutic area (86.7% central nervous system to 96.8%cardiovascular).Conclusion: We adopted a zero assumption approach to create an NLP algorithm, operational at scale, to produce structured output which enables data users maximum flexibility to formulate, test and apply their own assumptions according to the medicines under investigation. Text mining approaches can provide a solution to the safe and efficient management and provisioning of large volumes of data generated through our health systems.

AB - Background: Efficient generation of structured dose instructions that enable researchers to calculate drug exposure is central to pharmacoepidemiology studies. The aim was to design and test an algorithm to codify dose instructions, applied to the NHS Scotland Prescribing Information System (PIS) that records approximately 100 million prescriptions per annum.Methods: a natural language processing (NLP) algorithm was developed enabling free-text dose instructions to be represented by three attributes: quantity; frequency; and qualifier, each specified by a set of variables. This was tested on a sample of 15 593 distinct dose instructions and manually validated. The final algorithm was then applied to the full dataset.Results: the dataset comprised 458 227 687 prescriptions, of which 99.67% had dose instructions represented by 4 964 083 distinct free-text dose instructions; 13 593 (0.27%) of these occurred ≥1000 times accounting for 88.85% of all prescriptions. Reviewers identified 767 (5.83%) instances where the structured output (n=13 152) was incorrect, an accuracy of 94.2%. Application of the final NLP algorithm to the dataset generated an overall structured output of 92.3% which varied by therapeutic area (86.7% central nervous system to 96.8%cardiovascular).Conclusion: We adopted a zero assumption approach to create an NLP algorithm, operational at scale, to produce structured output which enables data users maximum flexibility to formulate, test and apply their own assumptions according to the medicines under investigation. Text mining approaches can provide a solution to the safe and efficient management and provisioning of large volumes of data generated through our health systems.

KW - text mining

KW - natural language processing

KW - dose information

KW - prescriptions

UR - https://academic.oup.com/ije

U2 - 10.1093/ije/dyx264

DO - 10.1093/ije/dyx264

M3 - Article

JO - International Journal of Epidemiology

T2 - International Journal of Epidemiology

JF - International Journal of Epidemiology

SN - 0300-5771

ER -