Abstract Background The exploding growth of the biomedical literature presents many challenges for biological researchers.One such challenge is from the use of a great deal of abbreviations.Extracting abbreviations and their definitions accurately is Baggies very helpful to biologists and also facilitates biomedical text analysis.Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based.State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations.
We propose a systematic method to extract abbreviations effectively.At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter.Results A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations.An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system.MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus.
Conclusion We present a new literature mining system MBA for extracting biomedical abbreviations.Our evaluation demonstrates that the MBA system performs better than the others.It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations Kratom Liquids (e.g., ), but also non-acronym-type abbreviations (e.
g., ).