NDLI: Biomedical Named Entity Recognition with less Supervision

Content Provider	IEEE Xplore Digital Library
Author	Ghiasvand, O. Kate, R.J.
Copyright Year	2015
Description	Author affiliation: Univ. of Wisconsin-Milwaukee, Milwaukee, WI, USA (Ghiasvand, O.; Kate, R.J.)
Abstract	Summary form only given: Annotating clinical notes manually is very labor-intensive and needs expertise in the area of annotation. Thus annotation is a highly expensive task not only in human resource but also in financial aspects. Moreover mistakes, missed tags, and inconsistency are the common problems with manual annotations. The purpose of this research is to reduce humans as annotation effort for clinical notes, to improve consistency, and to decrease cost of annotation. The aim of this research is to annotate clinical texts to extract biomedical names and terms. In our research Unified Medical Language System (UMLS) is the reference meta thesaurus of names and terms used in biomedical and clinical domains. In this research we have done unsupervised and semi-supervised Named Entity Recognition (NER) through exact matching in UMLS. The data sets that have been used were provided by SemEval 2015 (task 14) natural language processing competition, including 199 clinical notes in training set and 133 notes in test set. The analysis that has been done so far can be divided into two steps: mapping and learning. The first step is to map all terms into UMLS that includes not only unigrams but also n-grams, usually n is 5. To achieve the best results of exact matching, we extracted UMLS terms of diseases and disorders based on semantic groups and mapped each n-gram to that part of UMLS. If there is a match, that is assumed to be a disease or disorder. When there is no match for n-grams (n>=2), to avoid low precisions, we supposed that unigrams must be noun phrases to be nominated as a disease/disorder. With this method we got 60% of f-score, and training files for next process (training CRFs) were generated. The second step involves using Conditional Random Fields (CRFs). The results generated in the first step were used to train the CRF. CRFs learn from training data the general contexts in which named entities occur. Also because of different levels of correctness in training files, we decided to modify training files before using them to train CRFs and to test on test data. Level of correctness means different accuracies of tagging in the data set. Because exact matching is not very accurate, the accuracy in different notes is variable. In some data it is very high and in some of them it is low. This results in an inconsistency in training files. To solve this problem we divided training files into ten groups. The CRF used only one group to be trained and to tag other groups, and results of exact matches and CRFs were combined (logic OR between results of CRF and exact match) together to get the final results. This was done for all other groups as well, and finally applied on test data. These two steps together are known as unsupervised disease named entity recognition, and the results show a difference of 10.3 percent between unsupervised and supervised approaches. By supervised learning we got 73% F-score while we got 62.7% by the proposed unsupervised approach. Another approach that was developed is semi supervised disease named entity recognition that used annotated files generated by unsupervised method and annotated files by human or gold standards. By this method we could improve 73% of F-score, that we got in supervised approach, to 74.2%. In the future some other refinements and extra tasks are going to be done. To improve the results, we are planning to use approximate matching by the process that is called normalization. Normalization means mapping a term in clinical notes to a preferred term in UMLS. These kinds of terms do not have exact matches, thus the way to find exact matches is to use normalization. Moreover we are going to do exact/approximate matching over discontinuous mentions in clinical texts. In these texts there are mentions including disconnected words in a sentence that together form a named entity. This essential step will extract those mentions that could not be extracted by exact match and normalization approaches. The last thing in our plan is to expand our developed system to a less supervised "Biomedical Named Entity Recognition (BNER)" to extract all biomedical and clinical terms. We will do this for other semantic groups in UMLS such as Activities and Behaviors, Anatomy, Devices, Phenomena, etc. Thus developing a less supervised annotating system for clinical notes could generate annotated notes with less cost of manual tagging, more consistent, and accurate enough. By using this approach it is feasible to extract tags of other semantic groups in UMLS, and finally it could be an advanced system to tag all the biomedical and clinical mentions based on semantic groups in UMLS.
Starting Page	495
Ending Page	495
File Size	93950
Page Count	1
File Format	PDF
e-ISBN	9781467395489
DOI	10.1109/ICHI.2015.85
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2015-10-21
Publisher Place	USA
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Exact matching Unified modeling language Manuals Conditional random fields Unsupervised learning Diseases Training Semantics Supervised learning Named entity recognition Machine learning Tagging UMLS Natural language processing
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Named Entity Recognition for Malayalam language: A CRF based approach

Signature automation of UMLS concepts: An un-supervised named entity recognition framework for classification of DNA and RNA in biological text

Named Entity Recognition Using Conditional Random Fields

Active learning technique for biomedical named entity extraction

Thai named entity recognition based on conditional random fields

Personal name and location name recognition based on conditional random fields

Hierarchical Conditional Random Fields (HCRF) for Chinese Named Entity Tagging

Studying the impact of various features on the performance of Conditional Random Field-based Arabic Named Entity Recognition

AMRITA_CEN@FIRE-2014: Named Entity Recognition for Indian Languages using Rich Features

Biomedical Named Entity Recognition with less Supervision

Similar Documents

Named Entity Recognition for Malayalam language: A CRF based approach

Signature automation of UMLS concepts: An un-supervised named entity recognition framework for classification of DNA and RNA in biological text

Named Entity Recognition Using Conditional Random Fields

Active learning technique for biomedical named entity extraction

Thai named entity recognition based on conditional random fields

Personal name and location name recognition based on conditional random fields

Hierarchical Conditional Random Fields (HCRF) for Chinese Named Entity Tagging

Studying the impact of various features on the performance of Conditional Random Field-based Arabic Named Entity Recognition

AMRITA_CEN@FIRE-2014: Named Entity Recognition for Indian Languages using Rich Features

Biomedical Named Entity Recognition with less Supervision