Leitner, F. (Florian)

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    The CHEMDNER corpus of chemicals and drugs and its annotation principles
    (Chemistry Central, 2015) Usié, A. (Anabel); Alves, R. (Rui); Choi, M. (Miji); Zitnik, S. (Slavko); Tzong-Han-Tsai, R. (Richard); Lu, Y. (Yanan); Couto, F.M. (Francisco M.); Krallinger, M. (Martin); Matos, S. (Sérgio); Vazquez, M. (Miguel); Valencia, A. (Alfonso); An, X. (Xin); Munkhdalai, T. (Tsendsuren); Ji, D. (Donghong); Lu, Z. (Zhiyong); Rak, R. (Rafal); Yoshioka, M. (Masaharu); Ata, C. (Caglar); Liu, H. (Hongfang); Sayle, R.A. (Roger A.); Khabsa, M. (Madian); Akhondi, S.A. (Saber A.); Bajec, M. (Marko); Verspoor, K. (Karin); Tang, B. (Buzhou); Lowe, D.M. (Daniel M.); Oyarzabal, J. (Julen); Ravikumar, K.E. (Komandur Elayavilli); Segura-Bedmar, I. (Isabel); Ryu, K.H. (Keun Ho); Batista-Navarro, R.T. (Riza Theresa); Xu, H. (Hua); Dieb, T.M. (Thaer M.); Lamurias, A. (Andre); Dai, H.J (Hong-Jie); Weber, L. (Lutz); Rocktäschel, T. (Tim); Ramanan, S.V. (S.V.); Irmer, M. (Matthias); Rabal, O. (Obdulia); Salgado, D. (David); Martínez, P. (Paloma); Can, T. (Tolga); Sikdar, U.K. (Utpal Kumar); Ekbal, A. (Asif); Huber, T. (Torsten); Kors, J.A. (Jan A.); Giles, C.L. (C. Lee); Xu, S. (Shuo); Leitner, F. (Florian); Nathan, S. (Senthil); Campos, D. (David); Leaman, R. (Robert)
    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
  • Thumbnail Image
    CHEMDNER: The drugs and chemical names extraction challenge
    (Chemistry Central, 2015) Krallinger, M. (Martin); Vazquez, M. (Miguel); Valencia, A. (Alfonso); Oyarzabal, J. (Julen); Rabal, O. (Obdulia); Leitner, F. (Florian)
    Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties.