Note: This dataset is updated on 16:50pm Thursday, April 9, 2020.

This dataset will be constantly updated based on the incremental updates of the CORD-19 dataset and the improvement of our system.

On March 16th, 2020, White House released this Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset. Our Data Mining Group in CS@UIUC has created a comprehensieve named entity annotation dataset, CORD-NER, on the COVID-19 Open Research Dataset Challenge (CORD-19) corpus (2020-03-13).


CORD-NER annotation is a combination from 4 sources:

  1. Pretrained NER on 18 general entity types: Spacy.
  2. Pretrained NER on 18 biomedical entity types: SciSpacy.
  3. Knowledge base (KB)-guided NER on 127 biomedical entity types: our distantly-supervised NER method without requiring any human annotated training data. We use UMLS as the input KB for distant supervision.
  4. Seed-guided NER on 9 new entity types related to COVID-19 studies: our weakly-supervised NER method only requiring several human-input seeds for each new type.
    • Coronavirus: COVID-19, SARS-COV, MERS-COV, etc.
    • Viral Protein: Hemagglutinin, GP120, etc.
    • Livestock: cattle, sheep, pig, etc.
    • Wildlife: bats, pangolins, African green monkey, etc
    • Evolution: genetic drift, natural selection, mutation rate, etc
    • Physical Science: atomic charge, Amber force fields, Van der Waals force, etc.
    • Substrate: blood, sputum, urine, etc.
    • Material: copper, stainless steel, plastic, etc.
    • Immune Response: adaptive immune response, cell mediated immunity, innate immunity, etc.

We reorganized all the entity types from the 4 sources and merged into one entity type hierarchy with 75 fine-grained entity types for annotation. The entity type hierarchy (CORD-NER-types.xlsx) used in CORD-NER can be found in our dataset.

CORD-NER: Dataset Download

The CORD-NER dataset (CORD-NER-full.json) can be downloaded here. It includes all the related information (meta-data, full-text corpus and NER results) into one file for users’ convenience. The size of the dataset is about 1.2GB. The input corpus is generated from the 29,500 documents in COVID-19 Open Research Dataset Challenge (CORD-19) corpus (2020-03-13). A detailed description of the file schemas can be found in the README file in our dataset.

NER Annotation Results

Below is the NER performance comparison between SciSpacy and our annotation results on the COVID-19 corpus.


Below is an example of our annotation results on the CORD-19 corpus.

annotation example

Our NER methods are domain-independent that can be applied to corpus in different domains. Below is an example of our annotation on the New York Times corpus.

annotation example

Below are some annotation comparisons with existing supervised NER methods.

  • Example 1:

comparison 1

  • Example 2:

comparison 2

  • Example 3:

comparison 3

Top-Frequent Entity Summarization

Below are some examples of the most frequent entities for each type.

New entity types:

sars mutation bat positively charged
cov phylogenetic wild birds negatively charged
mers evolution wild animals force field
covid-19 recombination fruit bats highly hydrophobic
sars-cov-2 substitutions pteropus van der waals interactions
pigs air blood immunization
poultry plastic urine immunity
calves fluids sputum immune cells
chicken copper saliva innate immune
pig silica fecal inflammatory response

UMLS types that have not been annotated before:

collaboration hand hygiene detection
sharing disclosures vaccination
herd absenteeism isolation
mediating compliance stimulation
adoption empathy inoculation
rt-pcr health education machine learning
sequencing workshops data processing
screening nursery automation
diagnosis medical education deconvolution
prevention residency telecommunication


Our team for creating this CORD-NER dataset:


If you find our CORD-NER dataset useful, please cite our paper. Thanks!