eCare Web Corpus: Companion Website

Go back to the eCare@home project website

Last Updated: 08 June 2018

In this website we present and document the progress of the eCareCorpus. From this page you can download the corpus, the datasets, the term seeds to bootstrap the corpus, the gold standard we used for the evaluation of domainhood, some of the R scripts and the outputs of the machine learning models that we have used for the lay-specialized text classification task.


The project and the corpus are described in the following papers:

Corpus Statistics

Medical Terms a.k.a. "seeds"

A "seed" is a medical term that has been used as keyword in the search engine to retrieve documents about a disease. For the eCare web corpus, we have used only medical terms representing chronic diseases.

# initial seeds # retrieved seeds # bootcatted web doc. # web doc. per retrieved seeds: Mean # web doc. per retrieved seeds: Median # web doc. per retrieved seeds: SDev
Unigrams 13 13 112 8.61 9 3.57
Bigrams 215 142 689 4.85 4 3.16
Total 228 155 801 5.16 5 3.35
Table 1. Seeds: Statistics

Wordcounts per Subcorpus

The eCare web corpus contains 155 subcorpora. Each subcorpus includes the documents that have been retrieved and downloaded for each seed.

Wordcount per seed subcorpus Wordcount per seed subcorpus: Mean Wordcount per seed subcorpus: Median Wordcount per seed subcorpus: SD
Wordcount per Unigram Seed Subcorpus (13) 91 118 7009.07 7199 3770.957
Wordcount per Bigram Seed Subcorpus (142) 618 491 4355.57 3401 4072.31
Wordcount per Total Seed Subcorpus (155) 709 609 4578.123 3797 4103.22
Table 2. Wordcounts per Subcorpus: Statistics

Wordcounts at Web Document Level

Wordcount per web document Wordcount per web documents: Mean Wordcount per web documents: Median Wordcount per web documents: SD
Wordcount per Web Doc (Unigrams) (13) 89 921 802.8661 555.5 3770.957
Wordcount per Web Doc (Bigrams) (142) 618 491 4355.57 3401 982.3716
Wordcount per Web Doc (Total) (155) 610 669 886.312 582 1639.207
Table 3. Wordcount per Word Document: Statistics

Available for Download

ATT.! The eCare web corpus (namly eCare_Sv_01) is distributed under the following disclaimer: "Copyright is held by the author/owner(s) of the web documents included in the corpus. The documents in the corpus can be used for research purposes ONLY. We are ready to delete any documents in the corpus upon the author/owner(s)' request.".

Corpus Datasets Outputs R Scripts
Raw (txt) & Annotated by sublanguge (xml) arff files (zip) Models (zip) & Statistical Tests zip Cleaning the corpus, conversion of the corpus into a dataset, etc. (zip)
Table 4. Corpus, datasets, outputs and R scripts

Contact: Marina Santini (marinasantini dot ms a t g-m_a[i](l) dot c...

LinkedIn Profile

Go back to the eCare@home project website