eCare Web Corpus: Companion Website

Go back to the E-care@home project website

Last Updated: 03 Feb 2020

In this website we present and document the progress of the eCareCorpus from 2017 to 2019. The creation of this corpus has been supported by the E-care@home project funded by the the Swedish Knowledge Foundation.

From this page you can download the corpus, the datasets, the term seeds to bootstrap the corpus, the gold standard we used for the evaluation of domainhood, some of the R scripts and the outputs of the machine learning models that we have used for the lay-specialized text classification task.

Publications based on eCareCorpus and corpus downloads (the eCare corpus is an evolving corpus, check the differnt versions available)

The project and the corpus are described in the following papers (inverse chronological order):

2019

Santini M., Strandqvist W. and Jönsson, A. (2019). Profiling specialized web corpus qualities: A progress report on "Domainhood". Argentinian Journal of Applied Linguistics, 7(1) (journal article, AJAL Journal)

Corpora used in Experiment 1

  • (Eng) ukWaCsample (872 565 words): a random subset of ukWaC, a general-purpose web corpus (txt).
  • (Eng) eCare_En_02: Gold (544 677 words): a domain-specific web corpus collected with hand-picked term seeds from the E-Care personas and use cases/scenarios (rar, txt).
  • (Eng) eCare_En_02: Auto (492 479 words): a domain-specific web corpus collected with automatically extracted term seeds from the E-Care personas and use cases/scenarios (rar, txt).

Corpora used in Experiment 2

  • (Sv) eCare_ch_sv_01 (to be flattened) (zip).
  • (Sv) eCare uc_sv_02 (flattened) (txt).

Santini, M., Jönsson, A., Strandqvist, W., Cederblad, G., Nyström, M., Alirezaie, M., Lind, L., Blomqvist, E., Lindén, M. and Kristoffersson, A. (2019). Designing an Extensible Domain-Specific Web Corpus for “Layfication”: A Case Study in eCare at Home . In Cyber-Physical Systems for Social Applications (pp. 98-155). IGI Global. (chapter, book).

  • (Sv) The datasets (arff) used in Experiments 1, 2 and 3 are available here: zip.
  • (Sv) eCare_Sv_01+ (flattened) used to create the "nitty-gritty" distributional thesaurus is here: txt.
  • (Eng) ukWaCsample (872 565 words): a random subset of ukWaC, a general-purpose web corpus (txt).
  • (Eng) eCare_En_02: Gold (544 677 words): a domain-specific web corpus collected with hand-picked term seeds from the E-Care personas and use cases/scenarios (rar, txt).
  • (Eng) eCare_En_02: Auto (492 479 words): a domain-specific web corpus collected with automatically extracted term seeds from the E-Care personas and use cases/scenarios (rar, txt).
2018
Cederblad G.(2018) Finding Synonyms in Medical Texts – Creating a system for automatic synonym extraction from medical texts. Bachelor thesis in Cognitive Science, 2018. Linköping University, Department of Computer Science (thesis).
  • (Sv) eCare_ch_sv_01 expanded with 15 terms (flattened) (txt).
Santini M., Strandqvist W. and Jönsson A. (2018). Profiling Domain Specificity of Specialized Web Corpora using Burstiness. Explorations and Open Issues. SLTC2018 - Swedish Language Technology Conference 2018, 7-9 November 2018, Stockholm, Sweden. (paper, poster).
  • (Sv) eCare_ch_sv_01 (to be flattened) (zip).
  • (Sv) eCare uc_sv_02 (flattened) (txt).
Santini M., Strandqvist W., Nyström M., Alirezai M. and Jönsson A. (2018). Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora. In International Conference on Database and Expert Systems Applications (pp. 207-217). Springer, Cham. (paper, presentation).
  • (Sv) SUC (flattened)(txt).
  • (Sv) eCare_Sv_01 (to be flattened) (zip).
Strandqvist W., Santini M., Lind L. and Jönsson A. (2018). Towards a Quality Assessment of Web Corpora for Language Technology Applications. In: Read T., Montaner S. and Sedano B. (2018). Technological Innovation for Specialized Linguistic Domains Languages for Digital Lives and Cultures Proceedings of TISLID’18. Editions universitaires europeenne. (abstract, paper, presentation).
  • (Eng) ukWaCsample (872 565 words): a random subset of ukWaC, a general-purpose web corpus (txt).
  • (Eng) Gold (544 677 words): a domain-specific web corpus collected with hand-picked term seeds from the E-Care personas and use cases/scenarios (rar, txt).
  • (Eng) Auto (492 479 words): a domain-specific web corpus collected with automatically extracted term seeds from the E-Care personas and use cases/scenarios (rar, txt).
2017
Santini M., Jönsson A., Nystrom M. and Alirezai M. (2017). A Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague. (paper, presentation).
  • (Sv) Download the raw corpus. This is the corpus that was created with BootCaT. This version of the corpus is in text format: zip.
  • (Sv) Download the annotated corpus. This version of the corpus is annotated by lay and/or specialized sublanguage. The texts have been labelled by two annotators, a lay annotator and an expert annotator. Each text has two labels. This version of the corpus is in xml format:zip.
  • (Sv) Download the datasets. From the xml format of the corpus several datasets have been extracted. The format of the dataset is arff (csv) and they have been used with the Weka package: zip.

Corpus Statistics related to Santini et al. (2017)

Medical Terms a.k.a. "seeds"

A "seed" is a medical term that has been used as keyword in the search engine to retrieve documents about a disease. For the eCare web corpus, we have used only medical terms representing chronic diseases.

# initial seeds # retrieved seeds # bootcatted web doc. # web doc. per retrieved seeds: Mean # web doc. per retrieved seeds: Median # web doc. per retrieved seeds: SDev
Unigrams 13 13 112 8.61 9 3.57
Bigrams 215 142 689 4.85 4 3.16
Total 228 155 801 5.16 5 3.35
Table 1. Seeds: Statistics

Wordcounts per Subcorpus

The eCare web corpus contains 155 subcorpora. Each subcorpus includes the documents that have been retrieved and downloaded for each seed.

Wordcount per seed subcorpus Wordcount per seed subcorpus: Mean Wordcount per seed subcorpus: Median Wordcount per seed subcorpus: SD
Wordcount per Unigram Seed Subcorpus (13) 91 118 7009.07 7199 3770.957
Wordcount per Bigram Seed Subcorpus (142) 618 491 4355.57 3401 4072.31
Wordcount per Total Seed Subcorpus (155) 709 609 4578.123 3797 4103.22
Table 2. Wordcounts per Subcorpus: Statistics

Wordcounts at Web Document Level

Wordcount per web document Wordcount per web documents: Mean Wordcount per web documents: Median Wordcount per web documents: SD
Wordcount per Web Doc (Unigrams) (13) 89 921 802.8661 555.5 3770.957
Wordcount per Web Doc (Bigrams) (142) 618 491 4355.57 3401 982.3716
Wordcount per Web Doc (Total) (155) 610 669 886.312 582 1639.207
Table 3. Wordcount per Word Document: Statistics

Available for Download (related to Santini et al., 2017)

ATT.! The eCare web corpus (namly eCare_Sv_01) is distributed under the following disclaimer: "Copyright is held by the author/owner(s) of the web documents included in the corpus. The documents in the corpus can be used for research purposes ONLY. We are ready to delete any documents in the corpus upon the author/owner(s)' request.".

Corpus Datasets Outputs R Scripts
Raw (txt) & Annotated by sublanguge (xml) arff files (zip) Models (zip) & Statistical Tests zip Cleaning the corpus, conversion of the corpus into a dataset, etc. (zip)
Table 4. Corpus, datasets, outputs and R scripts

Contact: Marina Santini (marinasantini.ms@gmail.com

LinkedIn Profile

Go back to the eCare@home project website