Go back to the E-care@home project website
In this website we present and document the progress of the eCareCorpus from 2017 to 2019. The creation of this corpus has been supported by the E-care@home project funded by the the Swedish Knowledge Foundation.
From this page you can download the corpus, the datasets, the term seeds to bootstrap the corpus, the gold standard we used for the evaluation of domainhood, some of the R scripts and the outputs of the machine learning models that we have used for the lay-specialized text classification task.
The project and the corpus are described in the following papers (inverse chronological order):
Santini M., Strandqvist W. and Jönsson, A. (2019). Profiling specialized web corpus qualities: A progress report on "Domainhood". Argentinian Journal of Applied Linguistics, 7(1) (journal article, AJAL Journal)
Corpora used in Experiment 1
Corpora used in Experiment 2
Santini, M., Jönsson, A., Strandqvist, W., Cederblad, G., Nyström, M., Alirezaie, M., Lind, L., Blomqvist, E., Lindén, M. and Kristoffersson, A. (2019). Designing an Extensible Domain-Specific Web Corpus for “Layfication”: A Case Study in eCare at Home . In Cyber-Physical Systems for Social Applications (pp. 98-155). IGI Global. (chapter, book).
|Cederblad G.(2018) Finding Synonyms in Medical Texts – Creating a system for automatic synonym extraction from medical texts. Bachelor thesis in Cognitive Science, 2018. Linköping University, Department of Computer Science
|Santini M., Strandqvist W. and Jönsson A. (2018). Profiling Domain Specificity of Specialized Web Corpora using Burstiness. Explorations and Open Issues. SLTC2018 - Swedish Language Technology Conference 2018, 7-9 November 2018, Stockholm, Sweden. (paper, poster).|
|Santini M., Strandqvist W., Nyström M., Alirezai M. and Jönsson A. (2018). Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora. In International Conference on Database and Expert Systems Applications (pp. 207-217). Springer, Cham. (paper, presentation).|
|Strandqvist W., Santini M., Lind L. and Jönsson A. (2018). Towards a Quality Assessment of Web Corpora for Language Technology Applications. In: Read T., Montaner S. and Sedano B. (2018). Technological Innovation for Specialized Linguistic Domains Languages for Digital Lives and Cultures Proceedings of TISLID’18. Editions universitaires europeenne.
|Santini M., Jönsson A., Nystrom M. and Alirezai M. (2017). A Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.
A "seed" is a medical term that has been used as keyword in the search engine to retrieve documents about a disease. For the eCare web corpus, we have used only medical terms representing chronic diseases.
|# initial seeds||# retrieved seeds||# bootcatted web doc.||# web doc. per retrieved seeds: Mean||# web doc. per retrieved seeds: Median||# web doc. per retrieved seeds: SDev|
The eCare web corpus contains 155 subcorpora. Each subcorpus includes the documents that have been retrieved and downloaded for each seed.
|Wordcount per seed subcorpus||Wordcount per seed subcorpus: Mean||Wordcount per seed subcorpus: Median||Wordcount per seed subcorpus: SD|
|Wordcount per Unigram Seed Subcorpus (13)||91 118||7009.07||7199||3770.957|
|Wordcount per Bigram Seed Subcorpus (142)||618 491||4355.57||3401||4072.31|
|Wordcount per Total Seed Subcorpus (155)||709 609||4578.123||3797||4103.22|
|Wordcount per web document||Wordcount per web documents: Mean||Wordcount per web documents: Median||Wordcount per web documents: SD|
|Wordcount per Web Doc (Unigrams) (13)||89 921||802.8661||555.5||3770.957|
|Wordcount per Web Doc (Bigrams) (142)||618 491||4355.57||3401||982.3716|
|Wordcount per Web Doc (Total) (155)||610 669||886.312||582||1639.207|
ATT.! The eCare web corpus (namly eCare_Sv_01) is distributed under the following disclaimer: "Copyright is held by the author/owner(s) of the web documents included in the corpus. The documents in the corpus can be used for research purposes ONLY. We are ready to delete any documents in the corpus upon the author/owner(s)' request.".
|Raw (txt) & Annotated by sublanguge (xml)||arff files (zip)||Models (zip) & Statistical Tests zip||Cleaning the corpus, conversion of the corpus into a dataset, etc. (zip)|
Contact: Marina Santini (firstname.lastname@example.org
Go back to the eCare@home project website