We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.įor comments, changes of the existing content or inclusion of new corpora, send us an email. Most of the corpora are richly tagged as well as available under public licences. The CLARIN infrastructure offers 17 CMC corpora - most are available for Slovenian, but also for Czech, Dutch, Estonian, Finnish, French, German, Italian, and Lithuanian. Compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as resource to the scientific community, which is further exacerbated by the rapidly changing terms of service by content providers. Read more tools that can deal with non-standard spelling, vocabulary and grammar. They are also very important for the development of robust Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. Computer-mediated communication (CMC) constitutes public and private communication on-line, such as posts on blogs, forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, mobile phone applications such as WhatsApp, e-mail and chat rooms.