Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to COVID-19. Are you interested to view more results? Visit OpenAIRE - Explore.
7 Research products, page 1 of 1

  • COVID-19
  • Research data
  • Research software
  • Other research products
  • 2013-2022
  • Bulgarian

Date (most recent)
arrow_drop_down
  • Restricted Bulgarian
    Authors: 
    Irina Temnikova; Silvia Gargova; Veneta Kireva; Tsvetelina Stefanova;
    Publisher: Zenodo

    This dataset has been created within Project TRACES (more information: https://traces.gate-ai.eu/). The dataset contains 61411 tweet IDs of tweets, written in Bulgarian, with annotations. The dataset can be used for general use or for building lies and disinformation detection applications. The tweets have been collected via Twitter API under academic access between 1 Jan 2020 - 28 June 2022 and with the following keywords: (Covid OR коронавирус OR Covid19 OR Covid-19 OR Covid_19) - without replies and without retweets (Корона OR корона OR Corona OR пандемия OR пандемията OR Spikevax OR SARS-CoV-2 OR бустерна доза) - with replies, but without retweets Explanations of which fields can be used as markers of lies (or of intentional disinformation) are provided in our forthcoming paper (please follow the updates on our website: https://traces.gate-ai.eu/?page_id=20). The dataset contains the following fields: tweet_id the ID of each tweet sentence_count the number of sentences in the post words_per_sentence the number of words in each sentence words_count the total number of words in each post average_words_per_sentence the average number of words per sentence in the post clarin_classla_ner the named entity tags in the post by Clarin Classla slavic_bert_ner the social media post, with named entity tags, by Slavic BERT slavic_bert_ner_words the social media post tokenized by Slavic BERT ner_count_bert the number of named entities by type, by Slavic BERT ner_count_classla the number of named entities by type, by Clarin Classla ner_count_all_bert total number of all named entities in the social media post, by Slavic BERT ner_count_all_classla total number of all named entities in the social media post, by Clarin Classla NE_in_message_bert lists of the named entities in the post by type, by Slavic BERT NE_in_message_classla lists of the named entities in the post by type, by Clarin Classla count_upos_all number of words, grouped by part-of-speech tag, by Clarin Classla upos_in_message list of words, grouped by part-of-speech tag, by Clarin Classla type_token_ratio the number of unique word forms divided by the number of all words in the post content_word_diversity the number of unique content words divided by the number of all content words passive_voice_count the number of occurrences of passive voice in the post past_tense_count the number of occurrences of past tense in the post negative_count the number of occurrences of negative forms in the post self-ref/pronouns counts in the post of self-reference pronouns (1 person Sg./Pl.) and other personal pronouns (2nd, and 3rd person) еmotiveness total num. of adjectives + total number of adverbs / total number of nouns + total number of verbs (from Zhou et al, 2004) pausality total number of punctuation marks/total number of sentences (from Zhou et al, 2004) num_funct_words total number of function words num_conj total number of conjunctions redundancy total number of function words/total number of sentences (from Zhou et al, 2004) volition_words count in the post of volitional words (will, wish, coerce, impose) in Bulgarian expression_time occurrence of words from the time espressions list in each sentence, with a weight 2 if the marker is at the beginning of the sentence and 1 if not expression_spatial occurrence of words from the spatial espressions list in each sentence, with a weight 2 if the marker is at the beginning of the sentence and 1 if not expression_negative occurrence of words from the negative espressions list in each sentence, with a weight 2 if the marker is at the beginning of the sentence and 1 if not expression_cognitive_operations occurrence of words from the cognitive operations espressions list in each sentence, with a weight 2 if the marker is at the beginning of the sentence and 1 if not expression_verbs_detail occurrence of words from the details espressions list in each sentence, with a weight 2 if the marker is at the beginning of the sentence and 1 if not sense_expressions occurrence of words from the sense espressions list in the whole post, with a weight 2 if the marker is at the beginning of the sentence and 1 if not feeling_expressions occurrence of words from the feeling espressions list in the whole post, with a weight 2 if the marker is at the beginning of the sentence and 1 if not doubt_confidence_expressions occurence of words from the doubt/confidence expressions list in in the whole post, with a weight 2 if the marker is at the beginning of the sentence and 1 if not disc_mark total number of discourse markers, a list of the detected discourse markers generaliz_markers total number of generalization markers, followed by the list of recognized discourse markers attention_expressions occurrence of words from the attention-attracting expressions list in each sentence, with a weight 2 if the marker is at the beginning of the sentence and 1 if not duplicate_phrases repeated words or expressions, a potential characteristic of automatically generated messages (e.g. deepfakes) uppercase_middle_words words with uppercase letters in the middle, potential characteristic of automatically generated messages (e.g. deepfakes) lowercase_beginning_sentences information for each sentence - whether the sentence begins with a lowercase letter (True) or not (False) num_of_urls number of links per post num_of_hashtags number of hashtags per post num_of_mentions number of mentions per post

  • Open Access Bulgarian
    Authors: 
    Erjavec, Tomaž; Ogrodniczuk, Maciej; Osenova, Petya; Ljubešić, Nikola; Simov, Kiril; Grigorova, Vladislava; Rudolf, Michał; Pančur, Andrej; Kopp, Matyáš; Barkarson, Starkaður; +29 more
    Publisher: CLARIN ERIC

    ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after October 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text version of the corpus along with TSV metadata on the speeches. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1405.

  • Open Access Bulgarian
    Authors: 
    Toma? Erjavec; Maciej Ogrodniczuk; Petya Osenova; Nikola Ljube?i?; Kiril Simov; Vladislava Grigorova; Micha? Rudolf; Andrej Pan?ur; Matyá? Kopp; Starkaður Barkarson; +34 more
    Publisher: CLARIN ERIC
    Country: Italy

    ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (from November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the linguistically marked-up version of the corpus, while the text version is available at http://hdl.handle.net/11356/1432. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools. The compressed files include the ParlaMint.ana XML TEI-encoded linguistically annotated corpus; the derived corpus in CoNLL-U with TSV speech metadata; and the vertical files (with registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 2.1 release of the data and scripts available at the GitHub repository of the ParlaMint project. As opposed to the previous version 2.0, this version corrects some errors in various corpora and adds the information on upper / lower house for bicameral parliaments. The vertical files have also been changed to make them easier to use in the concordancers.

  • Open Access Bulgarian
    Authors: 
    Erjavec, Tomaž; Ogrodniczuk, Maciej; Osenova, Petya; Ljubešić, Nikola; Simov, Kiril; Grigorova, Vladislava; Rudolf, Michał; Pančur, Andrej; Kopp, Matyáš; Barkarson, Starkaður; +31 more
    Publisher: CLARIN ERIC

    ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after October 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the linguistically marked-up version of the corpus, while the text version is available at http://hdl.handle.net/11356/1388. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools. The compressed files include the ParlaMint.ana XML TEI-encoded linguistically annotated corpus; the derived corpus in CoNLL-U with TSV speech metadata; and the vertical files (with registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project.

  • Other research product . Other ORP type . 2021
    Bulgarian
    Authors: 
    Toma? Erjavec; Maciej Ogrodniczuk; Petya Osenova; Nikola Ljube?i?; Kiril Simov; Vladislava Grigorova; Micha? Rudolf; Andrej Pan?ur; Matyá? Kopp; Starkaður Barkarson; +34 more
    Country: Italy

    ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text version of the corpus along with TSV metadata on the speeches. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1431.

  • Open Access Bulgarian
    Authors: 
    Erjavec, Tomaž; Ogrodniczuk, Maciej; Osenova, Petya; Ljubešić, Nikola; Simov, Kiril; Grigorova, Vladislava; Rudolf, Michał; Pančur, Andrej; Kopp, Matyáš; Barkarson, Starkaður; +32 more
    Publisher: CLARIN ERIC

    ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text version of the corpus along with TSV metadata on the speeches. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1431.

  • Open Access Bulgarian
    Authors: 
    Erjavec, Tomaž; Grigorova, Vladislava; Ljubešić, Nikola; Ogrodniczuk, Maciej; Osenova, Petya; Pančur, Andrej; Rudolf, Michał; Simov, Kiril;
    Publisher: CLARIN ERIC

    ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting at the end of 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after October 2019), or being "reference" (before that date). The corpora have extensive meta-data about the speakers (name, gender, party affiliation, MP status), are structured into time-stamped terms, sessions and meetings, with each speech being marked by its speaker and their role (chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation, but have been validated to the compatible but much stricter ParlaMint schemas. The schemas are included in the distribution, along with scripts to convert the corpora into other formats. The ZIP files with the TEI encoded corpora also include the automatically derived plain text version of the corpus, along with metadata on the speeches. In addition to the ParlaMint TEI encoded corpora, their linguistically encoded variants (".ana") are also available. The annotation includes named entities, lemmatisation, part-of-speech tagging, and morphological features and syntactic parses according to the Universal Dependencies recommendations. State-of-the-art tools have been used to perform the annotations. The .ana.zip corpora include the ParlaMint encoded XML, as well as derived formats, in particular, CoNLL-U and vertical files.