Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to COVID-19. Are you interested to view more results? Visit OpenAIRE - Explore.
44 Research products, page 1 of 5

  • COVID-19
  • Research data
  • Research software
  • Zenodo

10
arrow_drop_down
Relevance
arrow_drop_down
  • Open Access English
    Authors: 
    Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo;
    Publisher: Zenodo

    Version 117 of the dataset. MAJOR CHANGE NOTE: The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: https://doi.org/10.3390/epidemiologia2030024. Please cite this when using the dataset. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,341,292,548 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (347,137,128 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.

  • Open Access English
    Authors: 
    Tekumalla, Ramya; Baig, Zia; Pan, Michelle; Robles Hernandez, Luis Alberto; Wang, Michael; Banda, Juan M.;
    Publisher: Zenodo

    This is the dataset, trained model, and software companion for the paper titled: Characterizing Anti-Asian Rhetoric During The COVID-19 Pandemic: A Sentiment Analysis Case Study on Twitter accepted for the Workshop on Data for the Wellbeing of Most Vulnerable of the ICWSM 2022 conference. The COVID-19 pandemic has shown a measurable increase in the usage of sinophobic comments or terms on online social media platforms. In the United States, Asian Americans have been primarily targeted by violence and hate speech stemming from negative sentiments about the origins of the novel SARS-CoV-2 virus. While most published research focuses on extracting these sentiments from social media data, it does not connect the specific news events during the pandemic with changes in negative sentiment on social media platforms. In this work we combine and enhance publicly available resources with our own manually annotated set of tweets to create machine learning classification models to characterize the sinophobic behavior. We then applied our classifier to a pre-filtered longitudinal dataset spanning two years of pandemic related tweets and overlay our findings with relevant news events.

  • Closed Access
    Authors: 
    Ana Isabela Lopes Sales-Moioli; Leonardo J. GalvĂŁo-Lima; Talita K. B. Pinto; Pablo H. Cardoso; Rodrigo D. Silva; Felipe Fernandes; Ingridy M. P. Barbalho; Fernando L. O. Farias; Nicolas V. R. Veras; Daniele M. S. Barros; +4 more
    Publisher: Zenodo

    Data Repository Dataset name: covid19_rn-br.csv Version: 1.0 Data collection period: 04/2020 - 08/2021 Dataset Characteristics: Multivalued Number of Instances: 12,635 Number of Attributes: 16 Missing Values: Yes Area(s): Health Sources: - Primary: RegulaRN (https://regulacao.saude.rn.gov.br/sala-situacao/sala_publica/). - Secondary: RN Mais Vacina (https://rnmaisvacina.lais.ufrn.br/cidadao/). Description: The covid19_rn-br.csv dataset is composed of data from individuals who were hospitalized due to the Sars-CoV-2 virus. The data comes from the ecosystem of services that includes the regulatory system for clinical and critical beds related to Covid-19 (RegulaRN) and the vaccination system against Covid-19 that records the data of the general population (RN Mais Vacina) from Rio Grande do Norte state, Brazil. This dataset provides elementary data to analyze the impact of vaccination on patients hospitalized in the state. Table 1 presents the dictionary used during the data analysis. Table 1: Description of Dataset Features. Attributes Description datatype Value usp Unified Score for Prioritization scale, which combines the parameters described in the quick Sequential Organ Failure Assessment (qSOFA), the Charlson Comorbidity Index (CCI), the Clinical Frailty Scale (CFS) and The Karnofsky Performance Status scores Numerical 2.0. 3.0, 4.0, 5.0, 6.0+ age Informs the patient's age Numerical. integer value for age outcome Informs the outcome of the hospitalized patient after leaving the hospital Categorical “Discharge” or “Death" comorbidities Informs if the patient has comorbidities Categorical. “Yes” or “No” vaccine Informs which type of vaccine was applied to the patient Categorical “Vaccine #1”, “Vaccine #2” or NaN bed_date_admission Informs the date the patient was hospitalized Date Date bed_date_outcome Informs the date that the patient left the hospital bed Date Date length_hospitalization Informs the number of days that the patient was hospitalized Numerical An integer value for days interval_d1_hospitalization Informs the interval (in days) that the patient had between the first dose and admission Numerical An integer value for days or NaN interval_d2_hospitalization Informs the interval (in days) that the patient had between the second dose and admission Numerical An integer value for days or NaN dt_d1 Informs the date of application of the patient's first dose Date Date or NaN dt_d2 Informs the patient's second dose application date Date Date or NaN comorbidities_txt Informs patients' comorbidities Categorical Free text or NaN immunization It informs the patient's immunization level according to the number of doses received and the interval (in days) of application of these doses Categorical “Partially”, “Fully” or “Not vaccinated” health_professionals Informs if the patient is a health professional Boolean 0 or 1 age_group Informs the age group of the hospitalized patient according to their age Categorical 0-19, 20-49, 50-59, 60-69, 70-79, 80-89, 90+ Article: Effectiveness of COVID-19 vaccination on reduction of hospitalizations and deaths in elderly patients in Rio Grande do Norte, Brazil Authors: Ana Isabela Lopes Sales-Moioli, Leonardo J. Galvão-Lima, Talita K. B. Pinto, Pablo H. Cardoso, Rodrigo D. Silva; Felipe Fernandes, Ingridy M. P. Barbalho, Fernando L. O. Farias, Nicolas V. R. Veras, Daniele M. S. Barros; Agnaldo S. Cruz; Ion G. M. Andrade; Lúcio Gama and Ricardo A. M. Valentim

  • Open Access
    Authors: 
    Jones, Adrian; Zhang, Daoyu; Deigin, Yuri; Quay, Steven C.;
    Publisher: Zenodo

    Supplementary Figures, Information and Data to accompany: Analysis of pangolin metagenomic datasets reveals significant contamination, raising concerns for pangolin CoV host attribution Adrian Jones, Daoyu Zhang, Yuri Deigin and Steven C. Quay https://arxiv.org/abs/2108.08163

  • Open Access
    Authors: 
    Thanasis Vergoulis; Ilias Kanellos; Serafeim Chatzopoulos; Danae Pla Karidi; Theodore Dalamagas;
    Publisher: Zenodo

    This dataset contains impact metrics and indicators for a set of publications that are related to the COVID-19 infectious disease and the coronavirus that causes it. It is based on: Τhe CORD-19 dataset released by the team of Semantic Scholar1 and Τhe curated data provided by the LitCovid hub2. These data have been cleaned and integrated with data from COVID-19-TweetIDs and from other sources (e.g., PMC). The result was dataset of 550,214 unique articles along with relevant metadata (e.g., the underlying citation network). We utilized this dataset to produce, for each article, the values of the following impact measures: Influence: Citation-based measure reflecting the total impact of an article. This is based on the PageRank3 network analysis method. In the context of citation networks, it estimates the importance of each article based on its centrality in the whole network. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Influence_alt: Citation-based measure reflecting the total impact of an article. This is the Citation Count of each article, calculated based on the citation network between the articles contained in the BIP4COVID19 dataset. Popularity: Citation-based measure reflecting the current impact of an article. This is based on the AttRank5 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). AttRank alleviates this problem incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to read papers which received a lot of attention recently. This is why it is more suitable to capture the current "hype" of an article. Popularity alternative: An alternative citation-based measure reflecting the current impact of an article (this was the basic popularity measured provided by BIP4COVID19 until version 26). This is based on the RAM6 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). RAM alleviates this problem using an approach known as "time-awareness". This is why it is more suitable to capture the current "hype" of an article. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Social Media Attention: The number of tweets related to this article. Relevant data were collected from the COVID-19-TweetIDs dataset. In this version, tweets between 5/5/22-15/5/22 have been considered from the previous dataset. We provide five CSV files, all containing the same information, however each having its entries ordered by a different impact measure. All CSV files are tab separated and have the same columns (PubMed_id, PMC_id, DOI, influence_score, popularity_alt_score, popularity score, influence_alt score, tweets count). The work is based on the following publications: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2022-05-31 Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2022-05-31. doi:10.5281/zenodo.3715506 Chen Q, Allot A, & Lu Z. (2020) Keep up with the latest coronavirus research, Nature 579:193 (version 2022-05-31) R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab. I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020) Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380 A Web user interface that uses these data to facilitate the COVID-19 literature exploration, can be found here. More details in our peer-reviewed publication here (also here there is an outdated preprint version). Funding: We acknowledge support of this work by the project "Moving from Big Data Management to Data Science" (MIS 5002437/3) which is implemented under the Action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.

  • Open Access English
    Authors: 
    Sharma, Maryada; Vanam, Hari Pankaj; Panda, Naresh K; Patro, Sourabha K; Arora, Rhythm; Bhadada, Sanjay K; Rudramurthy, Sivaprakash M; Singh, Mini P; Koppula, Purushotham Reddy;
    Publisher: Zenodo

    Raw data files of transcriptomic profiling experiments, which form the basis for our publication (https://www.mdpi.com/2673-4540/3/1/13).

  • Open Access
    Authors: 
    Huang, Xiaolei; Jamison, Amelia; Broniatowski, David; Quinn, Sandra; Dredze, Mark;
    Publisher: Zenodo

    This dataset contains tweets related to COVID-19. The dataset contains Twitter ids, from which you can download the original data directly from Twitter. Additionally, we include the date, keywords related to COVID-19 and the inferred geolocation. Check detailed information at http://twitterdata.covid19dataresources.org/index.

  • Open Access
    Authors: 
    Thanasis Vergoulis; Ilias Kanellos; Serafeim Chatzopoulos; Danae Pla Karidi; Theodore Dalamagas;
    Publisher: Zenodo

    This dataset contains impact metrics and indicators for a set of publications that are related to the COVID-19 infectious disease and the coronavirus that causes it. It is based on: Τhe CORD-19 dataset released by the team of Semantic Scholar1 and Τhe curated data provided by the LitCovid hub2. These data have been cleaned and integrated with data from COVID-19-TweetIDs and from other sources (e.g., PMC). The result was dataset of 546,324 unique articles along with relevant metadata (e.g., the underlying citation network). We utilized this dataset to produce, for each article, the values of the following impact measures: Influence: Citation-based measure reflecting the total impact of an article. This is based on the PageRank3 network analysis method. In the context of citation networks, it estimates the importance of each article based on its centrality in the whole network. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Influence_alt: Citation-based measure reflecting the total impact of an article. This is the Citation Count of each article, calculated based on the citation network between the articles contained in the BIP4COVID19 dataset. Popularity: Citation-based measure reflecting the current impact of an article. This is based on the AttRank5 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). AttRank alleviates this problem incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to read papers which received a lot of attention recently. This is why it is more suitable to capture the current "hype" of an article. Popularity alternative: An alternative citation-based measure reflecting the current impact of an article (this was the basic popularity measured provided by BIP4COVID19 until version 26). This is based on the RAM6 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). RAM alleviates this problem using an approach known as "time-awareness". This is why it is more suitable to capture the current "hype" of an article. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Social Media Attention: The number of tweets related to this article. Relevant data were collected from the COVID-19-TweetIDs dataset. In this version, tweets between 24/4/22-30/4/22 have been considered from the previous dataset. We provide five CSV files, all containing the same information, however each having its entries ordered by a different impact measure. All CSV files are tab separated and have the same columns (PubMed_id, PMC_id, DOI, influence_score, popularity_alt_score, popularity score, influence_alt score, tweets count). The work is based on the following publications: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2022-05-16 Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2022-05-16. doi:10.5281/zenodo.3715506 Chen Q, Allot A, & Lu Z. (2020) Keep up with the latest coronavirus research, Nature 579:193 (version 2022-05-16) R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab. I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020) Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380 A Web user interface that uses these data to facilitate the COVID-19 literature exploration, can be found here. More details in our peer-reviewed publication here (also here there is an outdated preprint version). Funding: We acknowledge support of this work by the project "Moving from Big Data Management to Data Science" (MIS 5002437/3) which is implemented under the Action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.

  • Open Access
    Authors: 
    Fobbe, Sean;
    Publisher: Zenodo

    Überblick Dieses R-Skript lädt den Corpus der Entscheidungen des Bundesverfassungsgerichts (CE- BVerfG) herunter, untersucht ihn auf mit SARS-CoV-2 assoziiertem Vokabular und speichert relevante Entscheidungen. Es ist die Grundlage für den Datensatz Corona-Rechtsprechung des Bundesverfassungsgerichts (BVerfG-Corona). Alle mit diesem Skript erstellten Datensätze werden dauerhaft kostenlos und urheberrechtsfrei auf Zenodo, dem wissenschaftlichen Archiv des CERN, veröffentlicht. Alle Versionen sind mit einem persistenten Digital Object Identifier (DOI) versehen. Die neueste Version des Datensatzes ist immer über den Link der Concept DOI erreichbar: https://doi.org/10.5281/zenodo.4459405 Aktualisierung Diese Software wird ca. alle 6 Monate aktualisiert. Benachrichtigungen über neue und aktualisierte Datensätze veröffentliche ich immer sofort auf Twitter unter @FobbeSean. NEU in Version 2022-02-01 Vollständige Aktualisierung der Daten Strenge Versionskontrolle von R packages mit renv Kompilierung jetzt detailliert konfigurierbar, insbesondere die Parallelisierung Parallelisierung nun vollständig mit future statt mit foreach und doParallel Fehlerhafte Kompilierungen werden vor der nächsten Kompilierung vollautomatisch aufgeräumt Alle Ergebnisse werden automatisch fertig verpackt in den Ordner 'output' sortiert README und CHANGELOG sind jetzt externe Markdown-Dateien, die bei der Kompilierung automatisiert eingebunden werden Systemanforderungen Betriebssystem Das Skript in seiner veröffentlichten Form kann nur unter Linux ausgeführt werden, da es Linux-spezifische Optimierungen (z.B. Fork Cluster) und Shell-Kommandos (z.B. OpenSSL) nutzt. Das Skript wurde unter Fedora Linux entwickelt und getestet. Die zur Kompilierung benutzte Version entnehmen Sie bitte dem sessionInfo()-Ausdruck am Ende des jeweiligen Compilation Reports. Software Sie müssen die Programmiersprache R installiert haben. Starten Sie danach eine Session im Ordner des Projekts, Sie sollten automatisch zur Installation aller packages in der empfohlenen Version aufgefordert werden. Andernfalls führen Sie bitte folgenden Befehl aus: renv::restore() Um die PDF Reports zu kompilieren benötigen Sie eine LaTeX-Installation. Sie können diese auf Fedora wie folgt installieren: sudo dnf install texlive-scheme-full Alternativ können sie das R package tinytex installieren. Parallelisierung In der Standard-Einstellung wird das Skript vollautomatisch die maximale Anzahl an Rechenkernen/Threads auf dem System zu nutzen. Die Anzahl der verwendeten Kerne kann in der Konfigurationsatei angepasst werden. Wenn die Anzahl Threads auf 1 gesetzt wird, ist die Parallelisierung deaktiviert. Speicherplatz Auf der Festplatte sollten 4 GB Speicherplatz vorhanden sein. Weitere Open Access Veröffentlichungen (Fobbe) Website — www.seanfobbe.de Open Data — https://zenodo.org/communities/sean-fobbe-data/ Source Code — https://zenodo.org/communities/sean-fobbe-code/ Volltexte regulärer Publikationen — https://zenodo.org/communities/sean-fobbe-publications/ Urheberrecht Der Source Code und alle von mir bereitgestellten Rohdaten stehen unter einer MIT No Attribution (MIT-0)-Lizenz. Sie können sie frei für alle Zwecke nutzen. Kontakt Fehler gefunden? Anregungen? Melden Sie diese entweder im Issue Tracker auf GitHub oder schreiben Sie mir eine E-Mail an fobbe-data@posteo.de

  • Open Access English
    Authors: 
    Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo;
    Publisher: Zenodo

    Version 114 of the dataset. MAJOR CHANGE NOTE: The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: https://doi.org/10.3390/epidemiologia2030024. Please cite this when using the dataset. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,337,295,758 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (345,843,722 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.

Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to COVID-19. Are you interested to view more results? Visit OpenAIRE - Explore.
44 Research products, page 1 of 5
  • Open Access English
    Authors: 
    Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo;
    Publisher: Zenodo

    Version 117 of the dataset. MAJOR CHANGE NOTE: The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: https://doi.org/10.3390/epidemiologia2030024. Please cite this when using the dataset. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,341,292,548 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (347,137,128 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.

  • Open Access English
    Authors: 
    Tekumalla, Ramya; Baig, Zia; Pan, Michelle; Robles Hernandez, Luis Alberto; Wang, Michael; Banda, Juan M.;
    Publisher: Zenodo

    This is the dataset, trained model, and software companion for the paper titled: Characterizing Anti-Asian Rhetoric During The COVID-19 Pandemic: A Sentiment Analysis Case Study on Twitter accepted for the Workshop on Data for the Wellbeing of Most Vulnerable of the ICWSM 2022 conference. The COVID-19 pandemic has shown a measurable increase in the usage of sinophobic comments or terms on online social media platforms. In the United States, Asian Americans have been primarily targeted by violence and hate speech stemming from negative sentiments about the origins of the novel SARS-CoV-2 virus. While most published research focuses on extracting these sentiments from social media data, it does not connect the specific news events during the pandemic with changes in negative sentiment on social media platforms. In this work we combine and enhance publicly available resources with our own manually annotated set of tweets to create machine learning classification models to characterize the sinophobic behavior. We then applied our classifier to a pre-filtered longitudinal dataset spanning two years of pandemic related tweets and overlay our findings with relevant news events.

  • Closed Access
    Authors: 
    Ana Isabela Lopes Sales-Moioli; Leonardo J. GalvĂŁo-Lima; Talita K. B. Pinto; Pablo H. Cardoso; Rodrigo D. Silva; Felipe Fernandes; Ingridy M. P. Barbalho; Fernando L. O. Farias; Nicolas V. R. Veras; Daniele M. S. Barros; +4 more
    Publisher: Zenodo

    Data Repository Dataset name: covid19_rn-br.csv Version: 1.0 Data collection period: 04/2020 - 08/2021 Dataset Characteristics: Multivalued Number of Instances: 12,635 Number of Attributes: 16 Missing Values: Yes Area(s): Health Sources: - Primary: RegulaRN (https://regulacao.saude.rn.gov.br/sala-situacao/sala_publica/). - Secondary: RN Mais Vacina (https://rnmaisvacina.lais.ufrn.br/cidadao/). Description: The covid19_rn-br.csv dataset is composed of data from individuals who were hospitalized due to the Sars-CoV-2 virus. The data comes from the ecosystem of services that includes the regulatory system for clinical and critical beds related to Covid-19 (RegulaRN) and the vaccination system against Covid-19 that records the data of the general population (RN Mais Vacina) from Rio Grande do Norte state, Brazil. This dataset provides elementary data to analyze the impact of vaccination on patients hospitalized in the state. Table 1 presents the dictionary used during the data analysis. Table 1: Description of Dataset Features. Attributes Description datatype Value usp Unified Score for Prioritization scale, which combines the parameters described in the quick Sequential Organ Failure Assessment (qSOFA), the Charlson Comorbidity Index (CCI), the Clinical Frailty Scale (CFS) and The Karnofsky Performance Status scores Numerical 2.0. 3.0, 4.0, 5.0, 6.0+ age Informs the patient's age Numerical. integer value for age outcome Informs the outcome of the hospitalized patient after leaving the hospital Categorical “Discharge” or “Death" comorbidities Informs if the patient has comorbidities Categorical. “Yes” or “No” vaccine Informs which type of vaccine was applied to the patient Categorical “Vaccine #1”, “Vaccine #2” or NaN bed_date_admission Informs the date the patient was hospitalized Date Date bed_date_outcome Informs the date that the patient left the hospital bed Date Date length_hospitalization Informs the number of days that the patient was hospitalized Numerical An integer value for days interval_d1_hospitalization Informs the interval (in days) that the patient had between the first dose and admission Numerical An integer value for days or NaN interval_d2_hospitalization Informs the interval (in days) that the patient had between the second dose and admission Numerical An integer value for days or NaN dt_d1 Informs the date of application of the patient's first dose Date Date or NaN dt_d2 Informs the patient's second dose application date Date Date or NaN comorbidities_txt Informs patients' comorbidities Categorical Free text or NaN immunization It informs the patient's immunization level according to the number of doses received and the interval (in days) of application of these doses Categorical “Partially”, “Fully” or “Not vaccinated” health_professionals Informs if the patient is a health professional Boolean 0 or 1 age_group Informs the age group of the hospitalized patient according to their age Categorical 0-19, 20-49, 50-59, 60-69, 70-79, 80-89, 90+ Article: Effectiveness of COVID-19 vaccination on reduction of hospitalizations and deaths in elderly patients in Rio Grande do Norte, Brazil Authors: Ana Isabela Lopes Sales-Moioli, Leonardo J. Galvão-Lima, Talita K. B. Pinto, Pablo H. Cardoso, Rodrigo D. Silva; Felipe Fernandes, Ingridy M. P. Barbalho, Fernando L. O. Farias, Nicolas V. R. Veras, Daniele M. S. Barros; Agnaldo S. Cruz; Ion G. M. Andrade; Lúcio Gama and Ricardo A. M. Valentim

  • Open Access
    Authors: 
    Jones, Adrian; Zhang, Daoyu; Deigin, Yuri; Quay, Steven C.;
    Publisher: Zenodo

    Supplementary Figures, Information and Data to accompany: Analysis of pangolin metagenomic datasets reveals significant contamination, raising concerns for pangolin CoV host attribution Adrian Jones, Daoyu Zhang, Yuri Deigin and Steven C. Quay https://arxiv.org/abs/2108.08163

  • Open Access
    Authors: 
    Thanasis Vergoulis; Ilias Kanellos; Serafeim Chatzopoulos; Danae Pla Karidi; Theodore Dalamagas;
    Publisher: Zenodo

    This dataset contains impact metrics and indicators for a set of publications that are related to the COVID-19 infectious disease and the coronavirus that causes it. It is based on: Τhe CORD-19 dataset released by the team of Semantic Scholar1 and Τhe curated data provided by the LitCovid hub2. These data have been cleaned and integrated with data from COVID-19-TweetIDs and from other sources (e.g., PMC). The result was dataset of 550,214 unique articles along with relevant metadata (e.g., the underlying citation network). We utilized this dataset to produce, for each article, the values of the following impact measures: Influence: Citation-based measure reflecting the total impact of an article. This is based on the PageRank3 network analysis method. In the context of citation networks, it estimates the importance of each article based on its centrality in the whole network. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Influence_alt: Citation-based measure reflecting the total impact of an article. This is the Citation Count of each article, calculated based on the citation network between the articles contained in the BIP4COVID19 dataset. Popularity: Citation-based measure reflecting the current impact of an article. This is based on the AttRank5 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). AttRank alleviates this problem incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to read papers which received a lot of attention recently. This is why it is more suitable to capture the current "hype" of an article. Popularity alternative: An alternative citation-based measure reflecting the current impact of an article (this was the basic popularity measured provided by BIP4COVID19 until version 26). This is based on the RAM6 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). RAM alleviates this problem using an approach known as "time-awareness". This is why it is more suitable to capture the current "hype" of an article. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Social Media Attention: The number of tweets related to this article. Relevant data were collected from the COVID-19-TweetIDs dataset. In this version, tweets between 5/5/22-15/5/22 have been considered from the previous dataset. We provide five CSV files, all containing the same information, however each having its entries ordered by a different impact measure. All CSV files are tab separated and have the same columns (PubMed_id, PMC_id, DOI, influence_score, popularity_alt_score, popularity score, influence_alt score, tweets count). The work is based on the following publications: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2022-05-31 Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2022-05-31. doi:10.5281/zenodo.3715506 Chen Q, Allot A, & Lu Z. (2020) Keep up with the latest coronavirus research, Nature 579:193 (version 2022-05-31) R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab. I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020) Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380 A Web user interface that uses these data to facilitate the COVID-19 literature exploration, can be found here. More details in our peer-reviewed publication here (also here there is an outdated preprint version). Funding: We acknowledge support of this work by the project "Moving from Big Data Management to Data Science" (MIS 5002437/3) which is implemented under the Action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.

  • Open Access English
    Authors: 
    Sharma, Maryada; Vanam, Hari Pankaj; Panda, Naresh K; Patro, Sourabha K; Arora, Rhythm; Bhadada, Sanjay K; Rudramurthy, Sivaprakash M; Singh, Mini P; Koppula, Purushotham Reddy;
    Publisher: Zenodo

    Raw data files of transcriptomic profiling experiments, which form the basis for our publication (https://www.mdpi.com/2673-4540/3/1/13).

  • Open Access
    Authors: 
    Huang, Xiaolei; Jamison, Amelia; Broniatowski, David; Quinn, Sandra; Dredze, Mark;
    Publisher: Zenodo

    This dataset contains tweets related to COVID-19. The dataset contains Twitter ids, from which you can download the original data directly from Twitter. Additionally, we include the date, keywords related to COVID-19 and the inferred geolocation. Check detailed information at http://twitterdata.covid19dataresources.org/index.

  • Open Access
    Authors: 
    Thanasis Vergoulis; Ilias Kanellos; Serafeim Chatzopoulos; Danae Pla Karidi; Theodore Dalamagas;
    Publisher: Zenodo

    This dataset contains impact metrics and indicators for a set of publications that are related to the COVID-19 infectious disease and the coronavirus that causes it. It is based on: Τhe CORD-19 dataset released by the team of Semantic Scholar1 and Τhe curated data provided by the LitCovid hub2. These data have been cleaned and integrated with data from COVID-19-TweetIDs and from other sources (e.g., PMC). The result was dataset of 546,324 unique articles along with relevant metadata (e.g., the underlying citation network). We utilized this dataset to produce, for each article, the values of the following impact measures: Influence: Citation-based measure reflecting the total impact of an article. This is based on the PageRank3 network analysis method. In the context of citation networks, it estimates the importance of each article based on its centrality in the whole network. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Influence_alt: Citation-based measure reflecting the total impact of an article. This is the Citation Count of each article, calculated based on the citation network between the articles contained in the BIP4COVID19 dataset. Popularity: Citation-based measure reflecting the current impact of an article. This is based on the AttRank5 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). AttRank alleviates this problem incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to read papers which received a lot of attention recently. This is why it is more suitable to capture the current "hype" of an article. Popularity alternative: An alternative citation-based measure reflecting the current impact of an article (this was the basic popularity measured provided by BIP4COVID19 until version 26). This is based on the RAM6 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). RAM alleviates this problem using an approach known as "time-awareness". This is why it is more suitable to capture the current "hype" of an article. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4. Social Media Attention: The number of tweets related to this article. Relevant data were collected from the COVID-19-TweetIDs dataset. In this version, tweets between 24/4/22-30/4/22 have been considered from the previous dataset. We provide five CSV files, all containing the same information, however each having its entries ordered by a different impact measure. All CSV files are tab separated and have the same columns (PubMed_id, PMC_id, DOI, influence_score, popularity_alt_score, popularity score, influence_alt score, tweets count). The work is based on the following publications: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2022-05-16 Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2022-05-16. doi:10.5281/zenodo.3715506 Chen Q, Allot A, & Lu Z. (2020) Keep up with the latest coronavirus research, Nature 579:193 (version 2022-05-16) R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab. I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020) Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380 A Web user interface that uses these data to facilitate the COVID-19 literature exploration, can be found here. More details in our peer-reviewed publication here (also here there is an outdated preprint version). Funding: We acknowledge support of this work by the project "Moving from Big Data Management to Data Science" (MIS 5002437/3) which is implemented under the Action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.

  • Open Access
    Authors: 
    Fobbe, Sean;
    Publisher: Zenodo

    Überblick Dieses R-Skript lädt den Corpus der Entscheidungen des Bundesverfassungsgerichts (CE- BVerfG) herunter, untersucht ihn auf mit SARS-CoV-2 assoziiertem Vokabular und speichert relevante Entscheidungen. Es ist die Grundlage für den Datensatz Corona-Rechtsprechung des Bundesverfassungsgerichts (BVerfG-Corona). Alle mit diesem Skript erstellten Datensätze werden dauerhaft kostenlos und urheberrechtsfrei auf Zenodo, dem wissenschaftlichen Archiv des CERN, veröffentlicht. Alle Versionen sind mit einem persistenten Digital Object Identifier (DOI) versehen. Die neueste Version des Datensatzes ist immer über den Link der Concept DOI erreichbar: https://doi.org/10.5281/zenodo.4459405 Aktualisierung Diese Software wird ca. alle 6 Monate aktualisiert. Benachrichtigungen über neue und aktualisierte Datensätze veröffentliche ich immer sofort auf Twitter unter @FobbeSean. NEU in Version 2022-02-01 Vollständige Aktualisierung der Daten Strenge Versionskontrolle von R packages mit renv Kompilierung jetzt detailliert konfigurierbar, insbesondere die Parallelisierung Parallelisierung nun vollständig mit future statt mit foreach und doParallel Fehlerhafte Kompilierungen werden vor der nächsten Kompilierung vollautomatisch aufgeräumt Alle Ergebnisse werden automatisch fertig verpackt in den Ordner 'output' sortiert README und CHANGELOG sind jetzt externe Markdown-Dateien, die bei der Kompilierung automatisiert eingebunden werden Systemanforderungen Betriebssystem Das Skript in seiner veröffentlichten Form kann nur unter Linux ausgeführt werden, da es Linux-spezifische Optimierungen (z.B. Fork Cluster) und Shell-Kommandos (z.B. OpenSSL) nutzt. Das Skript wurde unter Fedora Linux entwickelt und getestet. Die zur Kompilierung benutzte Version entnehmen Sie bitte dem sessionInfo()-Ausdruck am Ende des jeweiligen Compilation Reports. Software Sie müssen die Programmiersprache R installiert haben. Starten Sie danach eine Session im Ordner des Projekts, Sie sollten automatisch zur Installation aller packages in der empfohlenen Version aufgefordert werden. Andernfalls führen Sie bitte folgenden Befehl aus: renv::restore() Um die PDF Reports zu kompilieren benötigen Sie eine LaTeX-Installation. Sie können diese auf Fedora wie folgt installieren: sudo dnf install texlive-scheme-full Alternativ können sie das R package tinytex installieren. Parallelisierung In der Standard-Einstellung wird das Skript vollautomatisch die maximale Anzahl an Rechenkernen/Threads auf dem System zu nutzen. Die Anzahl der verwendeten Kerne kann in der Konfigurationsatei angepasst werden. Wenn die Anzahl Threads auf 1 gesetzt wird, ist die Parallelisierung deaktiviert. Speicherplatz Auf der Festplatte sollten 4 GB Speicherplatz vorhanden sein. Weitere Open Access Veröffentlichungen (Fobbe) Website — www.seanfobbe.de Open Data — https://zenodo.org/communities/sean-fobbe-data/ Source Code — https://zenodo.org/communities/sean-fobbe-code/ Volltexte regulärer Publikationen — https://zenodo.org/communities/sean-fobbe-publications/ Urheberrecht Der Source Code und alle von mir bereitgestellten Rohdaten stehen unter einer MIT No Attribution (MIT-0)-Lizenz. Sie können sie frei für alle Zwecke nutzen. Kontakt Fehler gefunden? Anregungen? Melden Sie diese entweder im Issue Tracker auf GitHub oder schreiben Sie mir eine E-Mail an fobbe-data@posteo.de

  • Open Access English
    Authors: 
    Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo;
    Publisher: Zenodo

    Version 114 of the dataset. MAJOR CHANGE NOTE: The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: https://doi.org/10.3390/epidemiologia2030024. Please cite this when using the dataset. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,337,295,758 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (345,843,722 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.