Editing COVID-19 Digital Observatory

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 6: Line 6:


== Overview and objectives ==
== Overview and objectives ==
As people struggle to make sense of the COVID-19 pandemic, many turn to social media and [[:wikipedia:Social_computing|social computing systems]] to share information, to understand what's happening, and to find new ways to support one another. As scholars, scientists, technologists, and concerned members of the public, we are building a digital observatory to understand where and how people are talking about COVID-19-related topics. The observatory collects, aggregates, and distributes social data related to how people are responding to the ongoing public health crisis of COVID-19. The public datasets and freely licensed tools created through this project will allow researchers, practitioners, and public health officials to more efficiently understand and act to improve these crucial sources of information during crises.  
As people struggle to make sense of the COVID-19 pandemic, many turn to social media and [[:wikipedia:Social_computing|social computing systems]] to share information, to understand what's happening, and to find new ways to support one another. As scholars, scientists, technologists, and concerned members of the public, we are building a digital observatory to understand where and how people are talking about COVID-19-related topics. The observatory collects, aggregates, and distributes social data related to how people are responding to the ongoing public health crisis of COVID-19. The public datasets and freely licensed tools, techniques, and knowledge created through this project will allow researchers, practitioners, and public health officials to more efficiently gather, analyze, understand, and act to improve these crucial sources of information during crises.  


The public data we are focused on is available on public webpages and in public APIs but requires technical skills and computational resources that are less widely distributed than the ability to analyze data. In particular, we are attempting to make datasets that researchers can download and analyze on personal computers.  
The public data we are focused on is available on public webpages and in public APIs but requires technical skills and computational resources that are less widely distributed than the ability to analyze data. In particular, we are attempting to make dataset that can downloaded and analyzed on researchers personal computers.  


Everything here is a work in progress as we get the project running, create communication channels, and start releasing datasets. Learn how you can stay connected, use our resources as we produce them, and get involved below.
Everything here is a work in progress as we get the project running, create communication channels, and start releasing datasets. Learn how you can stay connected, use our resources as we produce them, and get involved below.
Line 22: Line 22:
The digital observatory data, code, and other resources will exist in a few locations, all linked from this page. More details on the different datasets and sources follow below.
The digital observatory data, code, and other resources will exist in a few locations, all linked from this page. More details on the different datasets and sources follow below.


Our initial releases should provide a good starting point for investigating social computing and social media content related COVID-19. We're currently releasing three types of material: code, keywords, and data.
Our initial releases should provide a good starting point for investigating social computing and social media content related COVID-19. We're currently releasing three types of material:


;Code: We are releasing and inviting collaboration on all of the materials used to collect, parse, and publish our datasets. All of these material area being developed publicly.
;Keywords: We are building and releasing a list of keywords generated by daily Google search trends, Wikidata entities, and translations into many languages based on Wikidata entity links. We plan to expand these offerings with new material including data from Twitter, Reddit, and localized content specific to particular geographic regions. We also plan to build infrastructure to provide rapid and frequent updates of datasets in a variety of forms.
;Data: We area currently publishing static datasets (in raw text and structured formats like [[:wikipedia:Comma-separated values|CSV]], [[:wikipedia:Tab-separated values|TSV]] and [[:wikipedia:JSON|JSON]] from Wikipedia as well as search engine results pages (SERPs) for a set of searches on COVID-19 relevant terms.
Each are described below.
=== Code ===
=== Code ===
For code used to produce the data and get started with analysis we have a [https://github.com/CommunityDataScienceCollective/COVID-19_Digital_Observatory github repository] where almost everything lives. If you want to get involved or start using our work please clone the repository! You'll find example analysis scripts that walk through downloading data directly into something like R and producing some minimal analysis to help you get started.
For code used to produce the data and get started with analysis we have a [https://github.com/CommunityDataScienceCollective/COVID-19_Digital_Observatory github repository] where almost everything lives. If you want to get involved or start using our work please clone the repository! You'll find example analysis scripts that walk through downloading data directly into something like R and producing some minimal analysis to help you get started.
Line 39: Line 44:


===Data ===
===Data ===
The best way to find the data is to visit https://covid19.communitydata.science/datasets/. The <code>search_results</code> directory contains compressed raw data generated by Nick Vincent's [https://github.com/nickmvincent/LinkCoordMin SERP scraping project]. The <code>wikipedia</code> directory has view counts and revision histories for Wikipedia pages of COVID-19-related articles in <code>.json</code> and <code>.tsv</code> format. The <code>keywords</code> directory has <code>.csv</code> files with COVID-19 related keywords translated into many languages and associated Wikidata item identifiers.  
The best way to find the data is to visit https://covid19.communitydata.science/datasets/. The <code>search_results</code> directory contains compressed raw data generated by Nick Vincent's [https://github.com/nickmvincent/LinkCoordMin SERP scraping project]. The <code>wikipedia</code> directory has view counts and revision histories for Wikipedia pages of COVID19-related articles in <code>.json</code> and <code>.tsv</code> format. The <code>keywords</code> directory has <code>.csv</code> files with COVID-19 related keywords translated into many languages and associated Wikidata item identifiers.  


====Search Engine Results Pages (SERP) Data====
====Search Engine Results Pages (SERP) Data====
The SERP data in our initial data release includes the first search result page from Google and Bing for a variety of COVID-19 related terms gathered from Google Trends and Google and Bing's autocomplete "search suggestions." Specifically, using a set of six "stem keywords" about COVID-19 and online communities ("coronavirus", "coronavirus reddit", coronavirus wiki", "covid 19", "covid 19 reddit", and "covid 19 wiki"), we collect related keywords from Google Trends (using open source software[https://www.npmjs.com/package/google-trends-api]) and autocomplete suggestions from Google and Bing (using open source software[https://github.com/gitronald/suggests]). In addition to COVID-19 keywords, we also collect SERP data for the top daily trending queries. Currently, the SERP data collection process does not specify location in its searches. Consequently, the default location used is the location of our machine, at Northwestern University's Evanston campus. We are working on collecting SERP data with location specified beyond the Chicago area (aka other 'localized' content).  
The SERP data in our initial data release includes the first search result page from Google and Bing for a variety of COVID-19 related terms gathered from Google Trends and Google and Bing's autocomplete "search suggestions." Specifically, using a set of six "stem keywords" about COVID-19 and online communities ("coronavirus", "coronavirus reddit", coronavirus wiki", "covid 19", "covid 19 reddit", and "covid 19 wiki"), we collect related keywords from Google Trends (using open source software[https://www.npmjs.com/package/google-trends-api]) and autocomplete suggestions from Google and Bing (using open source software[https://github.com/gitronald/suggests]). In addition to COVID 19 keywords, we also collect SERP data for the top daily trending queries. Currently, the SERP data collection process does not specify location in its searches. Consequently, the default location used is the location of our machine, at Northwestern University's Evanston campus. We are working on collecting SERP data with location specified beyond the Chicago area (aka other 'localized' content).  


The SERP data is released as a series of compressed archives (7z), one archive per day, that follow the naming convention <code>covid_search_data-[YYYYMMDD].7z</code>. You will need a 7z extractor, "7z Opener" on windows worked well for me. Within these compressed archives, there is a folder for each device emulated in the data collection (currently two: Chrome on Windows and iPhone X) which contains all of the respective SERP data. Per each device subdirectory, SERP data itself is organized into folders that are titled by the URL of the search query (e.g. <code>'https---www.google.com-search?q=Krispy Kreme'</code>), and each SERP folder contains three data files:  
The SERP data is released as a series of compressed archives (7z), one archive per day, that follow the naming convention <code>covid_search_data-[YYYYMMDD].7z</code>. You will need a 7z extractor, "7z Opener" on windows worked well for me. Within these compressed archives, there is a folder for each device emulated in the data collection (currently two: Chrome on Windows and iPhone X) which contains all of the respective SERP data. Per each device subdirectory, SERP data itself is organized into folders that are titled by the URL of the search query (e.g. <code>'https---www.google.com-search?q=Krispy Kreme'</code>), and each SERP folder contains three data files:  
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)

Template used on this page: