Human Centered Data Science/Datasets: Difference between revisions

From CommunityData
m (Jtmorgan moved page HCDS (Fall 2017)/Datasets to Human Centered Data Science/Datasets: permanent resource)
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
__FORCETOC__
In order to complete your project, you will each need a dataset. If you are at the stage of your career where you already have a dataset, great! If not, there are many datasets to draw from. Here are some ideas:
In order to complete your project, you will each need a dataset. If you are at the stage of your career where you already have a dataset, great! If not, there are many datasets to draw from. Here are some ideas:


* If there's an author of a study you loved, you can send a polite email asking if they are able or willing to share an archival or replication version of the dataset used in their paper. Be very polite and make it clear that this is starting as a class project but that might turn into a paper for publication. Make your timeline clear. In communication, replication datasets are still very rare, so be prepared for a negative answer.
* Do some Google Scholar and normal Google searching for datasets in your research area. You'd be surprised at what's available.
* [https://docs.google.com/document/d/1RPjvoxYX87DM_px8UX9my6rEg0V3rTcNvRHzSAHxAzU/edit# This Google Doc] provides documentation and descriptions of datasets (and potential research questions) from Yelp, [https://data.seattle.gov/ Data.seattle.gov], and Wikimedia.
* Take a look at datasets available in the [https://dataverse.harvard.edu/ Harvard Dataverse] (the largest collection of social science research data) or one of the other members of the [http://dataverse.org/ Dataverse network].
* Take a look at datasets available in the [https://dataverse.harvard.edu/ Harvard Dataverse] (the largest collection of social science research data) or one of the other members of the [http://dataverse.org/ Dataverse network].
* Look at the collection of social scientific datasets at [https://www.icpsr.umich.edu/icpsrweb/ICPSR/ ICPSR] (UW is a member). There are an enormous number of very rich datasets.
* Look at the collection of social scientific datasets at [https://www.icpsr.umich.edu/icpsrweb/ICPSR/ ICPSR] (UW is a member). There are an enormous number of very rich datasets.
* Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
* Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
* If you're interested in using Wikimedia/Wikipedia data, and want to know what's available, come talk to Jonathan and Oliver.
* Many municipalities host public civic data. Two of the best sources are [https://data.cityofchicago.org/ Data.cityofchicago.gov] and [https://data.seattle.gov/ Data.seattle.gov].
* If you're interested in using Wikimedia/Wikipedia data, and want to know what's available.
<!-- * [https://docs.google.com/document/d/1RPjvoxYX87DM_px8UX9my6rEg0V3rTcNvRHzSAHxAzU/edit# This Google Doc] provides documentation and descriptions of datasets (and potential research questions) from Yelp, , and Wikimedia. -->


;Other ideas:
* If there's an author of a study you loved, you can send a polite email asking if they are able or willing to share an archival or replication version of the dataset used in their paper. Be very polite and make it clear that this is starting as a class project but that might turn into a paper for publication. Make your timeline clear. In communication, replication datasets are still very rare, so be prepared for a negative answer.
* Do some Google Scholar and normal Google searching for datasets in your research area of interest. You'd be surprised at what's available.


== Other open online datasets ==
== Other open online datasets and hosting sites ==
Note: no guarentee that the datasets you find here are licensed for re-use! Make sure to investigate this on your own.
* [https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0 Data Is Plural] an open archive of datasets curated by a British journalist. ''BEWARE: this database make contain links to open datasets that have sensitive personal information or that might have other ethical implications for re-use!''
* [https://www.data.gov/developers/apis Data.gov] provides access to a variety of US Federal datasets and data sources, along with an API and online tools for searching for data.  
* [https://www.data.gov/developers/apis Data.gov] provides access to a variety of US Federal datasets and data sources, along with an API and online tools for searching for data.  
* The [https://sunlightfoundation.com/api/ Sunlight Foundation] used to provide a one-stop shop for datasets and tools around political activity in the USA. The Foundation has closed down, but their website points to a variety of other organizations, datasets, and tools for accessing public civic data.  
* The [https://sunlightfoundation.com/api/ Sunlight Foundation] used to provide a one-stop shop for datasets and tools around political activity in the USA. The Foundation has closed down, but their website points to a variety of other organizations, datasets, and tools for accessing public civic data.  
Line 17: Line 23:
* [https://www.reddit.com/r/datasets/top/?sort=top&t=all r/datasets on Reddit] contains links to many interesting datasets, although some of these data may not be freely licensed.
* [https://www.reddit.com/r/datasets/top/?sort=top&t=all r/datasets on Reddit] contains links to many interesting datasets, although some of these data may not be freely licensed.
* [https://elitedatascience.com/datasets Elite Data Science] has published a list of public datasets that can be used for a variety of data science and machine learning projects.
* [https://elitedatascience.com/datasets Elite Data Science] has published a list of public datasets that can be used for a variety of data science and machine learning projects.
* [https://www.kaggle.com/datasets Kaggle] also provides many data science datasets that can be explored and used.
* [https://www.kaggle.com/datasets Kaggle] also provides many data science datasets that can be explored and used. However, the license terms of these datasets are often absent or incorrect!
 
== Dataset documentation examples ==
 
;Examples of well-documented open research projects
* Keegan, Brian. [https://github.com/brianckeegan/WeatherCrime ''WeatherCrime'']. GitHub, 2014.
* Geiger, Stuart R. and Halfaker, Aaron. [https://github.com/halfak/are-the-bots-really-fighting ''Operationalizing conflict and cooperation between automated software agents in Wikipedia: A replication and expansion of "Even Good Bots Fight"'']. GitHub, 2017.
* Narayan, Sneha et al. [https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6HPRIG ''Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users'']. Harvard Dataverse, 2017.
<!-- * * Warnke-Wang, Morten. ''[https://meta.wikimedia.org/wiki/Research:Autoconfirmed_article_creation_trial Autoconfirmed article creation trial].'' Wikimedia, 2017. -->
 
;Examples of not-so-well documented open research projects
* Eclarke. [https://github.com/eclarke/swga_paper SWGA paper]. GitHub, 2016.
* David Lefevre. [https://figshare.com/articles/Lefevre_and_Cox_Delayed_instructional_feedback_may_be_more_effective_but_is_this_contrary_to_learners_preferences_/2061303 ''Lefevre and Cox: Delayed instructional feedback may be more effective, but is this contrary to learners’ preferences?''] Figshare, 2016.
* Alneberg. [https://github.com/BinPro/paper-data ''CONCOCT Paper Data'']. GitHub, 2014.


[[Category:HCDS (Fall 2017)]]
[[Category:Human Centered Data Science]]

Latest revision as of 21:35, 31 October 2019


In order to complete your project, you will each need a dataset. If you are at the stage of your career where you already have a dataset, great! If not, there are many datasets to draw from. Here are some ideas:

  • Take a look at datasets available in the Harvard Dataverse (the largest collection of social science research data) or one of the other members of the Dataverse network.
  • Look at the collection of social scientific datasets at ICPSR (UW is a member). There are an enormous number of very rich datasets.
  • Use the ISA Explorer to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
  • Many municipalities host public civic data. Two of the best sources are Data.cityofchicago.gov and Data.seattle.gov.
  • If you're interested in using Wikimedia/Wikipedia data, and want to know what's available.
Other ideas
  • If there's an author of a study you loved, you can send a polite email asking if they are able or willing to share an archival or replication version of the dataset used in their paper. Be very polite and make it clear that this is starting as a class project but that might turn into a paper for publication. Make your timeline clear. In communication, replication datasets are still very rare, so be prepared for a negative answer.
  • Do some Google Scholar and normal Google searching for datasets in your research area of interest. You'd be surprised at what's available.

Other open online datasets and hosting sites[edit]

Note: no guarentee that the datasets you find here are licensed for re-use! Make sure to investigate this on your own.

  • Data Is Plural an open archive of datasets curated by a British journalist. BEWARE: this database make contain links to open datasets that have sensitive personal information or that might have other ethical implications for re-use!
  • Data.gov provides access to a variety of US Federal datasets and data sources, along with an API and online tools for searching for data.
  • The Sunlight Foundation used to provide a one-stop shop for datasets and tools around political activity in the USA. The Foundation has closed down, but their website points to a variety of other organizations, datasets, and tools for accessing public civic data.
  • Figshare contains many open datasets across scientific disciplines.
  • The Internet Archive contains thousands of curated datasets of all types, including the complete corpus of StackExchange.com, among many others.
  • r/datasets on Reddit contains links to many interesting datasets, although some of these data may not be freely licensed.
  • Elite Data Science has published a list of public datasets that can be used for a variety of data science and machine learning projects.
  • Kaggle also provides many data science datasets that can be explored and used. However, the license terms of these datasets are often absent or incorrect!

Dataset documentation examples[edit]

Examples of well-documented open research projects
Examples of not-so-well documented open research projects