Community Data Science Workshops (Winter 2020)/Resources

Resources Winter 2020

This list of resources was created in order to support CDSW workshop participants to continue to develop their skills or answer some of the questions that arose during the workshops

Scraping Data from the Web[edit]

Helena-Lang.org demonstrates how to get data scraped automatically. It requires no programming and has a free Chrome Plug-in. The website has a series of tutorials available here: Tutorials

Quantitative Data Analysis[edit]

Tea-Lang.org provides a high-level specification of your data and hypothesis, and get back valid statistical test results and explanations. It requires minimal programming and comfort in Python and has a free Python package.

Data Visualization[edit]

Altair-Viz.github.io allows you to write a high-level specification about desired visualization and data. The platform allows you to get back data visualization and requires some programming and comfort in Python. The platform has a free Python package.

Finding a dataset[edit]

In case you are looking for available datasets for your projects here are some potential leads:

Do some Google Scholar and normal internet searching for datasets in your research area. You'll probably be surprised at what's available.
Take a look at datasets available in the Harvard Dataverse (a very large collection of social science research data) or one of the other members of the Dataverse network.
Look at the collection of social scientific datasets at ICPSR at the University of Michigan (NU is a member). There is an enormous number of very rich datasets.
Use the ISA Explorer to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
The City of Chicago has one of the best data portal sites of any municipality in the U.S. (and better than many federal agencies). There are also numerous administrative datasets released by other public entities (try searching!) that you might find inspiring.
FiveThirtyEight.com has published a GitHub repository and an R package with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.