User:Groceryheist/drafts/Data Science Syllabus: Difference between revisions

From CommunityData
No edit summary
No edit summary
Line 505: Line 505:
=== A1: Project proposal and data aquisition ===
=== A1: Project proposal and data aquisition ===


For this assignment you will propose a project for your midterm project and use the skills you have learned in the CDSW to collect or present a dataset.   
For this assignment you will propose a midterm project and use the skills you have learned in the CDSW to collect or present a dataset. You will turn in a one-page project description that
 
* Identifies a dataset for analysis, and what makes it interesting to you.
* Explains how the source of the data, how did you get it?
* Describes 2-3 questions that you hope the data can help answer
* Includes a table of summary statistics (minimum, maximum, median, and mean values) for variables in your dataset related to these questions
 
I hope that you find a dataset related to your own interests, such as data from your workplace, community, or any other organization you may be involved in.  If you have trouble finding a dataset related to your current interests, [[this page | HCDS (Fall 2017)/Datasets]] has examples of freely available datasets that you can use for this project. 
 
==== Rubric ====
 
'''Dataset identification:''' 30%
'''Explaination of data source:''' 20%
'''Example questions:''' 30%
'''Summary statistics:''' 20%


==== Required deliverables ====
A directory in your GitHub repository called <tt>data-512-a1</tt> that contains the following files:
:# 5 source data files in JSON format that follow the specified naming convention.
:# 1 final data file in CSV format that follows the specified naming convention.
:# 1 Jupyter notebook named <tt>hcds-a1-data-curation</tt> that contains all code as well as information necessary to understand each programming step.
:# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources.
:# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code.
:# 1 .png or .jpeg image of your visualization.


==== Helpful tips ====
==== Helpful tips ====

Revision as of 21:00, 20 March 2019

Data Science and Organizational Communication
Principal instructor
Nate TeBlunthuis
Course Catalog Description
Fundamental principles of data science and its implications, including research ethics; data privacy; legal frameworks; algorithmic bias, transparency, fairness and accountability; data provenance, curation, preservation, and reproducibility; human computation; data communication and visualization; the role of data science in organizational context and the societal impacts of data science.

Course Description

The rise of "data science" reflects a broad and ongoing shift in how many teams, organizational leaders, communities of practice, and entire industries create and use knowledge. This class teaches "data science" as practiced by data-intensive knowledge workers but also as it is positioned in historical, organizational, institutional, and societal contexts. Students will gain an appriciation for the technical and intellectual aspects of data science, consider critical questions about how data science is often practiced, and envision ethical and effective science practice in their current and future organiational roles. The format of the class will be a mix of lecture, discussion, in-class activities, and qualitative and quantitative research assignments.

The course is designed around two high-stakes projects. In the first stage of the students will attend the Community Data Science Workshop (CDSC). I am one of the organizers and instructors of this three week intensive workshop on basic programming and data analysis skills. The first course project is to apply these skills together with the conceptual material from this course we have covered so far to conduct an original data analysis on a topic of the student's interest. The second high-stakes project is a critical analysis of an organization or work team. For this project students will serve as consultants to an organizational unit involved in data science. Through interviews and workplace observations they will gain an understanding of the socio-technical and organizational context of their team. They will then synthesize this understanding with the knowledge they gained from the course material to compose a report offering actionable insights to their team.

This version of the syllabus is designed around a weekly schedule.

Learning Objectives

By the end of this course, students will be able to:

  • Understand what it means to analyze large and complex data effectively and ethically with an understanding of human, societal, organizational, and socio-technical contexts.
  • Consider the account ethical, social, organizational, and legal considerations of data science in organizational and institutional contexts.
  • Combine quantitative and qualitative data to generate critical insights into human behavior.
  • Discuss and evaluate ethical, social, organizational and legal trade-offs of different data analysis, testing, curation, and sharing methods.

Schedule

Course schedule (click to expand)

This page is a work in progress.





Week 1

Introduction to Human Centered Data Science
What is data science? What is human centered? What is human centered data science?
Assignments due


Readings assigned
Homework assigned
  • Reading reflection
  • Attend week 2 of CDSW





Week 2

Ethical considerations
privacy, informed consent and user treatment
Assignments due
  • Week 1 reading reflection


Readings assigned
Homework assigned





Week 3

Reproducibility and Accountability
data curation, preservation, documentation, and archiving; best practices for open scientific research
Assignments due
  • Week 2 reading reflection
  • Attend week 2 of CDSW


Readings assigned
Homework assigned
  • Reading reflection
  • Attend week 3 of CDSW







Week 4

Interrogating datasets
causes and consequences of bias in data; best practices for selecting, describing, and implementing training data


Assignments due


Readings assigned (Read both, reflect on one)
  • Barley, S. R. (1986). Technology as an occasion for structuring: evidence from observations of ct scanners and the social order of radiology departments. Administrative Science Quarterly, 31(1), 78–108.
  • Orlikowski, W. J., & Barley, S. R. (2001). Technology and institutions: what can research on information technology and research on organizations learn from each other? MIS Q., 25(2), 145–165. https://doi.org/10.2307/3250927
Homework assigned
  • Reading reflection






Week 5

Technology and Organizing
Assignments due


Readings assigned
  • Passi, S., & Jackson, S. J. (2018). Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects. Proc. ACM Hum.-Comput. Interact., 2(CSCW), 136:1–136:28. https://doi.org/10.1145/3274405
Homework Assigned




Week 6

Data science in Organizational Contexts
Assignments due
Readings assigned (Read both, reflect on one)




Week 7

Introduction to mixed-methods research
Big data vs thick data; integrating qualitative research methods into data science practice; crowdsourcing


Assignments due
  • Reading reflection


Readings assigned (Read both, reflect on one)


Homework assigned
  • Reading reflection







Week 8

Algorithms
algorithmic fairness, transparency, and accountability; methods and contexts for algorithmic audits
Assignments due
  • Reading reflection
  • A4: Final Project Plan


Readings assigned
Homework assigned
  • Reading reflection






Week 9

Data science for social good
Community-based and participatory approaches to data science; Using data science for society's benefit
Assignments due
  • Reading reflection
  • A4: Final project plan


Readings assigned
Homework assigned
  • Reading reflection
Resources






Week 10

User experience and big data
Design considerations for machine learning applications; human centered data visualization; data storytelling
Assignments due
  • Reading reflection


Readings assigned
  • NONE
Homework assigned
  • A5: Final presentation





Week 11

Final presentations
course wrap up, presentation of student projects


Assignments due
  • A5: Final presentation


Readings assigned
  • none!
Homework assigned
  • A6: Final project report (by 11:59pm)





Week 12: Finals Week (No Class Session)

  • NO CLASS
  • A6: FINAL PROJECT REPORT DUE BY 11:59PM


Assignments

Your grade in this class will be assigned through:

  • 9 Reading reflections (25%)
  • 6 Project assignments (50%)
  • Participation (25%)

Assignments are comprised of weekly in-class activities, weekly reading reflections, written assignments, and programming/data analysis assignments. Weekly in-class reading groups will discuss the assigned readings from the course and students are expected to have read the material in advance. In class activities each week are posted to Canvas and may require time outside of class to complete.

Project Assignments 1 and 2 are extensions of exercies from the Community Data Science Workshop and will get you started on y Project

Unless otherwise noted, all assignments are due before 5pm on the following week's class.

Unless otherwise noted, all assignments are individual assignments.

Weekly reading reflections

This course will introduce you to cutting edge research and opinion from major thinkers in the domain of human centered data science. By reading and writing about this material, you will have an opportunity to explore the complex intersections of technology, methodology, ethics, and social thought that characterize this budding field of research and practice.

As a participant in the course, you are responsible for intellectually engaging with all assigned readings and developing an understanding of the ideas discussed in them.

The weekly reading reflections assignment is designed to encourage you to reflect on these works and make connections during our class discussions. To this end, you will be responsible for posting reflections on the previous week's assigned reading before the next class session.

There will generally be multiple readings assigned each week. You are responsible for reading all of them. However, you only need to write a reflection on one reading per week. Unless your instructor specifies otherwise, you can choose which reading you would like to reflect on.

These reflections are meant to be succinct but meaningful. Follow the instructions below, demonstrate that you engaged with the material, and turn the reflection in on time, and you will receive full credit. Late reading reflections will never be accepted.

Instructions
  1. Read all assigned readings.
  2. Select a reading to reflect on.
  3. In at least 2-3 full sentences, answer the question "How does this reading inform your understanding of human centered data science?"
  4. Using full sentences, list at least 1 question that this reading raised in your mind, and say why the reading caused you to ask this question.
  5. Post your reflection to Canvas before the next class session.

You are encouraged, but not required, to make connections between different readings (from the current week, from previous weeks, or other relevant material you've read/listened to/watched) in your reflections.


Project Assignments

This section provides basic descriptions of all scheduled course assignments.

A1: Project proposal and data aquisition

For this assignment you will propose a midterm project and use the skills you have learned in the CDSW to collect or present a dataset. You will turn in a one-page project description that

  • Identifies a dataset for analysis, and what makes it interesting to you.
  • Explains how the source of the data, how did you get it?
  • Describes 2-3 questions that you hope the data can help answer
  • Includes a table of summary statistics (minimum, maximum, median, and mean values) for variables in your dataset related to these questions

I hope that you find a dataset related to your own interests, such as data from your workplace, community, or any other organization you may be involved in. If you have trouble finding a dataset related to your current interests, HCDS (Fall 2017)/Datasets has examples of freely available datasets that you can use for this project.

Rubric

Dataset identification: 30% Explaination of data source: 20% Example questions: 30% Summary statistics: 20%


Helpful tips

  • Read all instructions carefully before you begin
  • Read all API documentation carefully before you begin
  • Experiment with queries in the sandbox of the technical documentation for each API to familiarize yourself with the schema and the data
  • Ask questions on Slack if you're unsure about anything
  • When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"

A2: Bias in data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:

  1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
  2. the countries with the highest and lowest proportion of high quality articles about politicians.

You are also expected to write a short reflection on the project, that describes how this assignment helps you understand the causes and consequences of bias on Wikipedia.

A repository with a README framework and examples of querying the ORES datastore in R and Python can be found here

Getting the article and population data

The first step is getting the data, which lives in several different places. The wikipedia dataset can be found on Figshare. Read through the documentation for this repository, then download and unzip it.

The population data is on Dropbox. Download this data as a CSV file (hint: look for the 'Microsoft Excel' icon in the upper right).

Getting article quality predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. For this step, we're using a Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

  1. FA - Featured article
  2. GA - Good article
  3. B - B-class article
  4. C - C-class article
  5. Start - Start-class article
  6. Stub - Stub-class article

For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on English Wikipedia. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any article you send it.

The ORES API is configured fairly similarly to the pageviews API we used last assignment; documentation can be found here. It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "wp10". The sample iPython notebooks for this assignment provide examples of a correctly-structured API query that you can use to understand how to gather your data, and also to examine the query output.

In order to get article predictions for each article in the Wikipedia dataset, you will need to read page_data.csv into Python (or R), and then read through the dataset line by line, using the value of the last_edit column in the API query. If you're working in Python, the CSV module will help with this.

When you query the API, you will notice that ORES returns a prediction value that contains the name of one category, as well as probability values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for prediction. We'll talk more about what the other values mean in class next week.

Combining the datasets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice versa. You will need to remove the rows that do not have matching data.

Consolidate the remaining data into a single CSV file which looks something like this:


Column
country
article_name
revision_id
article_quality
population

Note: revision_id here is the same thing as last_edit, which you used to get scores from the ORES API.

Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

Examples:

  • if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.
  • if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

Tables

The tables should be pretty straightforward. Produce four tables that show:

  1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
  2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
  3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
  4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

Embed them in the iPython notebook.

Writeup

Write a few paragraphs, either in the README or in the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning. Particular questions you might want to answer:

  1. What biases did you expect to find in the data, and why?
  2. What are the results?
  3. What theories do you have about why the results are what they are?

Submission instructions

  1. Complete your Notebook and datasets in Jupyter Hub.
  2. Create the data-512-a2 repository on GitHub w/ your code and data.
  3. Complete and add your README and LICENSE file.
  4. Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1244514/assignments/4376107

Required deliverables

A directory in your GitHub repository called data-512-a2 that contains the following files:

  1. 1 final data file in CSV format that follows the formatting conventions.
  2. 1 Jupyter notebook named hcds-a2-bias that contains all code as well as information necessary to understand each programming step, as well as your writeup (if you have not included it in the README) and the tables.
  3. 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources, and your writeup (if you have not included it in the notebook). A prototype framework is included in the sample repository
  4. 1 LICENSE file that contains an MIT LICENSE for your code.

Helpful tips

  • Read all instructions carefully before you begin
  • Read all API documentation carefully before you begin
  • Experiment with queries in the sandbox of the technical documentation for the API to familiarize yourself with the schema and the data
  • Explore the data a bit before starting to be sure you understand how it is structured and what it contains
  • Ask questions on Slack if you're unsure about anything. Please email Os to set up a meeting, or come to office hours, if you want to! This time is set aside specifically for you - it is not an imposition.
  • When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"

A3: Crowdwork ethnography

For this assignment, you will go undercover as a member of the Amazon Mechanical Turk community. You will preview or perform Mechanical Turk tasks (called "HITs"), lurk in Turk worker discussion forums, and write an ethnographic account of your experience as a crowdworker, and how this experience changes your understanding of the phenomenon of crowdwork.

The full assignment description is available as a Google doc and as a PDF.

A4: Final project plan

For this assignment, you will write up a study plan for your final class project. The plan will cover a variety of details about your final project. Identify the organization that you will work with, c data you will use, what you will do with the data (e.g. statistical analysis, train a model), what results you expect or intend, and most importantly, why your project is interesting or important (and to whom, besides yourself).

A5: Final project presentation

For this assignment, you will give an in-class presentation of your final project. The goal of this assignment is to demonstrate that you are able to effectively communicate your research questions, methods, conclusions, and implications to your target audience.

A6: Final project report

For this assignment, you will publish the complete code, data, and analysis of your final research project. The goal is to demonstrate that you can incorporate all of the human-centered design considerations you learned in this course and create research artifacts that are understandable, impactful, and reproducible.


Policies

The following general policies apply to this course.

Attendance

As detailed in my page on assessment, attendance in class is expected of all participants. If you need to miss class for any reason, please contact a member of the teaching team ahead of time (email is best). Multiple unexplained absences will likely result in a lower grade or (in extreme circumstances) a failing grade. In the event of an absence, you are responsible for obtaining class notes, handouts, assignments, etc.

Respect

Students are expected to treat each other, and the instructors, with respect. Students are prohibited from engaging in any kind of harassment or derogatory behavior, which includes offensive verbal comments or imagery related to gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, or religion. In addition, students should not engage in any form of inappropriate physical contact or unwelcome sexual attention, and should respect each others’ right to privacy in regards to their personal life. In the event that you feel you (or another student) have been subject to a violation of this policy, please reach out to the instructors in whichever form you prefer.

The instructors are committed to providing a safe and healthy learning environment for students. As part of this, students are asked not to wear any clothing, jewelry, or any related medium for symbolic expression which depicts an indigenous person or cultural expression re­appropriated as a mascot, logo, or caricature. These include, but are not limited to, iconography associated with the following sports teams:

  1. Chicago Blackhawks
  2. Washington Redskins
  3. Cleveland Indians
  4. Atlanta Braves


Devices in Class

Electronic devices (e.g., phones, tablets, laptops) are not going to permitted in class. If you have a documented need to use a device, please contact me ahead of time to let me know. If you do get permission to use a device, I will ask you to sit in the very back of the classroom.

The goal of this policy is to help you stay focused and avoid distractions for yourself and your peers in the classroom. This is really important and turns out to be much more difficult in the presence of powerful computing devices with brightly glowing screens and fast connections to the Internet. For more on the rationale behind this policy, please read Clay Shirky’s thoughtful discussion of his approach to this issue.


Electronic Mail Standards of Conduct

Email communications (and all communications generally) among UW community members should seek to respect the rights and privileges of all members of the academic community. This includes not interfering with university functions or endangering the health, welfare, or safety of other persons. With this in mind, in addition to the University of Washington's Student Conduct Code, I establishes the following standards of conduct in respect to electronic communications among students and faculty:

  • If, as a student, you have a question about course content or procedures, please use the online discussion board designed for this purpose. If you have specific questions about your performance, contact me directly.
  • I strive to respond to Email communications within 48 hours. If you do not hear from me, please come to my office, call me, or send me a reminder Email.
  • Email communications should be limited to occasional messages necessary to the specific educational experience at hand.
  • Email communications should not include any CC-ing of anyone not directly involved in the specific educational experience at hand.
  • Email communications should not include any blind-CC-ing to third parties, regardless of the third party’s relevance to the matter at hand.


Academic integrity and plagiarism

As a University of Washington student, you are expected to practice high standards of academic honesty and integrity. You are responsible to understand and abide by UW’s Student Governance Code on Academic Misconduct, and the UW’s Administrative Code on Academic Misconduct, and to comply with verbal or written instructions from the professor or TA of this course. This includes plagiarism, which is a serious offense. All assignments will be reviewed for integrity. All rules regarding academic integrity extend to electronic communication and the use of online sources. If you are not sure what constitutes plagiarism, read this overview in addition to UW’s policy statements.

I am committed to upholding the academic standards of the University of Washington’s Student Conduct Code. If I suspect a student violation of that code, I will first engage in a conversation with that student about my concerns. If we cannot successfully resolve a suspected case of academic misconduct through our conversations, I will refer the situation to the department of communication advising office who can then work with the COM Chair to seek further input and if necessary, move the case up through the College.

While evidence of academic misconduct may result in a lower grade, I will not unilaterally lower a grade without addressing the issue with you first through the process outlined above.

Other academic integrity resources:

Notice: The University has a license agreement with VeriCite, an educational tool that helps prevent or identify plagiarism from Internet resources. Your instructor may use the service in this class by requiring that assignments are submitted electronically to be checked by VeriCite. The VeriCite Report will indicate the amount of original text in your work and whether all material that you quoted, paraphrased, summarized, or used from another source is appropriately referenced.


Disability and accommodations

As part of ensuring that the class is as accessible as possible, the instructors are entirely comfortable with you using whatever form of note-taking method or recording is most comfortable to you, including laptops and audio recording devices. The instructors will do their best to ensure that all slides and scripts/notes are immediately available online after a lecture has concluded. In addition, if asked ahead of time we can try to record the audio of individial lectures for students who have learning differences that make audiovisual notes preferable to written ones.

If you require additional accommodations, please contact Disabled Student Services: 448 Schmitz, 206-543-8924 (V/TTY). If you have a letter from DSS indicating that you have a disability which requires academic accommodations, please present the letter to the instructors so we can discuss the accommodations you might need in the class. If you have any questions about this policy, reach out to the instructors directly.

For more information on disability accommodations, and how to apply for one, please review UW's Disability Resources for Students.

Assignments and coursework

Grades will be determined as follows:

  • 20% Participation
  • 20% Reading reflections
  • 20% Midterm project
  • 40% Final project

You are expected to produce work in all of the assignments that reflects the highest standards of professionalism. For written documents, this means proper spelling, grammar, and formatting.

Late assignments will not be accepted; if your assignment is late, you will receive a zero score. Again, if you run into an issue that necessitates an extension, please reach out.