HCDS (Fall 2017)/Bib

Various books, research papers, podcasts, online courses, essays, news articles, and blogs relevant to the coursework.

For more resources, check out Jonathan's 'hcds' tag on Pinboard.

Books

Payton, Theresa, and Ted Claypoole. Privacy in the age of big data: Recognizing threats, defending your rights, and protecting your family. Rowman & Littlefield, 2014.
Schutt, Rachel, and Cathy O'Neil. Doing data science: Straight talk from the frontline. O'Reilly Media, Inc., 2013.
O'Neil, Cathy. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown, 2016. https://weaponsofmathdestructionbook.com/

Podcasts

Data Skeptic Podcast

Blog posts

News articles

Academic papers

Videos and lectures

HCDS workshop position papers

From the 2016 CSCW workshop

Developing a research agenda for human centered data science: Aragon et al.

Big data approaches lack the 'rich detail' of qualitative approaches

How do we preserve this richness while enjoying the benefits (what? Scale?) of big data approaches?
"How do we uncover social nuances and consider ethics and values in data use?"

Note: need to show how this conceptualization connects to the definition of "human centered design"
Qual challenges: validation, generalizability, extension, verification
Quant challenges: insights gleaned are often shallow

'why bots fight' is an example of this?

"what happens as qual datasets grow even larger?"
How do we preserve the richness associated with qualitative approaches in data-driven research?
How can we not lose the compelling and inspiring stories of individuals?
Human interpretation of data is always necessary, even in quant.
Some researchers have incorporated quant methods into qual workflows
"data science tools that integrate seamlessly into the domain they were designed for have demonstrated the greatest success"
"human centered design is particularly effective for developing software for the analysis of large datasets."
Issues for HSDS:

Sampling
Selection
Privacy

What are the ethical questions raised by the necessity of processing large datasets?
How should we treat crowdworkers?
Who owns personal medical data?
Can design be effectively crowdsourced?
What policies do we need to develop to protect human rights in the era of "big data"?
Cited re: the above questions:

danah boyd, and Kate Crawford. 2012. Critical questions for big data: Provocations for a cultural, technological, and scholarly, phenomenon
France Bélanger, and Robert Crossler. 2011. Privacy in the digital age: a review of information privacy research in information systems.
Sangita Ganesh, and Rani Malhotra. 2014. Designing to scale: A human centered approach to designing applications for the Internet of Things
Yang Wang, Yun Huang, and Claudia Louis. (September, 2013). Towards A Framework for Privacy-Aware Mobile Crowdsourcing
Yang Wang, Huichuan Xia, and Yun Huang. 2016. Examining American and Chinese Internet Users' Contextual Privacy Preferences of Behavioral Advertising

Some other key questions:

Human-centered algorithm design: how do we design machine learning algorithms tailored to human use and understanding?
Understanding community data: how can we integrate knowledge about communites from their aggregated social data as well as their personal experiences?

On Usability Analytics and Beyond with Human-Centered Data Science

Interpretability in Human Centered Data Science

Can we develop (machine learning?) methods that are simultaneously interpretable, highly predictive, and representationally complex?
May domains use predictive models not to predict things (like who wrote this unknown text), but to identify characteristic features of a dataset (for example, the hallmarks of a writer's style)
Other domains are required by law to be transparent about what the model bases its predictions on
Further, "for cases where predictive accuracy is the primary concern, the information gained from understanding what a model is learning can be instructive in suggesting new features to include"
But there is often a tension between the predictive accuracy of a model the interpretability of the model, and the representational complexity of the model

Accuracy

How well does this model perform on test data? (offline evaluation)

Interpretability:

What features broadly distinguish class A from B, or
Why was data point x classified as A?

Representational complexity

Models are simplifications. What level of metadata about each datapoint creates the best 'fit' for the task (bias-variability tradeoff)

Interpretability depends on the use case: different ends entail very different model designs.

For lending institutions, interpretability just means enumerating the input features

To make sure unacceptable variables like race or gender are not included in the model

Decision trees lend themselves to interpretability, unless they have very great depth
So do (simpler) binary logistic regressions, which weigh features by their positive predictive paower
Random forest is an aggregation of decision trees: not very interpretable, same with neural networks

Interpretability is important for problems such as presenting users with rationales for the predictive decisions that impact them

Giving users rationales helps with collaborative filtering and context-aware computing
encourages trust
Gives users control over inferences made about them

We need both syntactic and semantic descriptions of models for them to be intrepretable (model size vs. interactions between features)
What models are "interpretable" enough to give rise to new knowledge?
Understanding what predictive models are learning is becoming increasingly important for establishing audit trails, for suggesting and prioritizing hypothese to test, and for facilitating the general sensemaking process

Two is better than one - a mixed-methods approach to Human-centered data science: Maddock et al.

Big data tecniques cannot provide descriptive depth required to understand individual actors
Qualitative approaches lack the analytical scope to describe and model overarching social mechanisms
No single methodological approach shows a complete picture of human behavior
Quantitative and qualitative techniques complement each other, surfacing insights not evident through a single methodological approach
Researchers have yet to formalize (effective? Ideal?) mixed-method approaches
Other projects that have coded data and then run stats on the coded data don't describe their methodologies very well
By labeling tweets related to rumors, and then examining these tweets, noticed that many of these tweets at critical times contained URLs. So then graphed URLs over time, and found that the graph looked similar.
Noticed that misinformation tweets seemed to have lower lexical diversity, and correction tweets more. Backed this up/confirmed it with corpus analysis
Developed an "uncertainty" code that they applied to the data, and found that uncertainty was a critical predictor of rumoring.
They observed that "uncertainty" tweets seemed to have distinct lexical patterns,
"each analytical iteration suggested a new measure to observe" and "ultimately we derived multi-dimensional signatures from several relatively simple analyses in order to describe complex human phenomena"

Scaling Up Qualitative Data Analysis with Interfaces Powered by Interpretable Machine Learning: Glassman et al.

We can use machine learning to help scale content analysis
But the machine learning algorithm's decisions need to be interpretable by the coder, so they can refine the algorithm (or their codes) in order to produce good quality data
Interpretable machine learning methods communicate the "why" behind their rule-based decisions and may even let the human interact with the algorithm to inject their knowledge or better understand machine output(?)

See second author PhD thesis
See also "algorithms for interpretable machine learning" talk

A list of models that are 'interpretable'. This could be useful as an exercise: use one of these models to analyze some data, and create an interface that explains the models judgements. Or, create an interface that explains the ORES model's judgements. Or the trending API? Or some recommendation API?
People need to understand the 'why' of clusters to know whether to trust them or not

A Human-Centered Approach to Data Privacy - Political Economy, Power, and Collective Data Subjects: Young

Conventional approaces to data privacy (de-identification, notice and consent, collection limitation, obfuscation) cannot protect privacy from re-identification or inferential analysis
Interview based: how do City of Seattle data managers consider privacy in data collection, management, and release?
Datasets can be more or less identifiable at the subject level, data can be more or less sensitive (potential for privacy harm). These concepts are not sufficient for protecting data privacy at the group level.
Datasets can be easily combined to re-identify
Notice and consent approaches place too much responsibility on the subjects ability to decide
Individual data that is collected, circulated, and then aggregated at a later point can spill
Differential privacy (see Dwork, 2011)

Produce statistically consistent noise

Folk data practices: trace ways that data subjects thwart attempts to collect data about them

Why should we analyze this? To what purpose ?

Race and social justice impacts of datasets: privacy beyond the individual level

My example: using bike traffic data to decide which neighborhoods to invest more bike infrastructure in

Values Tensions around for-profit uses of 'open data'
While inaccurate data might harm users, so too might too much accurate data
Cites Scott: accurate data makes subjects legible and amenible to control
A critical question for HCDS: what activities does a dataset make possible? (unintended consequences, gaming the system)
Interrogate values behind open data initiatives
HCDS should attend to…

Hacks, obfuscations, folk data practices
Move from individual to collective data subjects
Analyze the political economy of datasets
Examine disciplinary power and user exposure
Create a typology of open data

Learning With Data - Designing for Community Introspection and Exploration: Dasgupta and Hill

Allow users access to their own online community data
How do we leverage data science to study, evaluate, and improve the design of systems that enable end-user data science?
Scratch data is publicly available, but you need familiarity with tools and methods expertise to use it
Analyzing one's own data could be both useful and engaging for Scratch users
You can use 'community data blocks' to discover code blocks you have never tried, or the most common type of blocks you use

Find all projects shared by a user that use blocks in the 'pen' category
Identify all followers of a Scratch user who are from India

Data Empathy - A Call for Human Subjectivity in Data Science: Tanweer et al.

Data science sometimes claims or seems to displace human subjectivity: machine learning, no need for theory, sensor data over self-reporting
But weshould embrace subjectivity, through data empathy, which is the ability for sharing and understanding different data valences, or values, intentions, and expectations around data
It's a naïve assumption that just "inserting a little data" into a decision-making process will improve the decision making.
Human judgement is not a contaminent to be removed from data, but an inherent ingredient in the construction of datasets
Data valence: data means different things to different people at different times

Case studies: different uses of self-tracking health data by consumers, practitioners, companies

Human-Centered Data Science - Mixed Methods and Intersecting Evidence, Inference, and Scalability: Leavitt

Mixed methods: multiple methods applied to the same research question yields results that speak to more than one perspective
Mixed methods research becomes more challenging in online contexts as 1) datasets and tracking become increasingly large and complex, and 2) computational methods advance far beyond the grasp of traditional qualitative approaches.
Note: check out J. Burrell's 3-part "ethnography matters" blogs about big data
A lot of quantitative social science research uses qualitative approaches, but doesn’t talk about them in the publication
Qual research lets you understand what your data really means: what is missing and why, how to interpret actions.
Various ways that qual and quant methods combine for research good
Some ml approaches (neutral networks) make it more difficult to say what features produced the results and whether those features should be questioned
Qual methods can lead to biased inferences as well