HCDS (Fall 2017)/Bib

Semi-annotated bibliography of research papers, essays, news articles, and blogs relevant to the coursework.

HCDS workshop position papers
From the 2016 CSCW workshop


 * Developing a research agenda for human centered data science: Aragon et al.
 * Big data approaches lack the 'rich detail' of qualitative approaches
 * How do we preserve this richness while enjoying the benefits (what? Scale?) of big data approaches?
 * "How do we uncover social nuances and consider ethics and values in data use?"


 * Note: need to show how this conceptualization connects to the definition of "human centered design"
 * Qual challenges: validation, generalizability, extension, verification
 * Quant challenges: insights gleaned are often shallow
 * 'why bots fight' is an example of this?


 * "what happens as qual datasets grow even larger?"
 * How do we preserve the richness associated with qualitative approaches in data-driven research?
 * How can we not lose the compelling and inspiring stories of individuals?
 * Human interpretation of data is always necessary, even in quant.
 * Some researchers have incorporated quant methods into qual workflows
 * "data science tools that integrate seamlessly into the domain they were designed for have demonstrated the greatest success"
 * "human centered design is particularly effective for developing software for the analysis of large datasets."
 * Issues for HSDS:
 * Sampling
 * Selection
 * Privacy


 * What are the ethical questions raised by the necessity of processing large datasets?
 * How should we treat crowdworkers?
 * Who owns personal medical data?
 * Can design be effectively crowdsourced?
 * What policies do we need to develop to protect human rights in the era of "big data"?
 * Cited re: the above questions:
 * danah boyd, and Kate Crawford. 2012. Critical questions for big data: Provocations for a cultural, technological, and scholarly, phenomenon
 * France Bélanger, and Robert Crossler. 2011. Privacy in the digital age: a review of information privacy research in information systems.
 * Sangita Ganesh, and Rani Malhotra. 2014. Designing to scale: A human centered approach to designing applications for the Internet of Things
 * Yang Wang, Yun Huang, and Claudia Louis. (September, 2013). Towards A Framework for Privacy-Aware Mobile Crowdsourcing
 * Yang Wang, Huichuan Xia, and Yun Huang. 2016. Examining American and Chinese Internet Users' Contextual Privacy Preferences of Behavioral Advertising


 * Some other key questions:
 * Human-centered algorithm design: how do we design machine learning algorithms tailored to human use and understanding?
 * Understanding community data: how can we integrate knowledge about communites from their aggregated social data as well as their personal experiences?


 * On Usability Analytics and Beyond with Human-Centered Data Science


 * Interpretability in Human Centered Data Science
 * Can we develop (machine learning?) methods that are simultaneously interpretable, highly predictive, and representationally complex?
 * May domains use predictive models not to predict things (like who wrote this unknown text), but to identify characteristic features of a dataset (for example, the hallmarks of a writer's style)
 * Other domains are required by law to be transparent about what the model bases its predictions on
 * Further, "for cases where predictive accuracy is the primary concern, the information gained from understanding what a model is learning can be instructive in suggesting new features to include"
 * But there is often a tension between the predictive accuracy of a model the interpretability of the model, and the representational complexity of the model
 * Accuracy
 * How well does this model perform on test data? (offline evaluation)
 * Interpretability:
 * What features broadly distinguish class A from B, or
 * Why was data point x classified as A?
 * Representational complexity
 * Models are simplifications. What level of metadata about each datapoint creates the best 'fit' for the task (bias-variability tradeoff)


 * Interpretability depends on the use case: different ends entail very different model designs.
 * For lending institutions, interpretability just means enumerating the input features
 * To make sure unacceptable variables like race or gender are not included in the model
 * Decision trees lend themselves to interpretability, unless they have very great depth
 * So do (simpler) binary logistic regressions, which weigh features by their positive predictive paower
 * Random forest is an aggregation of decision trees: not very interpretable, same with neural networks


 * Interpretability is important for problems such as presenting users with rationales for the predictive decisions that impact them
 * Giving users rationales helps with collaborative filtering and context-aware computing
 * encourages trust
 * Gives users control over inferences made about them


 * We need both syntactic and semantic descriptions of models for them to be intrepretable (model size vs. interactions between features)
 * What models are "interpretable" enough to give rise to new knowledge?
 * Understanding what predictive models are learning is becoming increasingly important for establishing audit trails, for suggesting and prioritizing hypothese to test, and for facilitating the general sensemaking process


 * Two is better than one - a mixed-methods approach to Human-centered data science: Maddock et al.


 * Big data tecniques cannot provide descriptive depth required to understand individual actors
 * Qualitative approaches lack the analytical scope to describe and model overarching social mechanisms
 * No single methodological approach shows a complete picture of human behavior
 * Quantitative and qualitative techniques complement each other, surfacing insights not evident through a single methodological approach
 * Researchers have yet to formalize (effective? Ideal?) mixed-method approaches
 * Other projects that have coded data and then run stats on the coded data don't describe their methodologies very well
 * By labeling tweets related to rumors, and then examining these tweets, noticed that many of these tweets at critical times contained URLs. So then graphed URLs over time, and found that the graph looked similar.
 * Noticed that misinformation tweets seemed to have lower lexical diversity, and correction tweets more. Backed this up/confirmed it with corpus analysis
 * Developed an "uncertainty" code that they applied to the data, and found that uncertainty was a critical predictor of rumoring.
 * They observed that "uncertainty" tweets seemed to have distinct lexical patterns,
 * "each analytical iteration suggested a new measure to observe" and "ultimately we derived multi-dimensional signatures from several relatively simple analyses in order to describe complex human phenomena"


 * Scaling Up Qualitative Data Analysis with Interfaces Powered by Interpretable Machine Learning: Glassman et al.
 * We can use machine learning to help scale content analysis
 * But the machine learning algorithm's decisions need to be interpretable by the coder, so they can refine the algorithm (or their codes) in order to produce good quality data
 * Interpretable machine learning methods communicate the "why" behind their rule-based decisions and may even let the human interact with the algorithm to inject their knowledge or better understand machine output(?)
 * See second author PhD thesis
 * See also "algorithms for interpretable machine learning" talk


 * A list of models that are 'interpretable'. This could be useful as an exercise: use one of these models to analyze some data, and create an interface that explains the models judgements. Or, create an interface that explains the ORES model's judgements. Or the trending API? Or some recommendation API?
 * People need to understand the 'why' of clusters to know whether to trust them or not


 * A Human-Centered Approach to Data Privacy - Political Economy, Power, and Collective Data Subjects: Young
 * Conventional approaces to data privacy (de-identification, notice and consent, collection limitation, obfuscation) cannot protect privacy from re-identification or inferential analysis
 * Interview based: how do City of Seattle data managers consider privacy in data collection, management, and release?
 * Datasets can be more or less identifiable at the subject level, data can be more or less sensitive (potential for privacy harm). These concepts are not sufficient for protecting data privacy at the group level.
 * Datasets can be easily combined to re-identify
 * Notice and consent approaches place too much responsibility on the subjects ability to decide
 * Individual data that is collected, circulated, and then aggregated at a later point can spill
 * Differential privacy (see Dwork, 2011)
 * Produce statistically consistent noise


 * Folk data practices: trace ways that data subjects thwart attempts to collect data about them
 * Why should we analyze this? To what purpose ?


 * Race and social justice impacts of datasets: privacy beyond the individual level
 * My example: using bike traffic data to decide which neighborhoods to invest more bike infrastructure in


 * Values Tensions around for-profit uses of 'open data'
 * While inaccurate data might harm users, so too might too much accurate data
 * Cites Scott: accurate data makes subjects legible and amenible to control
 * A critical question for HCDS: what activities does a dataset make possible? (unintended consequences, gaming the system)
 * Interrogate values behind open data initiatives
 * HCDS should attend to…
 * Hacks, obfuscations, folk data practices
 * Move from individual to collective data subjects
 * Analyze the political economy of datasets
 * Examine disciplinary power and user exposure
 * Create a typology of open data


 * Learning With Data - Designing for Community Introspection and Exploration: Dasgupta and Hill
 * Allow users access to their own online community data
 * How do we leverage data science to study, evaluate, and improve the design of systems that enable end-user data science?
 * Scratch data is publicly available, but you need familiarity with tools and methods expertise to use it
 * Analyzing one's own data could be both useful and engaging for Scratch users
 * You can use 'community data blocks' to discover code blocks you have never tried, or the most common type of blocks you use
 * Find all projects shared by a user that use blocks in the 'pen' category
 * Identify all followers of a Scratch user who are from India


 * Data Empathy - A Call for Human Subjectivity in Data Science: Tanweer et al.
 * Data science sometimes claims or seems to displace human subjectivity: machine learning, no need for theory, sensor data over self-reporting
 * But weshould embrace subjectivity, through data empathy, which is the ability for sharing and understanding different data valences, or values, intentions, and expectations around data
 * It's a naïve assumption that just "inserting a little data" into a decision-making process will improve the decision making.
 * Human judgement is not a contaminent to be removed from data, but an inherent ingredient in the construction of datasets
 * Data valence: data means different things to different people at different times
 * Case studies: different uses of self-tracking health data by consumers, practitioners, companies


 * Human-Centered Data Science - Mixed Methods and Intersecting Evidence, Inference, and Scalability: Leavitt
 * Mixed methods: multiple methods applied to the same research question yields results that speak to more than one perspective
 * Mixed methods research becomes more challenging in online contexts as 1) datasets and tracking become increasingly large and complex, and 2) computational methods advance far beyond the grasp of traditional qualitative approaches.
 * Note: check out J. Burrell's 3-part "ethnography matters" blogs about big data
 * A lot of quantitative social science research uses qualitative approaches, but doesn’t talk about them in the publication
 * Qual research lets you understand what your data really means: what is missing and why, how to interpret actions.
 * Various ways that qual and quant methods combine for research good
 * Some ml approaches (neutral networks) make it more difficult to say what features produced the results and whether those features should be questioned
 * Qual methods can lead to biased inferences as well