Community Data Science Workshops (Fall 2014)/Reflections: Difference between revisions

From CommunityData
(added structure)
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
Over three weekends in Fall 2014, a group of volunteers organized the [[Community Data Science Workshops (Fall 2014)]] (CDSW) — the first series of four sessions designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. This version of the [[CDSW]] were held between November 7th and 22nd in 2014 at the University of Washington in Seattle.  
:''If you're interested in putting on your own CDSW, you should also see our [[Community Data Science Workshops (Spring 2014)/Reflections|reflections from Spring 2014]].''
 
Over three weekends in Fall 2014, a group of volunteers organized the [[Community Data Science Workshops (Fall 2014)]] the latest in [[CDSW|a series of four sessions workshops]] designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. The [[CDSW (Fall 2014)|Fall 2014 events]] were held between November 7th and 22nd in 2014 at the University of Washington in Seattle.  


This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW — including the authors!
This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW — including the authors!


In general, the mentors and students suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefited enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the sessions which are detailed below.
In general, the mentors and students suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefited enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the sessions which we have detailed below.


If you have any questions or issues, you can contact [[Benjamin Mako Hill]] directly or can email the whole group of mentors at cdsw-au2014-mentors@uw.edu.
If you have any questions or issues, you can contact [[Benjamin Mako Hill]] directly or can email the whole group of mentors at cdsw-au2014-mentors@uw.edu.


== Structure ==
== Structure ==


The [[Community Data Science Workshops (Fall 2014)]] consisted of [[Community Data Science Workshops (Fall 2014)#Schedule|four sessions]]:


* '''Session 0 (Friday November 7th)''': [[Community Data Science Workshops (Fall 2014)#Session 0|Setup and Programming Practice]]
* '''Session 1 (Saturday November 8th)''': [[Community Data Science Workshops (Fall 2014)#Session 1|Introduction to Python]]
* '''Session 2 (Saturday November 15th)''': [[Community Data Science Workshops (Fall 2014)#Session 2|Building data sets using web APIs]]
* '''Session 3 (Saturday November 22nd)''':  [[Community Data Science Workshops (Fall 2014)#Session 3|Data analysis and visualization]]


'''
Our organization and the curriculum for Sessions 0 and 1 were originally borrowed from the [http://bostonpythonworkshop.com/ Boston Python Workshop] (BPW) although our curriculum has diverged quite a bit as we've improved it and tailored it to the specific learning goals in our sessions.
Scheduling the next workshop''': Not too close to the end of the quarter.  


'''Session 0''' (Friday November 7th Evening 6:30-9:30pm)
Session 0 was a three hour evening session to install software. All three of the other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:
Went smoothly. No person reported “many problems with set-up.” All  respondants reported either no problems or a few problems.


Anaconda was key to smoothyness compared to the first workshop series. However, Anaconda is not open source.  It reduced issues, but was not 100% issue free. When one person’s home directory is in Chinese, Anaconda got confused. This was fixed by a mentor who changed the path.
* '''Morning, 10am-12:20''': A 2 hour lecture
* '''Lunch, 12:20-1pm'''
* '''Afternoon, 1pm-3:30pm''': Practice working on projects in 3 breakout sessions
* '''Wrap-up, 3:30pm-4pm''': Wrap-up, next steps, and upcoming opportunities


At least one mentor was confused about whether mentees should self-report they’d completed the steps or whether the mentor should verify that the steps were all taken. In future, email mentors ahead of time to let them know.
We collected detailed feedback from users at three points using the following Google forms (these are copies):


Improvements for session 0 for next time.  
* [https://docs.google.com/forms/d/1mJpLwobsSpxz4do99wyZWPc-Z3FMF9NS6IfzTEeOXUc/viewform Application to the workshop]
'''The process people use to flag for mentor help''': We didn’t model enough using sticky notes during lectures early on.
* [https://docs.google.com/forms/d/1rCgNZSJ0tBqIXUgKgtgKlrshVC3I7q6y9of1hL5isPE/viewform After Session 1]
* [https://docs.google.com/forms/d/1RJTTwXe2O_C1ZAtMgWRLGXVc-tRpY76NbvorLg644MQ/viewform After Session 2]
* [https://docs.google.com/forms/d/1-BngUwkEmephM2xLl3Ews2LnopF3sI7hlgYhQK4YJL4/viewform After Session 3]
* [https://docs.google.com/forms/d/1v2gNpPSY3gjJ9G_PZgmjt2YZTBz6XxA6-lLUzDKWfMg/viewform After Session 3 (Unretained)] — Unsurprisingly, perhaps, not a single person filled this out so we will not bother with this in the future.


'''Technology improvements''':  Get less ambiguous sticky notes.  
We used this feedback to both evaluate what worked well and what did not and to get a sense of what students wanted to learn in the next session and which afternoon sessions they might find interesting.


 
== Participants ==
'''Space improvements:'''
Set up/arrange/select the space to facilitate better circulation of mentors.
When mentors can circulate easily things are better for mentees.


'''
We had 30 mentors who attended at least one of the sessions and at least 20 mentors at each sessions. Many of our mentors were UW students in more technical departments like [https://www.cs.washington.edu/ Computer Science and Engineering] and [https://www.hcde.washington.edu Human Centered Design & Engineering]. Perhaps half of them worked outside of the university as software developers.
Streamling the instructions for set-up for next time.'''


Q: How to reduce the number of steps and the number of operating system specific version?
We had about 150 participants apply to attend the sessions. We selected on programming skill (to ensure that all attendees were complete beginners), enthusiasm, and randomly to maintain a learner to mentor ratio of between 4 and 5. We admitted 80 participants. 58 listed a UW affiliations. Affiliations listed by at least three people include the following:


This time CDSW moved from Powershell from CMD. Powershell doesn’t work well on PCs.  People were instructed to “find the Windows mentor.” No one had XP.  Next time, we might be able to move away from separate instructions for Linux / Mac / and Window. 
{| class=wikitable
! Department !! Participants
|-
| HCDE || 16
|-
| iSchool || 10
|-
| Communication || 8
|-
| Anthropology || 3
|-
| Alumni || 4
|-
| Undergrad || 3
|-
|}


Consider writing install instructions that do not rely on Anaconda so people have a fully open source option.  
We had two people each who listed their affiliations as Bio- and Health Informatics, the Foster School of Management, Microsoft, and Wikipedia.  


In the first two workshop sequences, no mentees were running Linux. Possibly, in future a Linux workshop would be good. Presently, Linux help/instructions may be moot.
We also had people from Psychology, the City of Seattle, the Low Income Housing Project, Seattle Meshnet, Biochemical Engineering, Bio Physical, Chemical Engineering, Game Studies, Linguistic, College of the Environment, Oceanography, the School and Public Health, UW Bothell, Central Washington University, and many people who did not specify an affiliation. We continue to think that it's important that people who are not doing research but who are are part of online communities were in the mix with UW-type researchers. Bringing together researchers and participants in online communities is an important goal and would like to work toward more balance in this regard and to increase the amount of non-UW participation.


'''Maintanence errors on the wiki.''' There was a need for several on-the-fly corrections of the instructions and files on the wiki during the workshop.
Retention between session and 0 and 1 was nearly 100%. Retention between sessions 1 and 2 and sessions 2 and 3 was roughly 75% leaving us with perhaps 55-60% retention between session 0 and session 3.


Anecdotally, there is a sense that those who are dropping were those who had trouble but who didn’t struggle visibly.


'''Q: How to handle when mentees want to refer back to the workshop material that they experienced?'''
Although our participant pool in [[CDSW (Spring 2014)]] was overwhelming female (80-90%), there was close to gender balance in both students and mentors this time around.


A: Create and archive template for the page they are looking at during the workshop.  
Once again, quite a large number of people applied were already skilled programmers. We're still not exactly sure why these people are applying because we think that the fact that the workshops are for absolute beginners is very clear. Perhaps people just want more exposure to data science?
Each project can be its own namespace as opposed to having event-specific pages.


Mentors should post the code generated in the break-outs. Encourage them to capture the code.  
Once again, the constraint on scaling the workshop was the number of mentors. Every mentor we added means that the workshop can accommodate four more participants.


One suggestion was allowing participants with have some programming skills — especially for the second and third workshops (given predictable rates of retention). There was not consensus among the organizers and mentors on this approach and preferred getting more newbies and invest more in them?


'''General observation about mentoring:''' Being a mentor is kind of hard, especially being a good mentor. Some steps were skipped in helping mentors that were in place last time.
== Morning Lectures ==


It was hard to tell who was a mentor and who wasn’t.
[[User:Mako|Benjamin Mako Hill]] gave lectures in Session 1 and 3. Frances Hocutt gave the lecture in Session 2 and we felt that this was was an important step. An important future goal is getting other people to give lectures. Tommy is an obvious choice to do one next time. Different faces, perspective, and backgrounds are useful to communicate the breadth of interest here. [[User:Mako|Mako]] does not want to be the only one giving these lectures.


Improvement: Help the mentors to be visually identifiable. E.g. Paper them head to foot in sticky notes.
Our biggest challenge with growing the workshops was with physical space for the lectures. Basically, rooms that can hold more than 100 people at UW are almost exclusively lectures halls that make it almost impossible for mentors to physically reach students in order to help them debug and solve problems.


Questions about mentorship:
We reserved a lecture hall that fit 200 people and filled it with 100 students in alternating rows to make it at least possible to reach each person. This worked reasonably well although it was still suboptimal.
How to help the mentors to mentor well?


Suggestion for mentors: Walk around to every single person. Ask, “How are you doing? What are you working on? Show me what you’re doing.
People continue to want a record of lectures. At the very minimum, we should make sure that we turn on console logging so that we can post this after the lectures. We intended to record lectures but, once again, this got lost in all the crazy preparation for the events.


How much do you help somebody?
== Afternoon Sessions ==


Should there be a page of guidelines for mentors?
Projects are done in breakout sessions in a series of three rooms. The general problem was that insisted on teacher per topic and topics were very unequal in their popularity. Next time, we will likely prepare to have multiple teacher for multiple rooms on topics we know will be more popular.


Where is uniformity needed in mentor style and where do we want to encourage diverse approaches?
Several changes we hope to make include:


Let’s have a mentors workshop!  At a bar! With BEER! and PIZZA!
* As we refine this process, we were also interested in thinking of trying to select or refine breakout sessions so that they are more closely tailored to individuals and their interests. Next time, we will consider mining the registration for a list of research questions we might use.
* We want to emphasize bringing people back together more often. In particular, we found that bringing people together back together share work several time during each session and then once in the end to show of achievements or interesting results was effective. We also need to designate a person to a person to go between for each session to remind people to reconvene and to create a program of important or inspiring achievements for presenting to the group at the very end.
* There seemed to be broad interest in examples or projects that are focused on public health and/or epidemiological data.
* We would love to create an afternoon project for Session 3 on basic statistical analysis in Python using scipy, statsmodels, and pandas. At least ten participants would have been enthusiastic to take it.


The pizza party, er, mentor workshop could cover: norms, best practices, goals. Planning, etc.
== Session 0: Python Setup ==


The goal of this session was to get users setup with Python and starting to learn some Python basics. We changed the curriculum originally used by BPW enormously to use Continuum's Anaconda instead of Python directly from [http://python.org python.org]. The result was staggering. Not a ''single person'' reported "many problems with set-up" (i.e., respondents reported either "no problems" or a "few problems.")


'''Should only fully open source tools be selected for workshops?
That said, we had several major concerns:
'''
A: Our job is not to extoll the virtues of open source. Our job is to help mentees solve their data problem. “We are teaching you how to do things with data that help you achieve your goal.” However, open source tools are desirable.


'''Q: Should we be teaching Python 3?'''
* Anaconda is not free software/open source.
A: Yes, but when?  It may solve some technical issues that are occurring now.  
* Anaconda does not support Python 3 which we'd like to move to.
* Anaconda seems to have at least some remaining i10n bugs. For example, one student had a home directory set to a Chinese string which caused the Anaconda installation to fail at a late stage. This was eventually fixed by a mentor who changed the path by hand.  


Additionally, we moved the Windows curriculum from away from <code>cmd</code> to using Powershell. This was an huge and unqualified improvement because it meant that <code>ls</code> works and the rest of the curriculum could converge. The only concerns were that Powershell is not installed on Windows XP although ''not a single student had Windows XP''.


'''Student demographics.'''
Changes for next time include:
This time there was more gender balance in both students and mentors.


2/3 of mentees were from UW. Included students from random places including someone who works for the city of Seattle.Many random Wikipedians were there. It's cool that people who are not doing research but are part of online communities were in the mix with the researchers.  
* Because it was less necessary, we will deemphasize recruiting mentors to the Friday night session. Many folks were standing around.
* Because Powershell was successful, we're going to try to create a single consolidated set of installation instructions for Windows, Mac OSX, and GNU/Linux
* We will make it more clear to mentors whether participants should self-report they’d completed the steps or whether the mentor should verify that the steps were all taken (the latter). In future, we will email mentors ahead of time to let them know.
* In a related issue, not everybody loves the checkout step. Maybe there's a way we can make it more fun?
* We need to do a better job of modeling sticky notes so folks use them more effectively. 
* The sticky notes we bought were small and ambiguous color. We should get large red sticky notes next time.
* We should set up/arrange/select space to facilitate better circulation of mentors. Generally, we found that when mentors can circulate easily things are better for participants.
* We are going to try writing additional installation instructions that do not rely on Anaconda so people have a fully open source option.
* Once again, not a single person outside of the mentor group ran GNU/Linux. We should strongly consider how much effort we want to put into maintaining this part of the curriculum which, to date, has never been used.
* We want to seriously investigate the possibility of moving to Python 3 to try to address lingering Unicode issues.


We had 16 students from HCDE were there, but also a bunch of mentors. They were good mentors.  
We also had [[Community Data Science Workshops (Fall 2014)/Reflections#Mentorship|a bunch of general feedback on how we could improvement mentorship]] that is particularly relevant to this session.


'''Demographics of Applicants.'''
== Session 1: Introduction to Python ==
Several people applied who are already good at programming. Why do they apply? Maybe they want more exposure to data science?


The goal of this session was to teach the basic of programming in Python.  The basic curriculum was originally built off the [[Boston Python Workshop]] curriculum which has been used many times and is well tested.  Unsurprisingly, it worked well for us as well.


'''Desired applicants.'''
That said, we made several major changes this time around. The biggest is that we retained only the [[Wordplay]] project. We also created a new project,  [[Baby Names]], that uses Social Security Administration data on the frequency of Baby Names.
The constraint on scaling the workshop is the number of mentors. Every mentor means that the workshop can accommodate four more mentees.  


Is it good to have mentees who have some programming skills along with those who don’t have any? Or is it a better use of the seats to only take those with no programming background?
=== Afternoon sessions ===


Who are the priorities?  Get more of the newbies and invest more in them?
We felt that that the new [[Baby Names]] project was excellent and feedback was overwhelmingly positive. Because it includes both dictionaries and lists of names (in the form of <code>.keys()</code> methods), it can do everything that [[Wordplay]] can but it has a much stronger feel of data science to it and, generally, a higher ceiling.  Wordplay felt relatively boring.


'''Improving Retention:'''
Suggestions based on feedback include:
Anecdotally, there is a sense that those who are dropping are those who had more trouble but didn’t struggle visibly.


Q: Would it help with retention if we show people what will happen in the following weeks?
* Do a better job of bringing folks back together to walk through potential solutions to the questions posed in the project rooms.
* Consider simply having two smaller rooms doing [[Baby Names]] and perhaps having one that emphasizes more numeric and math operations.
* Prepare questions before hand, list them all up front, and let folks choose what to work on.


A: Several mentors say “yes.” We’re doing that, but let’s do more of it.
== Session 2: Learning APIs ==


Pair programming for those who want it might be helpful. Working in groups is another possibility.  
The goal of this session was to describe what web APIs were, how they worked (making HTTP requests and receiving data back), how to understand JSON Data, and how to use common web APIs from Wikipedia and Twitter.


=== Morning lecture ===


'''Mining research interests/goals.'''
The [[Community Data Science Workshops (Fall 2014)/Day 2 lecture|morning lecture]] was given by Frances Hocutt and it was well received. Unsurprisingly, the example of [http://placekitten.com/ PlaceKitten] as an API was an enormous hit: informative ''and'' cute.
Could we help match up people with similar interests?


Frances used excellent slides which are shared [[Community Data Science Workshops (Fall 2014)/Day 2 lecture|on the wiki page]] and which we will reuse. About half found the lecture either too fast or too slow and about half found the lecture to be just right.


'''How can we support self-directed projects?'''
Since many people felt the lecture was on the slower side, we want to use this time to introduce function definitions. We will also devote a bit less time to review which, because of the one week spacing between sessions, feels less important than it did last time.


Can we give mentees more guidance to support their project interests?
=== Afternoon sessions ===
It’s easier to do that if people are pre-clustered.


Bring up people’s ideas at the end.  
There were three parallel afternoon sessions on '''Twitter''', '''Wikipedia API''' and '''SQL'''. All three were successful and we plan to do some version of all three sessions next round:


The size of the breakout workshops varied and that means different degrees of engagement were feasible.
'''Twitter''':


* Once again, the session had too many people for the room and we should consider splitting it if we have mentors who are comfortable teaching it and we should try to arrange this ahead of time.
* We should be careful to make sure that the advance notice asks everybody to download the project zip file ahead of time. If we're going to do this in class instead, we should set up a short URL to help streamline the process without forcing everybody to head to the wiki for things.
* A bunch of people found the Twitter session too fast so we should try to slow this down.
* TweePy continues to be both poorly documented and opaque. The opaqueness of TweePy was a problem and we may want to create an interface to TweePy that just gives users raw JSON.


The BIG feedback from the first series of workshops: Bring people back together more often. Bringing people together in the end was effective this time. We need a go between for each session to remind people to reconvene. An emcee.
'''Wikipedia''' workshop:


'''
* In terms of delivery, there was mixed feedback including some excellent feedback and some who felt that it was too detailed and slow. This mirrored some of our feedback from last time. One approach would be to make the Wikipedia room be a designated "slower" room.
Flow of the workshops'''
* We should consider graduated challenges that go from less challenging to more and more challenging which might help with the fact there is a range of learning levels.
Q: What degree of dependencies should there be between workshops?


'''Feedback on lectures.'''  
'''SQL workshop''':
About half found Frances’s lecture either too fast or too slow and about half found the lecture to be just right.


Getting other people to do some of the lectures.
Jonathan ran a session on using SQL. Although this was a diversion from the strong Python focus, it was well attended and appreciated by students trying to build up this skill.
Diversity is desired.  Mako does not want to be the only one.  


* Generally the session was was very successful and seemed to do a good job of giving people an overview of a data science and a way to hook themselves in to it.
* Next session, if we do this again, we should consider integrating Python more closely into this. We may either close the loop in this session or perhaps split into two sessions: (1) introduction to SQL; and (2) using Python to bring data back into Python (e.g., in Pandas).
* We should consider hosting an open SQL database somewhere.


'''Selecting workshops for next time.'''
== Session 3: Data Analysis and Visualization ==
Do we need more break-out sessions?  OR do we need to break out best of the break-out sessions? Two mentors thumb wrestle.


Wrestler one: Smaller groups of the same break-out session might be good.  
The goal of the lecture was to walk people through the actual mess of writing code from scratch and focused on a single example of code that builds a dataset from Wikipedia.


Precanned sessions make it easier for new mentors to feel confident and be successful.
In general, goals were clearer this time and the use of Anaconda meant that we could use <code>requests</code> which cleaned up several problems last time and led to more clear code.


Wrestler two: Diversity of projects inspires people to do the kinds of things that people can do with this new knowledge.
One challenge, pointed out in a question at the end of the final lecture, is that we don't actually do very much actual data analysis during the lecture. Next time, we should make this much more clear up front. The reality is that we were doing analysis from the very first day and that where analysis starts and where data cleaning and munging ends can be fluid, fuzzy, and subjective. We should foreground this in the beginning of the lecture or even at the beginning of the workshops.


What else can encourage generative-ness?
=== Afternoon sessions ===
Giving mentees generative moments within sessions and lectures might be empowering. Perhaps, calling out mentees who are doing generative things.


'''Basic statistical analysis''' in Python would be a fun thing to teach (says Mako) and at least ten mentees would be enthusiastic about it.  
We ran two sessions this time.


Some people love R some people don’t. The world goes round.  
An '''analysis with spreadsheets session''' similar to what we taught last time. This was improved and more effective. By the end, many participants were modifying the code to build their own datasets and doing their own visualizations. One student built a time series of edits to articles about death by police and another to articles about the NFL. In both cases, real patterns driven by current events became clearly visible.


We also ran a session on '''matplotlib''' which was taught by two mentors we brought in specifically to teach it but who had limited experience with the CDSW. Some people in the session were lost. Because the mentors who taught it were not at the other sessions, they therefore didn’t go in with a good sense of where the participants were at. In the future, we should loop in teachers better to where the participants are at. For example, we might encourage new mentors do a practice session with some friendly folks before they let loose.


'''Q: How can we strengthen the relationship between the lectures and the break-outs?'''
Also, next session, we are going to consider using [https://pypi.python.org/pypi/seaborn/0.1 SeaBorn] instead of matplotlib which Tommy seemed excited about.


'''Baby names''' is good project because it feel data-science-y.  Baby Names does everything that '''Word Play''' does but it has the stink of science about it. Next time, let’s have two small rooms doing the exact same thing. Wordplay is kind of boring.
== General Feedback ==
'''
Twitter''' had too many people in it. If you ask people do some steps in advance and not others mayhem ensues. Next time have them download all resources.  A bitly URL that helps people find the download easier streamlined things.


A bunch of people found the Twitter session way too fast. TweePie is not well documented. Squeeze the JSON out of it before the mentees have to cope with it. Get the mentors on it before hand. Yay!
* Generally, there was a sense that we should stop creating pages in the wiki by copying and pasting old stuff. This was the BPW model but it's leading to madness. We when archive an old version of a site, we can use MediaWiki to create links to the old version of the pages (we can install templates from English Wikipedia to help make this easier).
* We should try to schedule the workshop not quite so close to the end of the quarter. The beginning or middle of the quarter should be better for UW students.
* Mentors should post the code generated in the break-out sessions. Encourage them to capture the code created in examples and to post these afterward systematically.
* There was general interest in pair programming or more team based exercised. We should consider changes along this line.
* There was a need for several on-the-fly corrections of the instructions and files on the wiki during the workshop. Better planning and testing for this will be very useful.


=== Mentorship ===


'''Wikipedia''' workshop. The mentor explained stuff very clearly. That was frustrating for those who didn’t need it, BUT super great for people that wanted/needed a lot of explanation.  
Last time through, most of our observation were focused on improving the experience of attendees and we think we didn't spend as much time on helping mentors have a great experience and helping them prepare effectively. We had many new mentors this round. One general concern was the relative lack of mentor training, especially before the first sessions. We had a series of pieces of feedback on how to improve this.


Graduated challenges in a workhshop that go from less challenging to more and more challenging helps with the fact there is a range of mentee levels.
* Arrange a pre-CDSW mentors meeting (perhaps a day or two before to over material) and maybe at a bar or other social environment with beer and pizza. We could use this to set norms, best practices, goals, planning, etc.
* Perhaps meet 15-20 minutes early before Session 0 to get to know each other and over things.
* Create some easier way to distinguish mentors from students (e.g., t-shirts, buttons, paper them head to foot in sticky notes).
* Send out detailed instructions and emails to mentors, or create pages in this wiki, that detail good mentoring. For example:
** How much should you help? Some. But be careful not to just give away the answer, to focus too much on elegance or technical correctness. Be careful not to overwhelm the learners.
** Explicitly encourage mentors to reach out to students and ask them how things are going by walking around to every single person to ask, “How are you doing? What are you working on? Show me what you’re doing.


'''SQL workshop'''. Seemed to work really well. Did a good job of giving people an overview of a data science and a way to hook themselves in to it. Next session, also do a workshop that closes the loop between SQL and Python. Can we host an open SQL database somewhere?
=== More Projects or Better Projects ===


Once again, we had certain afternoon project sessions that were much more effective than others. One thing we were conflicted about was whether we wanted more break-out sessions or whether we should just use the best of the break-out sessions (perhaps in two rooms).


'''Session 3:
Arguments for smaller groups of the best break-out session include:
AM lecture'''. The goal of the lecture was to walk people through the actual mess of making a code.


Maybe the week 2 lecture should introduce APIs and functions.  People thought that week 2 lecture was slow, so adding functions would be good. Functions can be reinforced in the week 2 workshops. Lecture 2 is the earliest that makes sense to introduce functions and the latest. Introduce the idea that code is reusable.  
* Focus on a known good thing.
* Pre-canned sessions make it easier for new mentors to feel confident and be successful.


'''Afternoon of Session 3:'''
Arguments against include:


'''The spreadsheets session.''' People were modifying the code to build their own dataset and did their own visualizations. At least a few people. That was cool!
* Diversity of projects inspires people to do the kinds of things that people can do with this new knowledge.


'''The MatPlotLib session'''. Most people in the session were deeply lost. The mentors who taught it were not at any of the other sessions and therefore didn’t go in with a good sense of where the mentees were at. Several people left and went to other room.  In future, ensure mentor success by having them loop in better to where the mentees are at. Consider next time, encouraging new  mentors do a practice session with some friendly folks before they let loose. Also, next session, consider using SeaBorn instead of MatPlotLib.
We should pursue other ways to encourage creativity with code. For
example, we might give participants creative/flexible moments within sessions and lectures might be empowering in similar ways. We can also continue to call out participants who are doing creative things.


== Budget ==


'''The ethnographers get the last word:'''
We spent a total of $3280 on the CDSW. We spent approximately $280 on coffee. About $350 of this funded food and refreshments during post-session meetings among the mentors. About $280 was spent on coffee,
Some observations about the culture of mentoring from a first time mentor: There are some distinct values that came through strongly. There is a clear vision of empowerment through programming. The degree of inclusivity is impressive. The culture of feedback, iteration, and reflection was really surprising such as the amount of effort that goes into improving the materials and the teaching. As is the way that other organizations are able to (and are) using the materials.  The way that this is building the community. For example, how mentees are organizing their own meet-ups (though that could be encouraged even more). 


The pragmatism of what is taught demonstrates a clear value. It would be helpful to make sure that all mentors are clear that part of what is expected of them they give pragmatic coaching. That is they should lead mentees to something that works rather than telling them what an expert would do.
The rest (the large majority) was spent on food. Because were better able to model retention this time around, we did a much better job of ordering the "right" amount of food. We ordered:


== Mako's Raw Notes ==
* Session 1: Pizza from Jet City Pizza
* Session 2: Indian (four entries) from Jewel of India
* Session 3: Greek food (e.g., salad, hummus, spinach pies, souvlaki) from Costas


* general
Because [[Mako]] did the ordering, everybody ate vegetarian. At least one person complained about the lack of meet in Session 2 (but seemed    to be confused into think it was present in Session 1).


anaconda solved problems
<!--  
 
'''The ethnographers get the last word:'''
- next time recruit less mentors for hte first session
Some observations about the culture of mentoring from a first time mentor: There are some distinct values that came through strongly. There is a clear vision of empowerment through programming. The degree of inclusivity is impressive. The culture of feedback, iteration, and reflection was really surprising such as the amount of effort that goes into improving the materials and the teaching. As is the way that other organizations are able to (and are) using the materials. The way that this is building the community. For example, how participants are organizing their own meet-ups (though that could be encouraged even more).
 
sticky notes didn't work as well this time
 
- especially during the lectures
- we did this better last time
 
-> hjave better sticky notes would ahve been helpful
 
rooms:
 
maybe get the oodegard room
architecture of the space has quickly become the limiting factor
 
checkout
- not everybody loves the checkout, maybe there's a way we can make it more fun?
 
communiate the whole setup process to the mentors ahead of time
 
-> maybe stream line the process
 
-> finding the directory continues to be hard
 
 
 
we moved the curriculum from cmd to powershell
 
- windows xp is broke now. make sure you ahve a person with xp skills on hand
 
not a single person in our session had ip
 
we can move away from 3 separate installations in the setup information.
 
- everybody can use zip instead of a zip/tar.gz both
 
 
maybe we can consoldiate the wiki pages into a singel page which will be much eaiser to instlla nd keep updated in the future
 
generally, lets stop copying and pasting new stuff into the wiki. we when archive the old version, we can create links to teh old version of the wiki pages (intstall the templates from english wikipedia)
 
get rid of pages that are event specific
 
* friday evening
 
better material/training and information for mentors on what to expect
 
mentors should meet 15-20 minutes early to get to know each oand go over things
 
- maybe t-shirts buttons, etc or something to distinguish mentors
 
- encourage peoplt o reach out
 
topics to cover:
 
how much should you help? (not too much)
 
anaconda
 
- non-free and we're unhappy witht hat
 
- linux seems like we might actually want ot do sidewizse but it does work
 
- if something fully free and almost as good comes along, we'll use it
 
-> write installation instructions for linux
 
3 people who used it out of 80 had problems
 
-> anaconda choked on a person unicode path because the users homedir was in simplified chinese
 
broader unicode support wont be fixed until we can move to python3 and we still seem a little while away from that
 
 
 
* demorgraphics
 
people come in: departments? maybe build a table?
 
* org suggestions
 
let people joiun int he later session
 
making letting peopel skpi #1 could be usefuil
 
-> maybe we can accept people after words
 
-> alternatively, we can try to accetp more newsiebes and improve retention
 
mixed feelings
 
 
 
 
 
 
ways to improve and retain people
 
-> layout what we're going to do int he next sessions
 
go to show why are learning things up front
 
focus on broad research questions
 
pair programming?
encouraging people to work iun teams or with other on problems they suggest
 
next time maybe mine the registration for a list of research questions
 
note:
 
next time make it explicit that folks can work in grousp
 
tip: introduce mentors to everybody very clearly
 
introductions would have been good but are hard to do
 
* sesion 1
 
bring folks back together to go over things
 
- post examples of code used in teh lectures
 
- create code base
 
- turn on loggin gin the concsol and post it after the lecture
 
mentor workshop:
 
- get people together before
encourage people to get involved maybe  bar meetup
 
- track diversity of people along more dimensions
 
- the sql workshops was well received although slight mixed in terms of feedback
 
more breakout session next time
colorwall was gone and nobody missed it
 
* babynames
 
- try to integrate more year
- huge success
- may split rooms into two baby names
 
 
 
list questions up front and let folks choose what to work on and what to bring back together
 
generally:
 
- note places to bring folks back together
 
*  session 3
 
generally:
 
showcase what students ahve accomplished and places people can change things and do things differently
 
e.g., the fergeson thing with the exmaple from ha=rry party
 
strong connection between the lecture and the introduction
 
-> more connections and takeaways to emphasize the session more clearly
 
how to tap mentors on topics more effectively
 
wordplay
 
- kinda borning
 
next time
 
- public healtha nd epi data session
 
end of semseter was too late. maybe have it early next year
 
twitter:
 
 
- have people do the setup ahead of time
 
-> that was clear ahead of time and it happened in the beginnginf of class. either fix the instruction and make sure that everybody is doing the same thing
 
speed was an issue
 
the opaqueness of tweepy was a problem.. option to creat ea version of tweppty that just gives you json
 
or miku or michael for details onhow to do that
 
dharma might be able to do this.
 
 
sql session:
 
- maybe split this into two session next time
 
- merge in some more python this time
 
#1 intro into sql
 
#2 using pythong o tgra data and bring python and pandas
 
wikipedia
 
- too slow
 
we can do it faster
 
lecture
*stress defining functions more and earlier.. maybe in the first project and certain in session #2 so we can use it in the afternoon projects and tweepy
 
 
 
session 3:
 
show and tell at the end was very effectively
 
we need a designated mc who can go =between rooms
 
bring people up to the
 
matplot lib


- maybe replace it with seaborn?
The pragmatism of what is taught demonstrates a clear value. It would be helpful to make sure that all mentors are clear that part of what is expected of them they give pragmatic coaching. That is they should lead participants to something that works rather than telling them what an expert would do.
- tommy will teach it


ideomatic ptyhon
-->


talk to chris to try to fix those things
<!--  LocalWords:  CDSW th nd BPW Unretained wikitable HCDE iSchool
-->
<!--  LocalWords:  Informatics Meshnet Anecdotally suboptimal scipy
-->
<!--  LocalWords:  statsmodels cmd Powershell XP deemphasize OSX JSON
-->
<!--  LocalWords:  Mentorship mentorship PlaceKitten TweePy SeaBorn
-->
<!--  LocalWords:  matplotlib
-->

Revision as of 04:37, 27 December 2014

If you're interested in putting on your own CDSW, you should also see our reflections from Spring 2014.

Over three weekends in Fall 2014, a group of volunteers organized the Community Data Science Workshops (Fall 2014) the latest in a series of four sessions workshops designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. The Fall 2014 events were held between November 7th and 22nd in 2014 at the University of Washington in Seattle.

This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW — including the authors!

In general, the mentors and students suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefited enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the sessions which we have detailed below.

If you have any questions or issues, you can contact Benjamin Mako Hill directly or can email the whole group of mentors at cdsw-au2014-mentors@uw.edu.

Structure

The Community Data Science Workshops (Fall 2014) consisted of four sessions:

Our organization and the curriculum for Sessions 0 and 1 were originally borrowed from the Boston Python Workshop (BPW) although our curriculum has diverged quite a bit as we've improved it and tailored it to the specific learning goals in our sessions.

Session 0 was a three hour evening session to install software. All three of the other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:

  • Morning, 10am-12:20: A 2 hour lecture
  • Lunch, 12:20-1pm
  • Afternoon, 1pm-3:30pm: Practice working on projects in 3 breakout sessions
  • Wrap-up, 3:30pm-4pm: Wrap-up, next steps, and upcoming opportunities

We collected detailed feedback from users at three points using the following Google forms (these are copies):

We used this feedback to both evaluate what worked well and what did not and to get a sense of what students wanted to learn in the next session and which afternoon sessions they might find interesting.

Participants

We had 30 mentors who attended at least one of the sessions and at least 20 mentors at each sessions. Many of our mentors were UW students in more technical departments like Computer Science and Engineering and Human Centered Design & Engineering. Perhaps half of them worked outside of the university as software developers.

We had about 150 participants apply to attend the sessions. We selected on programming skill (to ensure that all attendees were complete beginners), enthusiasm, and randomly to maintain a learner to mentor ratio of between 4 and 5. We admitted 80 participants. 58 listed a UW affiliations. Affiliations listed by at least three people include the following:

Department Participants
HCDE 16
iSchool 10
Communication 8
Anthropology 3
Alumni 4
Undergrad 3

We had two people each who listed their affiliations as Bio- and Health Informatics, the Foster School of Management, Microsoft, and Wikipedia.

We also had people from Psychology, the City of Seattle, the Low Income Housing Project, Seattle Meshnet, Biochemical Engineering, Bio Physical, Chemical Engineering, Game Studies, Linguistic, College of the Environment, Oceanography, the School and Public Health, UW Bothell, Central Washington University, and many people who did not specify an affiliation. We continue to think that it's important that people who are not doing research but who are are part of online communities were in the mix with UW-type researchers. Bringing together researchers and participants in online communities is an important goal and would like to work toward more balance in this regard and to increase the amount of non-UW participation.

Retention between session and 0 and 1 was nearly 100%. Retention between sessions 1 and 2 and sessions 2 and 3 was roughly 75% leaving us with perhaps 55-60% retention between session 0 and session 3.

Anecdotally, there is a sense that those who are dropping were those who had trouble but who didn’t struggle visibly.

Although our participant pool in CDSW (Spring 2014) was overwhelming female (80-90%), there was close to gender balance in both students and mentors this time around.

Once again, quite a large number of people applied were already skilled programmers. We're still not exactly sure why these people are applying because we think that the fact that the workshops are for absolute beginners is very clear. Perhaps people just want more exposure to data science?

Once again, the constraint on scaling the workshop was the number of mentors. Every mentor we added means that the workshop can accommodate four more participants.

One suggestion was allowing participants with have some programming skills — especially for the second and third workshops (given predictable rates of retention). There was not consensus among the organizers and mentors on this approach and preferred getting more newbies and invest more in them?

Morning Lectures

Benjamin Mako Hill gave lectures in Session 1 and 3. Frances Hocutt gave the lecture in Session 2 and we felt that this was was an important step. An important future goal is getting other people to give lectures. Tommy is an obvious choice to do one next time. Different faces, perspective, and backgrounds are useful to communicate the breadth of interest here. Mako does not want to be the only one giving these lectures.

Our biggest challenge with growing the workshops was with physical space for the lectures. Basically, rooms that can hold more than 100 people at UW are almost exclusively lectures halls that make it almost impossible for mentors to physically reach students in order to help them debug and solve problems.

We reserved a lecture hall that fit 200 people and filled it with 100 students in alternating rows to make it at least possible to reach each person. This worked reasonably well although it was still suboptimal.

People continue to want a record of lectures. At the very minimum, we should make sure that we turn on console logging so that we can post this after the lectures. We intended to record lectures but, once again, this got lost in all the crazy preparation for the events.

Afternoon Sessions

Projects are done in breakout sessions in a series of three rooms. The general problem was that insisted on teacher per topic and topics were very unequal in their popularity. Next time, we will likely prepare to have multiple teacher for multiple rooms on topics we know will be more popular.

Several changes we hope to make include:

  • As we refine this process, we were also interested in thinking of trying to select or refine breakout sessions so that they are more closely tailored to individuals and their interests. Next time, we will consider mining the registration for a list of research questions we might use.
  • We want to emphasize bringing people back together more often. In particular, we found that bringing people together back together share work several time during each session and then once in the end to show of achievements or interesting results was effective. We also need to designate a person to a person to go between for each session to remind people to reconvene and to create a program of important or inspiring achievements for presenting to the group at the very end.
  • There seemed to be broad interest in examples or projects that are focused on public health and/or epidemiological data.
  • We would love to create an afternoon project for Session 3 on basic statistical analysis in Python using scipy, statsmodels, and pandas. At least ten participants would have been enthusiastic to take it.

Session 0: Python Setup

The goal of this session was to get users setup with Python and starting to learn some Python basics. We changed the curriculum originally used by BPW enormously to use Continuum's Anaconda instead of Python directly from python.org. The result was staggering. Not a single person reported "many problems with set-up" (i.e., respondents reported either "no problems" or a "few problems.")

That said, we had several major concerns:

  • Anaconda is not free software/open source.
  • Anaconda does not support Python 3 which we'd like to move to.
  • Anaconda seems to have at least some remaining i10n bugs. For example, one student had a home directory set to a Chinese string which caused the Anaconda installation to fail at a late stage. This was eventually fixed by a mentor who changed the path by hand.

Additionally, we moved the Windows curriculum from away from cmd to using Powershell. This was an huge and unqualified improvement because it meant that ls works and the rest of the curriculum could converge. The only concerns were that Powershell is not installed on Windows XP although not a single student had Windows XP.

Changes for next time include:

  • Because it was less necessary, we will deemphasize recruiting mentors to the Friday night session. Many folks were standing around.
  • Because Powershell was successful, we're going to try to create a single consolidated set of installation instructions for Windows, Mac OSX, and GNU/Linux
  • We will make it more clear to mentors whether participants should self-report they’d completed the steps or whether the mentor should verify that the steps were all taken (the latter). In future, we will email mentors ahead of time to let them know.
  • In a related issue, not everybody loves the checkout step. Maybe there's a way we can make it more fun?
  • We need to do a better job of modeling sticky notes so folks use them more effectively.
  • The sticky notes we bought were small and ambiguous color. We should get large red sticky notes next time.
  • We should set up/arrange/select space to facilitate better circulation of mentors. Generally, we found that when mentors can circulate easily things are better for participants.
  • We are going to try writing additional installation instructions that do not rely on Anaconda so people have a fully open source option.
  • Once again, not a single person outside of the mentor group ran GNU/Linux. We should strongly consider how much effort we want to put into maintaining this part of the curriculum which, to date, has never been used.
  • We want to seriously investigate the possibility of moving to Python 3 to try to address lingering Unicode issues.

We also had a bunch of general feedback on how we could improvement mentorship that is particularly relevant to this session.

Session 1: Introduction to Python

The goal of this session was to teach the basic of programming in Python. The basic curriculum was originally built off the Boston Python Workshop curriculum which has been used many times and is well tested. Unsurprisingly, it worked well for us as well.

That said, we made several major changes this time around. The biggest is that we retained only the Wordplay project. We also created a new project, Baby Names, that uses Social Security Administration data on the frequency of Baby Names.

Afternoon sessions

We felt that that the new Baby Names project was excellent and feedback was overwhelmingly positive. Because it includes both dictionaries and lists of names (in the form of .keys() methods), it can do everything that Wordplay can but it has a much stronger feel of data science to it and, generally, a higher ceiling. Wordplay felt relatively boring.

Suggestions based on feedback include:

  • Do a better job of bringing folks back together to walk through potential solutions to the questions posed in the project rooms.
  • Consider simply having two smaller rooms doing Baby Names and perhaps having one that emphasizes more numeric and math operations.
  • Prepare questions before hand, list them all up front, and let folks choose what to work on.

Session 2: Learning APIs

The goal of this session was to describe what web APIs were, how they worked (making HTTP requests and receiving data back), how to understand JSON Data, and how to use common web APIs from Wikipedia and Twitter.

Morning lecture

The morning lecture was given by Frances Hocutt and it was well received. Unsurprisingly, the example of PlaceKitten as an API was an enormous hit: informative and cute.

Frances used excellent slides which are shared on the wiki page and which we will reuse. About half found the lecture either too fast or too slow and about half found the lecture to be just right.

Since many people felt the lecture was on the slower side, we want to use this time to introduce function definitions. We will also devote a bit less time to review which, because of the one week spacing between sessions, feels less important than it did last time.

Afternoon sessions

There were three parallel afternoon sessions on Twitter, Wikipedia API and SQL. All three were successful and we plan to do some version of all three sessions next round:

Twitter:

  • Once again, the session had too many people for the room and we should consider splitting it if we have mentors who are comfortable teaching it and we should try to arrange this ahead of time.
  • We should be careful to make sure that the advance notice asks everybody to download the project zip file ahead of time. If we're going to do this in class instead, we should set up a short URL to help streamline the process without forcing everybody to head to the wiki for things.
  • A bunch of people found the Twitter session too fast so we should try to slow this down.
  • TweePy continues to be both poorly documented and opaque. The opaqueness of TweePy was a problem and we may want to create an interface to TweePy that just gives users raw JSON.

Wikipedia workshop:

  • In terms of delivery, there was mixed feedback including some excellent feedback and some who felt that it was too detailed and slow. This mirrored some of our feedback from last time. One approach would be to make the Wikipedia room be a designated "slower" room.
  • We should consider graduated challenges that go from less challenging to more and more challenging which might help with the fact there is a range of learning levels.

SQL workshop:

Jonathan ran a session on using SQL. Although this was a diversion from the strong Python focus, it was well attended and appreciated by students trying to build up this skill.

  • Generally the session was was very successful and seemed to do a good job of giving people an overview of a data science and a way to hook themselves in to it.
  • Next session, if we do this again, we should consider integrating Python more closely into this. We may either close the loop in this session or perhaps split into two sessions: (1) introduction to SQL; and (2) using Python to bring data back into Python (e.g., in Pandas).
  • We should consider hosting an open SQL database somewhere.

Session 3: Data Analysis and Visualization

The goal of the lecture was to walk people through the actual mess of writing code from scratch and focused on a single example of code that builds a dataset from Wikipedia.

In general, goals were clearer this time and the use of Anaconda meant that we could use requests which cleaned up several problems last time and led to more clear code.

One challenge, pointed out in a question at the end of the final lecture, is that we don't actually do very much actual data analysis during the lecture. Next time, we should make this much more clear up front. The reality is that we were doing analysis from the very first day and that where analysis starts and where data cleaning and munging ends can be fluid, fuzzy, and subjective. We should foreground this in the beginning of the lecture or even at the beginning of the workshops.

Afternoon sessions

We ran two sessions this time.

An analysis with spreadsheets session similar to what we taught last time. This was improved and more effective. By the end, many participants were modifying the code to build their own datasets and doing their own visualizations. One student built a time series of edits to articles about death by police and another to articles about the NFL. In both cases, real patterns driven by current events became clearly visible.

We also ran a session on matplotlib which was taught by two mentors we brought in specifically to teach it but who had limited experience with the CDSW. Some people in the session were lost. Because the mentors who taught it were not at the other sessions, they therefore didn’t go in with a good sense of where the participants were at. In the future, we should loop in teachers better to where the participants are at. For example, we might encourage new mentors do a practice session with some friendly folks before they let loose.

Also, next session, we are going to consider using SeaBorn instead of matplotlib which Tommy seemed excited about.

General Feedback

  • Generally, there was a sense that we should stop creating pages in the wiki by copying and pasting old stuff. This was the BPW model but it's leading to madness. We when archive an old version of a site, we can use MediaWiki to create links to the old version of the pages (we can install templates from English Wikipedia to help make this easier).
  • We should try to schedule the workshop not quite so close to the end of the quarter. The beginning or middle of the quarter should be better for UW students.
  • Mentors should post the code generated in the break-out sessions. Encourage them to capture the code created in examples and to post these afterward systematically.
  • There was general interest in pair programming or more team based exercised. We should consider changes along this line.
  • There was a need for several on-the-fly corrections of the instructions and files on the wiki during the workshop. Better planning and testing for this will be very useful.

Mentorship

Last time through, most of our observation were focused on improving the experience of attendees and we think we didn't spend as much time on helping mentors have a great experience and helping them prepare effectively. We had many new mentors this round. One general concern was the relative lack of mentor training, especially before the first sessions. We had a series of pieces of feedback on how to improve this.

  • Arrange a pre-CDSW mentors meeting (perhaps a day or two before to over material) and maybe at a bar or other social environment with beer and pizza. We could use this to set norms, best practices, goals, planning, etc.
  • Perhaps meet 15-20 minutes early before Session 0 to get to know each other and over things.
  • Create some easier way to distinguish mentors from students (e.g., t-shirts, buttons, paper them head to foot in sticky notes).
  • Send out detailed instructions and emails to mentors, or create pages in this wiki, that detail good mentoring. For example:
    • How much should you help? Some. But be careful not to just give away the answer, to focus too much on elegance or technical correctness. Be careful not to overwhelm the learners.
    • Explicitly encourage mentors to reach out to students and ask them how things are going by walking around to every single person to ask, “How are you doing? What are you working on? Show me what you’re doing.”

More Projects or Better Projects

Once again, we had certain afternoon project sessions that were much more effective than others. One thing we were conflicted about was whether we wanted more break-out sessions or whether we should just use the best of the break-out sessions (perhaps in two rooms).

Arguments for smaller groups of the best break-out session include:

  • Focus on a known good thing.
  • Pre-canned sessions make it easier for new mentors to feel confident and be successful.

Arguments against include:

  • Diversity of projects inspires people to do the kinds of things that people can do with this new knowledge. 


We should pursue other ways to encourage creativity with code. For example, we might give participants creative/flexible moments within sessions and lectures might be empowering in similar ways. We can also continue to call out participants who are doing creative things.

Budget

We spent a total of $3280 on the CDSW. We spent approximately $280 on coffee. About $350 of this funded food and refreshments during post-session meetings among the mentors. About $280 was spent on coffee,

The rest (the large majority) was spent on food. Because were better able to model retention this time around, we did a much better job of ordering the "right" amount of food. We ordered:

  • Session 1: Pizza from Jet City Pizza
  • Session 2: Indian (four entries) from Jewel of India
  • Session 3: Greek food (e.g., salad, hummus, spinach pies, souvlaki) from Costas

Because Mako did the ordering, everybody ate vegetarian. At least one person complained about the lack of meet in Session 2 (but seemed to be confused into think it was present in Session 1).