DS4UX (Spring 2016)/Day 4 lecture: Difference between revisions

From CommunityData
No edit summary
No edit summary
Line 1: Line 1:
==Review==
==Week 3 follow-up==
{{:DS4UX_(Spring_2016)/Day_3_follow_up}}
{{:DS4UX_(Spring_2016)/Day_3_follow_up}}
==Reading and writing files==
{{:DS4UX_(Spring_2016)/Reading_and_writing_files}}


==Case study: Wikipedia Notifications ==
==Case study: Wikipedia Notifications ==
{{:DS4UX_(Spring_2016)/Wikipedia Notifications survey}}
{{:DS4UX_(Spring_2016)/Wikipedia Notifications survey}}

Revision as of 21:27, 18 April 2016

Week 3 follow-up

Here are some important concepts that we didn't have a chance to go into in great detail last week. You can use the sections below to review the concepts individually. You can also review how they work together in math_game.py, which is included in the week 4 lecture files.

Return random values with the random module

Use random.choice() to select items at random from a list.

>>> import random
>>> my_list = ["terry j.","john","parrot","michael","terry g.", "graham", "llama"]
>>> random.choice(my_list)
'graham'
>>> random.choice(my_list)
'terry j.'
>>> 

Use random.sample() to gather a given number of random items from a list. The first argument you pass to the random.sample() function is the set of items you are sampling from. The second argument is the number of items you want to gather from that set.

>>> random.sample(my_list,3)
['terry j.', 'llama', 'michael']

Use random.randint() to gather a random number from a list of numbers. You specify the list of sequential numbers by passing the starting number as the first argument, and the final number as the last argument. Unlike with range() function discussed below, when you use randint() both the first and last numbers you specify are included in the set you are sampling from.

>>> random.randint(1,10)
8
>>> random.randint(1,10)
3
>>> random.randint(1,10)
10
>>> 

Generating a list of numbers easily with range()

>>> range(5)
[0, 1, 2, 3, 4]
>>> for i in range(5):
...     print("Hi" * i)
...

Hi
HiHi
HiHiHi
HiHiHiHi

The range() function returns a list of numbers. This is handy for when you want to generate a list of numbers on the fly instead of creating the list yourself.

>>> range(5)
[0, 1, 2, 3, 4]

Use range when you want to loop over a bunch of numbers in a list, or perform an operation a certain number of times:

>>> numbers = range(5)
>>> for number in numbers:
...     print(number * number)
...
0
1
4
9
16

We could rewrite the above example like this:

>>> for number in range(5):
...     print(number * number)
...
0
1
4
9
16

You can also set the start, end, and increment value (called "step") for a range.

>>> for i in range(2,20,2):
...         print(i)
2 
4
6
8
10
12
14
16
18


Get user input with input()

>>> for i in range(100):
...     my_input = input("Please type something> ")
...     if my_input == "Quit":
...         print("Goodbye!")
...         break
...     else:
...         print("You said: " + my_input)
... 
Please type something> Hello
You said: Hello
Please type something> How are you?
You said: How are you?
Please type something> Quit
Goodbye!
>>>

Things to remember about input()

  • Input() simply asks the user to type something.
  • You can test out input() interactively. Just go into the python interpreter and type: input("What's your favorite color?")
  • The stuff that goes inside the parentheses is the "prompt". It's a string, and should be surrounded by quotes. When you run your program, the prompt text will be shown to the user right to the left of the blinking cursor where they will type their input.
  • Python will ask the user to type something at the point in the script where input() is called. Remember that Python executes scripts from top to bottom, left to right. If you put input inside a loop, it will ask the user to type something every time the loop is executed in your script.
  • What you DO with that user input is up to you. The best thing to do is to save it as a variable, i.e. user_name = input("Please type your name")
  • Python saves user input as a string, so if the user types "Daria" in the example above, then user_name will equal "Daria".
  • Once you've saved your user's input, you can use it like any other string variable. In the case of the babynames challenges, you probably want to compare it with the keys in one of the babynames dictionaries (ssadata.boys or ssadata.girls), so that you can find out how many people share that name. These keys are also strings.
  • REMEMBER: the keys in the babynames dictionaries are all in lowercase, but you can't necessarily control how a user will type their input--it's natural that people will want to capitalize their own name! Fortunately, there are string methods (https://docs.python.org/3/library/stdtypes.html#string-methods (Links to an external site.)) that will convert any string into all lowercase. You can make a string lowercase by adding .lower() to the end of the string (or the variable that holds the string)!

Iterating an indeterminate number of times with while loops

Use while loops when you don't know how many times you want to repeat ("iterate") an operation.

grocery_list = []
testAnswer = input('Press y if you want to enter more groceries: ')
while testAnswer == 'y':
    food = input('Next item:')
    grocery_list.append(food)
    testAnswer = input('Press y if you want to enter more groceries: ')
print('Your grocery list:')
for food in grocery_list:
    print(food)

Most of the time, you will find that for loops are more common for the kind of coding that you will be doing. For example, if you are reading through a CSV file, a for loop makes perfect sense: there are a set number of lines in the file, and you want to loop through the file line by line until you reach the end of the file. However, whenever your code is accepting input from a person or an API, you may find that you don't know ahead of time how many times you will need to perform an operation before stopping. In these cases, it's useful to know how to keep looping until a particular condition is met, and then stop.

Splicing list items together with .join

Use .join() when you have a list of string items that you want to join together into a single string. You specify the DELIMITER (the thing you want to separate the items) in quotes first, then call the join() function by appending a dot (".") followed by the word join and—inside the parentheses—the list that you want to join together.

>>> print("The members of Monty Python are: %s" % (", ".join(my_list)))
The members of Monty Python are: terry j., john, parrot, michael, terry g., graham, llama, eric


Putting it all together with a math game

"""
It uses the concepts that we just reviewed (random, range, input, and while) to build a math guessing game.
random.choice, range, input, while, and join.

This program asks people to add together two random numbers between 1 and 1000, and keep asking them new questions as long as they gave the answer right to the previous math problem. Once they give an incorrect answer, it prints out how many they got right, and also prints all their correct responses using join.
"""
import random

numbers_to_add = list(range(1,1001))
correct_answers = []
true_answer = 0
your_answer = 0
while true_answer == your_answer:
    num1 = random.choice(numbers_to_add)
    num2 = random.choice(numbers_to_add)
    true_answer = num1 + num2
    your_answer = int(input("%d + %d = " % (num1,num2)))
    if your_answer == true_answer:
        print("Correct! Let's try another.")
        correct_answers.append("%d + %d = %s" % (num1, num2, your_answer))
    else:
        print("Incorrect!")

print("You got %d problems right:" % (len(correct_answers)))
print(", ".join(correct_answers))


Reading and writing files

One of the most common operations you will perform when you use Python for data analysis will be reading and writing data to and from flat files. These are somtimes referred to as flat file databases—not because the files themselves are a special kind of code, but because they hold multiple pieces of data in a simple, structured format that facilitates common data processing operations like querying, overwriting, appending, and ingesting.

Probably to most basic type of "flat file database" is a plain text file with a different piece of data on each line. To process a file like this in Python, you generally follow these steps:

  1. Open the file and save it in working memory (also sometimes referred to as "buffer") as a file object.
  2. Read through the file object line-by-line and do something with it.
  3. Close the file object (whether or not you have changed it, or just read it and done something else with what your read).

Plain text flat files with one piece of data per line (for example, a list of names) are very easy to create and can hold a lot of data. In the simplest case, the new line in the file (defined by an invisible newline character \n at the end of each line) serves as the delimiter between different pieces of data. This delimiter allows Python to know where one piece of data ends and the next begins, when it reads through the file.

However, if you want to store more complex data (such as a list of names, genders, and # of people of that gender with that name), you need to come up with a way to store associated datapoints together. To do this, you need to define a new delimiter so that you can separate different associated datapoints in each row. Theoretically, you could use any character as a delimiter, but two of the most common ones are a tab \t and a comma.

Now that you have two delimiters, one that delimits vertically and the other that delimits horizontally, you have a two-dimensional matrix, a way of structuring data that is so common that we don't generally even use it's technical name—instead, we generally use the name for the kind of software application we often use to display matrix data: a spreadsheet.


Goals

In the exercises below, you will learn the basic syntax for reading common types of file into Python, and transforming them into other types of files. We will only cover a couple of the most basic types of file conversation you might want to do here:

  1. reading a raw (and messy!) comma-separated text file into Python with the generic open method, parsing through it line by line to clean it up, and output a much cleaner matrix version with newlines and tabs as delimiters.
  2. reading that file with Python's csv library, which provides you with additional options for reading, formatting, and writing files with datapoints that are separated by newlines (vertically) and commas (horizontally).


Instructions

  1. If you haven't done so already, download the ZIPPED code and data file called lecture.zip, unzip it, and navigate to it in your Terminal or Powershell.
  2. Open the code and datafiles in that directory in TextWrangler as well, so you can see the code while we walk through it.


Notes

  1. The input file we will be working with in this exercise is intentionally messy—it has lots of extra spaces and tabs scattered through the various lines of the file. One of the reasons we're devoting a whole mini-lecture on reading and writing flat files is that the data you want to analyze is OFTEN messy like this when you first get it, and often creating a "clean" version is one of the first things you'll want to do with that data.
  2. File suffixes (.txt, .tsv, .csv) are in some cases more a matter of conventions than requirements. If a file is just plain text, Python may not care what its file suffix is: you may be able to read it in and read through it line-by-line whether it has ".txt" on it or now. However, in many cases, Python (like most applications) uses the file suffix to decide how to render/execute a file, so it's best to always use the proper (conventional) suffix for any file database you have: .csv for comma-separated value files, .tsv for tab-separated value files, and .txt for generic text files (or for text files where you don't know or can't guarantee that the structure is complete or consistent)


Outputting a TSV (tab-separated value) file

  1. open the input file class_names_raw.txt in TextWrangler. From the drop-down menu, select View->Text display->Show Invisibles. What do you see?
  2. run class_names_txt_to_tsv.py. Look at the output printed on the terminal. Now examine the output file itself. What has changed?


Outputting a CSV (comma-separated value) file

  1. run class_names_txt_to_csv.py, which takes the same messy input file as the previous script. Look at the output printed on the terminal. Now examine the output file itself. How is this file different from the .tsv file we created earlier?


Important syntax concepts

  1. using with open(FILENAME, "r") as VARIABLE will almost always be the best way to open a file (for reading or writing). This saves you some steps, and makes sure that the file is properly closed (removed from working memory) once you're done with it. There are other patterns for opening files, but I suggest you get used to this pattern before you explore any others.
  2. when you specify "r" after the filename in open() you're saying whether you want to open the file to READ it; when you specify "w", you are saying you want to open the file (or create it, if it doesn't exist) so that you can write to it—add data into the file to save. There are some other options for reading and writing, such as "rb" and "wb", which are used in different circumstances and for different types of data. We'll cover these in less detail next week.
  3. .readlines() vs csv.reader: both of these are functions that make it easy to read through a file object line-by-line once you have opened the file. csv.reader provides you with additional parameters that allow you to control how the data from the file is interpreted, which often comes in handy when dealing with messy data, but to use it you have to import csv first. The same goes for write, used to write the .tsv file in the first script, and csv.writer, used to write the .csv file in the second script.


Case study: Wikipedia Notifications

A screenshot of the survey.
A screenshot of the Notifications feature.

The Wikimedia Foundation is working on updates to the Notifications feature of MediaWiki, a panel that alerts editors to important, relavant events that have occured since they last logged in.

Types of notifications
There are currently 8 types of notification that a user can receive
  • someone @mentioned you in a discussion
  • someone left you a message on your user page
  • someone reverted one of your edits
  • someone changed your userrights
  • someone sent you an email
  • someone thanked you for your edit
  • someone reviewed an article you created
  • someone linked to an article you created


In order to understand how to prioritize their work, the product team at Wikimedia that is in charge of the Notifications feature needs to know how editors currently use the feature: what they value, what annoys them, what confuses them, etc.

We released a survey targeted at these editors, and asked them a variety of questions about how their usage of Notifications, and how we could improve the feature to make it better meet their needs. In that survey, we asked respondents to identify notifications that they have seen before, and then asked them to rank these notifications in order of how important/informative they were.

The ZIP file linked below contains a file called notifications_ranking_survey_data.csv contains anonymized responses from over 100 users, as well as some heavily-commented scripts for processing that file and outputting aggregated statistics from those responses. In class, we'll walk through this step by step, to show the process involved in transforming raw data in research findings!


Code and data

Click here to download the Notifications data and scripts

Analysis

To see how these questions are answered, ead the multi-line comments (the ones surrounded by """) and un-comment print statements 1-by-1 in the files below as you walk through the code.

  • How do import this dataset into Python? How should we structure the data we import? — run notifications.py.
  • How many people report having seen each type of notification? — run count_notifications.py.
  • Which types of notifications are ranked #1 priority by the most people? run rank_notifications1.py.
  • Which types of notifications are ranked among the top THREE highest priority by the most people? run rank_notifications2.py.