Community Data Science Course (Spring 2023)/Week 3 lecture notes

From CommunityData
< Community Data Science Course (Spring 2023)
Revision as of 01:48, 11 April 2023 by Benjamin Mako Hill (talk | contribs) (→‎Programming lecture outline)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Online Data Sets: An Important Question[edit]

Can you get bulk access to data?

Bad Signs:

  • You must authenticate as a particular user in order to access data, and you can only see data for that user. (For example: you must log into Instagram's API as a particular user as per this link!)

Good signs:

  • The organization owning the data wants everyone to access it. Like wikipedia or most government data.
  • You may have to authenticate as a particular user, but you can access general data.
  • For example: once you log into Reddit, you can get all posts about almost anything (Twitter API Docs)

Programming lecture outline[edit]

  • Dictionaries!
    • Purpose (use dictionaries to store key/value pairs)
    • Initialization {}
    • Accessing elements
    • Adding elements
    • Changing elements
    • .values() and .values()
    • using in to look into dictionaries
    • using for loops to iterate over dictionaries (e.g., lets build a list of every letter in the alphabet using wordplay data)
    • A few notes about dictionaries:
      • A given key can only have one value, but multiple keys can have the same value.
      • Dictionaries do not guarantee ordering (although if you are using new versions of Python, order will be preserved).
  • Additional loop control
    • break
    • continue
    • Note: These can be useful in combination to if statements and can also be super useful for debugging!
  • Writing to files:
    • Using the open("whatever.tsv", "w") function
    • Using the with open() as my_file: statement
    • Writing to a file with print(file=my_file)
    • Writing a tab-separated value file using "\t" (make sure we leave a header!)
    • Now lets open it up and make a little graph
  • Defining our own functions!
  • A little bit on looking for help (if it hasn't come up already)
    • Looking at StackOverflow
    • Walking through the Python API documentation
    • Using a reference card or cheatsheet

Resources and Example Code[edit]

Initialization[edit]

>>> my_dict = {}
>>> my_dict
{}
>>> your_dict = {"Alice" : "chocolate", "Bob" : "strawberry", "Cara" : "mint chip"}
>>> your_dict
{'Bob': 'strawberry', 'Cara': 'mint chip', 'Alice': 'chocolate'}

Adding elements to a dictionary[edit]

>>> your_dict["Dora"] = "vanilla"
>>> your_dict
{'Bob': 'strawberry', 'Cara': 'mint chip', 'Dora': 'vanilla', 'Alice': 'chocolate'}

Accessing elements of a dictionary[edit]

>>> your_dict["Alice"]
'chocolate'
>>> your_dict.get("Alice")
'chocolate'
>>> your_dict["Eve"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Eve'
>>> "Eve" in your_dict
False
>>> "Alice" in your_dict
True
>>> your_dict.get("Eve")
>>> person = your_dict.get("Eve")
>>> print(person)
None
>>> print(type(person))
<type 'NoneType'>
>>> your_dict.get("Alice")
'chocolate'

Changing elements of a dictionary[edit]

>>> your_dict["Alice"] = "coconut"
>>> your_dict
{'Bob': 'strawberry', 'Cara': 'mint chip', 'Dora': 'vanilla', 'Alice': 'coconut'}

"Histograms"[edit]

Challenge: using wordplay example from last week, count the number of words that start with each letter.

This kind of problem is very common Data Science, and it is easy with a dictionary.

(note: I will post the solution after class)

For-loops and dictionaries[edit]

There are two common ways to iterate through dictionaries:

>>> ages = {'Tommy': 34, Heather: 30, 'Joanna': 20}
>>> for key in ages:
>>>     print(key + " is " + str(ages[key]) + " years old")
>>> for key, value in ages.items():
>>>     print(key + " is " + str(value) + " years old")