Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Page
Discussion
Edit
View history
Editing
DS4UX (Spring 2016)/Reading and writing files
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
One of the most common operations you will perform when you use Python for data analysis will be reading and writing data to and from flat files. These are somtimes referred to as [[w:Flat_file_database|flat file databases]]—not because the files themselves are a special kind of code, but because they hold multiple pieces of data in a simple, structured format that facilitates common data processing operations like querying, overwriting, appending, and ingesting. Probably to most basic type of "flat file database" is a plain text file with a different piece of data on each line. To process a file like this in Python, you generally follow these steps: # '''Open''' the file and save it in working memory (also sometimes referred to as "buffer") as a file object. # '''Read''' through the file object line-by-line and do something with it. # '''Close''' the file object (whether or not you have changed it, or just read it and done something else with what your read). Plain text flat files with one piece of data per line (for example, a list of names) are very easy to create and can hold a lot of data. In the simplest case, the new line in the file (defined by an invisible newline character <code>\n</code> at the end of each line) serves as the ''delimiter'' between different pieces of data. This delimiter allows Python to know where one piece of data ends and the next begins, when it reads through the file. However, if you want to store more complex data (such as a list of names, genders, and # of people of that gender ''with'' that name), you need to come up with a way to store ''associated'' datapoints together. To do this, you need to define a new ''delimiter'' so that you can separate different associated datapoints in each row. Theoretically, you could use any character as a delimiter, but two of the most common ones are a tab <code>\t</code> and a comma. Now that you have two delimiters, one that delimits vertically and the other that delimits horizontally, you have a two-dimensional ''matrix'', a way of structuring data that is so common that we don't generally even use it's technical name—instead, we generally use the name for the kind of software application we often use to display matrix data: a spreadsheet. ===Goals=== In the exercises below, you will learn the basic syntax for reading common types of file into Python, and transforming them into other types of files. We will only cover a couple of the most basic types of file conversation you might want to do here: # reading a raw (and messy!) comma-separated text file into Python with the generic <code>open</code> method, parsing through it line by line to clean it up, and output a much cleaner matrix version with newlines and tabs as delimiters. # reading that file with Python's <code>csv</code> library, which provides you with additional options for reading, formatting, and writing files with datapoints that are separated by newlines (vertically) and commas (horizontally). == Instructions == # If you haven't done so already, [http://jtmorgan.net/ds4ux/week4/lecture.zip download the ZIPPED code and data file] called <code>lecture.zip</code>, unzip it, and navigate to it in your Terminal or Powershell. # Open the code and datafiles in that directory in TextWrangler as well, so you can see the code while we walk through it. === Notes === # The input file we will be working with in this exercise is intentionally messy—it has lots of extra spaces and tabs scattered through the various lines of the file. One of the reasons we're devoting a whole mini-lecture on reading and writing flat files is that the data you want to analyze is OFTEN messy like this when you first get it, and often creating a "clean" version is one of the first things you'll want to do with that data. # File suffixes (.txt, .tsv, .csv) are in some cases more a matter of ''conventions'' than requirements. If a file is just plain text, Python may not care what its file suffix is: you may be able to read it in and read through it line-by-line whether it has ".txt" on it or now. However, in many cases, Python (like most applications) uses the file suffix to decide how to render/execute a file, so it's best to always use the proper (conventional) suffix for any file database you have: .csv for comma-separated value files, .tsv for tab-separated value files, and .txt for generic text files (or for text files where you don't know or can't guarantee that the structure is complete or consistent) == Outputting a TSV (tab-separated value) file == # open the input file <code>class_names_raw.txt</code> in TextWrangler. From the drop-down menu, select View->Text display->Show Invisibles. What do you see? # run <code>class_names_txt_to_tsv.py</code>. Look at the output printed on the terminal. Now examine the output file itself. What has changed? == Outputting a CSV (comma-separated value) file == # run <code>class_names_txt_to_csv.py</code>, which takes the same messy input file as the previous script. Look at the output printed on the terminal. Now examine the output file itself. How is this file different from the .tsv file we created earlier? == Important syntax concepts == #using <code>with open(FILENAME, "r") as VARIABLE</code> will almost always be the best way to open a file (for reading or writing). This saves you some steps, and makes sure that the file is properly closed (removed from working memory) once you're done with it. There are other patterns for opening files, but I suggest you get used to this pattern before you explore any others. #when you specify <code>"r"</code> after the filename in <code>open()</code> you're saying whether you want to open the file to READ it; when you specify <code>"w"</code>, you are saying you want to open the file (or create it, if it doesn't exist) so that you can write to it—add data into the file to save. There are some other options for reading and writing, such as <code>"rb"</code> and <code>"wb"</code>, which are used in different circumstances and for different types of data. We'll cover these in less detail next week. #<code>.readlines()</code> vs <code>csv.reader</code>: both of these are functions that make it easy to read through a file object line-by-line once you have opened the file. <code>csv.reader</code> provides you with additional ''parameters'' that allow you to control how the data from the file is interpreted, which often comes in handy when dealing with messy data, but to use it you have to <code>import csv</code> first. The same goes for <code>write</code>, used to write the .tsv file in the first script, and <code>csv.writer</code>, used to write the .csv file in the second script. [[Category:DS4UX (Spring 2016)]]
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information