User:Kaylea/Using Google Cloud to Analyze Wikipedia

From CommunityData
< User:Kaylea
Revision as of 01:31, 19 July 2018 by Kaylea (talk | contribs) (Created page with "Because wikipedia view data is large, I used the version hosted via Google's Big Query to avoid parsing logfiles. Here's a basic version of my advice, workflow, etc. 1 - Chec...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Because wikipedia view data is large, I used the version hosted via Google's Big Query to avoid parsing logfiles. Here's a basic version of my advice, workflow, etc.

1 - Check the list of datasets here -- make sure these are the files and fields you're looking for.

2 - Use a google account to sign up for your free trial of google cloud. There are tutorials available, and they're decent although the whole site feels a little diffuse and disorganized.

3 - Configure your account to deal with financial issues (an area of paranoia for me). Your new account should come with $300 in free credit, and there is also a monthly free quota after your year-long trial expires. The setup process does ask to connect with your credit card, but the good news is that you can set boundaries on the amount you spend -- for example, I set my project budget to $100, and have done many large queries and stored a lot of data without coming near it. You can customize all of this via the cloud console.

4 - Do a few tutorials that seem to apply to what you want to do, they're quick and practical; notice whether you are going to be using query terms that are part of legacy sql or part of google's special query language.

5 - Set up a project in BigQuery to organize your stuff. Then be bold and make some queries. You can use the schema, details, and preview tab on a table to make sure you're getting what you want. Definitely use the "Show Options" button for your query to set a destination table for your results. I set mine to allow big results and to disallow flattening of repeated data because I was dealing with hourly views and didn't want to overwrite anything when I dropped the date time field. Once I was happy with the faster-executing queries from Interactive mode, I set the mode to Batch to save a little $. My queries took a minute or so to run interactively and maybe 3 minutes elapsed when I allowed Batch mode.

Some of my queries:

SELECT title, requests FROM `fh-bigquery.wikipedia.pagecounts_201101` WHERE language="en"

SELECT title, SUM(requests) FROM PageViews.201112 GROUP BY title

It cost me a little under $7 of my free credit to extract out the english language wikipedia view data with number of views to a certain page in a given month for all of 2011.

6 - Transfer the results. Because the data is somewhat large, it didn't want to simply download out of bigquery directly -- instead, I had export it to google's cloud storage and then download it from there. It will shard, CSV, gzip the data for you, and give it unique names (just specify the filename like this when you export: gs://myCloudBucketNameHere/FilenameHere*.csv.gz and it will insert shard numbers in place of the *).

7 - Download the data. You can do a browser click to download approach or fetch them using Google's gsutil. I followed the directions to install the gsutils directly into my hyak homedir, then I did things like gsutil cp gs://allviewdata/201108*