Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
User page
Discussion
Edit
View history
Editing
User:Kaylea/Using Google Cloud to Analyze Wikipedia
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
Because wikipedia view data is large, I used the version hosted via Google's Big Query to avoid parsing logfiles. Here's a basic version of my advice, workflow, etc. 1 - Check the list of datasets [https://bigquery.cloud.google.com/ here] -- make sure these are the files and fields you're looking for. 2 - Use a google account to sign up for your free trial of google cloud. There are tutorials available, and they're decent although the whole site feels a little diffuse and disorganized. 3 - Configure your account to deal with financial issues (an area of paranoia for me). Your new account should come with $300 in free credit, and there is also a monthly free quota after your year-long trial expires. The setup process does ask to connect with your credit card, but the good news is that you can set boundaries on the amount you spend -- for example, I set my project budget to $100 (i.e. well below my free credit level), and have done many large queries and stored a lot of data without coming near it. You can customize all of this via the cloud console. 4 - Do a few tutorials that seem to apply to what you want to do, they're quick and practical; notice whether you are going to be using query terms that are part of legacy sql or part of google's special query language. 5 - Set up a project in BigQuery to organize your stuff. Then be bold and make some queries. You can use the schema, details, and preview tab on a table to make sure you're getting what you want. Definitely use the "Show Options" button for your query to set a destination table for your results. I set mine to allow big results and to disallow flattening of repeated data because I was dealing with hourly views and didn't want to overwrite anything when I dropped the date time field. Once I was happy with the faster-executing queries from Interactive mode, I set the mode to Batch to save a little $. My queries took a minute or so to run interactively and maybe 3 minutes elapsed when I allowed Batch mode. Some of my queries: ''SELECT title, requests FROM `fh-bigquery.wikipedia.pagecounts_201101` WHERE language="en"'' ''SELECT title, SUM(requests) FROM PageViews.201112 GROUP BY title'' It cost me a little under $7 of my free credit to extract out the english language wikipedia view data with number of views to a certain page in a given month for all of 2011. 6 - Transfer the results. Because the data is somewhat large, it didn't want to simply download out of bigquery directly -- instead, I had export it to google's cloud storage and then download it from there. It will shard, CSV, gzip the data for you, and give it unique names (just specify the filename like this when you export: gs://myCloudBucketNameHere/FilenameHere*.csv.gz and it will insert shard numbers in place of the *). 7 - Download the data. You can do a browser click to download approach or fetch them using Google's gsutil. I followed the [https://cloud.google.com/storage/docs/gsutil directions] to install the gsutils directly into my hyak homedir, then I did things like ''gsutil cp gs://allviewdata/201108*''
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
User contributions
Logs
View user groups
Special pages
Page information