Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Project page
Discussion
Edit
View history
Editing
CommunityData:Introduction to CDSC Resources
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Computation: Servers, data, and more == Much of our work is pretty computational/quantitative and involves large datasets. We have multiple computing resources and servers. ;Hyak: Hyak is a supercomputer system that is hosted at UW but that the whole group uses for conducting statistical analysis and data processing. Hyak is necessary if you need large amounts of storage (e.g., tens of terabytes) or if you need large amount of computational resources (e.g., CPU time, memory, etc). ''Servers in Hyak do not have direct access to the Internet.'' (except for 'build' machines). That means that Hyak is not useful for collecting data from APIs, etc. Access requires a UW NetID but they will be sponsored for you. You can learn more about it at: [[CommunityData:Hyak]] which has various links to tutorials/documentation as well. :In order to use Hyak, you need to get an account setup. This is documented on [[CommunityData:Hyak setup]]. ;Kibo: Kibo is a server we use for research hosted at Northwestern that came online in 2018-2019. Kibo is only a single machine but it is very powerful and is connected to the Internet. It has several dozen terabytes of space, a large amount of memory, and many CPUs. We use it primarily for (a) data collection APIs and (b) publication of large datasets like the data from the CDSC [[COVID-19 Digital Observatory]]. Access requires a NU NetID but they will be sponsored for you. More details are on [[CommunityData:Kibo]]. ;Nada: Nada is a sever at UW that is used primarily for infrastructure. It runs the blogs, mailing lists, git repositories and so on. We backup all of nada and these backups can be very expensive. Before you download or use data on Nada, please read the page [[CommunityData:Backups (nada)]] which provide details on what is, and what isn't, backed up from nada. ;Asha: Asha is a server at UW that is used for storing and analyzing Scratch data. Only people on the IRB protocol for Scratch are online. When using servers, these pages might be helpful: * [[CommunityData:Tmux]] β You can use tmux (terminal multiplexer) to keep a persistent session on a server, even if you're not logged into the server. This is especially helpful when you ssh to a server and then run a job that runs for quite a while and then you can't stay logged in the whole time. Check out the [https://github.com/tmux/tmux/wiki tmux git repo] or its [https://en.wikipedia.org/wiki/Tmux Wikipedia page] for more information about this. * [[CommunityData:Hyak Spark]] β Spark is a powerful tool that helps build programs dealing with large datasets. It's great for Wikimedia and Reddit data dumps. === Wiki Data in particular=== Multiple people in the group work on large datasets gathered from Wikipedia, Wikia (Fandom), or other projects running MediaWiki software. We have some specific resources and tools for these kinds of data * [[CommunityData:ORES]] - Using ORES with Wikipedia data * [[CommunityData:Wikia data]] β How to get and validate wikia dumps. * [[CommunityData:Wikiq]] - Processing MediaWiki XML dumps, outputting parsed dumps as tsv (which can then be processed by the very powerful Spark).
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information