Editing CommunityData:Introduction to CDSC Resources
From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
If you're new to the group, welcome! | If you're new to the group, welcome! | ||
This is an introduction to | This is an introduction to the various technical tools we use (as we use many) in our research work. It may be helpful to look at before diving into everything and starting your research with/in this group. You can find any of the resources mentioned below on the [https://wiki.communitydata.science/CommunityData:Resources Resources] page. The Resources page will generally list more resources than those listed in the intro here. | ||
To start, here's some [https://wiki.communitydata.science/CommunityData:Jargon common shorthand] that members might use. It's a little outdated but has some | To start, here's some [https://wiki.communitydata.science/CommunityData:Jargon common shorthand] that members might use. It's a little outdated but has some main shorthands that might pop up in conversation. | ||
== Communication Channels == | == Communication Channels == | ||
We communicate on multiple channels! | We communicate on multiple channels! | ||
* We communicate (chat) frequently on [ | * We communicate (chat) frequently on [https://wiki.communitydata.science/CommunityData:Resources#Chat_on_IRC IRC] | ||
* We use email lists to communicate things relevant to the ''entire'' group or subgroup, like upcoming events or circulating papers for feedback: [ | * We use email lists to communicate things relevant to the ''entire'' group or subgroup, like upcoming events or circulating papers for feedback: [https://wiki.communitydata.science/CommunityData:Email CDSC - Email] | ||
* One can also contact specific [ | * One can also contact specific [https://wiki.communitydata.science/People members] directly. | ||
* For weekly meetings and other (video)calls, we | * For weekly meetings and other (video)calls, we videocall using Jitsi. There are a lot of us, which can make calls a little hectic, so please keep in mind some [https://wiki.communitydata.science/CommunityData:Jitsi Jitsi etiquette]. | ||
* We also have a calendar of group-wide events: | * We also have a calendar of group-wide events: [https://wiki.communitydata.science/Schedule CDSC Calendar], such as the retreats. | ||
== | == Shared Resources == | ||
* | * We maintain a large shared [https://wiki.communitydata.science/CommunityData:Zotero Zotero] directory that is really helpful for finding relevant papers and smooths the process of collaboration (as one can see the papers and sources stored by collaborators as well). Please review the Zotero etiquette described on the "Adding and Organizing References" and "Tips and Tricks" sections of [https://wiki.communitydata.science/CommunityData:Zotero Zotero] before using the shared folder. | ||
* We also have a Git repository with some shared resources (both technical and non-technical) on it: | |||
* | ** [[CommunityData:Code]] — List of software projects maintained by the collective. | ||
** [[CommunityData:Git]] — How to get set up on the git server | |||
** | |||
** | |||
== | == Servers and Data Stuff == | ||
Much of our work is pretty computational/quantitative and involves large datasets. We have multiple computing resources and servers. | Much of our work is pretty computational/quantitative and involves large datasets. We have multiple computing resources and servers. For any given project, you might not need it eventually. | ||
;Hyak: Hyak is a supercomputer system that is hosted at UW but that the whole group uses for conducting statistical analysis and data processing. Hyak is necessary if you need large amounts of storage (e.g., tens of terabytes) or if you need large amount of computational resources (e.g., CPU time, memory, etc). '' | ;Hyak: Hyak is a supercomputer system that is hosted at UW but that the whole group uses for conducting statistical analysis and data processing. Hyak is necessary if you need large amounts of storage (e.g., tens of terabytes) or if you need large amount of computational resources (e.g., CPU time, memory, etc). ''Severs in Hyak do not direct access to the Internet.'' That means that Hyak is not useful for collecting data from APIs, etc. Access requires a UW NetID but they will be sponsored for you. You can learn more about it at: [[CommunityData:Hyak]] which has various links to tutorials/documentation as well. | ||
:In order to use Hyak, you need to get an account setup. This is documented on [[CommunityData:Hyak setup]]. | :In order to use Hyak, you need to get an account setup. This is documented on [[CommunityData:Hyak setup]]. | ||
Line 44: | Line 35: | ||
When using servers, these pages might be helpful: | When using servers, these pages might be helpful: | ||
* [[CommunityData:Tmux]] — You can use tmux (terminal multiplexer) to keep a persistent session on a server, even if you're not logged into the server. This is especially helpful when you ssh to a server and then run a job that runs for quite a while and then you can't stay logged in the whole time. Check out the [https://github.com/tmux/tmux/wiki tmux git repo] or its [https://en.wikipedia.org/wiki/Tmux Wikipedia page] for more information about this. | * [[CommunityData:Tmux]] — You can use tmux (terminal multiplexer) to keep a persistent session on a server, even if you're not logged into the server. This is especially helpful when you ssh to a server and then run a job that runs for quite a while and then you can't stay logged in the whole time. Check out the [https://github.com/tmux/tmux/wiki tmux git repo] or its [https://en.wikipedia.org/wiki/Tmux Wikipedia page] for more information about this. | ||
* [[CommunityData:Hyak Spark]] — Spark is a powerful tool that helps build programs dealing with large datasets. It's great for Wikimedia | * [[CommunityData:Hyak Spark]] — Spark is a powerful tool that helps build programs dealing with large datasets. It's great for Wikimedia data dumps. | ||
=== Wiki Data | === Re: Wiki Data === | ||
* [[CommunityData:ORES]] - Using ORES with wikipedia data | |||
* [[CommunityData:ORES]] - Using ORES with | * [[CommunityData:Wikia data]] — Documents information about how to get and validate wikia dumps. | ||
* [[CommunityData:Wikia data]] — | * [[CommunityData:Wikiq]] - Wikiq is a handy tool we use to process Wikipedia XML dumps, outputting dumps as tsv (which can then be easily processed by the very powerful Spark). | ||
* [[CommunityData:Wikiq]] - | |||
== Creating Documents and Presentations == | == Creating Documents and Presentations == | ||
=== Planning === | === Planning === | ||
You can develop a research plan in whatever way works best, but one thing that may be useful is the outline of a | You can develop a research plan in whatever way works best, but one thing that may be useful is the outline of a Matsuzaki-style planning documents. You can see a detailed outline description [https://wiki.communitydata.science/CommunityData:Planning_document here] to help guide the planning process. If you scroll to the bottom, you'll see who to contact to get some good examples of planning documents. | ||
Also helpful in developing a research plan might be some of the readings in this course taught by Aaron to PhD students: [https://wiki.communitydata.science/Practice_of_scholarship_(Spring_2019) Practice of Scholarship (SP19)]. | Also helpful in developing a research plan might be some of the readings in this course taught by Aaron to PhD students: [https://wiki.communitydata.science/Practice_of_scholarship_(Spring_2019) Practice of Scholarship (SP19)]. | ||
Line 73: | Line 63: | ||
* [[CommunityData:reveal.js]] — Using RMarkdown to create reveal.js HTML presentations | * [[CommunityData:reveal.js]] — Using RMarkdown to create reveal.js HTML presentations | ||
== | == Misc. Resources == | ||
=== Technical === | === Technical === | ||
* [[CommunityData:Exporting from Python to R]] | * [[CommunityData:Exporting from Python to R]] | ||
Line 80: | Line 70: | ||
=== Non-technical === | === Non-technical === | ||
* [[CommunityData:Advice on writing a background section to an academic paper]] | * [[CommunityData:Advice on writing a background section to an academic paper]] | ||
* See some past and upcoming lab retreats [[CommunityData:Resources#Ongoing_and_Future_Meetings_and_Meetups | * See some past and upcoming lab retreats [[https://wiki.communitydata.science/CommunityData:Resources#Ongoing_and_Future_Meetings_and_Meetups here]]. |