CommunityData:Introduction to CDSC Resources: Difference between revisions

Revision as of 23:20, 7 May 2020

If you're new to the group, welcome!

This is an introduction to the various technical tools we use (as we use many) in our research work. It may be helpful to look at before diving into everything and starting your research with/in this group. You can find any of the resources mentioned below on the Resources page. The Resources page will generally list more resources than those listed in the intro here.

To start, here's some common shorthand that members might use.

Communication Channels

We communicate on multiple channels.

One might contact specific members directly.
We communicate (chat) much more frequently on IRC
We use email lists to communicate things relevant to the entire group or subgroup, like upcoming events or circulating papers for feedback: CDSC - Email
For weekly meetings and other (video)calls, we videocall using Jitsi. There are a lot of us, which can make calls a little hectic, so please keep in mind some Jitsi etiquette.
We also have a calendar of group-wide events: CDSC Calendar, such as the retreats.

Shared Resources

We maintain a large shared Zotero directory that is really helpful for finding relevant papers and smooths the process of collaboration (as one can see the papers and sources stored by collaborators as well). Please review the Zotero etiquette described on the "Adding and Organizing References" and "Tips and Tricks" sections of Zotero before using the shared folder.
We also have a Git repository with some shared resources (both technical and non-technical) on it:
- CommunityData:Git — Getting set up on the git server
- CommunityData:Code — List of software projects maintained by the collective.

Servers and Data Stuff

Much of our work is quantitative and involves large datasets. We have multiple computing resources:

Hyak is a supercomputer system that may or may not be relevant to your research project. For example, if you're running code on a huge dataset, you might want to use Hyak. You can learn more about it at:

Hyak, which has various links to tutorials/documentation as well. We used to use Hyak-ikt but are transitioning to Hyak-mox, information about which can be found [here].

If you want to get an account and get set up on Hyak, look at the Hyak Set-Up page:

CommunityData:Hyak setup

Kibo is a server we use, hosted at NU:

[Kibo]

Nada is used for backups.

CommunityData:Backups (nada) — Details on what is, and what isn't, backed up from nada.

When using servers, these pages might be helpful:

CommunityData:Tmux — You can use tmux (terminal multiplexer) to keep a persistent session on a server, even if you're not logged into the server. This is especially helpful when you ssh to a server and then run a job that runs for quite a while and then you can't stay logged in the whole time. Check out the tmux git repo or its Wikipedia page for more information about this.
CommunityData:Hyak Spark — Spark is a powerful tool that helps build programs dealing with large datasets.

Re: Wiki Data

CommunityData:ORES - Using ORES with wikipedia data
CommunityData:Wikia data — Documents information about how to get and validate wikia dumps.
CommunityData:Wikiq - Wikiq is a handy tool we use to process Wikipedia XML dumps, outputting dumps as tsv (which can then be easily processed by the very powerful Spark).

Creating Documents and Presentations

Planning

You can develop a research plan in whatever way works best, but one thing that may be useful is the outline of a Matsuzaki-style planning documents. You can see a detailed outline description here to help guide the planning process. If you scroll to the bottom, you'll see who to contact to get some good examples of planning documents.

Also helpful in developing a research plan might be some of the readings in this course taught by Aaron to PhD students: Practice of Scholarship (SP19).

Paper building

We typically write LaTeX documents when writing papers. One option to do this is to use the web-based Overleaf. Another option, using CDSC TeX templates, is detailed here. These comes with some assumptions about your workflow, which you can learn about here: CommunityData:Build papers.

If you're creating graphs and tables or formatting numbers in R that you want to put into a TeX document, you should look at the knitr package.

Some more specific things that might crop up in building the La/TeX document:

CommunityData:Embedding fonts in PDFs — ggplot2 creates PDFs with fonts that are not embedded which, in turn, causes the ACM to bounce our papers back. This page describes how to fix it.

Building presentation slides

Below are some options to creating presentation slides (though, feel free to use what you want and are most comfortable with):

CommunityData:Beamer — Beamer is a LaTeX document class for creating presentation slides. This is a link to installing/using Mako's beamer templates.
- Again, like the CDSC TeX templates, these Beamer templates also come with some assumptions about your workflow, which you can learn about here (again): CommunityData:Build papers.

CommunityData:reveal.js — Using RMarkdown to create reveal.js HTML presentations

@@ Line 1: / Line 1: @@
 If you're new to the group, welcome!
-This is an introduction to the various technical tools we use (as we use many) in our research work. It may be helpful to look at before diving into everything and starting your research with/in this group. You can find any of the resources mentioned below on the [https://wiki.communitydata.science/CommunityData:Resources Resources] page, (mostly) organized by alphabetical order for quick finding.
+This is an introduction to the various technical tools we use (as we use many) in our research work. It may be helpful to look at before diving into everything and starting your research with/in this group. You can find any of the resources mentioned below on the [https://wiki.communitydata.science/CommunityData:Resources Resources] page. The Resources page will generally list more resources than those listed in the intro here.
 To start, here's some [https://wiki.communitydata.science/CommunityData:Jargon common shorthand] that members might use.
 == Communication Channels ==
+We communicate on multiple channels.
 * One might contact specific [https://wiki.communitydata.science/People members] directly.
+* We communicate (chat) much more frequently on [https://wiki.communitydata.science/CommunityData:Resources#Chat_on_IRC IRC]
 * We use email lists to communicate things relevant to the ''entire'' group or subgroup, like upcoming events or circulating papers for feedback: [https://wiki.communitydata.science/CommunityData:Email CDSC - Email]
-* We communicate (chat) much more frequently on [https://wiki.communitydata.science/CommunityData:Resources#Chat_on_IRC IRC]
 * For weekly meetings and other (video)calls, we videocall using Jitsi. There are a lot of us, which can make calls a little hectic, so please keep in mind some [https://wiki.communitydata.science/CommunityData:Jitsi Jitsi etiquette].
-* We also have a calendar of group-wide events: [https://wiki.communitydata.science/Schedule CDSC Calendar].
+* We also have a calendar of group-wide events: [https://wiki.communitydata.science/Schedule CDSC Calendar], such as the retreats.
 == Shared Resources ==
-* We maintain a large shared [https://wiki.communitydata.science/CommunityData:Zotero Zotero] directory that is really helpful for finding relevant papers and smooths the process of collaboration (as one can see the papers and sources stored by collaborators as well).
+* We maintain a large shared [https://wiki.communitydata.science/CommunityData:Zotero Zotero] directory that is really helpful for finding relevant papers and smooths the process of collaboration (as one can see the papers and sources stored by collaborators as well). Please review the Zotero etiquette described on the "Adding and Organizing References" and "Tips and Tricks" sections of [https://wiki.communitydata.science/CommunityData:Zotero Zotero] before using the shared folder.
 * We also have a Git repository with some shared resources (both technical and non-technical) on it:
 ** [[CommunityData:Git]] — Getting set up on the git server
@@ Line 19: / Line 21: @@
 == Servers and Data Stuff ==
-Hyak is a supercomputer system that may or may not be relevant to your research project. For example, if you're running code on a huge dataset, you might want to use Hyak. You can learn more about it at:
+Much of our work is quantitative and involves large datasets. We have multiple computing resources:
-* [https://wiki.communitydata.science/CommunityData:Hyak Hyak], which has various links to tutorials/documentation as well.
+<b>Hyak</b> is a supercomputer system that may or may not be relevant to your research project. For example, if you're running code on a huge dataset, you might want to use Hyak. You can learn more about it at:
+* [https://wiki.communitydata.science/CommunityData:Hyak Hyak], which has various links to tutorials/documentation as well. We used to use Hyak-ikt but are transitioning to Hyak-mox, information about which can be found [[https://wiki.communitydata.science/CommunityData:Hyak-Mox here]].
 If you want to get an account and get set up on Hyak, look at the Hyak Set-Up page:
 * [[CommunityData:Hyak setup]]
-When using Hyak (or other servers), these pages might be helpful:
+<b>Kibo</b> is a server we use, hosted at NU:
-* [[CommunityData:Tmux]] — You can use tmux (terminal multiplexer) to keep a persistent session on a server. Check out the [https://github.com/tmux/tmux/wiki tmux git repo] or its [https://en.wikipedia.org/wiki/Tmux Wikipedia page] for more information about this.
+* [[https://wiki.communitydata.science/CommunityData:Kibo Kibo]]
-* [[CommunityData:Hyak Spark]] — Spark is a powerful tool that helps build programs dealing with large datasets.
-Nada is used for backups.
+<b>Nada</b> is used for backups.
 * [[CommunityData:Backups (nada)]] — Details on what is, and what isn't, backed up from nada.
+When using servers, these pages might be helpful:
+* [[CommunityData:Tmux]] — You can use tmux (terminal multiplexer) to keep a persistent session on a server, even if you're not logged into the server. This is especially helpful when you ssh to a server and then run a job that runs for quite a while and then you can't stay logged in the whole time. Check out the [https://github.com/tmux/tmux/wiki tmux git repo] or its [https://en.wikipedia.org/wiki/Tmux Wikipedia page] for more information about this.
+* [[CommunityData:Hyak Spark]] — Spark is a powerful tool that helps build programs dealing with large datasets.
 === Re: Wiki Data ===
 * [[CommunityData:ORES]] - Using ORES with wikipedia data
 * [[CommunityData:Wikia data]] — Documents information about how to get and validate wikia dumps.
+* [[CommunityData:Wikiq]] - Wikiq is a handy tool we use to process Wikipedia XML dumps, outputting dumps as tsv (which can then be easily processed by the very powerful Spark).
 == Creating Documents and Presentations ==
@@ Line 40: / Line 48: @@
 You can develop a research plan in whatever way works best, but one thing that may be useful is the outline of a Matsuzaki-style planning documents. You can see a detailed outline description [https://wiki.communitydata.science/CommunityData:Planning_document here] to help guide the planning process. If you scroll to the bottom, you'll see who to contact to get some good examples of planning documents.
-Also helpful in developing a research plan might be some of the readings in this course taught by Aaron: [https://wiki.communitydata.science/Practice_of_scholarship_(Spring_2019) Practice of Scholarship (SP19)].
+Also helpful in developing a research plan might be some of the readings in this course taught by Aaron to PhD students: [https://wiki.communitydata.science/Practice_of_scholarship_(Spring_2019) Practice of Scholarship (SP19)].
 === Paper building ===