CommunityData:Introduction to CDSC Resources: Difference between revisions

Latest revision as of 19:46, 4 May 2025

If you're new to the group, welcome!

This is an introduction to some of the tools we use (and we use many!) in our research work. It may be helpful to look at before diving into everything and starting your research with/in this group. You can find additional information on the resources mentioned below on the Resources page. The Resources page will generally list more resources and details than those listed here.

To start, here's some common shorthand that members might use. It's a little outdated but has some acronyms, names of things, etc. that might pop up in conversation.

Usually, we try to walk new group members through some of this in an orientation session shortly after they get started. There's a recording of one of these sessions from June 29 2020 online here. It's hosted in Mako's cdsc_only video repository so there's a username and and password but you can ask anybody in the group and they should be able to get it by searching their email for "cdsc_only".

Once you take a look through this, you might proceed to the onboarding checklist to find a set of tasks you should do to make sure you've got some key things setup.

Basics and routines[edit]

The group is always evolving, but here is some current basic information and routines.

Lab structure: We are a research lab spread out across several institutions. The biggest groups of us are based at Northwestern University, Purdue University, and the University of Washington.; You can read about the people in the group on our People page. This includes information about current-ish affiliations, interests, and more.
Weekly lab meetings and workshops (local/global meetings): We have ~30 minute lab meetings every week. These alternate between local (within-university) and global meetings. The lab meetings are default remote via Jitsi and mostly involve collective coordination, announcements (e.g., upcoming deadlines), etc.; We also have hour-long workshop sessions every week. These typically involve discussion and feedback on a presentation or work-in-progress, but we sometimes use this time for other things, like a guest or a "lab date" with another lab.; If you are joining the group you should talk with someone like a supervisor or faculty PI about what meetings you ought to attend. In general, we expect current students to participate in lab meetings and workshops. This includes bringing something to be workshopped a couple of times per year.
Yearly (or slightly more often) retreats/meetups: We have been gathering annually, lately in ~September, for in-person meetings. These typically involve workshop sessions, social activities, and often a public-facing portion like a poster session or symposium.
Summer reading group: We have an optional summer reading group, christened the Sociotechnocanonicon

Communication Channels[edit]

We communicate on multiple channels!

We communicate (chat) frequently on Matrix
We use email lists to communicate things relevant to the entire group or subgroup, like upcoming events or circulating papers for feedback: CDSC - Email
One can also contact specific members directly.
For weekly meetings and other (video)calls, we typically use Jitsi. There are a lot of us, which can make calls a little hectic, so please keep in mind some Jitsi etiquette.
We also have a calendar of group-wide events: CDSC Calendar, such as the retreats.

We also have some public facing channels:

We have a variety of various social media accounts including the Community Data Science blog, Twitter, Youtube, Mastodon, and so on. That page has details about to get accounts.
Lately, we have been asking everyone to post on the blog in a sort of rotating basis. As a new member of the group, you should write a blog post introducing yourself! The blog post schedule is here.

Collaboration tools[edit]

This wiki: The CDSC Wiki includes group resources, as well as things like research project pages and course websites. It is highly recommended that you create an account and then reach out to someone else in the group to make you an admin. This will help you to avoid having your edits reverted.
Bibliographic references: We maintain a large shared Zotero directory that is really helpful for finding relevant papers and smooths the process of collaboration (as one can see the papers and sources stored by collaborators as well). Please review the Zotero etiquette described on the "Adding and Organizing References" and "Tips and Tricks" sections of Zotero before using the shared folder.
LaTeX authoring: Many of us work on papers and presentations together in Overleaf. See additional info about this below. You can get a free account to join a project or two and use the basic functionalities of Overleaf. More sustained use of more features probably means you should join the cdsc account or another paid account. We don't have a CDSC overleaf info page (yet). if you think you need to join the CDSC group account, contact Aaron about that.
Meeting Poll Tools: We use When Is Good for a lot of our meeting polls. Here are some tips, tricks and norms about filling out meeting polls
Version control: We also have a Git repository with some shared resources (both technical and non-technical) on it:
- Git repositories: CommunityData:Git — How to get set up on the git server to create, clone, work on/in shared git repositories we maintain.
- Software projects: CommunityData:Code — List of software projects maintained by the collective.

Computation: Servers, data, and more[edit]

Much of our work is pretty computational/quantitative and involves large datasets. We have multiple computing resources and servers.

Hyak: Hyak is a supercomputer system that is hosted at UW but that the whole group uses for conducting statistical analysis and data processing. Hyak is necessary if you need large amounts of storage (e.g., tens of terabytes) or if you need large amount of computational resources (e.g., CPU time, memory, etc). Servers in Hyak do not have direct access to the Internet. (except for 'build' machines). That means that Hyak is not useful for collecting data from APIs, etc. Access requires a UW NetID but they will be sponsored for you. You can learn more about it at: CommunityData:Hyak which has various links to tutorials/documentation as well.

In order to use Hyak, you need to get an account setup. This is documented on CommunityData:Hyak setup.

Kibo: Kibo is a server we use for research hosted at Northwestern that came online in 2018-2019. Kibo is only a single machine but it is very powerful and is connected to the Internet. It has several dozen terabytes of space, a large amount of memory, and many CPUs. We use it primarily for (a) data collection APIs and (b) publication of large datasets like the data from the CDSC COVID-19 Digital Observatory. Access requires a NU NetID but they will be sponsored for you. More details are on CommunityData:Kibo.

Nada: Nada is a sever at UW that is used primarily for infrastructure. It runs the blogs, mailing lists, git repositories and so on. We backup all of nada and these backups can be very expensive. Before you download or use data on Nada, please read the page CommunityData:Backups (nada) which provide details on what is, and what isn't, backed up from nada.

Asha: Asha is a server at UW that is used for storing and analyzing Scratch data. Only people on the IRB protocol for Scratch are online.

When using servers, these pages might be helpful:

CommunityData:Tmux — You can use tmux (terminal multiplexer) to keep a persistent session on a server, even if you're not logged into the server. This is especially helpful when you ssh to a server and then run a job that runs for quite a while and then you can't stay logged in the whole time. Check out the tmux git repo or its Wikipedia page for more information about this.
CommunityData:Hyak Spark — Spark is a powerful tool that helps build programs dealing with large datasets. It's great for Wikimedia and Reddit data dumps.

Wiki Data in particular[edit]

Multiple people in the group work on large datasets gathered from Wikipedia, Wikia (Fandom), or other projects running MediaWiki software. We have some specific resources and tools for these kinds of data

CommunityData:ORES - Using ORES with Wikipedia data
CommunityData:Wikia data — How to get and validate wikia dumps.
CommunityData:Wikiq - Processing MediaWiki XML dumps, outputting parsed dumps as tsv (which can then be processed by the very powerful Spark).

Creating Documents and Presentations[edit]

Planning[edit]

You can develop a research plan in whatever way works best, but one thing that may be useful is the outline of a Matsuzaki-style planning documents and the qualitative planning document. You can see a detailed outline description here to help guide the planning process. If you scroll to the bottom, you'll see who to contact to get some good examples of planning documents.

Also helpful in developing a research plan might be some of the readings in this course taught by Aaron to PhD students: Practice of Scholarship (SP19).

Paper building[edit]

We typically write LaTeX documents when writing papers. One option to do this is to use the web-based Overleaf. Another option, using CDSC TeX templates, is detailed here. These comes with some assumptions about your workflow, which you can learn about here: CommunityData:Build papers.

If you're creating graphs and tables or formatting numbers in R that you want to put into a TeX document, you should look at the knitr package.

Some more specific things that might crop up in building the La/TeX document:

CommunityData:Embedding fonts in PDFs — ggplot2 creates PDFs with fonts that are not embedded which, in turn, causes the ACM to bounce our papers back. This page describes how to fix it.

Building presentation slides[edit]

Below are some options to creating presentation slides (though, feel free to use what you want and are most comfortable with):

CommunityData:Beamer — Beamer is a LaTeX document class for creating presentation slides. This is a link to installing/using Mako's beamer templates.
- Again, like the CDSC TeX templates, these Beamer templates also come with some assumptions about your workflow, which you can learn about here (again): CommunityData:Build papers.

CommunityData:reveal.js — Using RMarkdown to create reveal.js HTML presentations

@@ Line 1: / Line 1: @@
 If you're new to the group, welcome!
-This is an introduction to the various technical tools we use (as we use many) in our research work. It may be helpful to look at before diving into everything and starting your research with/in this group. You can find any of the resources mentioned below on the [https://wiki.communitydata.science/CommunityData:Resources Resources] page. The Resources page will generally list more resources than those listed in the intro here.
+This is an introduction to some of the tools we use (and we use many!) in our research work. It may be helpful to look at before diving into everything and starting your research with/in this group. You can find additional information on the resources mentioned below on the [[CommunityData:Resources|Resources]] page. The Resources page will generally list more resources and details than those listed here.
-To start, here's some [https://wiki.communitydata.science/CommunityData:Jargon common shorthand] that members might use.
+To start, here's some [https://wiki.communitydata.science/CommunityData:Jargon common shorthand] that members might use. It's a little outdated but has some acronyms, names of things, etc. that might pop up in conversation.
+Usually, we try to walk new group members through some of this in an orientation session shortly after they get started. There's [https://communitydata.science/~mako/cdsc_only/jitsi-onboarding_session-20200629.mp4 a recording of one of these sessions from June 29 2020 online here]. It's hosted in [[Mako's]] <code>cdsc_only</code> video repository so there's a username and and password but you can ask anybody in the group and they should be able to get it by searching their email for "cdsc_only".
+Once you take a look through this, you might proceed to the [[CommunityData:Onboarding_Checklist|onboarding checklist]] to find a set of tasks you should do to make sure you've got some key things setup.
+== Basics and routines ==
+The group is always evolving, but here is some current basic information and routines.
+;Lab structure: We are a research lab spread out across several institutions. The biggest groups of us are based at Northwestern University, Purdue University, and the University of Washington.
+:You can read about the people in the group on our [[People]] page. This includes information about current-ish affiliations, interests, and more.
+; Weekly lab meetings and workshops (local/global meetings): We have ~30 minute lab meetings every week. These alternate between local (within-university) and global meetings. The lab meetings are default remote via Jitsi and mostly involve collective coordination, announcements (e.g., upcoming deadlines), etc.
+:We also have hour-long workshop sessions every week. These typically involve discussion and feedback on a presentation or work-in-progress, but we sometimes use this time for other things, like a guest or a "lab date" with another lab.
+:If you are joining the group you should talk with someone like a supervisor or faculty PI about what meetings you ought to attend. In general, we expect current students to participate in lab meetings and workshops. This includes bringing something to be workshopped a couple of times per year.
+;Yearly (or slightly more often) retreats/meetups: We have been gathering annually, lately in ~September, for in-person meetings. These typically involve workshop sessions, social activities, and often a public-facing portion like a poster session or symposium.
+;Summer reading group: We have an optional summer reading group, christened the [[Sociotechnocanonicon]]
 == Communication Channels ==
-We communicate on multiple channels.
+We communicate on multiple channels!
+* We communicate (chat) frequently on [[CommunityData:Matrix|Matrix]]
+* We use email lists to communicate things relevant to the ''entire'' group or subgroup, like upcoming events or circulating papers for feedback: [[CommunityData:Email|CDSC - Email]]
+* One can also contact specific [[People|members]] directly.
+* For weekly meetings and other (video)calls, we typically use Jitsi. There are a lot of us, which can make calls a little hectic, so please keep in mind some [[CommunityData:Jitsi|Jitsi etiquette]].
+* We also have a calendar of group-wide events: [[Schedule|CDSC Calendar]], such as the retreats.
+We also have some public facing channels:
-* One might contact specific [https://wiki.communitydata.science/People members] directly.
+* We have a variety of [[CommunityData:Blog and social media| various social media accounts]] including [https://blog.communitydata.science the Community Data Science blog], Twitter, Youtube, Mastodon, and so on. That page has details about to get accounts.
-* We communicate (chat) much more frequently on [https://wiki.communitydata.science/CommunityData:Resources#Chat_on_IRC IRC]
+* Lately, we have been asking everyone to post on the blog in a sort of rotating basis. As a new member of the group, you should write a blog post introducing yourself! [[CommunityData:Blog_post_schedule|The blog post schedule is here]].
-* We use email lists to communicate things relevant to the ''entire'' group or subgroup, like upcoming events or circulating papers for feedback: [https://wiki.communitydata.science/CommunityData:Email CDSC - Email]
-* For weekly meetings and other (video)calls, we videocall using Jitsi. There are a lot of us, which can make calls a little hectic, so please keep in mind some [https://wiki.communitydata.science/CommunityData:Jitsi Jitsi etiquette].
-* We also have a calendar of group-wide events: [https://wiki.communitydata.science/Schedule CDSC Calendar], such as the retreats.
-== Shared Resources ==
+== Collaboration tools ==
-* We maintain a large shared [https://wiki.communitydata.science/CommunityData:Zotero Zotero] directory that is really helpful for finding relevant papers and smooths the process of collaboration (as one can see the papers and sources stored by collaborators as well). Please review the Zotero etiquette described on the "Adding and Organizing References" and "Tips and Tricks" sections of [https://wiki.communitydata.science/CommunityData:Zotero Zotero] before using the shared folder.
+* ''[[CommunityData:Wiki|This wiki]]'': The CDSC Wiki includes group resources, as well as things like research project pages and course websites. It is highly recommended that you create an account and then reach out to someone else in the group to make you an admin. This will help you to avoid having your edits reverted.
-* We also have a Git repository with some shared resources (both technical and non-technical) on it:
+* ''Bibliographic references'': We maintain a large shared [[CommunityData:Zotero|Zotero]] directory that is really helpful for finding relevant papers and smooths the process of collaboration (as one can see the papers and sources stored by collaborators as well). Please review the Zotero etiquette described on the "[[CommunityData:Zotero#Adding_and_Organizing_References|Adding and Organizing References]]" and "[[CommunityData:Zotero#Tips_and_Tricks|Tips and Tricks]]" sections of [[CommunityData:Zotero|Zotero]] before using the shared folder.
-** [[CommunityData:Git]] — Getting set up on the git server
+* ''LaTeX authoring'': Many of us work on papers and presentations together in [https://overleaf.com Overleaf]. See additional info about this [[CommunityData:Introduction_to_CDSC_Resources#Creating_Documents_and_Presentations|below]]. You can get a free account to join a project or two and use the basic functionalities of Overleaf. More sustained use of more features probably means you should join the cdsc account or another paid account. We don't have a CDSC overleaf info page (yet). if you think you need to join the CDSC group account, contact Aaron about that.
-** [[CommunityData:Code]] — List of software projects maintained by the collective.
+* ''Meeting Poll Tools'': We use [http://whenisgood.net When Is Good] for a lot of our meeting polls. Here are some [[CommunityData:HowToWhenToMeet|tips, tricks and norms about filling out meeting polls]]
+* ''Version control'': We also have a Git repository with some shared resources (both technical and non-technical) on it:
+** ''Git repositories'': [[CommunityData:Git]] — How to get set up on the git server to create, clone, work on/in shared git repositories we maintain.
+** ''Software projects'': [[CommunityData:Code]] — List of software projects maintained by the collective.
-== Servers and Data Stuff ==
+== Computation: Servers, data, and more ==
-Much of our work is quantitative and involves large datasets. We have multiple computing resources and servers. For any given project, you might not need it eventually.
+Much of our work is pretty computational/quantitative and involves large datasets. We have multiple computing resources and servers.
-;Hyak: Hyak is a supercomputer system that is hosted at UW but that the whole group uses for conducting statistical analysis and data processing. Hyak is necessary if you need large amounts of storage (e.g., tens of terabytes) or if you need large amount of computational resources (e.g., CPU time, memory, etc). ''Severs in Hyak do not direct access to the Internet.'' That means that Hyak is not useful for collecting data from APIs, etc. Access requires a UW NetID but they will be sponsored for you. You can learn more about it at: [[CommunityData:Hyak]] which has various links to tutorials/documentation as well.
+;Hyak: Hyak is a supercomputer system that is hosted at UW but that the whole group uses for conducting statistical analysis and data processing. Hyak is necessary if you need large amounts of storage (e.g., tens of terabytes) or if you need large amount of computational resources (e.g., CPU time, memory, etc). ''Servers in Hyak do not have direct access to the Internet.'' (except for 'build' machines). That means that Hyak is not useful for collecting data from APIs, etc. Access requires a UW NetID but they will be sponsored for you. You can learn more about it at: [[CommunityData:Hyak]] which has various links to tutorials/documentation as well.
 :In order to use Hyak, you need to get an account setup. This is documented on [[CommunityData:Hyak setup]].
@@ Line 30: / Line 54: @@
 ;Nada: Nada is a sever at UW that is used primarily for infrastructure. It runs the blogs, mailing lists, git repositories and so on. We backup all of nada and these backups can be very expensive. Before you download or use data on Nada, please read the page [[CommunityData:Backups (nada)]] which provide details on what is, and what isn't, backed up from nada.
+;Asha: Asha is a server at UW that is used for storing and analyzing Scratch data. Only people on the IRB protocol for Scratch are online.
 When using servers, these pages might be helpful:
 * [[CommunityData:Tmux]] — You can use tmux (terminal multiplexer) to keep a persistent session on a server, even if you're not logged into the server. This is especially helpful when you ssh to a server and then run a job that runs for quite a while and then you can't stay logged in the whole time. Check out the [https://github.com/tmux/tmux/wiki tmux git repo] or its [https://en.wikipedia.org/wiki/Tmux Wikipedia page] for more information about this.
-* [[CommunityData:Hyak Spark]] — Spark is a powerful tool that helps build programs dealing with large datasets.
+* [[CommunityData:Hyak Spark]] — Spark is a powerful tool that helps build programs dealing with large datasets. It's great for Wikimedia and Reddit data dumps.
-=== Re: Wiki Data ===
+=== Wiki Data in particular===
-* [[CommunityData:ORES]] - Using ORES with wikipedia data
+Multiple people in the group work on large datasets gathered from Wikipedia, Wikia (Fandom), or other projects running MediaWiki software. We have some specific resources and tools for these kinds of data
-* [[CommunityData:Wikia data]] — Documents information about how to get and validate wikia dumps.
+* [[CommunityData:ORES]] - Using ORES with Wikipedia data
-* [[CommunityData:Wikiq]] - Wikiq is a handy tool we use to process Wikipedia XML dumps, outputting dumps as tsv (which can then be easily processed by the very powerful Spark).
+* [[CommunityData:Wikia data]] — How to get and validate wikia dumps.
+* [[CommunityData:Wikiq]] - Processing MediaWiki XML dumps, outputting parsed dumps as tsv (which can then be processed by the very powerful Spark).
 == Creating Documents and Presentations ==
 === Planning ===
-You can develop a research plan in whatever way works best, but one thing that may be useful is the outline of a Matsuzaki-style planning documents. You can see a detailed outline description [https://wiki.communitydata.science/CommunityData:Planning_document here] to help guide the planning process. If you scroll to the bottom, you'll see who to contact to get some good examples of planning documents.
+You can develop a research plan in whatever way works best, but one thing that may be useful is the outline of a [[CommunityData:Planning document |Matsuzaki-style planning documents]] and the [[CommunityData:Qualitative planning document|qualitative planning document]]. You can see a detailed outline description [https://wiki.communitydata.science/CommunityData:Planning_document here] to help guide the planning process. If you scroll to the bottom, you'll see who to contact to get some good examples of planning documents.
 Also helpful in developing a research plan might be some of the readings in this course taught by Aaron to PhD students: [https://wiki.communitydata.science/Practice_of_scholarship_(Spring_2019) Practice of Scholarship (SP19)].
@@ Line 61: / Line 88: @@
 * [[CommunityData:reveal.js]] — Using RMarkdown to create reveal.js HTML presentations
-== Misc. Resources ==
+== A few additional resources ==
 === Technical ===
 * [[CommunityData:Exporting from Python to R]]
@@ Line 68: / Line 95: @@
 === Non-technical ===
 * [[CommunityData:Advice on writing a background section to an academic paper]]
-* See some past and upcoming lab retreats [[https://wiki.communitydata.science/CommunityData:Resources#Ongoing_and_Future_Meetings_and_Meetups here]].
+* See some past and upcoming lab retreats [[CommunityData:Resources#Ongoing_and_Future_Meetings_and_Meetups|here]].