CommunityData:Dataverse
The Harvard Dataverse is an archive for datasets and code hosted by Harvard but available to anybody. The Community Data Science Collective Dataverse a portal within the Harvard Dataverse that is for CDSC projects and that is managed by our team.
How should I add things to the CDSC Dataverse?[edit]
If you have not done so before, everyone should begin by:
- Create an account You might want to use your institutional login for it or you can create a new one with your username/email and password.
- Ask an existing administration to make you a member/administrator of the CDSC dataverse (everybody in the group should be an admin so it's best to ask on IRC).
You now need to select between one of two choices: (1) The first is to create your dataset within the CDSC Dataverse. This is usually best because anyone in the group can manage it. (2) Th second is to create it outside of the Dataverse but to "link" it. You will typically do this when there is access-restricted data that should not be available to everyone in the group.
If you want to create a dataset in the CDSC Dataverse you should create your replication package or dataset release by:
- Go to the CDSC Dataverse Page
- Click "+ Add Data" → "New Dataset"
- Make sure that "Host Dataverse" says "Community Data Science Collective Dataverse"
- Upload and fill out metadata fields (minimally, include a README.txt file to explain how to use your data and code)
- Publish/release!
If you want to create your dataset outside the CDSC Dataverse but have it listed you will need to:
- Go to the Main Harvard Dataverse Page
- Click "Click" → "Add a Dataset"
- Make sure that "Host Dataverse" says "Harvard Dataverse"
- Upload and fill out metadata fields (minimally, include a README.txt file to explain how to use your data and code)
- Publish/release!
- Click the "Link Dataset" button on your dataset page and then type/select the Community Data Science Collective Dataverse.
Finally, if you have already created a dataset and want it moved into the CDSC Dataverse, you will need to click the Support button on the top each page and write a message asking them to move it for you. They usually do this very quickly.
An open science workflow using dataverse[edit]
There are many ways to follow open science practices. One way to fit the CDSC dataverse into your open science workflow is as follows:
Step 1: Anonymous while under review[edit]
Some publications ask for an anonymized release of code and data. This is easy to do without breaking double-blind anonymity. Generate a code and data package that doesn't include information that will identify you, and then when uploading do not fill out metadata fields with authorship information and do not release (publish) your archive. Delete places where it autofills your name. Once your files are uploaded, under 'Edit Dataset', there's an option to 'Generate Private URL'. See details in the user guide. You'll see that this creates a blue box at the top of your archive which reads "Unpublished Dataset Private URL – Privately share this dataset before it is published:" -- that's the link to share with your reviewers (test this link with another browser to be sure that it doesn't reveal anything).
Step 2: Identified after acceptance[edit]
You might like to include a link to your dataverse in your paper; you might also want to add it to your accepted preprint before uploading the paper into arXiv. Fill out as many metadata fields as you find useful (authors, description, subject, keywords), ask a colleague to take a look at your archive, and then release it.
Step 3: Updated after publication[edit]
After your paper is published and the DOI goes live, why not add this information into your archive so that others can find it (the 'Related Publication' metadata field)?
Potential questions and problems[edit]
Oh no, I made an error in my archive![edit]
After an archive is released, you can make updates. But if you've realized that the previous version is sufficiently bad that you don't want it to be findable, the archive needs to be deleted or 'deaccessioned'.
What's this message about my data format and 'tabular ingest failed'?[edit]
Dataverse wants to be able to present your data in tabular form for people to view live without downloading, and is having trouble parsing what you uploaded. You can reformat, or you can ignore this error.
My replication package has a main directory and subdirectories -- how do I represent this?[edit]
Dataverse assumes everything is in the root. If you have subdirectories, the way to make this work is to upload the files from those subdirectories and then specify the file path using the UI that only shows up after you do the upload.