Data and Donuts How to write a data management plan Script Slide 1: Hi, and welcome to Data and Donuts. I'm Tobin Magle, the data management specialist at the Morgan Library at Colorado State University. This lecture will cover how to write a successful data management plan. Slide 2: Before we start, let's define what research data is. According to the White House Office of Management and Budget, it is the "recorded factual material commonly accepted in the scientific community as necessary to validate research findings". In reality, you can use the same strategies to manage any digital product of your research. Slide 3: A data management plan is a description of how you plan to describe, preserve and share your research data. They are often required for grant applications. Slide 4: A good place to look for funding agency requirements is DMPtool, an online tool for data management plans. You can review funder requirements at the link listed here. You can also create an account to use DMP templates and search public DMPs. Slide 5: While the specifics vary by funding agency, successful Data Management plans include * A data inventory * A strategy for describing the data * A plan for preserving the data * And a method to access the data. Always make sure to check your funding agency requirements Slide 6: Let's start with the contents of the data inventory section. This section should include * The type of data you will collect * The file type that will be produced * The size and number of the files * And what other research products will be produced, such as code and templates that might be useful for other researchers Slide 7: I work on a data curation project with a research lab that studies microRNAs in mice. In this genomics example, * the type of data they're collecting is miRNA sequences. * The file type is FASTQ files. * There will be 1 gig files for 64 strains in replicates of 3, so a total of 200 GB depending on number of reads per file * Also, we will be producing code, R packages, and tutorials for how to use the data. Slide 8: While building your data inventory, it's important to consider what types of file types you store your data in. This is important to ensure that the data are accessible into the future. For example, you could store tabular data in Excel, but what happens if Microsoft changes their the file format so that it doesn't open older file versions? This is a problem with all proprietary formats. If you can, convert your data into non-proprietary formats like text or .csv files. This concept also applies to more complicated data formats like images, audio and movies, which can be stored as .tiff, mp4 or mp3 files respectively. Slide 9: Let's take a moment to write a data inventory. Open a text editor, and describe: 1. The kind of data you'll collect 2. What file type you'll produce 3. How big these files will be 4. And any other research outputs you will produce ------------------------------------------------------------------------------------------------------------------------------ Recording Break ------------------------------------------------------------------------------------------------------------------------------- Slide 10: The next aspect of data management that you should consider is how you will describe your data. This description is commonly referred to as metadata, or data about data. At minimum, your metadata should include contact information for the researchers who produced the data, a description of how the data were collected, when and where the data were collected, and the units that are being measured. The format can be as simple as a text file. Slide 11: For the genomics example, I created a readme file that described * what the data are (mouse miRNA sequences), * how it was produced (RNA extracted from mouse brains, then sequenced) * Links to where the data can be found * And contact information Slide 12: However, if you want your data to be integrated with other similar data, it's useful to use metadata standards. Dublin Core is a generic metadata standard that covers the elements listed on the last slide. Many disciplines have field-specific metadata standards, like EML for ecology or MIAME for microarray experiments. If you're not sure what the common metadata standard is for your field, you can use the Digital Curation Center or Biosharing if you're in the biosciences to find metadata standards. Slide 13: Another way to decide on a metadata standard is to look at the requirements of the repository your data will end up in. For the genomics example, I downloaded a template that is specific for my data type from the NCBI and filled out the table. Each column is a variable, like organism, age, etc, and each row is an observation from a single sample. Some fields were required, some weren't, but I included as much information as I could. The important part is coming up with a plan, and being consistent throughout the project. Slide 14: Let's take a moment to decide on a metadata standard and add it to the text file with your data inventory. Important questions to ask yourself are * What do people need to know to reuse your data? Like I said, even if it will never be shared widely, this is a good head space to be in when deciding how to describe your data * Are there any standards that are common in your field? * What format will the metadata be in? * What metadata fields will you include? ------------------------------------------------------------------------------------------------------------------------------ Recording Break ------------------------------------------------------------------------------------------------------------------------------- Slide 15: Now that you know what data you will have and how to describe it, you need to think about how you will provide for its safety over the long term: * What are you going to do to ensure the data are stored properly and preserved? * What metadata or other products need to go with it? * Also note that your data preservation techniques will change over the course of the project, so account for all stages. We'll talk about this a little bit more later. Slide 16: When thinking about preservation, you should ask yourself * What will you store? * Who will be in charge? This is especially important when backing up manually * How long will you store it? - Often determined by granting agency, rule of thumb is 5-10 years * Where will you store it? - For example, during the project, you'll probably have your data stored on your machine and backed up on a departmental server or the cloud. After the project is done, it might be easier to have it stored in a repository or on a less accessible central IT server to make room for new data and so you don't have to worry about remembering your own backups. Slide 17: When thinking about backing up your data, here are some things to consider: * Storing things in geographically distinct locations can be important in the case of natural disaster. We know CSU has had issues with flooding, so backing up your data on an external hard drive that is next to your computer isn't a good solution for this problem. Cloud solutions can be advantageous here. CSU provides researchers 1 TB on Microsoft OneDrive, and allows the use of other systems like DropBox and google drive. * Is your system automated? A backup system is only as good as the last time you synced, so make this happens often. * Finally, not every solution is secure enough if you're dealing with private information. Generally, things like Dropbox are not certified secure for private data, and these data need to be stored locally. Slide 18: Let's take a moment to write out your preservation plan. Open up your document and answer the following questions: * What will you store? * Who will be responsible for the data? This can be a specific person, or a position * How long will you store it? * Where will you store it at different stages of the project? * And how will you back it up? Slide 19: Now that you know what data you will have, how to describe it, and how to preserve it, we need to figure out how others will be able to access your data. This is particularly important to funding agencies because they want the public to see what their taxes are paying for. Your data must be easily available: saying "will be made available upon request" doesn't really cut it anymore. However, it is accepted to embargo your data for about 12 months after publication or project completion. As always, if you're working with private data, please consider security when assessing this plan. Slide 20: When sharing your data, it's best to use proprietary formats, because not everyone will have access to the same software that you do. Including metadata is also essential so others can make sense of your data and use it responsibly. As always, you want to make sure the data is stored properly with a backup system in place. A good place to do this is a trusted repository. Slide 21: Trusted repositories have a dual purpose: storage and sharing. Optimally, you would put your data in a discipline-specific repository so others in your field can find it. You can use re3data to look for repositories in your field. If you do not have a repository that fits your needs, generic repositories like FigShare and Dryad can be good options to accommodate diverse data types, but they often have an associated fee. Finally, Colorado State has its own digital repository that you can store your research data in. Slide 22: To see datasets that are in the repository, you can go to the repository and look in the "data collection". It's a flexible system that uses Dublin Core metadata, but other standards can be integrated as needed. Also, storing your data here is at no cost for less than a TB of data. Above that, the cost is $150 per terabyte for 5 years, or $300 per terabyte for longer than 5 years. Slide 23: Trusted repositories also provide what are known as stable identifiers. We all know the URLs break, so stable IDs were created to prevent links from breaking over time. You probably have seen DOIs, or digital object identifiers, on journal articles, but they can be applied to any digital object, including data. To find a digital object, search for the DOI at dx.doi.org, which will direct you to the specific digital object on the web. If the object's URL changes, the DOI is updated, preserving the association. CSU's digital repository provides another stable identifier called a Handle by default, but can mint DOIs on request. Slide 24: Even if you share your data, you don't need to do it without restrictions. You can state your conditions for reuse in a license. A common condition for reuse is that users cite your data and/or your associated publications. You can also issue disclaimers about your dataset. However, you must justify your usage limitations. A good starting point for licenses is the Creative Commons web site, but we won't go into that here because that's an entire session all on its own. Slide 25: You're almost done with your data management plan! Take a moment to consider your data sharing plan with the following questions: - Where will people access the data? Does your discipline have a repository? - What kind of stable identifier do you want? - What are the conditions for reuse? - What limitations do you want to place on use of your data and why? Slide 26: Thanks for listening! If you need help with any of these topics, my contact information is here, You can also consult DMPTool or the CSU libraries data management web site. Thanks again!