FAIR Access to Data

“FAIR” is the acronym for Findable, Accessible, Interoperable and Reusable, and applies to scientific data used by people and computers. The FAIR data principles were established in 2016. More information is available from the 2016 publication, the 2018 editorial publication, the FAIR website and the GO FAIR Initiative webpage.

The FAIR principles recognize that:

published data should be stored where it can be found and identified accurately using globally unique persistent identifiers;
data can be retrieved using standard protocols (like through a web interface);
data is in a standard format such that it can be used in other applications; and
is licensed properly to allow reuse.

At SoyBase, we value data and believe that it is the responsibility of every soybean researcher and breeder to make their publicly-funded data FAIR. Here we outline some basic guidelines for good data management. We hope that our breeding and research community will lead the world wide charge to FAIR data! We are always happy to answer your questions on these issues.

Why is this important?

Published data continues to increase dramatically in volume and complexity. Data sets in individual publications are now routinely so large, that the aid of computer analysis is required. For data to be “machine readable” (that is, the data can be manipulated by a computer program), new standards are being established so that data is BOTH human and machine readable. Also, journal publishers often do not accept large data sets, so other data repositories must be used, and it can be difficult for submitters to find the right database for their data. It can also be difficult for researchers to find data associated with a publication that is not in the supplementary data, if persistent globally unique identifiers are not specified in the published article.

Why should YOUR Data be FAIR?

If you make your data FAIR, it will be more visible, easier to reuse, and more frequently cited. If others make their data FAIR, it will help the entire community harvest the vast depth and breath of soybean data, and discovery will proceed more quickly, especially as new analysis methods come online.

What can YOU do to make your data FAIR?

Start with these tasks:

Understand the FAIR data principles, and make your data FAIR

Start by reading FAIR Guiding Principles for scientific data. It is also important to budget time for data management, just as you budget time for the other aspects of your research.

Put your data in a permanent and stable database

There are permanent and stable repositories for many types of scientific data. Data should go into the correct repository, and then it can be pulled into SoyBase for further curation and use with our tools. Where ever you deposit data, get a DOI (or other persistent, globally unique identifier) and put it in your publication. Not sure where to put your data? Nature provides an excellent list of data repositories and recommendations. The re3data.org and FAIRsharing.org websites have extensive lists of databases, resources, and repositories. If you are still unsure where to submit data, or need help submitting, please ask anyone at SoyBase. If your journal article refers to data NOT published with your article, please make sure to obtain and add a persistent identifier and location of your data in your article.
Genome Assemblies: Submit genome assemblies to EBI or NCBI Genomes. We understand this can take some time to complete. We can help, so please do not be tempted to simply submit contigs to Genbank.
DNA/RNA/Protein Sequences: All DNA,RNA and protein sequences need to be submitted to NCBI, EBI, or DDBJ. This databases provide a stable, long-term storage for DNA, RNA and protein sequence data and create stable identifiers for datasets. These three organizations share sequence data on a daily basis, so data deposited at one is available at all three.
SNPs: All soybean SNPs should be submitted to EVA at EBI.
Gene Expression: Data used in gene expression studies should be submitted to the NCBI GEO.
Protein/Proteomics/Metabolomics: Explore Uniprot, MassIVE, MetaboLights, Peptide Atlas, and PRIDE. Metabolomics Data should be submitted following the MSI guidelines. Submit proteomics data to members of the ProteomeXchange, following the MIAPE recommendations.
General Repositories: Data Dryad or Figshare

Understand “Machine Readable”

This simply means that data is in a format that can be read and processed by a computer without human intervention. Computers are good at exact matches; for example, "lg1" does NOT equal "liguleless1" and “Chr1” does NOT equal “1” to a computer. Word documents and pdfs are NOT machine readable. Formats such as spreadsheets with header columns that can be exported as comma separated values (CSV), or standard formats for specific data types like FASTA, FASTQ, BED, GFF3, BAM, SAM, VCF, etc., ARE machine readable. Repositories often describe the machine-readable file formats they accept. If you have questions about what file formats SoyBase accepts please contact the SoyBase team.

Attach complete and detailed metadata to your data sets, and use accepted file formats

When you deposit data, you are asked for information about your data (metadata). Please give this the same careful attention you give to your bench work and analysis. Datasets that are not adequately described are not reusable or reproducible, and raise questions about the carefulness and accuracy of the research. You should supply enough metadata so that your experiment can be reproduced. Be sure to use community standards for your datatype, such as MIxS (Minimal Information about any Sequence) or MIAPPE (Minimum Information About a Plant Phenotyping Experiment). These standards will inform you on what information to provide and the accepted file formats for your type of data. Standards can also be found at data repositories.
Use ontology terms to describe your data. Ontologies provide a powerful organizing framework for data, and help data to be machine readable.

Do not rename genes that already have names

Once upon a time, the name of a soybean gene was its unique, persistent identifier. But now, renaming of genes that already have names is a big problem. Many names for the same genes make it difficult to find all information for that gene. Even worse, when the same name is used for different genes, how can a human, much less a computer know they are different genes? Please look up your gene at SoyBase before assigning a name, and follow the SoyBase nomenclature guidelines.

Let SoyBase know about your work!

Please let us know about your publications, and provide links to your data. If you have published on a gene or genes, or if you have a dataset that will be useful to others, let us know. Review the pages on genes or other information that you study, and let us know if corrections should be made. We want to get it right!