search

  • Español – América Latina
  • Português – Brasil
  • Tiếng Việt

TFDS now supports the Croissant 🥐 format ! Read the documentation to know more.

scientific_papers

  • Description :

Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

  • article: the body of the document, pagragraphs seperated by "/n".
  • abstract: the abstract of the document, pagragraphs seperated by "/n".

section_names: titles of sections, seperated by "/n".

Additional Documentation : Explore on Papers With Code north_east

Homepage : https://github.com/armancohan/long-summarization

Source code : tfds.datasets.scientific_papers.Builder

  • 1.1.0 : No release notes.
  • 1.1.1 (default): No release notes.

Download size : 4.20 GiB

Auto-cached ( documentation ): No

Feature structure :

  • Feature documentation :

Supervised keys (See as_supervised doc ): ('article', 'abstract')

Figure ( tfds.show_examples ): Not supported.

scientific_papers/arxiv (default config)

Config description : Documents from ArXiv repository.

Dataset size : 7.07 GiB

  • Examples ( tfds.as_dataframe ):

scientific_papers/pubmed

Config description : Documents from PubMed repository.

Dataset size : 2.34 GiB

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-12-23 UTC.

Datasets: scientific_papers like 117

Dataset card for "scientific_papers", dataset summary.

Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

  • article: the body of the document, paragraphs separated by "/n".
  • abstract: the abstract of the document, paragraphs separated by "/n".
  • section_names: titles of sections, separated by "/n".

Supported Tasks and Leaderboards

More Information Needed

Dataset Structure

Data instances.

  • Size of downloaded dataset files: 4.50 GB
  • Size of the generated dataset: 7.58 GB
  • Total amount of disk used: 12.09 GB

An example of 'train' looks as follows.

  • Size of the generated dataset: 2.51 GB
  • Total amount of disk used: 7.01 GB

An example of 'validation' looks as follows.

Data Fields

The data fields are the same among all splits.

  • article : a string feature.
  • abstract : a string feature.
  • section_names : a string feature.

Data Splits

Dataset creation, curation rationale, source data, initial data collection and normalization, who are the source language producers, annotations, annotation process, who are the annotators, personal and sensitive information, considerations for using the data, social impact of dataset, discussion of biases, other known limitations, additional information, dataset curators, licensing information, citation information, contributions.

Thanks to @thomwolf , @jplu , @lewtun , @patrickvonplaten for adding this dataset.

Models trained or fine-tuned on scientific_papers

research paper dataset

google/bigbird-pegasus-large-arxiv

Google/bigbird-pegasus-large-pubmed.

research paper dataset

allenai/led-large-16384-arxiv

research paper dataset

patrickvonplaten/led-large-16384-pubmed

Mse30/bart-base-finetuned-pubmed, dagar/t5-small-science-papers, space using scientific_papers 1.

Logo for Cornell University

We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors.

  • Accessibility
  • Status Information
  • Ancillary Files (data, code, images)
  • Availability of submissions
  • Category cross listing
  • Endorsement
  • Adding Journal Reference and DOI
  • Text Overlap
  • Metadata for Required and Optional Fields
  • Submit a new version of a work
  • Oversized Submissions
  • Submit a Paper List for Conference Proceedings
  • Creating tar and zip Files for Upload
  • What is TeX
  • Proxy / Third Party Submission
  • Translations
  • Version Availability
  • Why Submit TeX?
  • Withdraw / Retract a Submission
  • Institutional Repository Interoperability
  • Automated DOI and journal reference updates from publishers
  • arXiv Usage Stats

Support for data sets associated with arXiv articles

arXiv is primarily an archive and distribution service for research articles . arXiv provides support for data sets and other ancillary materials only in direct connection with research articles submitted.

arXiv supports the inclusion of ancillary files of modest size with articles. If you are including multiple page datasets or code with your submission please use the ancillary file option rather than embed them in the full text. The ancillary files are stored in the source package on arXiv and facilities are available to download either the entire source package or individual files. The ability to add ancillary files is available as part of the normal arXiv submission process .

  • Privacy Policy
  • contact arXiv Click here to contact arXiv Contact
  • subscribe to arXiv mailings Click here to subscribe Subscribe
  • Report an issue Click here to report an issue with arXiv's documentation in github Report a documentation issue
  • Web Accessibility Assistance

arXiv Operational Status Get status notifications via email or slack

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 13 July 2020

A dataset describing data discovery and reuse practices in research

  • Kathleen Gregory   ORCID: orcid.org/0000-0001-5475-8632 1  

Scientific Data volume  7 , Article number:  232 ( 2020 ) Cite this article

10k Accesses

8 Citations

32 Altmetric

Metrics details

  • Research data
  • Social sciences

This paper presents a dataset produced from the largest known survey examining how researchers and support professionals discover, make sense of and reuse secondary research data. 1677 respondents in 105 countries representing a variety of disciplinary domains, professional roles and stages in their academic careers completed the survey. The results represent the data needs, sources and strategies used to locate data, and the criteria employed in data evaluation of these respondents. The data detailed in this paper have the potential to be reused to inform the development of data discovery systems, data repositories, training activities and policies for a variety of general and specific user communities.

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12445034

Similar content being viewed by others

research paper dataset

re3data – Indexing the Global Research Data Repository Landscape Since 2012

Heinz Pampel, Nina Leonie Weisweiler, … Vivien Petras

research paper dataset

SciSciNet: A large-scale open data lake for the science of science research

Zihang Lin, Yian Yin, … Dashun Wang

research paper dataset

A focus groups study on data sharing and research data management

Devan Ray Donaldson & Joshua Wolfgang Koepke

Background & Summary

Reusing data created by others, so-called secondary data 1 , holds great promise in research 2 . This is reflected in the creation of policies 3 , platforms (i.e. the European Open Science Cloud 4 ), metadata schemas (i.e. the DataCite schema 5 ) and search tools, i.e. Google Dataset ( https://datasetsearch.research.google.com/ ) or DataSearch ( https://datasearch.elsevier.com ), to facilitate the discovery and reuse of data. Despite the emergence of these systems and tools, not much is known about how users interact with data in search scenarios 6 or the particulars of how such data are used in research 7 .

This paper describes a dataset first analysed in the article Lost or Found? Discovering data needed for research 8 . The dataset includes quantitative and qualitative responses from a global survey, with 1677 complete responses, designed to learn more about data needs, data discovery behaviours, and criteria and strategies important in evaluating data for reuse. This survey was conducted as part of a project investigating contextual data search undertaken by two universities, a data archive, and an academic publisher, Elsevier. The involvement of Elsevier enabled the recruitment strategy, namely drawing the survey sample from academic authors who have published an article in the past three years that is indexed in the Scopus literature database ( https://www.scopus.com/ ) .This recruitment strategy helped to ensure that the sample consisted of individuals active in research, across disciplines and geographic locations.

The data themselves are presented in two data files, according to the professional role of respondents. The dataset as a whole consists of these two data files, one for researchers (with 165 variables) and one for research support professionals (with 167 variables), the survey questionnaire and detailed descriptions of the data variables 9 . The survey questionnaire contains universal questions which could be applicable to similar studies; publishing the questionnaire along with the data files not only facilitates understanding the data, but it also fosters possible harmonization with other survey-based studies.

The dataset has the potential to answer future research questions, some of which are outlined in the usage notes of this paper, and to be applied at a practical level. Designers of both general and specific data repositories and data discovery systems could use this dataset as a starting point to develop and enhance search and sensemaking interfaces. Data metrics could be informed by information about evaluation criteria and data uses present in the dataset, and educators and research support professionals could build on the dataset to design training activities.

The description below of the methods used to design the questionnaire and to collect the data, as well as the description of potential biases in the technical validation section, all build on those presented in the author’s previous work 8 .

Questionnaire design

The author’s past empirical work investigating data search practices 10 , 11 (see also Fig.  1 ), combined with established models of interactive information retrieval 12 , 13 , 14 , 15 and information seeking 16 , 17 and other studies of data practices 18 , 19 , were used to design questions examining the categories identified in Table  1 . Specifically, questions explored respondents’ data needs, their data discovery practices, and their methods for evaluating and making sense of secondary data.

figure 1

Creation of dataset in relation to prior empirical work by the author. Bolded rectangles indicate steps with associated publications, resulting from an analytical literature review 10 , semi-structured interviews 11 and an analysis of the survey data 8 .

The questionnaire used a branching design, consisting of a maximum of 28 primarily multiple choice items (Table  1 ). The final question of the survey, which provided space for respondents to provide additional comments in an open text field is not represented in Table  1 . The individual items were constructed in accordance with best practices in questionnaire design, with special attention given to conventions for wording questions and the construction of Likert scale questions 20 , 21 . Nine of the multiple choice questions were constructed to allow multiple responses. There were a maximum of three optional open response questions. The majority of multiple choice questions also included the possibility for participants to write-in an “other” response.

The first branch in the questionnaire design was based on respondents’ professional role. Respondents selecting “librarians, archivists or research/data support providers,” a group referred to here as research support professionals , answered a slightly different version of the questionnaire. The items in this version of the questionnaire were worded to reflect possible differences in roles, i.e. whether respondents seek data for their own use or to support other individuals. Four additional questions were asked to research support professionals in order to further probe their professional responsibilities; four questions were also removed from this version of the questionnaire. This was done in order to maintain a reasonable completion time for the survey and because the removed questions were deemed to be more pertinent to respondents with other professional roles, i.e. researchers. The questionnaire is available in its entirety with the rest of the dataset 12 .

Sampling, recruitment and administration

Individuals involved in research, across disciplines, who seek and reuse secondary data comprised the population of interest. This is a challenging population to target, as it is difficult to trace instances of data reuse, particularly given the fact that data citation, and other forms of indexing, are still in their infancy 22 . The data reuse practices of individuals in certain disciplines have been better studied than others 23 , in part because of the existence of established data repositories within these disciplines 24 . In order to recruit individuals active in research across many disciplinary domains, a broad recruitment strategy was adopted.

Recruitment emails were sent to a random sample of 150,000 authors who are indexed in Elsevier’s Scopus database and who have published in the past three years. The recruitment sample was created to reflect the distribution of published authors by country within Scopus. Two batches of recruitment emails were sent: one of 100,000 and the other of 50,000. One reminder email was sent two weeks after the initial email. A member of the Elsevier Research and Academic Relations team created the sample and sent the recruitment letter, as access to the email addresses was not available to the investigator due to privacy regulations. The questionnaire was scripted and administered using the Confirmit software ( https://www.confirmit.com/ ).

1637 complete responses were received during a four-week survey period between September and October 2018 using this methodology. Only seven of the 1637 responses came from research support professionals. In a second round of recruitment in October 2018, messages were posted to discussion lists in research data management and library science to further recruit support professionals. Individuals active in these lists spontaneously posted notices about the survey on their own Twitter feeds. These methods resulted in an additional 40 responses, yielding a total of 1677 complete responses.

Ethical review and informed consent

This study was approved by the Ethical Review Committee Inner City faculties (ERCIC) at Maastricht University, Netherlands, on 17 May 2018 under the protocol number ERCIC_078_01_05_2018.

Prior to beginning the study, participants had the opportunity to review the informed consent form. They indicated their consent by clicking on the button to proceed to the first page of survey questions. Respondents were informed about the purpose of the study, its funding sources, the types of questions which would be asked, how the survey data would be managed and any foreseen risks of participation.

Specifically, respondents were shown the text below, which also states that the data would be made available in the DANS-EASY data repository ( https://easy.dans.knaw.nl ), which is further described in the Data Records section of this paper.

Your responses will be recorded anonymously, although the survey asks optional questions about demographic data which could potentially be used to identify respondents. The data will be pseudonymized (e.g. grouping participants within broad age groups rather than giving specific ages) in order to prevent identification of participants. The results from the survey may be compiled into presentations, reports and publications. The anonymized data will be made publicly available in the DANS-EASY data repository.

Respondents were also notified that participation was voluntary, and that withdrawal from the survey was possible at any time. They were further provided with the name and contact information of the primary investigator.

Data Records

Preparation of data files.

The data were downloaded from the survey administration system as csv files by the employee from Elsevier and were sent to the author. The downloads were performed in two batches: the 1637 responses received before the additional recruiting of research support professionals, and the 40 responses received after this second stage of recruitment. The seven responses from research support professionals from the first round of recruitment were extracted and added to the csv file from the second batch. This produced separate files for research support professionals and the remainder of respondents, who are referred to as researchers in this description. This terminology is appropriate as the first recruitment strategy ensured that respondents were published academic authors, making it likely that they had been involved in conducting research at some point in the past three years.

The following formatting changes were made to the data files in order to enhance understandability for future data reusers. All changes were made using the analysis program R 25 .

Open responses were checked for any personally identifiable information, particularly email addresses. This was done by searching for symbols and domains commonly used in email addresses (i.e. “@”; “.com,” and “.edu”). Two email addresses were identified in the final question recording additional comments about the survey. In consultation with an expert at the DANS-EASY data repository, all responses from this final question were removed from both data files as a precautionary measure.

Variables representing questions asked only to research support professionals were removed from datadiscovery_researchers.csv. Variables representing questions asked only to researchers were removed from datadiscovery_supportprof.csv.

Variables were renamed using mnemonic names to facilitate understanding and analysis. Variable names for questions asked to both research support professionals and researchers have the same name in both data files.

Variables were re-ordered to match the order of the questions presented in the questionnaire. Demographic variables, including role, were grouped together at the end of the data files.

Multiple choice options which were not chosen by respondents were recorded by the survey system as zeros. If a respondent was not asked a question, this is coded as “Not asked.” If a respondent wrote “NA” or a similar phrase in the open response questions, this was left unchanged to reflect the respondent’s engagement with the survey. If a respondent did not complete an optional open response question, this was recorded as a space, which appears as an empty cell. In the analysis program R, e.g., this empty space is represented as “ “.

Description of data and documentation files

The dataset described here consists of one text readme file, four csv files, and one pdf file with the survey questionnaire. These files should be used in conjunction with each other in order to appropriately use the data. Table  2 provides a summary and description of the files included in the dataset.

Descriptions of the variable names are provided in two files (Table  2 ). Variables were named following a scheme that matches the structure of the questionnaire; each variable name begins with a mnemonic code representing the related research aim. The primary codes are summarised in Table  3 . The values of the variables for multiple choice items are represented as either a “0” for non-selected options, as described above, or with a textual string representing the selected option.

The dataset is available at the DANS-EASY data repository 9 . DANS-EASY is a principal component of the federated national data infrastructure of the Netherlands 26 and is operated by the Data Archive and Networked Services (DANS), an institution of the Royal Netherlands Academy for Arts and Sciences and the Dutch Research Council. DANS-EASY has a strong history of providing secure long-term storage and access to data in the social sciences 27 . The repository has been awarded a CoreTrustSeal certification for data repositories ( https://www.coretrustseal.org/ ), which assesses the trustworthiness of repositories according to sixteen requirements. These requirements focus on organisational infrastructure (e.g. licences, continuity of access and sustainability), digital object management (e.g. integrity, authenticity, preservation, and re-use) and technology (e.g. technical infrastructure and security).

Sample characteristics

Respondents identified their disciplinary domains of specialization from a list of 31 possible domains developed after the list used by Berghmans, et al . 28 . Participants could select multiple responses for this question. The domain selected most often was engineering and technology, followed by the biological, environmental and social sciences (Fig.  2a ) Approximately half of the respondents selected two or more domains, with one quarter selecting more than three.

figure 2

( a ) Disciplinary domains selected by respondents; multiple responses possible (n = 3431). ( b ) Respondents’ years of professional experience; percentages denote percent of respondents (n = 1677). ( c ) Number of respondents by country of employment (n = 1677).

Forty percent of respondents have been professionally active for 6–15 years (Fig.  2b ). The majority identified as being researchers (82%) and are employed at universities (69%) or research institutions (17%). Respondents work in 105 countries; the most represented countries include the United States, Italy, Brazil and the United Kingdom (Fig.  2c ).

Technical Validation

Several measures were performed to ensure the validity of the data, both before and after data collection. Sources of uncertainty and potential bias in the data are also outlined below in order to facilitate understanding and data reuse.

Questionnaire development

The questionnaire items were developed after extensively reviewing relevant literature 10 , 29 , 30 , 31 , 32 and conducting semi-structured interviews to test the validity of our guiding constructs. To test the validity and usability of the questionnaire itself, a two-phase pilot study was conducted. In the first phase, four researchers, recruited using convenience sampling, were observed as they completed the online survey. During these observations, the researchers “thought out loud” as they completed the survey; they were encouraged to ask questions and to make remarks about the clarity of wording and the structure of the survey. Based on these comments, the wording of questions was fine tuned and additional options were added to two multiple choice items.

In the second pilot phase, an initial sample of 10,000 participants was recruited, using the primary recruitment methodology detailed in the methods section of this paper. After 102 participants interacted with the survey, the overall completion rate (41%) was measured and points where individuals stopped completing the survey were noted. Based on this information, option-intensive demographic questions (i.e. country of employment, discipline of specialization) were moved to the final section of the survey in order to minimize survey fatigue. The number of open-ended questions were also reduced and open-response questions were made optional.

The online presentation of the survey questions also helped to counter survey fatigue. Only one question was displayed at a time; the branching logic of the survey ensured that respondents were only shown the questions which were relevant to them, based on their previous answers.

Questionnaire completion

1677 complete responses to the survey questionnaire were received. Using the total number of recruitment emails in the denominator, this yields a response rate of 1.1%. Taking into account the number of non-delivery reports which were received (29,913), the number of invalid emails which were reported (81) and the number of recruited participants who elected to opt-out of the survey (448) yields a slightly higher response rate of 1.4%. It is likely that not all of the 150,000 individuals who received recruitment emails match our targeted population of data seekers and reusers. Knowledge about the individuals who did not respond to the survey and about the frequency of data discovery and reuse within research as a whole, is limited; this complicates the calculation of a more accurate response rate, such as the methodology described in 33 .

A total of 2,306 individuals clicked on the survey link, but did not complete it, yielding a completion rate of 42%. Of the non-complete responses, fifty percent stopped responding after viewing the introduction page with the informed consent statement. This point of disengagement could be due to a variety of reasons, including a lack of interest in the content of the survey or a disagreement with the information in the consent form. The majority of individuals who did not complete the survey stopped responding within the first section of the survey (75% of non-complete responses). Only data from complete responses are included in this dataset.

Of the 1677 complete responses, there was a high level of engagement with the optional open response questions. Seventy-eight percent of all respondents answered Q2 regarding their data needs; 92% of respondents who were asked Q5a provided an answer; and 69% of respondents shown Q10a described how their processes for finding academic literature and data differ.

Data quality and completeness

Checks for missing values and NAs were performed using standard checks in R. As detailed in the section on data preparation, multiple choice responses not selected by respondents were recorded as a zero. If a respondent was not asked a question, this was coded as “Not asked.” If a respondent wrote “NA” or a similar phrase in the open response questions, this was left unchanged to reflect the respondent’s engagement with the survey. If a respondent did not complete an optional open response question, this was recorded as a space, which appears as an empty cell. In the analysis program R, e.g., this empty space is represented as “ “.

Due to the limited available information about non-responders to the survey and about the frequency of data seeking and discovery behaviours across domains in general, the data as they stand are representative only of the behaviours of our nearly 1700 respondents - a group of data-aware people already active in data sharing and reuse and confident in their ability to respond to an English-language survey. Surveys in general tend to attract a more active, communicative part of the targeted population and do not cover non-users at all 34 . While not generalizable to broader populations, the data could be transferable 35 , 36 to similar situations or communities. Creating subsets of the data, i.e. by discipline, may provide insights that can be applied to particular disciplinary communities.

There are potential sources of bias in the data. The recruited sample was drawn to mirror the distribution of published authors by country in Scopus; the geographic distribution of respondents does not match that of the recruited sample (Table  4 ). This is especially noticeable for Chinese participants, who comprised 15% of the recruited sample, but only 4% of respondents. This difference could be due to a number of factors, including language differences, perceived power differences 37 , or the possibility that data seeking is not a common practice.

Our respondents were primarily drawn from the pool of published authors in the Scopus database. Some disciplinary domains are under-represented within Scopus, most notably the arts and humanities 38 , 39 . Subject indexing within Scopus occurs at the journal or source level. As of January 2020, 30.4% of titles in Scopus are from the health sciences; 15.4% from the life sciences; 28% from the physical sciences and 26.2% from the social sciences 45 . Scopus has an extensive and well-defined review process for journal inclusion; 10% of the approximately 25,000 sources indexed in Scopus are published by Elsevier 40 .

Self-reported responses also tend to be pro-attitudinal, influenced by a respondent’s desire to provide a socially acceptable answer. Survey responses can also be influenced by the question presentation, wording and multiple choice options provided. The pilot studies and the provision of write-in options for individual items helped to mitigate this source of error.

Usage Notes

Notes for data analysis.

It is key to note which questions were designed to allow for multiple responses. This will impact the type of analysis which can be performed and the interpretation of the data. These nine questions are marked with an asterisk in Table  1 ; the names of the variables related to these questions are summarized in Table  5 .

The data are available in standard csv formats and may be imported into a variety of analysis programs, including R and Python. The data are well-suited in their current form to be treated as factors or categories in these programs, with the exception of open response questions and the write-in responses to the “other” selection options, which should be treated as character strings. An example of the code needed to load the data into R and Python, as well as how to change the open and other response variables to character strings, is provided in the section on code availability. To further demonstrate potential analyses approaches, the code used to create Fig.  2a in R is also provided.

Certain analysis programs, i.e. SPSS, may require that the data be represented numerically; responses in the data files are currently represented in textual strings. The survey questionnaire, which is available with the data files, contains numerical codes for each response which may be useful in assigning codes for these variables.

Future users may wish to integrate the two data files to examine the data from all survey respondents together. This can easily be done by creating subsets of the variables of interest from each data file (i.e. by using the subset and select commands in R) and combining the data into a single data frame (i.e. using the rbind command in R). Variables that are common between both of the data files have the same name, facilitating this type of integration. An example of the code needed to do this is provided in the code for creating Fig.  2a .

Open and write-in responses are included in the same data file with the quantitative data. These variables can be removed and analysed separately, if desired.

To ease computational processing, the data do not include embedded information about the question number or the detailed meaning of each variable name. This information is found in the separate variable_labels csv file associated with each data file.

Potential questions and applications

The data have the potential to answer many interesting questions, including those identified below.

How do the identified practices vary by demographic variables? The data could be sub-setted to examine practices along the lines of:

Country of employment

Career stage, e.g. early career researchers

Disciplinary domain

What correlations exist among the different variables, particularly the variables allowing for multiple responses? Such questions could examine Box  1 – 3 :

Possible correlations between the frequency of use of particular sources and the type of data needed or uses of data

Possible correlations between particular challenges for data discovery and needed data or data use

How representative are these data of the behaviours of broader populations?

How will these behaviours change as new technologies are developed? The data could serve as a baseline for comparison for future studies.

How do practices within a particular domain relate to the existence of data repositories and infrastructures within a domain? Given the practices identified in this survey, how can repositories and infrastructures better support data seekers and reusers?

Box 1. R code for loading data and changing selected columns to character strings.

#Set the working directory; the data files should be in the working directory .

setwd(“~/Desktop/survey/“)

#Import the data files as data frames and store as “researcher.df” and “support.df”. If you don’t want to use factors, set stringsAsFactors = FALSE .

researcher.df < - read.csv(file = ‘datadiscovery_researchers.csv’, header = TRUE, stringsAsFactors = TRUE)

support.df < - read.csv(file = ‘datadiscovery_supportprof.csv’, header = TRUE, stringsAsFactors = TRUE)

#Select columns to be treated as character strings .

cols.res < - c(“need_open”,“need_othresp”,“use_othresp”,“find_whoothresp”,“source_open”,“strategy_othresp”,“find_litdatopen”,“find_chalothresp”,“eval_infopen”,“eval_stratopen”,“eval_trstopen”,“eval_qualopen”,“disc_othresp”)

cols.sup < - c(“whosupprt_othresp”,“supprt_othresp”,“need_open”,“need_othresp”,“use_othresp”,“source_open”,“strategy_othresp”,“findaccseval_oth”,“find_litdatopen”,“eval_infopen”,“eval_spprtdopen”,“eval_stratopen”,“eval_trstopen”,“eval_qualopen”,“disc_othresp”)

#Change these columns from factors to characters

researcher.df[cols.res] < - lapply(researcher.df[cols.res], as.character)

support.df[cols.sup] < - lapply(support.df[cols.sup], as.character)

str(researcher.df)

str(support.df)

Box 2. Python code for loading data and changing selected columns to character strings.

#Import required libraries

import pandas as pd

#Read in the csv as a pandas dataframe. Pandas will infer data types but we will explicitly set all to “categories” initially and then change the “str” (string) columns later . df = pd.read_csv(‘./datadiscovery_researchers.csv’, index_col = ‘responseid’, dtype = “category”)

#Create a list of the columns which are not categories but should be treated as strings .

str_cols = [‘need_open’,

       need_othresp’,

       use_othresp’,

       find_whoothresp’,

       source_open’,

       strategy_othresp’,

       find_litdatopn’,

       find_chalothresp’,

       eval_infopen’,

       eval_stratopen’,

       eval_trstopen’,

       eval_qualopen’,

       ‘disc_othresp’]

#Change data type for columns to be treated as strings .

df[str_cols] = df[str_cols].astype(“str”)

#Print data types to confirm

df[cat_cols].dtypes

df[str_cols].dtypes

Box 3. R code for creating Fig.  2a .

#Install packages and libraries for plot

install.packages(ggplot2)

install.packages(reshape2)

install.packages(dplyr)

library(ggplot2)

library(reshape2)

library(dplyr)

#Select and combine variables from both data files to use in the plot

researcherdisc.df < - subset(researcher.df, select = c(responseid,disc_agricul:disc_other))

supportdisc.df < - subset(support.df, select = c(responseid,disc_agricul:disc_other))

disc.df < - rbind(researcherdisc.df,supportdisc.df)

#Transform data from wide to long

disclong.df < - disc.df % > % melt(id.vars = c(“responseid”),value.name = “discipline”)

#Create data frames with frequencies

discfreq.df < - disclong.df % > %

       filter(discipline! = “0”) % > %

       select(responseid,discipline)% > %

       count(discipline)

#Create plot of frequencies

discplot < - ggplot(discfreq.df, aes(x = reorder(discipline, n), y = n)) + geom_bar(stat = “identity”, fill = “#238A8DFF”) + coord_flip()

#Format plot and add labels

discplot < - discplot + theme(plot.title = element_text(hjust = 0),axis.ticks.y = element_blank(),axis.text = element_text(size = 15),text = element_text(size = 15),panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank()) + ylab(“Frequency”) + xlab(“Disciplinary domain”)

Code availability

All R scripts used in data preparation and technical validation, along with the un-prepared data, are available upon request from the corresponding author. Examples of how to load the data and how to change factor/category columns to character columns in R (Box  1 ) and Python (Box  2 ) are provided. Additionally, the code used to create Fig.  2a in R (Box  3 ) is listed as an example of how to combine data from both data files into a single plot.

Allen, M. In The SAGE Encyclopedia of Communication Research Methods .Vols. 1-4 (ed. Allen, M.) Secondary data (SAGE Publications, Inc, 2017).

Wilkinson, M. D. et al . The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3 , 160018 (2016).

Article   Google Scholar  

European Commission. Facts and figures for open research data. European Commission website https://ec.europa.eu/info/research-and-innovation/strategy/goals-research-and-innovation-policy/open-science/open-science-monitor/facts-and-figures-open-research-data_en (2019).

European Commission. EOSC declaration: European Open Science Cloud: new research & innovation opportunities. European Commission website , https://ec.europa.eu/research/openscience/pdf/eosc_declaration.pdf#view=fit&pagemode=none (2017).

DataCite Metadata Working Group. DataCite metadata schema documentation for the publication and citation of research data, version 4.3. DataCite website https://doi.org/10.14454/7xq3-zf69 (2019).

Noy, N., Burgess, M. & Brickley, D. In The World Wide Web Conference Google Dataset Search: building a search engine for datasets in an open Web ecosystem (ACM Press, 2019).

Pasquetto, I. V., Randles, B. M. & Borgman, C. L. On the reuse of scientific data. Data Sci J. 16 , 1–9 (2017).

Gregory, K., Groth, P., Scharnhorst, A., & Wyatt, S. Lost or found? Discovering data needed for research. Harvard Data Science Review 2 (2020).

Gregory, K. M. Data Discovery and Reuse Practices in Research. Data Archiving and Networked Services (DANS) https://doi.org/10.17026/dans-xsw-kkeq (2020).

Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A. & Wyatt, S. Searching data: a review of observational data retrieval practices in selected disciplines. J. Assoc. Inf. Sci. Technol. 70 , 419–432 (2019).

Article   CAS   Google Scholar  

Gregory, K. M., Cousijn, H., Groth, P., Scharnhorst, A. & Wyatt, S. Understanding data search as a socio-technical practice. J. Inf. Sci . 0165551519837182 (2019).

Ingwersen, P. Information retrieval interaction . (Taylor Graham, 1992).

Ingwersen, P. Cognitive perspectives of information retrieval interaction: elements of a cognitive IR theory. J. Doc. 52 , 3–50 (1996).

Belkin, N. J. In Inform ation re triev al’ 93: Von der Modellierung zur Anwendung (eds. Knorz, G., Krause, J. & Womser-Hacker, C.) Interaction with texts: Information retrieval as information-seeking behavior (Universitaetsverlag Konstanz, 1993).

Belkin, N. J. In ISI ’96: Proceedings of the Fifth International Symposium for Information Science (eds. Krause, J., Herfurth, M. & Marx, J.) Intelligent information retrieval: whose intelligence? (Universtaetsverlag Konstanz, 1996).

Blandford, A. & Attfield, S. Interacting with information: synthesis lectures on human-centered informatics (Morgan & Claypool, 2010).

Adams, A. & Blandford, A. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries . Digital libraries’ support for the user’s ‘information journey’ (ACM Press, 2005).

Borgman, C. L. Big data, little data, no data: Scholarship in the networked world . (MIT press, 2015).

Faniel, I. M. & Yakel, E. In P Curating Research Data, Volume 1: Practical Strategies for Your Digital Repository (ed. Johnson, L.) Ch.4 (Association of College & Research Libraries, 2017).

de Vaus, D. Surveys In Social Research . (Routledge, 2013).

Robson, C. & McCartan, K. Real World Research . (John Wiley & Sons, 2016).

Park, H., You, S. & Wolfram, D. Informal data citation for data sharing and reuse is more common than formal data citation in biomedical fields. J. Assoc. Inf. Sci. Technol. 69 , 1346–1354 (2018).

Borgman, C. L., Wofford, M. F., Darch, P. T. & Scroggins, M. J. Collaborative ethnography at scale: reflections on 20 years of data integration. Preprint at, https://escholarship.org/content/qt5bb8b1tn/qt5bb8b1tn.pdf (2020).

Leonelli, S. Integrating data to acquire new knowledge: three modes of integration in plant science. Stud. Hist. Philos. Sci. C 44 , 503–514 (2013).

Google Scholar  

R Core Team. R: A language and environment for statistical computing. R-project website , https://www.r-project.org (2017).

Dillo, I. & Doorn, P. The front office–back office model: supporting research data management in the Netherlands. Int. J. Digit. Curation 9 , 39–46 (2014).

Doorn, P. K. Archiving and managing research data: data services to the domains of the humanities and social sciences and beyond: DANS in the Netherlands. Archivar 73 , 44–50 (2020).

Berghmans, S. et al . Open data: the researcher perspective. Elsevier website , https://www.elsevier.com/about/open-science/research-data/open-data-report (2017).

Kim, Y. & Yoon, A. Scientists’ data reuse behaviors: a multilevel analysis. J. Assoc. Inf. Sci. Technol. 68 , 2709–2719 (2017).

Kratz, J. E. & Strasser, C. Making data count. Sci. Data 2 , 150039 (2015).

Schmidt, B., Gemeinholzer, B. & Treloar, A. Open data in global environmental research: the Belmont Forum’s open data survey. PLoS ONE 11 , e0146695 (2016).

Tenopir, C. et al . Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLoS ONE 10 , e0134826 (2015).

American Association for Public Opinion Research. Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys . (American Association for Public Opinion Research, 2016).

Wyatt, S. M. In How Users Matter : T he Co-Construction of Users and Technology (eds. Oudshoorn, N. & Pinch, T.) Ch.3 (MIT press, 2003).

Lincoln, Y. & Guba, E. Naturalistic inquiry . (SAGE Publications, 1985).

Firestone, W. A. Alternative arguments for generalizing from data as applied to qualitative research. Educ. Res. 22 (4), 16–23 (1993).

Harzing, A.-W. Response styles in cross-national survey research: a 26-country study. Int. J. Cross Cult. Manag. 6 , 243–266 (2006).

Article   ADS   Google Scholar  

Mongeon, P. & Paul-Hus, A. The journal coverage of Web of Science and Scopus: a comparative analysis. Scientometrics 106 , 213–228 (2016).

Vera-Baceta, M.-A., Thelwall, M. & Kousha, K. Web of Science and Scopus language coverage. Scientometrics 121 , 1803–1813 (2019).

Elsevier. Scopus content coverage guide. Elsevier website https://www.elsevier.com/__data/assets/pdf_file/0007/69451/Scopus_ContentCoverage_Guide_WEB.pdf (2020).

Download references

Acknowledgements

Paul Groth, Andrea Scharnhorst and Sally Wyatt provided valuable feedback and comments on this paper. I also wish to acknowledge Ricardo Moreira for his assistance in creating the sample and distributing the survey, Wouter Haak for his organizational support, Helena Cousijn for her advice in designing the survey, and Emilie Kraaikamp for her advice regarding personally identifiable information. This work is part of the project Re-SEARCH: Contextual Search for Research Data and was funded by the NWO Grant 652.001.002.

Author information

Authors and affiliations.

Data Archiving and Networked Services, Royal Netherlands Academy of Arts & Sciences, Anna van Saksenlaan 51, 2593 HW, Den Haag, The Netherlands

Kathleen Gregory

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kathleen Gregory .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.

Reprints and permissions

About this article

Cite this article.

Gregory, K. A dataset describing data discovery and reuse practices in research. Sci Data 7 , 232 (2020). https://doi.org/10.1038/s41597-020-0569-5

Download citation

Received : 06 April 2020

Accepted : 12 June 2020

Published : 13 July 2020

DOI : https://doi.org/10.1038/s41597-020-0569-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A dataset on energy efficiency grade of white goods in mainland china at regional and household levels.

  • Chunyan Wang

Scientific Data (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper dataset

  • Maps & Floorplans
  • Libraries A-Z

University of Missouri Libraries

  • Ellis Library (main)
  • Engineering Library
  • Geological Sciences
  • Journalism Library
  • Law Library
  • Mathematical Sciences
  • MU Digital Collections
  • Veterinary Medical
  • More Libraries...
  • Instructional Services
  • Course Reserves
  • Course Guides
  • Schedule a Library Class
  • Class Assessment Forms
  • Recordings & Tutorials
  • Research & Writing Help
  • More class resources
  • Places to Study
  • Borrow, Request & Renew
  • Call Numbers
  • Computers, Printers, Scanners & Software
  • Digital Media Lab
  • Equipment Lending: Laptops, cameras, etc.
  • Subject Librarians
  • Writing Tutors
  • More In the Library...
  • Undergraduate Students
  • Graduate Students
  • Faculty & Staff
  • Researcher Support
  • Distance Learners
  • International Students
  • More Services for...
  • View my MU Libraries Account (login & click on My Library Account)
  • View my MOBIUS Checkouts
  • Renew my Books (login & click on My Loans)
  • Place a Hold on a Book
  • Request Books from Depository
  • View my ILL@MU Account
  • Set Up Alerts in Databases
  • More Account Information...

Data Sets for Quantitative Research: Public Use Datasets

  • Public Use Datasets
  • Roper Center
  • Missouri government grants and grant-writing aids
  • Contact Librarian

Finding Datasets on the Internet

There are many research organizations making data available on the web, but still no perfect mechanism for searching the content of all these collections. The links below will take you to data search portals which seem to be among the best available. Note that these portals point to both free and pay sources for data, and to both raw data and processed statistics.

  • PEW Research Center
  • Open Access Directory (OAD) Data Repositories
  • UK Data Archive
  • Socioeconomic Applications Data Center
  • Council of European Social Science Data Archives (CESSDA)
  • NTIS Federal Computer Products Center .*  Includes databases, data files, CD-ROM, etc. available for purchase.  
  • Harvard DataVerse
  • r3Data.org Registry of Research Data Repositories
  • Open Data: European Commission Launches European Data Portal (over 1 million datasets From 36 countries)
  • Awesome Public Datasets (on github)*. Includes a mix of free and pay resources.
  • SNAP (Stanford Network Analysis Project)
  • Statistics, Resources and Big Data on the Internet, 2020 *

 * Resources that are not entirely free are marked with an asterisk.

Transform web information into machine-readable data for analysis

Have you found fantastic numeric information in a less-than-ideal format, such as PDF or HTML?   Here are some software products that may help you transform those formats into numbers that you can read into a spreadsheet or statistical software program.  Some of these are free or offer limited time, free trials:

  • Spark OCR : Find tables in images, visually identify rows and columns, and extract data from cells into data frames. Turn scans from financial disclosures, academic papers, lab results and more into usable data. 
  • PDFTables : PDF to Excel Converter
  • Tabula : Extract tables from PDFs
  • table-ocr : For those who know Python
  • Abbyy Finereader : Access and modify information locked in paper-based documents and PDF files
  • OCR Space : This free service transforms PDFs into plain text files directly in your browser.  Rows and columns are preserved, making it easier to import the file into Excel using the Import Text Wizard .  See further explanation and instructions here:  Table recognition with OCR .  
  • Parsehub : Data mining tool for data scientists and journalists
  • Webhose : Turn unstructured web content into machine-readable data feeds
  • Data Streamer : Index weblogs, mainstream news, and social media
  • Outwit : Turn websites into structured data

Feeling intrigued, but unsure how to leverage web-based data for your own research?  Here are some how-to guides:

  • Data Journalism: What it is and why should I care?
  • How to get data from the Web
  • Manipulating data
  • Data for journalists: a practical guide for computer-assisted reporting by Brant Houston (2019)
  • Scraping for journalists by Paul Bradshaw (2013)
  • Data Journalism Heist by Paul Bradshaw (2013)

Selected datasets on the Internet, arranged by topic

These are some of the most significant datasets available on the internet, arranged by topic.  Almost everything here is freely available. The few that do involve fees are marked with asterisks (*). Note that some of the listings below are also available in ICPSR.

Political Science/Public Policy

  • American National Election Studies
  • Conflict and Peace Data Bank, 1948-1978  available through ICPSR
  • Correlates of War
  • Cross-National Time-Series Data Archive  (available as a library item on CD-ROM)
  • International Country Risk Guide (IRCG) Table 3B: Political Risk Points by Component, 1984-2009  (available through MU Library to current affiliates)
  • Polidata Presidential Results By Congressional District 1992-2004  (available through MU Library to current affiliates)
  • Record of American Democracy, 1984-1990
  • Survey of Income and Program Participation 

Demographics

  • IPUMS: Integrated Public Use Microdata Series
  • Geocorr  -- Geographic Correspondence Engine
  • Missouri Census Data Center UEXPLORE/Dexter  ( explanation )
  • National Historical Geographic Information System

Business and Economics

  • Consumer Expenditure Surveys microdata
  • National Bureau of Economic Research data
  • National Longitudinal Surveys from the Bureau of Labor Statistics
  • Panel Study of Income Dynamics
  • World Bank – Poverty and Equity Data
  • International industrial development  (manufacturing, mining, utilities, etc.) data from the United Nations (UNIDO)
  • Biologic Specimen and Data Repository Information Coordinating Center (bioLINCC)
  • Demographic and Health Surveys (mainly 3rd world countries)
  • Global Health Observatory data repository  from the World Heath Organization
  • ICPSR Health and Medical Care Archive  
  • ICPSR National Addiction and HIV Data Archive Program
  • National Center for Health Statistics Public Use Data Files  from the U.S. Centers for Disease Control
  • Missouri Information for Community Assessment (MICA) health datasets
  • National Longitudinal Study of Adolescent Health
  • National Cancer Institute SEER data
  • DataONE   Earth and environmental data
  • EPA Environmental dataset gateway
  • General Social Survey
  • National Longitudinal Surveys  (U.S. Bureau of Labor Statistics)
  • National Survey of Households and Families
  • Pew Internet & Technology
  • World Values Survey
  • National Center for Education Statistics DataLab
  • The National Survey of College Graduates (NSCG)
  • NCES Public Elementary & Secondary Schools Universe Survey Data
  • The Survey of Doctorate Recipients (SDR)

Miscellaneous

  • American Religion Data Archive
  • National Household Travel Survey
  • Roper Opinion Polls * 

*Resources that are not entirely free are marked with an asterisk

  • << Previous: Home
  • Next: ICPSR >>
  • Last Updated: Nov 3, 2023 1:05 PM
  • URL: https://libraryguides.missouri.edu/datasets

Facebook Like

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • AMIA Jt Summits Transl Sci Proc
  • v.2021; 2021

Recommender system of scholarly papers using public datasets

The exponential growth of public datasets in the era of Big Data demands new solutions for making these resources findable and reusable. Therefore, a scholarly recommender system for public datasets is an important tool in the field of information filtering. It will aid scholars in identifying prior and related literature to datasets, saving their time, as well as enhance the datasets reusability. In this work, we developed a scholarly recommendation system that recommends research-papers, from PubMed, relevant to public datasets, from Gene Expression Omnibus (GEO). Different techniques for representing textual data are employed and compared in this work. Our results show that term-frequency based methods (BM25 and TF-IDF) outperformed all others including popular Natural Language Processing embedding models such as doc2vec, ELMo and BERT.

Introduction

Recommendation systems, or recommenders, are information filtering systems that employ data mining and analytics of users' behaviors, including preferences and activities, to predict users' interests in information, products or services. There are broadly two types of recommenders: collaborative filtering and content-based. The former works by utilizing the rating activities of items or users, while the latter works by comparing descriptions of items or profiles of users' preferences.

With the ever-growing public information online, recommendation systems have proven to be an effective strategy to deal with information overload. In fact, recommenders are thriving in this era of Big Data with wide commercial applications in recommending products (e.g. Amazon), music 1 , movies 2 , books 3 , news articles 4 , and many more.

Applications of recommendation systems are currently expanding beyond the commercial to include scholarly activities. The first recommendation system for research papers was introduced in the CiteSeer project 5 . Following that, Science Concierge 6 , PURE 7 , pmra 8 were also developed for recommending articles. More recent experiments include Colin and Beel's 9 and A. Mohamed Hassan et al.'s 10 , in which they experimented with Natural Language Processing (NLP) models.

The aforementioned systems are all paper-to-paper recommendations, i.e., they provide recommendations of papers similar to a given paper. To date, no prior research has yet been performed on recommending papers based on public datasets, to the best of our knowledge. There are many public datasets available on the internet which might be useful to researchers for further exploration. A scholarly literature recommendation system for datasets is an important and very helpful tool in the field of information filtering. It can aid in identifying prior and related literature to the dataset's topic. It can save researchers' time as well as enhance the experience of the dataset's re-usability. Further, recommending literature to datasets is also a field of research yet to be explored.

In this paper, we described the development of a content-based recommendation system that recommends articles from PubMed corresponding to datasets (referred to as data series) from Gene Expression Omnibus (GEO). GEO is a public repository for high-throughput microarray and next-generation sequence functional genomics data. As of Feb 05, 2020, there are 124,825 data series available in GEO (A series record links together a group of related samples and provides a focal point and description of the whole study 11 ). Many of these series' data were collected at enormous effort only to be used just once. We believe that dataset use and reuse can be significantly improved when recommending research papers that are relevant to such dataset to researchers, an idea consistent with NIH Strategic Plan for Data Science 12 . We experimented and compared a variety of vector representations from traditional term-frequency based methods and topic-modeling to embeddings, and evaluated different recommendations using existing citations as a reference. The work described herein is part of the dataset re-usability platform (GETc Research Platform) developed at The University of Texas Health Science Center at Houston available at http://genestudy.org .

Relevant work

CiteSeer 5 is a content-based recommender based on keywords matching, Term Frequency-Inverse Document Frequency (TF-IDF) for word information and Common Citation-Inverse Document Frequency (CCIDF) for citation information. Science Concierge6 is another content-based research article recommendation system using Latent

Semantic Analysis (LSA) and Rocchio Algorithms with large-scale approximate nearest neighbor search based on ball trees. PURE 7 , another content-based PubMed article recommender developed using a finite mixture model for soft clustering with Expectation-Maximization (EM) algorithm, which achieved 78.2% precision at 10% recall with 200 training articles. Lin and Wilbur developed pmra 8 , a probabilistic topic-based content similarity model for PubMed articles. Their method achieved slight but statistically significant improvement on precision@5 compared to BM25.

With the popularity of NLP models, such as Google's doc2vec, USE, and most recently BERT, there have been some efforts in incorporating these embedding methods in research papers recommenders. Colin and Beel 9 experimented with doc2vec, TF-IDF and key phrases for providing related-article recommendations to both digital library Sowiport 13 and the open-source reference manager Jabref 14 . A. Mohamed Hassan et al.10 evaluated USE, InferSent, ELMo, BERT and SciBERT for reranking results from BM25 for research paper recommendations.

We used data series from GEO and MEDLINE articles from PubMed. For GEO series, metadata such as title, summary, date of publications and names of authors were collected using a web crawler. We also collected the PMIDs of the articles associated with each series. From these PMIDs, metadata of corresponding articles such as title, abstract, authors, affiliations, MeSH terms, publisher name were also collected. Figure 1 shows an example of GEO data series, Figure 2 shows an example of PubMed publication.

An external file that holds a picture, illustration, etc.
Object name is 3477705f1.jpg

An example of GEO data series

An external file that holds a picture, illustration, etc.
Object name is 3477705f2.jpg

An example of PubMed publication.

In order to automatically evaluate our recommendations, using metrics such as precision and recall, we kept only the series that have associated citations (publications). That left us with a total of 72,971 unique series and 50,159

associated unique publications. Multiple series can reference the same paper(s). 96% of the series have only 1 related publication and the rest have between 2 to 10.

We adopted an information retrieval strategy, where the data series are treated as queries and the list of recommended publications as retrieved documents. In our experiments, series were represented by their titles and summaries; while publications were represented by their titles and abstracts. Further, we removed stop words, punctuation, and URLs from summaries of series before transforming them into vectors.

We used cosine similarity as the ranking score, which is a popular measure in query-document analysis 15 for the similarity of features due to its low-complexity and intuitional definition. In our case, we only returned the top 10 recommendations based on cosine similarity, which is a realistic scenario where few people would check the end of a long recommendation list. Figure 3 shows our recommender's architecture.

An external file that holds a picture, illustration, etc.
Object name is 3477705f3.jpg

Literature recommendation system architecture

The recommendations were then evaluated using existing series-articles relationships from series metadata using MRR@10, recall@1, recall@10, precision@1, and MAP@10.

Vector representation

Methods of representing textual data in recommendation systems are ranging from traditional term-frequency based methods and topic-modeling to embeddings. Below is the list of methods we experimented with in this study:

TF-IDF : a numerical statistical representation of how important a word is to a document in a collection or corpus 16 . For each vocabulary V, the value increases proportionally to the number of times that V appears in the document (term frequency, TF) and is offset by the total number of documents that contain V (inverse document frequency, IDF). We used TF-IDF implementation from scikit-learn 17 .

BM25 : a ranking function that is based on a probabilistic retrieval framework that utilizes adjusted values of TF and IDF and document length 18 . We used BM25 implementation from genism 19 .

LSA : a topic modeling technique that utilizes singular value decomposition (SVD) on a term-frequency matrix to find a low-rank approximation representation. We used TruncatedSVD from scikit-learn for LSA implementation, with a reduced dimension equals to 300.

word2vec 20 , 21 : a two-layer neural network that is trained to reconstruct linguistic contexts of words by mapping each unique word to a corresponding vector space. We utilized word2vec implemented in gensim, with an embedding dimension of 200.

doc2vec 21 : a neural network method that extends word2vec and learns continuous distributed vector representations for variable-length pieces of texts. We utilized doc2vec implemented in gensim, with an embedding dimension of 300.

ELMo 22 : a deep, contextualized bi-directional Long Short-Term Memory (LSTM) model that was pre-trained on 1B Word Benchmark 23 . We used the latest TensorFlow Hub implementation 24 of ELMo to obtain embeddings of 1024 dimensions.

InferSent 25 : a bi-directional LSTM encoder with max-pooling that was pre-trained on the supervised data of Stanford Natural Language Inference (SNLI) 26 . There are two versions of InferSent models, and we used one with fastText word embeddings from Facebook's github 27 , with the resulting embedding dimension of 4096.

USE 28 : Universal Sentence Encoder, developed by Google, has two variations of model structures: one is transformer-based while the other one is Deep Average Network (DAN)-based, both of which were pre-trained on unsupervised data such as Wikipedia, web news and web question-answer pages, discussion forums, and further on supervised data of SNLI. We used the TensorFlow Hub implementation of transformer USE to obtain embeddings of 512 dimensions.

BERT 29 : Bidirectional Encoder Representations from Transformer developed by Google, which has previously achieved state-of-the-art performance in many classical natural language processing tasks. It was pre-trained on 800M-words BooksCorpus and 2500M-word English Wikipedia using masked language model (MLM) and next sentence prediction (NSP) as the pre-training objectives. We used the package Sentence-BERT 30 to obtain vectors optimized for Semantic Textual Similarity (STS) task, which is of 768 dimensions.

SciBERT 31 : a BERT model that was further pre-trained on 1.14M full-paper corpus from semanticscholar.org 32 . Similarly, we used Sentence-BERT to obtain vectors of 768 dimensions.

BioBERT 33 : a BERT model that was further pre-trained on large scale biomedical corpus, i.e. 4.5B-word PubMed abstracts and 13.5B-word PubMed Central full-text articles. Similar to BERT, vectors of 768 dimensions were obtained using Sentence-BERT.

RoBERTa 34 : a robust version of BERT that has been further pre-trained on CC-NEWs 35 corpus, with enhanced hyperparameters choices including batch-sizes, epochs, and dynamic masking patterns in the pre-training process. We used Sentence-BERT to obtain vectors of 768 dimensions.

DistilBERT 36 : a distilled version of BERT with a 40% reduced size, 97% of the original performance while being 60% faster. We used Sentence-BERT to obtain vectors of 768 dimensions.

For all term-frequency based methods, the experiments were performed on 8 Intel(R) Xeon(R) Gold 6140 CPUs@ 2.30GHz. For embedding based methods, the experiments were performed using 1 Tesla V100-PCIE-16GB GPU. The implementations of the experiments are at https://github.com/chocolocked/RecommendersOfScholarlyPapers

Evaluation metrics

The following metrics were used to evaluate our system:

Mean reciprocal rank (MRR)@k : The Reciprocal Rank (RR) measures the reciprocal of the rank at which the first relevant document was retrieved. RR is 1 if the relevant document was retrieved at rank 1, RR is 0.5 if document is retrieved at rank 2, and so on. When we average the top k retrieved items across queries, the measure is called the Mean Reciprocal Rank@k 37 . In our case, we chose k=10.

Recall@k : At the k-th retrieved item, this metric measures the proportion of relevant items that are retrieved. We evaluated both recall@1 and recall@10.

Precision@k : At the k-th retrieved item, this metric measures the proportion of the retrieved items that are relevant. In our case, we are interested in precision@1. Since most of our data series has only 1 corresponding publication, which means most of the data only has 1 relevant item.

Mean average precision (MAP)@k : Average Precision is the average of the precision value obtained for the set of top k items after each relevant document is retrieved. When average precision is averaged again over all retrieval, this value becomes mean average precision.

Detailed procedure-example

Below, we demonstrate the detailed procedure using BM25 and data series ' {"type":"entrez-geo","attrs":{"text":"GSE11663","term_id":"11663"}} GSE11663 ' as an example:

  • For each of the 50,159 publications, we concatenated processed titles with abstracts. We then created a BM25 object, its dictionary and corpus out of the list.
  • For ' {"type":"entrez-protein","attrs":{"text":"GES11663","term_id":"1775955970","term_text":"GES11663"}} GES11663 ', we concatenated the title ( 'human cleavage stage embryos chromosomally unstable' ) and the processed summary ( 'embryonic chromosome aberrations cause birth defects reduce human fertility however neither nature incidence known develop assess genome-wide copy number variation loss heterozygosity single cells apply screen blastomeres vitro fertilized preimplantation embryos complex patterns chromosome-arm imbalances segmental deletions duplications amplifications reciprocal sister blastomeres detected large proportion embryos addition aneuploidies uniparental isodisomies frequently observed since embryos derived young fertile couples data indicate chromosomal instability common human embryogenesis comparative genomic hybridisation ') and got its vector representation using dictionary: [ (27, 1), (32, 1), (44, 1), (46, 1), (80, 1), (116, 1), (141, 1), (175, 1), (182, 1), (190, 1), (360, 2), (390, 1), (407, 1), (530, 1), (649, 1), (663, 1), (725, 1), (842, 1), (844, 1), (999, 1), (1034, 1), (1186, 1), (1235, 1), (1370, 1), (1634, 1), (1635, 1), (1636, 1), (1761, 1), (1862, 1), (2023, 1), (2174, 1), (2224, 1), (2292, 1), (2675, 1), (2677, 1), (3023, 1), (3082, 1), (3113, 1), (3144, 2), (3145, 2), (3153, 1), (3697, 1), (4265, 1), (4935, 1), (5021, 1), (5105, 1), (5775, 1), (6665, 1), (6772, 1), (6828, 1), (7298, 1), (7372, 1), (7684, 1), (7808, 1), (7949, 1), (8211, 1), (8344, 1), (8569, 2), (8974, 1), (9009, 1), (9302, 1), (9705, 1), (10480, 1), (11360, 1), (17139, 1), (24769, 1), (28560, 1), (38594, 1), (54855, 1), (228500, 1), (250370, 1) ]. Then we used sklearn 'cosine-similarity' to get similarity scores of all 50,159 publications with this series.
  • Since ' {"type":"entrez-geo","attrs":{"text":"GSE11663","term_id":"11663"}} GSE11663 ' has the citation ['19396175', '21854607'] (without order), and our top 10 recommendations were ['19396175', '23526301', '16698960', '25475586', '29040498', '23054067', '27197242', '23136300', '24035391', '18713793']. Our recommendations hit top 1. In this case, we calculated M R R @ 10 : 1 1 = 1 ,                             r e c a l l @ 1 : 1 3 = 0.33 , r e c a l l @ 10 : 1 3 = 0.33 ,                     p r e c i s i o n @ 1 : 1 1 = 1 , M A P @ 10 : 1 2 * ( 1 + 0 ) = 0.5.
  • We repeated the above two steps, and computed average for all 72,971 series

Table 1 shows the results of our experiments with different vector representations. BM25 outperformed all other methods in terms of all evaluation metrics, with MRR@10, recall@1, recall@10, precision@1, and MAP@10 of 0.785, 0.742, 0.833, 0.756, and 0.783 respectively, followed closely by TF-IDF. None of the embedding methods alone was able to outperform BM25. Furthermore, word2vec, doc2vec, and BioBERT were among the top embedding methods outperforming ELMo, USE, and the rest.

Our findings show that traditional term-frequency based methods (BM25, TF-IDF) were more effective for recommendations compared to embedding methods. Contrasting previous beliefs that embeddings can conquer it all, given their performances in standardized general NLP tasks such as sentiment analysis, Questions & Answering (Q&A), and Named Entity Recognition (NER). They failed to show advantage in the simple scenario of capturing semantic similarity as measured by cosine similarity. Even though the context were not exactly the same, Colin and Beel9 did find out in their studies that doc2vec failed to defeat TF-IDF or key phrases in the two experimental setups of publication recommendations for digital library Sowiport and reference manager Jabref. Moreover, A. Mohamed Hassan et al.12 also concluded in their study that none of the sentence embeddings (USE, InferSent, ELMo, BERT and SciBERT) that they had employed were able to outperform BM25 alone for their research paper recommendations.

One possible reason could be that traditional statistical methods produce better features when the queries are relatively homogenous, Ogilvie and Callan38 showed that single database (homogeneous) queries with TF-IDF performed unanimously better than multi-database (heterogenous) queries when no additional IR techniques, such as query expansion, were involved. Currently, we are only using GEO datasets for queries which are all related to gene expressions. But as we introduce more diverse datasets for our platform in the future, e.g. immunology and infectious disease datasets, the heterogeneity might require more advanced embedding methods. Further, as we observe approximately 8% improvement from regular BERT to BioBERT, we think it might be of importance for NLP models to be further trained on domain-specific corpus for better feature representations for cosine similarity. Another possible reason could be that, as these embeddings were pre-trained on standardized tasks, thus the embeddings might be specialized towards those tasks instead of representing simple semantic information. This could explain the observation that general text embeddings, e.g. word2vec, doc2vec, perform better than other more specialized NLP models, e.g. ELMo and BERT, which were pre-trained to perform on tasks such as Q&A, sequence classification. Therefore, we might be able to take full advantage of their potentials when formulating our problem from a simple cosine similarity between query and documents to matching classification for example; a format closer to how these models were designed for in the first place. That is also the direction we are heading towards for future experiments.

Even though we do not currently have users' feedback for manual evaluations, we did, however, manually inspect the recommendation results for the completeness of our experiments, particular for those where the cited articles did not appear within the top 5 recommendations. We randomly sampled 20 such data series and examined recommended papers by thoroughly reading through papers' abstract, introduction, and methods. We had some interesting observations regarding those cases: For example, for ' {"type":"entrez-geo","attrs":{"text":"GSE96174","term_id":"96174"}} GSE96174 ' data series, even though our top 5 recommendations did not include the existing related article, three of them actually cited and used the data series as relevant research materials. Another example is that of ' {"type":"entrez-geo","attrs":{"text":"GSE27139","term_id":"27139"}} GSE27139 ', where our top recommendations were from the same author that submitted the data series, and those articles were extensions from their previous research work. Due to time limitation, we could not check all the 13,013 cases, but we found at least 10 cases (' {"type":"entrez-geo","attrs":{"text":"GSE96174","term_id":"96174"}} GSE96174 ', ' {"type":"entrez-geo","attrs":{"text":"GSE836","term_id":"836"}} GSE836 ', ' {"type":"entrez-geo","attrs":{"text":"GSE92231","term_id":"92231"}} GSE92231 ', ' {"type":"entrez-geo","attrs":{"text":"GSE78579","term_id":"78579"}} GSE78579 ', ' {"type":"entrez-geo","attrs":{"text":"GSE96211","term_id":"96211"}} GSE96211 ', ' {"type":"entrez-geo","attrs":{"text":"GSE27139","term_id":"27139"}} GSE27139 ',' {"type":"entrez-geo","attrs":{"text":"GSE10903","term_id":"10903"}} GSE10903 ', ' {"type":"entrez-geo","attrs":{"text":"GSE105628","term_id":"105628"}} GSE105628 ', ' {"type":"entrez-geo","attrs":{"text":"GSE44657","term_id":"44657"}} GSE44657 ', ' {"type":"entrez-geo","attrs":{"text":"GSE81888","term_id":"81888"}} GSE81888 ') that had similar situations as we mentioned above and where the top 3 recommendations were, to the best of our judgement, associated with data series of concern, even though they did not appear in the citation as of the time we did our experiments. Therefore, we believe that our recommendation systems might do even better in the real setting than the evaluations presented here.

We want to mention that we also experimented with re-ranking. The final ranking score is defined as the previous cosine-similarity adding a re-ranking score, with the re-ranking score calculated using cosine similarity of only titles of the queried dataset and of the articles. We did not find statistically significant improvements, and therefore did not report the results in this paper.

In this work, we developed a scholarly recommendation system to identify and recommend research-papers relevant to public datasets. The sources of papers and datasets are PubMed and Gene Expression Omnibus (GEO) series, respectively. Different techniques for representing textual data ranging?from traditional term-frequency based methods and topic-modeling to embeddings are employed and compared in this work. Our results show that embedding models that perform well in their standardized NLP tasks, failed to outperform term-frequency based probabilistic methods such as BM25. General embeddings (word2vec and doc2vec) performed better than more specialized embeddings (ELMo and BERT) and domain-specific embeddings (BioBERT) performed better than non-domain specific embeddings (BERT). In future experiments, we plan to develop a hybrid method combining the strengths of the term-frequency approach and also embeddings to maximize their potentials in different (heterogeneous vs. homogeneous) problem scenarios. In addition, we plan to engage users in rating our recommendations, use interrater agreement approach to further evaluate results, and incorporate the feedback to further improve our system. We hope to utilize content-based and collaborative filtering for better recommendations.

Given their usefulness, extending the applications of recommender systems to aid scholars in finding relevant information and resources will significantly enhance research productivity and will ultimately promote data and resources reusability.

Figures & Table

Editorial Manager, our manuscript submissions site will be unavailable between 12pm April 5, 2024 and 12pm April 8 2024 (Pacific Standard Time). We apologize for any inconvenience this may cause.

When you choose to publish with PLOS, your research makes an impact. Make your work accessible to all, without restrictions, and accelerate scientific discovery with options like preprints and published peer review that make your work more Open.

  • PLOS Biology
  • PLOS Climate
  • PLOS Complex Systems
  • PLOS Computational Biology
  • PLOS Digital Health
  • PLOS Genetics
  • PLOS Global Public Health
  • PLOS Medicine
  • PLOS Mental Health
  • PLOS Neglected Tropical Diseases
  • PLOS Pathogens
  • PLOS Sustainability and Transformation
  • PLOS Collections

Open Data is a strategy for incorporating research data into the permanent scientific record by releasing it under an Open Access license.  Whether data is deposited in a purpose-built repository or published as Supporting Information alongside a research article, Open Data practices ensure that data remains accessible and discoverable. For verification, replication, reuse, and enhanced understanding of research.

Benefits of Open Data

Readers rely on raw scientific data to enhance their understanding of published research, for purposes of verification, replication and reanalysis, and to inform future investigations.

Ensure reproducibility Proactively sharing data ensures that your work remains reproducible over the long term.

Inspire trust Sharing data demonstrates rigor and signals to the community that the work has integrity.

Receive  credit Making data public opens opportunities to get academic credit for collecting and curating data during the research process.

Make a contribution Access to data accelerates progress. According to the 2019 State of Open Data report, more than 70% of researchers use open datasets to inform their future research.

Preserve the scientific record Posting datasets in a repository or uploading them as Supporting Information prevents data loss.

Why do researchers choose to make their data public?

Watch the short video that explores the top benefits of data sharing, what types of research data you should share, and how you can get it ready to help ensure more impact for your research.

PLOS Open Data policy

Publishing in a PLOS journal carries with it a commitment to make the data underlying the conclusions in your research article publicly available upon publication.

Our data policy underscores the rigor of the research we publish, and gives readers a fuller understanding of each study.

Read more about Open Data

Data sharing has long been a hallmark of high-quality reproducible research. Now, Open Data is becoming...

For PLOS, increasing data-sharing rates—and especially increasing the amount of data shared in a repository—is a high priority.

Ensure that you’re publication-ready and ensure future reproducibility through good data management How you store your data matters. Even after…

Data repositories

All methods of data sharing data facilitate reproduction, improve trust in science, ensure appropriate credit, and prevent data loss. When you choose to deposit your data in a repository, those benefits are magnified and extended.

Data posted in a repository is…

…more discoverable.

Detailed metadata and bidirectional linking to and from related articles help to make data in public repositories easily findable.

…more reusable

Machine-readable data formatting allows research in a repository to be incorporated into future systematic reviews or meta analyses more easily.

…easier to cite

Repositories assign data its own unique DOI, distinct from that of related research articles, so datasets can accumulate citations in their own right, illustrating the importance and lasting relevance of the data itself.

…more likely to earn citations

A 2020 study of more than 500,000 published research articles found articles that link to data in a public repository were likely to have a 25% higher citation rate on average than articles where data is available on request or as Supporting Information.

Open Data is more discoverable and accessible than ever

Deposit your data in a repository and earn an accessible data icon.

research paper dataset

You already know depositing research data in a repository yields benefits like improved reproducibility, discoverability, and more attention and citations for your research.

PLOS helps to magnify these benefits even further with our Accessible Data icon. When you link to select, popular data repositories, your article earns an eye-catching graphic with a link to the associated dataset, so it’s more visible to readers.

Participating data repositories include: 

  • Open Science Framework (OSF)
  • Gene Expression Omnibus
  • NCBI Bioproject
  • NCBI Sequence Read Archive
  • Demographic and Health Surveys

We aim to add more repositories to the list in future. Read more

The PLOS Open Science Toolbox

The future is open

The PLOS Open Science Toolbox is your source for sci-comm tips and best-practice. Learn practical strategies and hands-on tips to improve reproducibility, increase trust, and maximize the impact of your research through Open Science.

Sign up to have new issues delivered to your inbox every week.

Learn more about the benefits of Open Science.   Open Science

Yale Scientific Article Summarization Dataset

Background: what is scisumm, the scisummnet corpus.

  • significantly larger than the previous CL-SciSumm 2017 corpus
  • more citation information is available

The following paper introduces the corpus in detail and shows how ScisummNet enables the training of data-driven neural summarization models for scientific papers. Read the paper (AAAI 2019) NEWS :  ScisummNet Corpus was featured in the CL-Scisumm 2019 shared task! Check out the project page .

Getting Started

Download the dataset (distributed under the CC BY-SA 4.0 license): ScisummNet ver1.1 (15 MB) ScisummNet ver1.0 (15 MB) --> When unzipped, the package contains a dataset description and subdirectories for the 1000 papers. Each paper directory contains the paper's PDF file, XML file, annotated citation information (in JSON format), and manual summay. Please see the included documentation for more detail.

If you use our corpus or summarization models, please consider citing the following papers.

Acknowledgment

We thank the members of the CL-Scisumm team, Kokil Jaidka, Muthu Kumar Chandrasekaran, and Min-Yen Kan, for their help on this project. We are also grateful to the developers of the SQuAD website , from which this website design is adapted.

Read our research on: Gun Policy | International Conflict | Election 2024

Regions & Countries

Download datasets.

Pew Research Center makes its data available to the public for secondary analysis after a period of time. See this post for more information on how to use our datasets and contact us at  [email protected]  with any questions.

Find a dataset by research area:

U.S. Politics & Policy

Journalism & Media

Internet & Tech

Science & Society

Religion & Public Life

Hispanic Trends

Global Attitudes & Trends

Social & Demographic Trends

American Trends Panel

Methodology

About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts .

ScienceDaily

Integrated dataset enables genes-to-ecosystems research

A first-ever dataset bridging molecular information about the poplar tree microbiome to ecosystem-level processes has been released by a team of Department of Energy scientists led by Oak Ridge National Laboratory. The project aims to inform research regarding how natural systems function, their vulnerability to a changing climate, and ultimately how plants might be engineered for better performance as sources of bioenergy and natural carbon storage.

The data, described in Nature Publishing Group's Scientific Data , provides in-depth information on 27 genetically distinct variants, or genotypes, of Populus trichocarpa , a poplar tree of interest as a bioenergy crop. The genotypes are among those that the ORNL-led Center for Bioenergy Innovation previously included in a genome-wide association study linking genetic variations to the trees' physical traits. ORNL researchers collected leaf, soil and root samples from poplar fields in two regions of Oregon -- one in a wetter area subject to flooding and the other drier and susceptible to drought.

Details in the newly integrated dataset range from the trees' genetic makeup and gene expression to the chemistry of the soil environment, analysis of the microbes that live on and around the trees and compounds the plants and microbes produce.

The dataset "is unprecedented in its size and scope," said ORNL Corporate Fellow Mitchel Doktycz, section head for Bioimaging and Analytics and project co-lead. "It is of value in answering many different scientific questions." By mining the data with machine learning and statistical approaches, scientists can better understand how the genetic makeup, physical traits and chemical diversity of Populus relate to processes such as cycling of soil nitrogen and carbon, he said.

"The knowledge we generated from this one plant will be folded back into projects that produce biofuels from poplar," said Melanie Mayes, leader of ORNL's Ecosystem Processes group and a collaborator on the project. "The procedure we built here will be needed for bioengineering of other plants, and to help us build climate resilience -- to advance soil carbon storage and reduce greenhouse gas emissions."

The complete dataset comprises more than 25 terabytes. Links to the data are available as part of the National Microbiome Data Collaborative, or NMDC, a DOE initiative supporting data-sharing on the association of microbiomes with environmental processes.

"The dataset represents the largest publicly available metagenomics repository on a tree endosphere," the plant tissue environment that is home to complex microbial communities, said Christopher Schadt, project co-lead and ORNL distinguished staff scientist.

Detailed analyses of the samples resulted in 318 metagenomes, revealing the diversity of microbes living in and around trees through genetic sequencing. Ninety-eight plant transcriptomes provide information on the full range of messenger RNA molecules expressed in the plant roots. The dataset includes 314 metabolomic profiles, supplying information on the small molecules produced by plants and microbes as they grow or in response to stress. Data are also included on associated soil physical and biogeochemical characteristics, examining chemicals present and how they cycle through the environment.

Integrating this "multi-omics" data will provide essential information to scientists studying how plant-related molecular and cellular events are connected to ecosystem processes and behaviors.

Understanding plant, soil nitrogen cycling triggers

The Joint Genome Institute, a DOE Office of Science user facility at Lawrence Berkeley National Laboratory, was a close collaborator on the project. JGI led the metabolomics profiling of the leaf, root and soil environment, or rhizosphere, the plant root transcriptomics sequencing, and the soil rhizosphere and endosphere metagenomics work.

"The combination of metagenomics and metabolomics from leaf, root and soils, along with Populus host transcriptomes, make this a truly unique dataset for the research community and could serve as a central data resource to explore plant-microbe interactions," said Emiley Eloe-Fadrosh, Metagenome Program head at JGI.

The project began as an ORNL pilot called Bio-Scales, supported by the Biological Systems Science Division in the DOE Office of Science's Biological and Environmental Research program. Bio-Scales pursues a better understanding of the plant-microbe relationship with a focus on nitrogen cycling. Nitrogen is an essential nutrient for life, but when overused in agriculture and other applications it can harm water quality or be emitted as the potent greenhouse gas nitrous oxide, or N2O.

"The project required the integration of a lot of diverse expertise," Doktycz said. "It started with a team who went out in the midst of COVID-19 to collect all these diverse materials and got them back to the lab, then prepared, analyzed and extracted data from them. We also had an incredible technical support team who processed hundreds of these samples in a tracked and coordinated way, interfacing with the Joint Genome Institute for the sequence analysis."

In addition to its size and scope, the dataset stands out as being heavily annotated with metadata -- with precise details, for instance, on where and how the sampling took place and a standard format for subsequent data reporting. Adding those elements to data makes information easier to find, understand and reuse.

ORNL's Stanton Martin, who led data management for the project in close coordination with the NMDC, noted that the data-first approach supports artificial intelligence and other analytical approaches to help resolve scientific questions. "The data management we performed on this project is hugely valuable to data practices for other projects like the Plant-Microbe Interfaces Scientific Focus Area and the Center for Bioenergy Innovation at ORNL. It plays to ORNL's strengths in what I call data management's three V's -- data volume, variety and velocity -- and allowed us to take a first step in integrating very large 'omics data in a way that has not been done before."

The project started with Schadt and Mayes traveling to Oregon for sampling. "It normally would have been six scientists, but we had travel restrictions on groups traveling together due to the pandemic," Schadt said. They also had to work around encroaching wildfires, as Oregon experienced an active fire season that year. Schadt and Mayes worked with the assistance of Oregon State University volunteers to gather extensive geotagged samples at the two sites.

Beneficial bioengineering

Mayes said the project "gets at the role of genes in influencing not just the fate of the plant itself, but also the environment around it, such as the soil. For instance, we wanted to understand the potential of soil microbes to either make more nitrate or to remove excess nitrate from the system. We wanted to learn more about how plant genomics influence what soil microbes are doing." Knowing more about the plant and soil nitrogen cycle can affect emissions of N2O, a gas that accounts for 6% of all greenhouse gas emissions in the United States.

"If you know which genes to target that result in the minimization of N2O or nitrate production, then you have the potential to affect both greenhouse gas-related warming and water quality," Mayes said. "You could, for instance, select and further bioengineer plants with the best genetic profile for controlling these emissions."

"This project is unique because it gets at the connection between plant genomes and environmental outcomes like nitrous oxide emissions or nitrate production," Mayes said. "Building one of the first, comprehensive datasets on the plant-microbe relationship also tells us how much we still can learn."

  • Endangered Plants
  • Environmental Issues
  • Exotic Species
  • Environmental Policy
  • Carbon dioxide sink
  • Carbon dioxide
  • Renewable energy
  • Trait (biology)
  • Fossil fuel
  • IPCC Report on Climate Change - 2007
  • Molecular biology

Story Source:

Materials provided by DOE/Oak Ridge National Laboratory . Note: Content may be edited for style and length.

Journal Reference :

  • Christopher Schadt, Stanton Martin, Alyssa Carrell, Allison Fortner, Dan Hopp, Dan Jacobson, Dawn Klingeman, Brandon Kristy, Jana Phillips, Bryan Piatkowski, Mark A. Miller, Montana Smith, Sujay Patil, Mark Flynn, Shane Canon, Alicia Clum, Christopher J. Mungall, Christa Pennacchio, Benjamin Bowen, Katherine Louie, Trent Northen, Emiley A. Eloe-Fadrosh, Melanie A. Mayes, Wellington Muchero, David J. Weston, Julie Mitchell, Mitchel Doktycz. An integrated metagenomic, metabolomic and transcriptomic survey of Populus across genotypes and environments . Scientific Data , 2024; 11 (1) DOI: 10.1038/s41597-024-03069-7

Cite This Page :

Explore More

  • Soft, Flexible 'Skeletons' for 'Muscular' Robots
  • Toothed Whale Echolocation and Jaw Muscles
  • Friendly Pat On the Back: Free Throws
  • How the Moon Turned Itself Inside Out
  • A Welcome Hug Is Good for Your Health
  • Climate Change Threatens Antarctic Meteorites
  • Precise Measurement of Our Expanding Universe
  • Little Research On 'Polycrisis' Humanity Faces
  • Prebiotic Molecular Kitchen
  • A Neurodegenerative Disease Triggered by Virus

Trending Topics

Strange & offbeat.

Subscribe to the PwC Newsletter

Join the community, edit social preview.

research paper dataset

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row, remove a task, add a method, remove a method, edit datasets, use cases for high performance research desktops.

4 Apr 2024  ·  Robert Henschel , Jonas Lindemann , Anders Follin , Bernd Dammann , Cicada Dennis , Abhinav Thota · Edit social preview

High Performance Research Desktops are used by HPC centers and research computing organizations to lower the barrier of entry to HPC systems. These Linux desktops are deployed alongside HPC systems, leveraging the investments in HPC compute and storage infrastructure. By serving as a gateway to HPC systems they provide users with an environment to perform setup and infrastructure tasks related to the actual HPC work. Such tasks can take significant amounts of time, are vital to the successful use of HPC systems, and can benefit from a graphical desktop environment. In addition to serving as a gateway to HPC systems, High Performance Research Desktops are also used to run interactive graphical applications like MATLAB, RStudio or VMD. This paper defines the concept of High Performance Research Desktops and summarizes use cases from Indiana University, Lund University and Technical University of Denmark, which have implemented and operated such a system for more than 10 years. Based on these use cases, possible future directions are presented.

Code Edit Add Remove Mark official

Datasets edit.

IMAGES

  1. How to Use Tables & Graphs in a Research Paper

    research paper dataset

  2. 3000+ Research Datasets For Machine Learning Researchers By Papers With

    research paper dataset

  3. Data Visualization

    research paper dataset

  4. Example of dataset description

    research paper dataset

  5. How we published a successful dataset on Kaggle

    research paper dataset

  6. FacetSum Dataset

    research paper dataset

VIDEO

  1. PubMedQA: A Dataset for Biomedical Research Question Answering

  2. [short] RewardBench: Evaluating Reward Models for Language Modeling

  3. Python for Data Analysis: Data Loading, Storage, and File Formats (py4da01 6)

  4. Best Dataset Websites for Data Science

  5. [short] Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition

  6. Statistics for Genomics: Introduction to RNAseq

COMMENTS

  1. Machine Learning Datasets

    9587 datasets • 123865 papers with code. The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images.

  2. Dataset Search

    Learn more about Dataset Search.. ‫العربية‬ ‪Deutsch‬ ‪English‬ ‪Español (España)‬ ‪Español (Latinoamérica)‬ ‪Français‬ ‪Italiano‬ ‪日本語‬ ‪한국어‬ ‪Nederlands‬ Polski‬ ‪Português‬ ‪Русский‬ ‪ไทย‬ ‪Türkçe‬ ‪简体中文‬ ‪中文(香港)‬ ‪繁體中文‬

  3. Datasets

    Paper--dataset pairs for datasets mentioned or referenced in CORD-19 papers, an open research datasets of papers relevant for COVID-19. Specifically, the content contributes the metadata for these datasets collected from their descriptions in schema.org across data repositories on the Web.

  4. arXiv Dataset

    arXiv dataset and metadata of 1.7M+ scholarly papers across STEM. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion.

  5. scientific_papers

    scientific_papers/arxiv (default config) scientific_papers/pubmed. Description: Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories. Both "arxiv" and "pubmed" have two features: article: the body of the document, pagragraphs seperated by "/n".

  6. arxiv_dataset · Datasets at Hugging Face

    A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces. ... For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the ...

  7. scientific_papers · Datasets at Hugging Face

    Next. Dataset Card for "scientific_papers". Dataset Summary. Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories. Both "arxiv" and "pubmed" have two features: article: the body of the document, paragraphs separated by "/n".

  8. SciSciNet: A large-scale open data lake for the science of science research

    The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the ...

  9. Datasets

    Support for data sets associated with arXiv articles. arXiv is primarily an archive and distribution service for research articles. arXiv provides support for data sets and other ancillary materials only in direct connection with research articles submitted.. arXiv supports the inclusion of ancillary files of modest size with articles. If you are including multiple page datasets or code with ...

  10. Machine Learning Datasets

    SSN (short for Semantic Scholar Network) is a scientific papers summarization dataset which contains 141K research papers in different domains and 661K citation relationships. Internet Archive Scholar Reference Dataset. A scholarly named entity recognition dataset with focus on machine learning models and datasets.

  11. A dataset describing data discovery and reuse practices in research

    This paper presents a dataset produced from the largest known survey examining how researchers and support professionals discover, make sense of and reuse secondary research data. 1677 respondents ...

  12. Datasets: A Community Library for Natural Language Processing

    The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for ...

  13. Everything you always wanted to know about a dataset ...

    1. In this paper, a "dataset" refers to structured or semi-structured information collected by an individual or organisation, which is distributed in a standard format, for instance as CSV files. In the context of search, it refers to the artifacts returned by a search algorithm in response to a user query. 2.

  14. [2105.03011] A Dataset of Information-Seeking Questions and Answers

    A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner. Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the ...

  15. The latest in Machine Learning

    Papers With Code highlights trending Machine Learning research and the code to implement it. ... Subscribe to the PwC Newsletter ×. Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets.

  16. Data Sets for Quantitative Research: Public Use Datasets

    Convert PDF charts and tables into machine-readable, numeric datasets Spark OCR: Find tables in images, visually identify rows and columns, and extract data from cells into data frames. Turn scans from financial disclosures, academic papers, lab results and more into usable data. PDFTables: PDF to Excel Converter; Tabula : Extract tables from PDFs

  17. Recommender system of scholarly papers using public datasets

    We believe that dataset use and reuse can be significantly improved when recommending research papers that are relevant to such dataset to researchers, an idea consistent with NIH Strategic Plan for Data Science 12. We experimented and compared a variety of vector representations from traditional term-frequency based methods and topic-modeling ...

  18. Open Data

    Open Data is a strategy for incorporating research data into the permanent scientific record by releasing it under an Open Access license. Whether data is deposited in a purpose-built repository or published as Supporting Information alongside a research article, Open Data practices ensure that data remains accessible and discoverable.

  19. ScisummNet

    Getting Started. Download the dataset (distributed under the CC BY-SA 4.0 license): ScisummNet ver1.1 (15 MB) When unzipped, the package contains a dataset description and subdirectories for the 1000 papers. Each paper directory contains the paper's PDF file, XML file, annotated citation information (in JSON format), and manual summay.

  20. [1405.0312] Microsoft COCO: Common Objects in Context

    We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise ...

  21. Download Datasets

    Pew Research Center makes its data available to the public for secondary analysis after a period of time. See this post for more information on how to use our datasets and contact us at [email protected] with any questions. Find a dataset by research area: U.S. Politics & Policy. Journalism & Media. Internet & Tech. Science & Society.

  22. Datasets at your fingertips in Google Search

    Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites. Datasets cover many disciplines and topics, including government, scientific, and commercial datasets. Dataset Search shows users essential metadata about datasets and previews of the data where ...

  23. arXiv Summarization Dataset Dataset

    arXiv Summarization Dataset. Introduced by Cohan et al. in A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. This is a dataset for evaluating summarisation methods for research papers. Source: A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. Homepage.

  24. Research Papers Dataset

    Comprehensive Research Papers Dataset- A Goldmine of Scholarly Knowledge. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook.

  25. Integrated dataset enables genes-to-ecosystems research

    Integrated dataset enables genes-to-ecosystems research. ScienceDaily . Retrieved April 8, 2024 from www.sciencedaily.com / releases / 2024 / 04 / 240408125933.htm

  26. [2404.04003] BuDDIE: A Business Document Dataset for Multi-task

    The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse ...

  27. The Role of Firm Dynamics in Aggregate Productivity, Job Flows, and

    Abstract: This paper examines the role of firm dynamics in aggregate total factor productivity, job flows, and wage inequality in Ecuador. Utilizing a comprehensive employer-employee dataset, the paper documents firm dynamics and job flow patterns that are consistent with the presence of market distortions.

  28. Use Cases for High Performance Research Desktops

    This paper defines the concept of High Performance Research Desktops and summarizes use cases from Indiana University, Lund University and Technical University of Denmark, which have implemented and operated such a system for more than 10 years. Based on these use cases, possible future directions are presented. PDF Abstract.