Advertisement

Issue Cover

  • Previous Article
  • Next Article

1. INTRODUCTION

2. what does the data source provide, 3. how to get access to the data, 4. how can the data be used, 5. conclusion, competing interests, special issue on bibliographic data sources.

ORCID logo

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Ludo Waltman , Vincent Larivière; Special issue on bibliographic data sources. Quantitative Science Studies 2020; 1 (1): 360–362. doi: https://doi.org/10.1162/qss_e_00026

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Research in quantitative science studies relies on an increasingly broad range of data sources, providing data on scholarly publications, social media activity, peer review, research funding, the scholarly workforce, scientific prizes, and so on. However, there is one type of data source that remains at the heart of research in quantitative science studies: bibliographic databases. These data sources have increasingly diversified over the last decade. Several organizations now provide large-scale databases of metadata on scholarly publications. For this special issue of Quantitative Science Studies , we invited the providers of major bibliographic data sources to provide insights on how their data can be used to support research in quantitative science studies.

This special issue comprises six papers. Three papers cover the most important commercial bibliographic data sources: Web of Science (Clarivate Analytics), Scopus (Elsevier), and Dimensions (Digital Science). Three other papers cover open data sources: Microsoft Academic, Crossref and OpenCitations. 1 There are of course many other bibliographic data sources. However, for this special issue, we have chosen to consider only data sources that cover publications from all fields of science and from all parts of the world. Data sources that focus on specific scientific fields (e.g., PubMed from the US National Library of Medicine) or specific countries (e.g., national databases of scholarly publications) are therefore not included.

We hope that this special issue will help authors of submissions to Quantitative Science Studies to choose the most suitable bibliographic data source for their research. In the past, Web of Science and Scopus often were the only data sources between which researchers could choose. Researchers typically used the data source to which their institution happened to have a subscription. In recent years, however, the number of options has increased considerably. This special issue aims to characterize the most important data sources currently available and to show how they differ in various dimensions, for instance in the data they provide, their level of openness, and their support for making research reproducible. As editors of Quantitative Science Studies , we consider openness and reproducibility to be of major importance. Research published in Quantitative Science Studies is expected to be as reproducible as possible, and reproducibility can be promoted by making use of open data.

Below we provide a brief overview of some of the key differences between the bibliographic data sources considered in this special issue, focusing on three questions: What does the data source provide? How to get access to the data? And how can the data be used?

Web of Science and Scopus are selective data sources. Both the Web of Science Core Collection and Scopus aim to cover only content that meets certain standards. These data sources for instance try to make sure they do not cover journals that adopt questionable publication practices (‘predatory journals’). In the case of Scopus, content selection is carried out by an external Content Selection and Advisory Board consisting of independent researchers.

Rather than being selective, the other data sources aim to be comprehensive. Microsoft Academic obtains most of its data by crawling the Web, although it also makes use of data provided by publishers. Microsoft decides which content retrieved from the Web is considered to be of a scholarly nature and deserves to be included in Microsoft Academic. Data curation is performed using artificial intelligence techniques. Human intervention is minimized as much as possible. Crossref obtains its data from publishers that work with Crossref to obtain digital object identifiers (DOIs) for their content. Publishers decide whether they want to work with Crossref and what data they want to make available through Crossref. Dimensions builds on data from sources such as Crossref and PubMed, and complements this data with data received from publishers. Finally, OpenCitations also obtains its data from other sources, such as Crossref and PubMed Central. It does not receive data from publishers. In addition to other formats, OpenCitations makes its data available in RDF format as linked open data, using semantic web technologies.

Access to Web of Science and Scopus normally requires a payment. Dimensions has a free and a paid version. The paid version offers additional features not available in the free version. It for instance provides access to data that is not accessible in the free version. For research purposes, it is possible to apply for no-cost access to the full version of Dimensions, including access through an API. Likewise, Elsevier’s International Center for the Study of Research plans to create a ‘virtual laboratory’ that provides free access to Scopus data for research purposes. While Dimensions data has already been made available for bibliometric research, the details of Elsevier’s plan are not yet clear at the moment.

Microsoft Academic, Crossref, and OpenCitations make all their data openly available. Microsoft Academic can be queried through an API. Up to a certain limit, this API can be used free of charge. Crossref and OpenCitations can also be queried through APIs. Their APIs can be freely used without any limit. Crossref also offers a paid service called Metadata Plus, which provides improved API access and the possibility to download a snapshot of the full Crossref database. A snapshot of the full Microsoft Academic database can be downloaded through Microsoft’s Azure platform. A small fee may be required to cover the costs associated with the use of this platform. OpenCitations also releases snapshots of its databases. These snapshots can be freely downloaded.

All data sources allow their data to be used for research purposes. For some of the data sources, in particular Web of Science and Dimensions, the providers ask researchers that use their data to share their results and to report problems identified in the data.

Ideally, bibliographic data used in research projects is made openly available. The commercial data sources (Web of Science, Scopus, and Dimensions) do not allow their data to be made openly available. They impose restrictions on the sharing or redistribution of their data. Such restrictions make it more difficult to reproduce research based on data from these sources. Even if different research teams all have access to a specific data source, it may be hard for one team to reproduce the work of another team, as the data is a moving target because of continuous updates of the data source. To address this problem, data providers would need to provide access to archived time-stamped versions of their data.

Microsoft Academic, Crossref, and OpenCitations make their data openly available. Researchers are therefore allowed to share or redistribute their data, which makes research easier to reproduce. Microsoft Academic releases its data under an ODC-BY license. This license requires Microsoft to be acknowledged when Microsoft Academic data is used. Crossref considers its data to be facts. The data cannot be owned and is therefore made available without a license. Finally, OpenCitations makes its data available under a CC0 license, releasing the data into the public domain and minimizing restrictions on the use of the data.

We hope that this special issue will help researchers working with bibliographic data to better understand the characteristics of different data sources and to choose the most suitable data source for their research. There are advantages and disadvantages to each data source. While the selectivity of Web of Science and Scopus may for instance be beneficial for some studies, it may be problematic for others. Likewise, some studies may use an open data source because reproducibility is considered essential, while other studies may have to rely on a closed data source because they require data that is not openly available.

This special issue does not provide comparisons of the different data sources in terms of their coverage, completeness, and data quality. Such comparisons can best be performed by independent research groups rather than by the data providers themselves. Submissions of papers presenting comparisons of the data provided by different bibliographic data sources are very much welcomed at Quantitative Science Studies . We hope to publish such papers in the near future.

Finally, we would like to express our gratitude to Clarivate Analytics, Elsevier, Digital Science, Microsoft, Crossref, and OpenCitations for working together with us in making this special issue possible.

Ludo Waltman is deputy director of the Centre for Science and Technology Studies (CWTS) at Leiden University. CWTS has commercial relationships with Clarivate Analytics, Elsevier, and Digital Science. CWTS and Digital Science also work together as founding partners of the Research on Research Institute (RoRI; http://researchonresearch.org ). Waltman works together with OpenCitations in his capacity as chair of the Advisory Board of the Research Centre for Open Scholarly Metadata.

Vincent Larivière is associate scientific director of the Observatoire des sciences et des technologies at Université du Québec à Montréal (OST-UQAM). OST-UQAM has a commercial relationship with Clarivate Analytics. Larivière also sits on the Advisory Board of the Research Centre for Open Scholarly Metadata.

The team of Google Scholar was also invited to contribute to this special issue, but they did not respond to our invitation.

Email alerts

Related articles, related book chapters, affiliations.

  • Online ISSN 2641-3337

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

Open Access Bibliographic Resources for Maintaining a Bibliographic Database of Research Organization

  • Published: 02 November 2023
  • Volume 50 , pages 211–223, ( 2023 )

Cite this article

  • N. A. Mazov 1 , 2 &
  • V. N. Gureyev 1 , 2  

87 Accesses

Explore all metrics

Appealing to external bibliographic systems is an inevitable stage when organizing in-house resources in research libraries and information services. On the one hand, data from external sources can be widely used when working with institutional repositories, e.g., at the stage of searching for data on the organization’s papers, creating alerts for new entries or exporting data using appropriate formats to cut the time for bibliographic metadata processing. On the other hand, the most complete data from in-house databases can be used for data correction in external bibliographic systems to increase data accuracy in the organization’s publication profiles and enhancing the visibility of bibliographic information for the scientific community. During this time of open science, the use of open access bibliographic systems is becoming more promising, especially in the light of expensiveness and other problems with access to commercial bibliographic products. The paper draws on a set of papers of one of the Russian Academy of Sciences organizations to demonstrate the facilities of open access bibliographic resources when working with institutional repository. We compare the previously used commercial systems Web of Science and Scopus with the open access Russian Science Citation Index, Dimensions, and Lens with regard to coverage of organizations’ papers, as well as external databases’ appropriateness for library technological processes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

bibliographic data research

Stukalova, A.A., Functional capabilities of higher education institutions’ repositories-members of the program “Priority-2030”, Tr. GPNTB Sib. Otd. Ross. Akad. Nauk , 2022, no. 2, pp. 36–47. https://doi.org/10.20913/2618-7515-2022-2-36-47

Mazov, N.A. and Gureyev, V.N., Publication databases of research organizations as a tool for information studies, Sci. Tech. Inf. Process. , 2022, vol. 49, no. 2, pp. 108–118. https://doi.org/10.3103/s0147688222020071

Article   Google Scholar  

Zakharova, S.S. and Gureeva, Ju.A., Scientific publications: From a library card index to bibliographic profiles, Bibliosfera , 2017, no. 2, pp. 85–89. https://doi.org/10.20913/1815-3186-2017-2-85-89

Al’perin, B.L., Vedyagin, A.A., and Zibareva, I.V., SciAct—An information-analytical system of the Institute of Catalysis of Siberian Branch, Russian Academy of Sciences, to monitor and promote sceintific activities, Tr. GPNTB Sib. Otd. Ross. Akad. Nauk , 2015, vol. 9, pp. 95–102.

Google Scholar  

Kovyazina, E.V., Digital archive of scientific publications: The stages of development, Nauchn. Tekh. Bibl. , 2014, no. 2, pp. 19–26.

Vlasova, S.A., Automated system for supporting a database of scientific works of academic institution’s employees, Inf. Resur. Ross. , 2020, no. 5, pp. 29–31.

Busygina, T.V., Balutkina, N.A., Lavrik, O.L., Mandrinina, L.A., and Elepov, B.S., Library database of the staff of institutions of the Siberian Branch of the Russian Academy of Sciences as a tool for scientometric studies, Inf. Byulleten RBA , 2013, vol. 66, pp. 192–200.

Bazhenov, S.R., Balutkina, N.A., and Stukalova, A.A., The concept of new information system of SB RAS State Public Scientific and Technological Library based on IRBIS64+, Nauchn. Tekh. Bibl. , 2023, no. 3, pp. 80–101. https://doi.org/10.33186/1027-3689-2023-3-80-101

Bazhenov, S.R., Danilin, M.V., and Rogoznikova, O.A., Integrating database of organization publications with scientific citation indices with IRBIS64 ALIS functionality, Biblioteki i informatsionnye resursy v sovremennom mire nauki, kul’tury, obrazovaniya i biznesa. Trudy 22-i Mezhdunarodnoi konferentsii Krym-2015 (Libraries and Information Resources in Modern World of Science, Culture, Education, and Business: Proc. 22nd Int. Conf.), Sudak, Russia, 2015, Moscow: Izd-vo GPNTB Rossii, 2015, pp. 1–4.

Lappalainen, Yr. and Narayanan, N., Harvesting publication data to the institutional repository from Scopus, Web of Science, Dimensions and Unpaywall using a custom R script, J. Acad. Librarianship , 2023, vol. 49, no. 1, p. 102653. https://doi.org/10.1016/j.acalib.2022.102653

Zhang, H., Boock, M., and Wirth, A.A., It takes more than a mandate: Factors that contribute to increased rates of article deposit to an institutional repository, J. Librarianship Scholarly Commun. , 2015, vol. 3, no. 1. https://doi.org/10.7710/2162-3309.1208

Bull, J. and Schultz, T.A., Harvesting the academic landscape: Streamlining the ingestion of professional scholarship metadata into the institutional repository, J. Librarianship Scholarly Commun. , 2018, vol. 6, no. 1. https://doi.org/10.7710/2162-3309.2201

Li, Yu., Harvesting and repurposing metadata from Web of Science to an institutional repository using web services, D-Lib Mag. , 2016, vol. 22, no. 3/4. https://doi.org/10.1045/march2016-li

Peroni, S. and Shotton, D., OpenCitations, an infrastructure organization for open scholarship, Quant. Sci. Stud. , 2020, vol. 1, no. 1, pp. 428–444. https://doi.org/10.1162/qss_a_00023

Gureev, V.N. and Mazov, N.A., Increased role of open bibliographic data in the context of restricted access to proprietary information systems, Upr. Naukoi: Teoriya Prakt. , 2023, vol. 5, no. 2, pp. 49–76. https://doi.org/10.19181/smtp.2023.5.2.4

Herzog, C., Hook, D., and Konkiel, S., Dimensions: Bringing down barriers between scientometricians and data, Quant. Sci. Stud. , 2020, vol. 1, no. 1, pp. 387–395. https://doi.org/10.1162/qss_a_00020

Hook, D.W., Porter, S.J., and Herzog, C., Dimensions: Building context for search and evaluation, Front. Res. Metrics Analytics , 2018, vol. 3, p. 23. https://doi.org/10.3389/frma.2018.00023

Orduña-Malea, E. and Delgado-López-Cózar, E., Dimensions: Redescubriendo el ecosistema de la información científica, Profesional Información , 2018, vol. 27, no. 2, p. 420. https://doi.org/10.3145/epi.2018.mar.21

Thelwall, M., Dimensions: A competitor to Scopus and the Web of Science?, J. Informetrics , 2018, vol. 12, no. 2, pp. 430–435. https://doi.org/10.1016/j.joi.2018.03.006

Visser, M., van Eck, N.J., and Waltman, L., Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic, Quant. Sci. Stud. , 2019, vol. 2, no. 1, pp. 20–41. https://doi.org/10.1162/qss_a_00112

Jefferson, O.A., Koellhofer, D., Warren, B., and Jefferson, R., The Lens MetaRecord and LensID: An open identifier system for aggregated metadata and versioning of knowledge artefacts, 2019. https://doi.org/10.31229/osf.io/t56yh

Penfold, R., Using the Lens database for staff publications, J. Med. Libr. Assoc. , 2020, vol. 108, no. 2, pp. 341–344. https://doi.org/10.5195/jmla.2020.918

Research Metrics Guidebook , Elsevier, 2018. https://www.elsevier.com/research-intelligence/resource-library/research-metrics-guidebook.

Vera-Baceta, M., Thelwall, M., and Kousha, K., Web of Science and Scopus language coverage, Scientometrics , 2019, vol. 121, no. 3, pp. 1803–1813. https://doi.org/10.1007/s11192-019-03264-z

Mongeon, P. and Paul-Hus, A., The journal coverage of Web of Science and Scopus: A comparative analysis, Scientometrics , 2016, vol. 106, no. 1, pp. 213–228.

Martín-Martín, A., Orduna-Malea, E., and Delgado López-Cózar, E., Coverage of highly-cited documents in Google Scholar, Web Sci., Scopus: A multidisciplinary comparison, Scientometrics , 2018, vol. 116, no. 3, pp. 2175–2188. https://doi.org/10.1007/s11192-018-2820-9

Mazov, N.A. and Gureev, V.N., IPGGTR: Database of treatises of the staff of the Trofimuk Institute of Petroleum Geology and Geophysics, Siberian Branch of the Russian Academy of Sciences (referative-fulltext bibliography), RF Certificate of State Registration of Software 2020621025, 2020.

Mazov, N.A. and Gureev, V.N., IPGGAU: Author’s identification profiles, RF Certificate of State Registration of Software 2020621128, 2020.

Mazov, N.A. and Gureev, V.N., Bibliographic database of the staff of an organization: Purposes, functions, sphere of use in scientometrics, Vestn. Dal’nevostochnoi Gos. Nauchn.i Bibl. , 2016, no. 2, pp. 84–87.

Gureev, V.N. and Mazov, N.A., Editing organization profiles in SCOPUS and the RSCI: Facilities comparison, Sci. Tech. Inf. Process. , 2016, vol. 43, no. 1, pp. 66–77. https://doi.org/10.3103/S0147688216010135

Web of Science Release Notes, April 13 2023: Automatic updates to claimed profiles, Clarivate, 2023. https://clarivate.com/webofsciencegroup/release-notes/ wos/web-of-science-release-notes-april-13-2023-2/. Cited June 13, 2023.

Haak, L.L., Fenner, M., Paglione, L., Pentz, E., and Ratner, H., ORCID: A system to uniquely identify researchers, Learned Publishing , 2012, vol. 25, no. 4, pp. 259–264. https://doi.org/10.1087/20120404

Lutai, A.V. and Lyubushko, E.E., Comparison of metadata quality in databases CrossRef, Lens, OpenAlex, Scopus, Semantic Scholar, Web of Science Core Collection, RFFI, 2023. https://podpiska.rfbr.ru/storage/reports2021/2022_meta_quality.html. Cited June 13, 2023.

Sterligov, I.A., The Russian conference outbreak: Description, causes and possible policy measures, Upr. Naukoi: Teoriya Prakt. , 2021, vol. 3, no. 2, pp. 222–251. https://doi.org/10.19181/smtp.2021.3.2.10

Kosyakov, D.V., Anatomy of the abnormal growth in the number of Russian publications in conference proceedings in Scopus, Sci. Tech. Inf. Process. , 2023, vol. 50, no. 2, pp. 96–108. https://doi.org/10.3103/S0147688223020028

Singh, V.K., Singh, P., Karmakar, M., Leta, J., and Mayr, P., The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis, Scientometrics , 2021, vol. 126, no. 6, pp. 5113–5142. https://doi.org/10.1007/s11192-021-03948-5

Download references

The study was performed by projects of the State Public Scientific Technological Library of the Siberian Branch of the Russian Academy of Sciences (122040600059-7) and Trofimuk Institute of Petroleum Geology and Geophysics of the Siberian Branch of the Russian Academy of Sciences (FWZZ-2022-0028).

Author information

Authors and affiliations.

State Public Scientific Technological Library, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia

N. A. Mazov & V. N. Gureyev

Trofimuk Institute of Petroleum Geology and Geophysics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to N. A. Mazov or V. N. Gureyev .

Ethics declarations

The authors declare that they have no conflict of interest.

Additional information

Translated by L. Solovyova

About this article

Mazov, N.A., Gureyev, V.N. Open Access Bibliographic Resources for Maintaining a Bibliographic Database of Research Organization. Sci. Tech. Inf. Proc. 50 , 211–223 (2023). https://doi.org/10.3103/S0147688223030115

Download citation

Received : 13 July 2023

Published : 02 November 2023

Issue Date : September 2023

DOI : https://doi.org/10.3103/S0147688223030115

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • bibliographic database
  • institutional repository
  • scholarly output
  • scientific communication
  • Web of Science
  • Find a journal
  • Publish with us
  • Track your research
  • Open supplemental data
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, comparative analysis of the bibliographic data sources dimensions and scopus: an approach at the country and institutional levels.

www.frontiersin.org

  • 1 Departamento de Información y Comunicación, Universidad de Extremadura, Badajoz, Spain
  • 2 Instituto de Políticas y Bienes Públicos (IPP), Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
  • 3 Facultad de Ingeniería, Universidad Panamericana, Zapopan, Mexico
  • 4 SCImago Research Group, Granada, Spain

This paper presents a large-scale document-level comparison of two major bibliographic data sources: Scopus and Dimensions. The focus is on the differences in their coverage of documents at two levels of aggregation: by country and by institution. The main goal is to analyze whether Dimensions offers as good new opportunities for bibliometric analysis at the country and institutional levels as it does at the global level. Differences in the completeness and accuracy of citation links are also studied. The results allow a profile of Dimensions to be drawn in terms of its coverage by country and institution. Dimensions’ coverage is more than 25% greater than Scopus which is consistent with previous studies. However, the main finding of this study is the lack of affiliation data in a large fraction of Dimensions documents. We found that close to half of all documents in Dimensions are not associated with any country of affiliation while the proportion of documents without this data in Scopus is much lower. This situation mainly affects the possibilities that Dimensions can offer as instruments for carrying out bibliometric analyses at the country and institutional level. Both of these aspects are highly pragmatic considerations for information retrieval and the design of policies for the use of scientific databases in research evaluation.

Introduction

As new multidisciplinary scientific bibliographic data sources are coming onto the market, there is growing interest in comparative studies looking at aspects of the coverage they offer. Scholarly databases have begun to play an increasingly important role in the academic ecosystem. There are several reasons for this, including burgeoning competitiveness in research, greater availability of data, and the need to justify the use of public funds. This context has driven the diversification of evaluations of publication and citation data use cases as well as of research use cases that have not been met by existing scholarly databases ( Hook et al., 2018 ). Since bibliometric methods are used in multiple areas for a variety of purposes, especially research evaluation, the results they provide may vary depending on the representativeness of the database used ( Mongeon and Paul-Haus, 2016 ; Huang et al., 2020 ). The new data sources can offer several benefits for research evaluators because they may have better coverage or have capabilities that make them a better fit for a given impact evaluation task, and they can reduce the cost of evaluations and make informal self-evaluations of impact possible for researchers who would not pay to access that kind of data ( Thelwall, 2018 ). Given the potential value of these data sources for research evaluation, it is important to assess their key properties to better understand their strengths and weaknesses, in particular, to decide whether their data is sufficient in volume, completeness, and accuracy to be useful for scientists, policymakers, and other stakeholders.

Traditionally, the only homogeneous record of published research available when funders and governments sought additional information to help them make evidence-driven decisions was the Web of Science (WoS). The appearance of the Scopus database ( Baas et al., 2020 ) and Google Scholar in 2004 as “competitors” to WoS, providing metadata on scientific documents and on citation links between these documents, led to an immense quantity of studies focused on comparative analyses of these other new bibliographic sources, the basic intention being to look for novel bibliometric opportunities that these tools might bring to the academic community and policymakers.

At that time, it appeared that Scopus and WoS had entered into head-on competition ( Pickering, 2004 ), and any comparison of them called for the utmost care and methodological consistency. One large-scale comparison at the journal level was done using Ulrich’s directory as the gold standard by Moya-Anegón et al. (2007) . The results outlined a profile of Scopus in terms of its coverage by areas—geographic and thematic—and the significance of peer-review in the publications. Both of these aspects are of highly pragmatic significance for policymakers and the users of scientific databases. Years later, Mongeon and Paul-Haus (2016) revisited the issue and compared the coverage of WoS and Scopus to examine whether preexisting biases (such as language, geography, and theme) were still to be found in Scopus. They concluded that some biases still remained in both databases and stated that this should be taken into account in assessing scientific activities. For example, most languages and countries are underrepresented, which contributes to the known lack of visibility of research done in some countries. Hence, when using bibliometric methods for research evaluation, it is important to understand what each tool has to offer and what its limitations are and to choose the right tool for the task at hand before drawing conclusions for research evaluation purposes ( Mongeon and Paul-Haus, 2016 ).

Google Scholar appeared to be an alternative to WoS and Scopus, but its suitability for research evaluation and other bibliometric analyses was called strongly into question. For a comprehensive review of this data source in research evaluation, we would refer to Martín-Martín et al. (2018a) and Martín-Martín et al. (2020) .

At the beginning of 2018, Digital Science launched Dimensions, a new integrated database covering the entire research process from funding to research, from publishing results through attention, both scholarly and beyond, to commercial applications and policymaking, consistently matched in multiple dimensions ( Adams et al., 2018 ). This new scholarly data source was created to overcome significant constraints of the existing databases. It sought to understand the research landscape through the lens of publication and citation data and help the academic community to formulate and develop its own metrics that can tell the best stories and give the best context to a line of research ( Bode et al., 2019 ).

Previous studies have compared data quality between Dimensions and other data sources in order to evaluate its reliability and validity ( Bornmann, 2018 ; Martín-Martín et al., 2018 ; Thelwall, 2018 ; Visser et al., 2020 ). Most of them have focused on publication and citation in specific thematic fields, but few of them have taken a global perspective. The findings of these studies in the field of Food Science show Dimensions to be a competitor to WoS and Scopus in making nonevaluative citation analyses and in supporting some types of formal research evaluations ( Thelwall, 2018 ). Similarly, Martín-Martín et al. (2018b) conclude that Dimensions is a clear alternative for carrying out citation studies, being capable of rivalling Scopus. But the reliability and validity of its field classification scheme were questioned. This scheme is not based on journal classification systems as it is in WoS or Scopus, but on machine learning. This feature makes it desirable to undertake large-scale investigations in future studies to ensure that metrics such as the field-normalized citation scores presented in Dimensions and calculated based on its field classification scheme are indeed reliable ( Bornmann, 2018 ).

A large-scale comparison of five multidisciplinary bibliographic data sources, including Dimensions and Scopus, was carried out recently by Visser et al. (2020) . They used Scopus as the baseline for comparing and analyzing not just the different coverage of documents over time by document type and discipline but also the completeness and accuracy of the citation links. The results of this comparison shed light on the different types of documents covered by Dimensions but not by Scopus. These are basically meeting abstracts and other short items that do not seem to make a very substantial contribution to science. The authors concluded that differences between data sources should be assessed in accordance with the purpose for which the data sources are used. For example, it may be desirable to work within a more restricted universe of documents, such as a specific thematic field or a specific level of aggregation. This is the case with the study of Huang et al. (2020) which compared WoS, Scopus, and Microsoft Academic and their implications for the robustness of university rankings.

The present communication extends previous comparisons of Scopus by expanding the study set to include distinct levels of aggregation (by country and by institution) across a larger selection of characteristics and measures. A particular aim is to inquire closely into just how balanced Dimensions’ coverage is compared with that of the Scopus database.

Objectives/Research Questions

The goal of this study was to compare Dimensions’ coverage with that of Scopus at the geographic and institutional levels. The following research questions were posed:

(1) How comprehensive is Dimensions’ coverage compared with that of Scopus in terms of documents?

(2) Are the distributions of publications by country and by institution in Dimensions comparable with those in Scopus?

(3) Are Dimensions’ citation counts by country and by institution interchangeable with those of Scopus in the sense of their being strongly correlated?

(4) Is Dimensions a reliable new bibliometric data source at the country and institutional levels?

Material and Methods

Scopus is a scientific bibliography database created by Elsevier in 2004 ( Hane, 2004 ; Pickering, 2004 ) which has been extensively characterized ( Moya-Anegón et al., 2007 ; Archambault et al., 2009 ; Leydesdorff et al., 2010 ) and used in scientometric studies ( Gorraiz et al., 2011 ; Jacso, 2011 ; Guerrero-Bote and Moya-Anegón, 2015 ; Moya-Anegón et al., 2018 ). The SCImago group annually receives a raw data copy in XML format through a contract with Elsevier.

In 2018, Digital Science published the Dimensions database with scientific publications and citations, grants, patents, and clinical trials ( Hook et al., 2018 ; Herzog et al., 2020 ). Since then, there has been characterization published of it ( Bornmann, 2018 ; Harzing, 2019 ; Visser et al., 2020 ). In the present study, we shall only consider the scientific publications.

Bibliographic databases often give bibliometric studies problems with author affiliations which usually do not include standardized names of institutions. One of the improvements that Dimensions incorporates is the mapping of author affiliations in documents to an entity list for organizations involved in research. This is the GRID (Global Research Identifier Database) system ( Hook et al., 2018 ). This mapping is not an addition to but a replacement for author affiliations. If this mapping is rigorous and complete, it is an important improvement. But if the list of organizations or the mapping is incomplete, this could be a major problem because there would be loose documents without any possibility of associating them with institutions or countries, thus leaving the output of the institutions and countries affected incomplete.

The SCImago group has had the possibility of downloading a copy of Dimensions in Json format through an agreement with Dimensions Science.

From the Scopus and Dimensions data of April 2020, the SCImago group created a relational database for internal use that allows for massive computation operations that would otherwise be unfeasible.

For the analysis that was an objective of this study, it was necessary to implement a matching procedure between the Dimensions and Scopus databases. To this end, we applied the method developed in the SCImago group to match PATSTAT NPL references with Scopus documents ( Guerrero-Bote et al., 2019 ). This method has two phases: a broad generation of candidate pairs, followed by a second phase of pair validation.

In this case, a modification was made, similar to that in Visser et al. (2020) , in which not all the candidate pairs were generated at the same time. Instead, once there was a set of candidate pairs, a validation procedure was applied, accepting as valid the matches that exceeded a certain threshold. This reduced the combinatorial variability of the following generations of candidates. The pairs that did not exceed the threshold were not discarded but were saved in case at the end they were unpaired and were those with the greatest similarity.

In more detail, our procedure began with the normalization of the fields to facilitate pairing, although, unlike Visser et al. (2020) , we did not stay exclusively with the numerical values of the volume, issue, or pages because at times those fields do not contain numerical values. This is the case with journals such as PLOS One or Frontiers, for instance.

Then we started to generate candidate pairs in phases. The phases were centered on the following conditions:

(1) One of these conditions:

(1) Same year of publication, title with a high degree of similarity, and the same DOI.

(2) Same year of publication, title with a high degree of similarity, and the same authors.

(3) Same year of publication, title, and first author.

As can be seen, there are conditions that include some previous phases. However, it should be borne in mind that each candidate pair generation phase is followed by a validation phase. So the first phases are quite specific; they generate a relatively small number of candidate pairs, most of which are accepted and come to constitute the majority of the definitively matched pairs. In this way, the lists of documents waiting to be matched are reduced, allowing for broader searches in the following phases without greatly increasing the computational cost. Logically, the percentage of success in the candidate pairs decreases from phase to phase.

For validation, all the reference's data were compared: DOI, year of publication, authors, title, publication, volume, issue, and pages. The last three were compared both numerically and alpha-numerically. The comparison of each field generated a numerical score corresponding to the number of matching characters with some adjustments, for which the Levenshtein 1 distance was used as in Guerrero-Bote et al. (2019) and Visser et al. (2020) . Once the coincidence score had been calculated in each field, we took the product to get the total score. The individual scores by field never have a zero value because that would mean the total score would be zero. In case of noncoincidence, the field score may be unity if the field is considered to be nonessential, 0.75 if it is considered to be important, etc. In either of the databases, the fields of some records may be empty. With this process, coincidence in several fields increases the total score geometrically rather than arithmetically.

Once the candidate pairs of a phase have been validated, we take as matched the pairs that obtain a total score greater than 1,000, and in which neither the Scopus nor the Dimensions record scores higher with any other pair. The total score threshold of 1,000 was set after sampling and verifying that under these conditions no mismatched pair was found.

Once the 5 phases had been carried out, a repechage operation was initiated for the rejected candidate pairs. This accepted pairs in which both components obtained a lower score in the rest of the pairs, down to a total score of 50. Also accepted were those in which the score was greater than 300, but one of the components had another pair with exactly the same score. This latter was done because both databases contain some duplicated records.

The Results of Matching

The general results are given in Table 1 . It is true that, even though our study includes more years than that of Visser et al. (2020) , it gives fewer matched documents for the period 2008–2017.

www.frontiersin.org

TABLE 1 . Overall results of the linking procedure.

The number of matched pairs grows from year to year, and in Scopus, the percentage of matches also grows. This is not the case for Dimensions, however, due to the great growth this database experienced from year to year.

In summary, Dimensions’ coverage is more than 25% greater than Scopus’s, although there is a significant overlap in coverage between the two data sources. Almost three-quarters of the Scopus documents and more than half of the Dimensions documents match. The question now is to see if these percentage differences are maintained at levels of grouping of lower rank (countries and institutions).

The percentage of matching in Scopus by document type is presented in Table 2 . The greatest percentages are in articles, reviews, letters, conference proceedings, errata, editorials, book chapters, short surveys, etc. (We have not listed some document types due to their low output.) For the primary output (articles, reviews, conference proceedings, and short surveys), the matching is over 75%.

www.frontiersin.org

TABLE 2 . Scopus matching percentages by most frequent document type.

Table 3 presents the same information, but for Dimensions. Articles and conference proceedings are the most matched types.

www.frontiersin.org

TABLE 3 . Dimensions matching percentages by document type.

Figure 1 shows that the total and matched output distributed by country is systematically greater in Scopus than in Dimensions. The solid line represents the ideal positions of the countries if they had the same output in Scopus and Dimensions. It is noticeable at a glance that most countries appear above the solid line in the graph, indicating that the Scopus output by country tends to be greater than the Dimensions output.

www.frontiersin.org

FIGURE 1 . Scatter plot of the total and matched Dimensions/Scopus output by country.

Figure 2 shows the relationship of the output by institution between Dimensions and Scopus. The solid line represents the positions of the institutions if they had the same output in both databases. It is again noticeable at a glance that most institutions are above the solid line, indicating that there are more institutions with more output in Scopus than in Dimensions.

www.frontiersin.org

FIGURE 2 . Scatter plot of the total and matched Dimensions/Scopus output by institution.

Figure 3 allows one to analyze the evolution of the average number of countries whose institutions correspond to the author's affiliations in the documents present in one or the other database. What most stands out in this graph is the difference between the two databases. The two sets of evolution should be very similar, and yet they are not. These differences remain stable over time and need to be confirmed with the data representing the evolution of the number of institutions that appear in the author's affiliations.

www.frontiersin.org

FIGURE 3 . Evolution of the average number of countries per document in Scopus and Dimensions in total and in the matched subsets.

Figure 4 confirms, from the institutional perspective, the evolution of the average of institutions per document in the two databases and in the matched documents. The two sets of evolution reveal the average of institutional affiliations associated with the items in the four subsets of the two data sources. As can be seen, the comparison between the two graphical representations is consistent.

www.frontiersin.org

FIGURE 4 . Evolution of the average number of institutions per document in Scopus and Dimensions in total and in the matched subsets.

In order to check the influence of documents without a country on the averages presented in Figures 3 , 4 , Figure 5 shows the evolution of the percentage of items in the four subsets of documents that do not record any country for some reason. As can be seen in the figure, these percentages have a downwards trend over the years in the different subsets of documents, and the order of the curves is contrary to that in Figures 3 , 4 , which is consistent from the perspective of data interpretation.

www.frontiersin.org

FIGURE 5 . Evolution of the annual percentage of items without country in the four subsets of documents belonging to Dimensions and Scopus.

In general terms, one can say that the information about institutional affiliations that allows documents to be discriminated by country and institution has greater completeness in Scopus than in Dimensions. The case is similar when analyzing this same situation from the perspective of the matched documents. In terms of temporal evolution, despite the positive trend in the number of countries and institutions associated with the items in both databases, the difference between the two sources in this regard tends to be maintained over time.

A more detailed characterization of the Dimensions documents where no country affiliation data is available is provided in Table 4 . The distribution of document types shows that there are distinct document types affected by this situation.

www.frontiersin.org

TABLE 4 . Distribution of document types where no country affiliation data is available.

Using as a basis the citation data ( Figure 6 ), it is easy to see that, both for total documents and for matched documents, the volume of citations in Scopus is in all cases greater than that of Dimensions, as noted previously by Visser et al. (2020) . The case is similar when the problem is analyzed from the point of view of the citing date ( Figure 7 ).

www.frontiersin.org

FIGURE 6 . Citations by cited year.

www.frontiersin.org

FIGURE 7 . Citations by citing year.

When the citations of the documents in the two databases are distributed by country, one observes that all of them, regardless of the size of their output, accumulate more citations in the Scopus database than in the Dimensions one. Figure 8 shows that both total citations and those of matched documents are consistently greater in Scopus than in Dimensions for all countries. The case is similar when the distribution of citations is by institution in the period of observation. The distribution of citations by institution is also greater in Scopus than in Dimensions in more than 97% of the cases. Figure 9 shows very clearly how just a small group of institutions lies below the straight line, and these conform to the 2.5% of cases that have more citations in Dimensions than in Scopus.

www.frontiersin.org

FIGURE 8 . Relationship between total citations and matched documents by country.

www.frontiersin.org

FIGURE 9 . Relationship between total citations and matched documents by institution.

Our starting hypothesis was that the difference in overall coverage between the two databases should be similar in general terms when the total set of documents was fragmented into smaller levels of aggregation. From our perspective, it is important that overall coverage levels be maintained on average when the source is split into smaller groupings (countries or institutions, for example) in order to guarantee the bibliometric relevance of the source. For this reason, we continued along the path begun by other workers trying to deepen the comparative analysis of the coverage of the two sources.

Our first conclusion is that, for reasons that have to do with the data structures themselves, the two sources have notable differences in coverage at the level of countries and institutions, with a tendency for there to be greater coverage at those levels in Scopus than in Dimensions. This is even though what was to be expected would have been the opposite, given the overall differences in coverage between the two sources.

Second, despite the fact that Dimensions has a larger raw coverage of documents than Scopus, close to half of the documents in Dimensions lack country or institutional affiliation information, which means that when documents are aggregated by country or institutional affiliation, Scopus systematically provides more documents/citations than Dimensions. In 2014, Dimensions started working on the problem of creating an entity list for organizations to provide a consistent view of an organization within one content source, but also across the various different types of content. This was the GRID (Global Research Identifier Database) system. At that time, a set of policies about how to handle the definition of a research entity was developed. 2 At the time of writing, GRID contains 98,332 unique organizations, for which the data has been curated and each institution assigned a persistent identifier. This set of institutions represents an international coverage of the world’s leading research organizations, indexing 92% of funding allocated globally. It is clear, however, that the repeated differences between Scopus and Dimensions in output and citation are related to the fact that Dimensions’ method of linking institutional affiliations to GRID, while a promising idea, is still a work in progress. In overall terms, currently, it limits linkages of item with countries and institutions. This situation mainly affects the possibilities that the two sources can offer as instruments for carrying out bibliometric analyses.

As Bode et al. (2019) point out in Dimensions’ Guide v.6 (p. 3), “Linked and integrated data from multiple sources are core to Dimensions. These matchings are data driven, then, the content and enrichment pipeline is as automated as possible. However, while an automated approach allows us to offer a more open, free approach it also results in some data issues, which we will continue to have to work on and improve.” This is advisable for both the publications and citation links because, as Visser et al. (2020) noted, “Dimensions incorrectly has not identified citation links. Hence, this data source fails to identify a substantial number of citation links” (p. 20). Dimensions also has the limitation that it does not provide data for references that have not been matched with a cited document (p. 23).

The results described should help fill the gap in exploring differences between Scopus and Dimensions at the country and institutional levels. Figure 5 appears to be the main cause that explains most of the other results. Most of the other results in this manuscript are an effect or consequence of this. This should allow a profile of Dimensions to be outlined in terms of its coverage by different levels of aggregation of its publications in comparison with Scopus. Both of these aspects are highly pragmatic considerations for bibliometric researchers and practitioners, in particular for policymakers who rely on such databases as a principal criterion for research assessment (hiring, promotion, and funding).

At the country level, this study has shown that not all articles had complete address data. Even though there was a decreasing trend over time in the number of documents with no country information in the address data, in 2018 still more than 40% of documents in Dimensions remained without a country. Given the size of the data source and its goal in the scientific market, missing information of the country in the affiliation data has important implications at all levels of aggregation and analysis. Thus, Dimensions does not currently appear to be a reliable data source with which to define and evaluate the set of output at the country level.

At the institutional level, according to Huang et al. (2020) , “Universities are increasingly evaluated on the basis of their outputs which are often converted to rankings with substantial implications for recruitment, income, and perceived prestige.” The present study has shown that Dimensions does not record all institutional affiliation of the authors, which has implications for metrics and rankings at the institutional scale. In this case, it seems advisable to integrate diverse data sources into any institutional evaluation framework ( Huang et al., 2020 ).

We have not been comparing document types but presenting results derived from the matching procedure. As in Visser et al. (2020) , we found that there were many articles in Dimensions for which there was no matching document in our matching procedure. This is because it seems that any document published in a journal is classified as an article in Dimensions.

Finally, as in previous studies examining data sources’ coverage ( Moya-Anegón et al., 2007 ), to very briefly conclude and with possible future bibliometric studies in mind, the above considerations conform to an important part of the context of scientific output and evaluation and should be taken into account so as to avoid bias in the comparison of research results in diverse domains or at different aggregation levels. All data sources suffer from problems of incompleteness and inaccuracy of citation links ( Visser et al., 2020 , p. 23), and GRID is not yet perfect and never will be ( Bode et al., 2019 , p. 6). But we are confident that studies like the present will help to improve this tool and the data in the near future.

Data Availability Statement

The datasets presented in this article are not readily available because the SCImago group annually receives a raw data copy in XML format through a contract with Elsevier. The SCImago group has the possibility of downloading a copy of Dimensions in Json format through an agreement with Digital Science. We are not allowed to redistribute the Scopus and Dimensions data used in this paper. Requests to access the datasets should be directed to [email protected] .

Author Contributions

VG-B: conception, data curation, and writing. AM: data curation. ZC-R: conception, data analysis, and writing. FM-A: conception, data analysis, and writing. All authors read and approved the final manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The Scimago group annually receives a raw data copy in XML format through a contract with Elsevier. The Scimago group has the possibility of downloading a copy of Dimensions in JSON format through an agreement with Digital Science. We are not allowed to redistribute the Scopus and Dimensions data used in this paper.We thank the Dimensions and Scopus development teams for their availability to provide us with access to the information necessary to carry out this analysis.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frma.2020.593494/full#supplementary-material .

1 In our case, we subtract the Levenshtein distance (multiplied by 1.3) from the number of characters in the largest of the fields to be compared, thus obtaining a number indicative of the number of matching characters between the fields (with a 30% penalty). Recall that the Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.

2 https://www.grid.ac/pages/policies

Adams, J., Jones, P., Porter, S., Szomszor, M., Draux, H., and Osipov, I. (2018). Dimensions–A collaborative approach to enhancing research discovery. Technical report. Digital Science. doi:10.6084/m9.figshare.5783160.v1

CrossRef Full Text | Google Scholar

Archambault, É., Campbell, D., Gingras, Y., and Larivière, V. (2009). Comparing bibliometric statistics obtained from the Web of Science and Scopus. J. Am. Soc. Inf. Sci. 60 (7), 1320–1326. doi:10.1002/asi.21062

Baas, J., Schotten, M., Plume, A., Côté, G., and Karimi, R. (2020). Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies. Quant. Sci. Stud. 1 (1), 377–386. doi:10.1162/qss_a_00019

Bode, C., Herzog, C., Hook, D., and McGrath, R. (2019). A guide to the dimensions data approach. Technical report. Digital Science. doi:10.6084/m9.figshare.5783094.v5

Bornmann, L. (2018). Field classification of publications in Dimensions: a first case study testing its reliability and validity. Scientometrics 117 (1), 637–640. doi:10.1007/s11192-018-2855-y

Gorraiz, J., Gumpenberger, C., and Wieland, M. (2011). Galton 2011 revisited: a bibliometric journey in the footprints of a universal genius. Scientometrics , 88 (2), 627–652. doi:10.1007/s11192-011-0393-y

Guerrero-Bote, V. P., and Moya-Anegón, F. (2015). Analysis of scientific production in food science from 2003 to 2013. J. Food Sci. 80 (12), R2619–R2626. doi:10.1111/1750-3841.13108

PubMed Abstract | CrossRef Full Text | Google Scholar

Guerrero-Bote, V. P., Sánchez-Jiménez, R., and Moya-Anegón, F. (2019). The citation from patents to scientific output revisited: a new approach to Patstat/Scopus matching. El Prof. Inf. 28 (4), e280401. doi:10.3145/epi.2019.jul.01

Hane, P. (2004). Elsevier announces Scopus service. Information Today . Available at: http://newsbreaks.infotoday.com/nbreader.asp?ArticleID=16494 (Accessed June 9, 2011).

Google Scholar

Harzing, A.-W. (2019). Two new kids on the block: how do crossref and dimensions compare with Google scholar, Microsoft academic, Scopus and the Web of Science? Scientometrics , 120 (1), 341–349. doi:10.1007/s11192-019-03114-y

Herzog, C., Hook, D., and Konkiel, S. (2020). Dimensions: bringing down barriers between scientometricians and data. Quant. Sci Stud. 1 (1), 387–395. doi:10.1162/qss_a_00020

Hook, D. W., Porter, S. J., and Herzog, C. (2018). Dimensions: building context for search and evaluation. Front. Res. Metr. Anal. 3, 23. doi:10.3389/frma.2018.00023

Huang, C.-K., Neylon, C., Brookes-Kenworthy, C., Hosking, R., Montgomery, L., Wilson, K., et al. (2020). Comparison of bibliographic data sources: implications for the robustness of university rankings. Quant. Sci. Stud. 1, 445–478. doi:10.1162/qss_a_00031

Jacsó, P. (2011). The h‐index, h‐core citation rate and the bibliometric profile of the Scopus database. Online Inf. Rev. 35 (3), 492–501. doi:10.1108/14684521111151487

Leydesdorff, L., Moya Anegón, F., and Guerrero Bote, V. P. (2010). Journal maps on the basis of Scopus data: a comparison with the journal citation reports of the ISI. J. Am. Soc. Inf. Sci. Technol. 61 (2), 352–369. doi:10.1002/asi.21250

Martín-Martín, A., Orduna-Malea, E., and Delgado López-Cózar, E. (2018a). Coverage of highly-cited documents in Google scholar, Web of Science, and Scopus: a multidisciplinary comparison. Scientometrics 116 (3), 2175–2188. doi:10.1007/s11192-018-2820-9

Martín-Martín, A., Orduna-Malea, E., Thelwall, M., and Delgado López-Cózar, E. (2018b). Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories. J. Inf. 12 (4), 1160–1177. doi:10.1016/j.joi.2018.09.002

Martín-Martín, A., Thelwall, M., Orduna-Malea, E., and López-Cózar, E. D. (2020). Google scholar, Microsoft academic, Scopus, dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations. arXiv:2004.14329 [Preprint].

Mongeon, P., and Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: a comparative analysis. Scientometrics 106 (1), 213–228. doi:10.1007/s11192-015-1765-5

Moya-Anegón, F., Chinchilla-Rodríguez, Z., Vargas-Quesada, B., Corera-Álvarez, E., González-Molina, A., Muñoz-Fernández, F. J., et al. (2007). Coverage analysis of SCOPUS: a journal metric approach. Scientometrics 73 (1), 57–58. doi:10.1007/s11192-007-1681-4

Moya-Anegón, F., Guerrero-Bote, V. P., Lopez-Illescas, C., and Moed, H. F. (2018). Statistical relationships between corresponding authorship, international co-authorship and citation impact of national research systems. J. Inf. 12 (4), 1251–1262. doi:10.1016/j.joi.2018.10.004

Pickering, B. (2004). Elsevier prepares Scopus to rival ISI Web of Science. Information World Review, 200, 1.

Thelwall, M. (2018). Dimensions: a competitor to Scopus and the Web of Science?. J. Inf. 12, 430–435. doi:10.1016/j.joi.2018.03.006

Visser, M., van Eck, N., and Waltman, L. (2020). Large-scale comparison of bibliographic data sources: Scopus, Web of Science, dimensions, crossref, and Microsoft academic. arXiv:2005.10732 [Preprint]. Available at: https://arxiv.org/abs/2005.10732

Keywords: Dimensions, Scopus, bibliographic data sources, database coverage, research evaluation, scientometrics, bibliometrics

Citation: Guerrero-Bote VP, Chinchilla-Rodríguez Z, Mendoza A and de Moya-Anegón F (2021) Comparative Analysis of the Bibliographic Data Sources Dimensions and Scopus: An Approach at the Country and Institutional Levels. Front. Res. Metr. Anal. 5:593494. doi: 10.3389/frma.2020.593494

Received: 10 August 2020; Accepted: 10 November 2020; Published: 22 January 2021.

Reviewed by:

Copyright © 2021 Guerrero-Bote, Chinchilla-Rodríguez, Mendoza and de Moya-Anegón. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Zaida Chinchilla-Rodríguez, [email protected]

This article is part of the Research Topic

Best Practices in Bibliometrics & Bibliometric Services

An introduction to library linked data

bibliographic data research

  • Jeff Mixter
  • 26 March 2024
  • Linked data

bibliographic data research

We recently announced a comprehensive strategy to bring linked data into mainstream library cataloging workflows. It’s a long-term approach, recognizing that most libraries will move to linked data slowly and incrementally—and we’re committed to providing tools and resources to support the transition for everyone.

Working closely with libraries around the world, we know that staff at some libraries are already educating themselves on the topic, piloting linked data services, and taking part in ongoing research. But we also know that many others have a lot of questions. In addition to technical issues, librarians are also wondering how linked data will impact and affect the work they are currently doing. To help, OCLC is rolling out linked data infrastructure and services that meet libraries where they are today and provide meaningful improvement to challenges facing libraries.

What is linked data?

At its simplest, linked data is about connections. It’s a way to organize and connect data on the web so it can be easily, automatically, and programmatically shared and used by various systems and services.

For a brief, more technical introduction, jump down to the end of this post. But the super-short version is that linked data is simply lines of standardized HTML code that computers use to link different concepts by their relationships to each other.

If you look at the “knowledge panel” in a Google result, you’ll often see information about a subject from many sources. That “info card” is populated with linked data from many other sites (including direct links to library resources , using information from WorldCat). Other related linked data sources, including VIAF, Wikidata, and DBpedia, are already being used to connect services and create new applications.

As more related linked data comes online, we’ll see more opportunities for additional library-focused applications. By breaking up the valuable, library-focused data locked in MARC records and publishing it using URIs (Uniform Resource Identifiers), library staff will be able to provide greater context for information and build rich connections across library resources, their communities, and beyond.

How is linked data different? Is it better?

Traditional, fixed data formats—like MARC records—have two major limitations. It’s hard to get useful data from other, nonlibrary sources into library workflows and it’s hard for potential users of library information to get MARC data into their workflows.

The first is a challenge because, as we know, there are many sources of information to help improve the discovery and use of library materials. That could be across campus—in another department or system that is more heavily used by students and researchers—or from experts around the world. The second is a lost opportunity, because library metadata is created by cataloging workers (at libraries and OCLC) who are among the most talented data specialists in the world. Many other industries and areas could benefit from the work they do.

Linked data helps address both challenges. For example, OCLC works with organizations like Google to insert library linked data into their services. These efforts make library materials more visible in places where people search online. And there are opportunities for partners to help do the same in reverse, getting their information into systems and services where library workers and users can connect. For example, linked data makes connecting works across languages much easier, meaning that publishers can direct inquiries in one language to materials available in others.

In both cases, it helps connect library work to the wider web, promoting libraries while improving efficiency.

What about MARC?

If we look at the history of metadata, there’s a consistent record of libraries moving to systems and services that let more people interact with it in more ways.

  • Closed stacks were the ultimate data filter. When users had to ask library staff to fetch resources from a closed room, there was no chance for direct interaction.
  • Shelf browsing , using systems like Dewey and LCC (Library of Congress Classification), allowed users to interact with metadata themselves, making their own choices. Library workers moved from the position of data gatekeepers to being guides, educators, and advocates.
  • Centralized databases , such as WorldCat, connected library catalogs for cooperative record creation and improvement, as well as new discovery and resource sharing options within library-based services.
  • Online access to library databases, in places like WorldCat.org, meant that anyone with access to a web browser could find and use library metadata online. Early OCLC partnerships also meant that library data could—with some additional work—be shared in other online resources.

Linked data is the next step in this evolution. Until now, everything we’ve done was primarily to make library metadata more accessible to people. Now we’re putting library data out there in a way that’s more accessible to today’s online services, programs, machine learning systems, and artificial intelligence (AI) applications.

MARC will be with us for the foreseeable future. After all, it took nearly 50 years for many libraries to fully make the transition from printed cards to online cataloging. Our plan is to continue to support MARC-based functions while actively building powerful library linked data tools and resources.

Why should I care about linked data today?

As libraries continue to focus on new ways to facilitate the creation and sharing of knowledge, and as the volume and variety of information increases, metadata and metadata expertise are more important than ever. Evolving library data into linked data frees the knowledge in library collections and connects it to the knowledge streams that inform our everyday lives—on the web, through smart devices, and using technologies like AI.

Here are some of the reasons I think you should be excited about what’s happening with linked data today:

  • It allows us to harness the collective expertise of library workers at thousands of institutions. That’s exciting both in terms of partnerships and original research .
  • It synchronizes and enhances library data at scale. WorldCat Entities is a set of centralized data that establishes the context for bibliographic metadata curation. And we’re connecting it to existing systems like the DDC (Dewey Decimal Classification) and FAST (Faceted Application of Subject Technology) to integrate linked data into other library workflows.
  • It helps current systems and workflows through the transition to linked data by integrating data like WorldCat Entities URIs to WorldCat.
  • We’re creating new tools that will let cataloging workers add linked data to existing records. This will allow for enhanced cataloging applications, record output with identifiers, and soon, the launch of OCLC Meridian, a WorldCat Entities linked data management tool.
  • We’ll also launch a bibliographic editing tool that works seamlessly between BIBFRAME and MARC data, helping to meet the needs of librarians as they transition to non-MARC formats.

There’s a lot to be excited about. And this will be a marathon, not a sprint. But for today? Know that OCLC is working toward a linked data future that supports all libraries as they transition at their own pace and in ways that provide value without impacting current processes.

This is the first of three posts about linked data. Keep an eye on this space, check out the main page for OCLC linked data strategy and news, and sign up for updates on this important subject.

Technical background for linked data

When Tim Berners-Lee and the team at CERN invented the basic protocols for the web in 1989, they proposed three basic technologies to connect people to resources:

  • Unique resource identifiers (URIs) for anything that can be connected on the web; URLs (Uniform Resource Locators)—commonly known as “web page names”—are a type of URI
  • The Hypertext Markup Language (HTML) code used to format documents on the web
  • The Hypertext Transfer Protocol (HTTP), which is used to establish connections between web pages and related assets (pictures, sound, video, apps, and data)

When you—a human user of the web—click on a link that says, for example, “ Boston Symphony Orchestra Archives ,” you have an expectation that it will take you to another page with related information. The context for that journey is based on how people use documents and links to find and access related resources.

Later, Berners-Lee expanded this, outlining principles to link data between computers rather than people. He proposed that “conceptual things” should have a URI for an online name that returns data about that thing in a standard format, and that other related things should also be given a URI. In this way, similarly to how people use links, computer programs can move from page to page (URI to URI), using common technology to search for and utilize related information.

The URI for a “thing” (commonly called an “entity,” which could be any object, person, date, concept, place, etc.) is just a web page that has linked data code on it. That code contains information about the subject, and also links to other entities using something called “a triple,” which is just:

[Thing 1] <has this relationship> to [Thing 2]

So, for example:

[Octavia E. Butler] <is the author of> [Parable of the Sower]

That information would be found in a line of code on the page for both Butler and the novel. So, when a computer program finds either page, it will be able to “know” the relationship between those two entities. And when billions of pieces of linked data are published and connected all over the web, it becomes possible to build applications that utilize previously disconnected information in unique and powerful ways.

For example, another site might publish linked data about where famous people are born, and could have the following triple on the page for Pasadena, California, USA:

[Pasadena, California, USA] <is the birthplace of> [Octavia E. Butler]

And a third application might be pulling data from many sites in order to display interesting travel-related information for vacation planning. Its service could pull linked data from the birthplace site, and then search for related, interesting links. So that when you use its software to plan a trip to Pasadena, it would search that linked data, which would then connect to library data, and provide library links to works by authors from that city.

The main thing to keep in mind is that linked data is simply computer code on ordinary web pages that provides contextual information about things (“entities”). That data is then read by automated programs that put it together with linked data from other sources to create new applications and services.

Related Posts:

217276_Next-Blog_Best-IT-Award-2024

Share your comments and questions on Twitter with hashtag #OCLCnext .

Tags: Cataloging , Development , Librarianship , Library Management , Linked Data , Metadata , Technology

More From Forbes

Scientists tend to inflate how ethical they are in doing their research.

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

We have known for a long time that people tend to paint a rosy picture of how good they are. Now we know that scientists are no exception, at least when it comes to conducting their own research. This is especially surprising since scientists are regularly thought to be objective.

Research and Ethics - clinical trial law and rules, Medical compliance.

This new discovery emerged from a massive survey of 11,050 scientific researchers in Sweden, conducted by Amanda M. Lindkvist, Lina Koppel, and Gustav Tinghög at Linköping University and published in the journal Scientific Reports. The survey was very simple, with only two questions. Here was the first:

Question One : In your role as a researcher, to what extent do you perceive yourself as following good research practices—compared to other researchers in your field?

Rather than allowing the survey participants to each define what ‘good research practice’ is, the researchers gave them these criteria:

(1) Tell the truth about one’s research.

(2) Consciously review and report the basic premises of one’s studies.

(3) Openly account for one’s methods and results.

(4) Openly account for one’s commercial interests and other associations.

One Of The Best Shows Ever Made Lands On Netflix Today For The Very First Time

This popular google app will stop working in 3 days how to migrate your data, google suddenly reveals surprise android update that beats iphone.

(5) Not make unauthorized use of the research results of others.

(6) Keep one’s research organized, for example through documentation and filing.

(7) Stive to conduct one’s research without doing harm to people, animals or the environment.

(8) Be fair in one’s judgement of others’ research.

Note that many of these criteria have to do with honesty, but there are also ones on conscientiousness, non-malevolence, and fairness.

What were the results? Participants used a scale to rate themselves from 1 to 7, with 1 = Much less than other researchers, 4 = As much as other researchers, and 7 = Much more than other researchers. This is what the responses revealed:

44% rated themselves as more ethical in their research practices than other researchers in their field.

55% rated themselves as the same as their peers.

Not even 1% rated themselves as less ethical than their peers.

Of course these results can’t reflect real life, since mathematically there have to be more than 1% of scientists who are less than average in this area of their lives.

The other question that Lindkvist and his colleagues asked these scientific researchers was this:

Question Two : To what extent do you perceive researchers within your field as following good research practices–compared to researchers within other fields?

Here too the results were very skewed. 29% said their field followed good research practices to a greater extent than did scientists in other fields. Only 8% said it was the other way around.

These results should surprise us for a couple of reasons. One is that they go against the popular narrative of scientists as objective and neutral. When it comes to their own ethical behavior in conducting their research, they appear as a whole to be biased and overconfident. Another reason these results are surprising is that many scientists are likely aware of the existence of scientific research on how people in general tend to have an inflated view of their own virtue . So you’d expect that they would be on guard against such a tendency in their own case.

There are dangers that come with scientists having an overly positive view of their own research ethics. Lindkvist helpfully explains one of them: it “may lead researchers to underestimate the ethical implications of the decisions they make and to sometimes be blind to their own ethical failures. For example, researchers may downplay their own questionable practices but exaggerate those of other researchers, perhaps especially researchers outside their field.” Another danger that Lindkvist notes is a greater tendency to ignore warnings and ethical safeguards, if they are dismissed by a scientist as applying to others but not to her since she thinks she is above average.

It would be interesting in future work to see if similar patterns emerge with researchers in other countries besides Sweden. It would also be interesting to look at researchers anonymously rating the research ethics of their colleagues in their own departments and schools.

If these results hold up, it will be important to find ways to encourage scientific researchers to correct their inflated perceptions. As Lindkvist urges, “To restore science’s credibility, we need to create incentive structures, institutions, and communities that foster ethical humility and encourage us to be our most ethical selves in an academic system that otherwise incentivizes us to be bad.”

Christian B. Miller

  • Editorial Standards
  • Reprints & Permissions

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Res Metr Anal
  • PMC10541951

Logo of frontrma

Editorial: Linked open bibliographic data for real-time research assessment

So far, research evaluation has been very important as a means for deciding academic tenure, awarding research grants, tracking the evolution of scholarly institutions, and assessing doctoral students (King, 1987 ). However, this effort has been limited by the lack of findability and accessibility of bibliographic databases allowing such an assessment and the legal and financial burdens toward their reuse (Herther, 2009 ). By the Internet age, scholarly publications became issued online in an electronic format allowing the extraction of accurate bibliographic information from them (Borgman, 2008 ) as well as the tracking of their readership, download, sharing, and search patterns (Markscheffel, 2013 ). Online resources called bibliographic knowledge graphs have consequently appeared, providing free bibliographic data and usage statistics for scholarly publications (Markscheffel, 2013 ). These resources are structured in triples, making them manageable through APIs and query endpoints (Ji et al., 2021 ). They are kept up-to-date in near real-time through automated methods for enrichment and validation.

Currently, many of these resources are released under permissive licenses such as CC0, CC-BY 4.0, MIT, and GNU covering various aspects of research evaluation (Markscheffel, 2013 ) including citation data (Peroni and Shotton, 2020 ), patent information (Verluise et al., 2020 ), research metadata (Stocker et al., 2022 ), bibliographic metadata (Hendricks et al., 2020 ), author information (Haak et al., 2012 ), and data about scholarly journals, and conferences (Ley, 2009 ). Multilingual and multidisciplinary open knowledge graphs provide large-scale information about a variety of topics, including bibliographic metadata thanks to user contributions and crowdsourcing within the framework of Linked Open Data (Turki et al., 2022 ). Due to their flexible data model, they can integrate and centralize knowledge from multiple open and linked bibliographic resources based on persistent identifiers (PID) to become a secondary resource for research data (Nielsen et al., 2017 ). They also include a large set of non-bibliographic information such as country and prize information that can be used to augment bibliographic data and study the effect of social factors on research efforts (Turki et al., 2022 ). Later, gathered information can be used to generate research evaluation dashboards that can be updated in real-time based on SPARQL queries (Nielsen et al., 2017 ) or API queries ( Lezhnina et al. ). This will allow the launching of a new generation of knowledge-driven living research evaluation (Markscheffel, 2013 ). Beyond online resources having permissive licenses, several bibliographic databases are available online but have an All Rights Reserved license like Google Scholar, maintained by Google (Orduña-Malea et al., 2015 ), and PubMed, provided by the National Center for Biotechnology Information (Fiorini et al., 2017 ). These resources can be very useful to feed private research dashboards and real-time research evaluation reports for scholarly institutions.

Despite the value of open bibliographic resources, they can involve inconsistencies that should be solved for better accuracy. As an example, OpenCitations mistakenly includes 1,370 self-citations and 1,498 symmetric citations as of April 30, 2022. 1 As well, they can involve several biases that can provide a distorted mirror of the research efforts across the world (Martín-Martín et al., 2021 ). That is why these databases need to be enhanced from the perspective of data modeling, data collection, and data reuse. This goes in line with the current perspective of the European Union on reforming research assessment (CoARA, 2022 ). In this topical collection, we are honored to feature novel research works in the context of allowing the automatic generation of real-time research assessment reports based on open bibliographic resources. We are happy to host research efforts emphasizing the importance of open research data as a basis for transparent and responsible research assessment, assessing the data quality of open resources to be used in real-time research evaluation, and providing implementations of how online databases can be combined to feed dashboards for real-time scholarly assessment.

The four accepted papers in this Research Topic provide insight into the use of open bibliographic data to evaluate academic performance. Majeti et al. present an interface that harvests bibliographic and research funding data from online sources. The authors of this paper address systematic biases in collected data through nominal and normalized metrics and present the results of an evaluation survey taken by senior faculty. Porter and Hook explore the deployment of scientometric data into the hands of practitioners through cloud-based data infrastructures. The authors present an approach that connects Dimensions and World Bank data on Google BigQuery to study international collaboration between countries of different economic classifications. Schnieders et al. evaluate the readiness of research institutions for partially automated research reporting using open, public research information collected via persistent identifiers (PIDs) for organizations (ROR), persons (ORCID), and research outputs (DOI). The authors use internally maintained lists of persons to investigate ORCID coverage in external open data sources and present recommendations for future actions. Lezhnina et al. propose a dashboard using scholarly knowledge graphs to visualize research contributions, combining computer science, graphic design, and human-technology interaction. The user survey showed the dashboard's appeal and potential to enhance scholarly communication through knowledge graph-powered dashboards in different domains.

The research papers featured here underscore the critical importance of open bibliographic data in transforming the landscape of research evaluation. These papers not only shed light on this pivotal role but also offer invaluable practical tools for both researchers and practitioners. By harnessing linked open data, these resources empower individuals within the academic community to navigate the intricacies of scholarly communication more effectively, ultimately leading to improved research assessment practices among scholars and institutions.

Author contributions

MB: Writing—original draft, Writing—review and editing. HT: Writing—original draft, Writing—review and editing. MH: Writing—original draft, Writing—review and editing.

1 A detailed list of deficient self-citations and symmetric citations at OpenCitations can be found at https://github.com/csisc/OCDeficiency .

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

  • Borgman C. L. (2008). Data, disciplines, and scholarly publishing . Learn. Pub. 21 , 29–38. 10.1087/095315108X254476 [ CrossRef ] [ Google Scholar ]
  • CoARA (2022). Agreement on Reforming Research Assessment. Brussels, Belgium: Science Europe and European University Association. [ Google Scholar ]
  • Fiorini N., Lipman D. J., Lu Z. (2017). Cutting edge: toward PubMed 2.0 . eLife 6 , e28801. 10.7554/eLife.28801 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Haak L. L., Fenner M., Paglione L., Pentz E., Ratner H. (2012). ORCID: a system to uniquely identify researchers . Learn. Pub. 25 , 259–264. 10.1087/20120404 [ CrossRef ] [ Google Scholar ]
  • Hendricks G., Tkaczyk D., Lin J., Feeney P. (2020). Crossref: the sustainable source of community-owned scholarly metadata . Quant. Sci. Stud. 1 , 414–427. 10.1162/qss_a_00022 [ CrossRef ] [ Google Scholar ]
  • Herther N. K. (2009). Research evaluation and citation analysis: key issues and implications . Elect. Lib. 27 , 361–375. 10.1108/02640470910966835 [ CrossRef ] [ Google Scholar ]
  • Ji S., Pan S., Cambria E., Marttinen P., Philip S. Y. (2021). A survey on knowledge graphs: representation, acquisition, and applications . IEEE Transact. Neural Networks Learn. Sys. 33 , 494–514. 10.1109/TNNLS.2021.3070843 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • King J. (1987). A review of bibliometric and other science indicators and their role in research evaluation . J. Inform. Sci. 13 , 261–276. 10.1177/016555158701300501 [ CrossRef ] [ Google Scholar ]
  • Ley M. (2009). DBLP: some lessons learned . Proceed. VLDB Endow. 2 , 1493–1500. 10.14778/1687553.1687577 [ CrossRef ] [ Google Scholar ]
  • Markscheffel B. (2013). New Metrics, a Chance for changing Scientometrics. A Preliminary Discussion of Recent Approaches. Scientometrics: Status and Prospects for Development (p. 37). Moscow, Russia: Institute for the Study of Science of RAS . Available online at: https://www.researchgate.net/publication/258926049_New_Metrics_a_Chance_for_Changing_Scientometrics_A_Preliminary_Discussion_of_Recent_Approaches (accessed August 10, 2023).
  • Martín-Martín A., Thelwall M., Orduna-Malea E., Delgado López-Cózar E. (2021). Google scholar, microsoft academic, scopus, dimensions, web of science, and opencitations' COCI: a multidisciplinary comparison of coverage via citations . Scientometrics 126 , 871–906. 10.1007/s11192-020-03690-4 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nielsen F. Å., Mietchen D., Willighagen E. (2017). Scholia, scientometrics and wikidata . European Semantic Web Conference . Cham: Springer, 237-259. 10.1007./978-3-319-70407-4_36 [ CrossRef ] [ Google Scholar ]
  • Orduña-Malea E., Ayllón J. M., Martín-Martín A., Delgado López-Cózar E. (2015). Methods for estimating the size of Google Scholar . Scientometrics 104 , 931–949. 10.1007/s11192-015-1614-6 [ CrossRef ] [ Google Scholar ]
  • Peroni S., Shotton D. (2020). OpenCitations, an infrastructure organization for open scholarship . Quant. Sci. Stud. 1 , 428–444. 10.1162/qss_a_00023 [ CrossRef ] [ Google Scholar ]
  • Stocker M., Heger T., Schweidtmann A., Cwiek-Kupczyńska H., Penev L., Dojchinovski M., et al.. (2022). SKG4EOSC-scholarly knowledge graphs for EOSC: establishing a backbone of knowledge graphs for FAIR scholarly information in EOSC . Res. Ideas Out. 8 , e83789. 10.3897/rio.8.e83789 [ CrossRef ] [ Google Scholar ]
  • Turki H., Hadj Taieb M. A., Shafee T., Lubiana T., Jemielniak D., Ben Aouicha M., et al.. (2022). Representing COVID-19 information in collaborative knowledge graphs: the case of Wikidata . Sem. Web 13 , 233–264. 10.3233/SW-210444 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Verluise C., Cristelli G., Higham K., de Rassenfosse G. (2020). The Missing 15 Percent of Patent Citations. Lausanne: EPFL. [ Google Scholar ]

Help | Advanced Search

Quantum Physics

Title: natural language, ai, and quantum computing in 2024: research ingredients and directions in qnlp.

Abstract: Language processing is at the heart of current developments in artificial intelligence, and quantum computers are becoming available at the same time. This has led to great interest in quantum natural language processing, and several early proposals and experiments. This paper surveys the state of this area, showing how NLP-related techniques including word embeddings, sequential models, attention, and grammatical parsing have been used in quantum language processing. We introduce a new quantum design for the basic task of text encoding (representing a string of characters in memory), which has not been addressed in detail before. As well as motivating new technologies, quantum theory has made key contributions to the challenging questions of 'What is uncertainty?' and 'What is intelligence?' As these questions are taking on fresh urgency with artificial systems, the paper also considers some of the ways facts are conceptualized and presented in language. In particular, we argue that the problem of 'hallucinations' arises through a basic misunderstanding: language expresses any number of plausible hypotheses, only a few of which become actual, a distinction that is ignored in classical mechanics, but present (albeit confusing) in quantum mechanics.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • INSPIRE HEP
  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Read our research on: Abortion | Podcasts | Election 2024

Regions & Countries

Americans’ use of chatgpt is ticking up, but few trust its election information.

It’s been more than a year since ChatGPT’s public debut set the tech world abuzz . And Americans’ use of the chatbot is ticking up: 23% of U.S. adults say they have ever used it, according to a Pew Research Center survey conducted in February, up from 18% in July 2023.

The February survey also asked Americans about several ways they might use ChatGPT, including for workplace tasks, for learning and for fun. While growing shares of Americans are using the chatbot for these purposes, the public is more wary than not of what the chatbot might tell them about the 2024 U.S. presidential election. About four-in-ten adults have not too much or no trust in the election information that comes from ChatGPT. By comparison, just 2% have a great deal or quite a bit of trust.

Pew Research Center conducted this study to understand Americans’ use of ChatGPT and their attitudes about the chatbot. For this analysis, we surveyed 10,133 U.S. adults from Feb. 7 to Feb. 11, 2024.

Everyone who took part in the survey is a member of the Center’s American Trends Panel (ATP), an online survey panel that is recruited through national, random sampling of residential addresses. This way, nearly all U.S. adults have a chance of selection. The survey is weighted to be representative of the U.S. adult population by gender, race, ethnicity, partisan affiliation, education and other categories. Read more about the ATP’s methodology .

Here are the questions used for this analysis , along with responses, and the survey methodology .

Below we’ll look more closely at:

  • Which U.S. adults have used ChatGPT
  • How Americans are using it
  • How much Americans trust ChatGPT’s election information

Who has used ChatGPT?

A line chart showing that chatGPT use has ticked up since July, particularly among younger adults.

Most Americans still haven’t used the chatbot, despite the uptick since our July 2023 survey on this topic . But some groups remain far more likely to have used it than others.

Differences by age

Adults under 30 stand out: 43% of these young adults have used ChatGPT, up 10 percentage points since last summer. Use of the chatbot is also up slightly among those ages 30 to 49 and 50 to 64. Still, these groups remain less likely than their younger peers to have used the technology. Just 6% of Americans 65 and up have used ChatGPT.

Differences by education

Highly educated adults are most likely to have used ChatGPT: 37% of those with a postgraduate or other advanced degree have done so, up 8 points since July 2023. This group is more likely to have used ChatGPT than those with a bachelor’s degree only (29%), some college experience (23%) or a high school diploma or less (12%).

How have Americans used ChatGPT?

Since March 2023, we’ve also tracked three potential reasons Americans might use ChatGPT: for work, to learn something new or for entertainment.

Line charts showing that the share of employed Americans who have used ChatGPT for work has risen by double digits in the past year.

The share of employed Americans who have used ChatGPT on the job increased from 8% in March 2023 to 20% in February 2024, including an 8-point increase since July.

Turning to U.S. adults overall, about one-in-five have used ChatGPT to learn something new (17%) or for entertainment (17%). These shares have increased from about one-in-ten in March 2023.

Line charts showing that about a third of employed Americans under 30 have now used ChatGPT for work.

Use of ChatGPT for work, learning or entertainment has largely risen across age groups over the past year. Still, there are striking differences between these groups (those 18 to 29, 30 to 49, and 50 and older).

For example, about three-in-ten employed adults under 30 (31%) say they have used it for tasks at work – up 19 points from a year ago, with much of that increase happening since July. These younger workers are more likely than their older peers to have used ChatGPT in this way.

Adults under 30 also stand out in using the chatbot for learning. And when it comes to entertainment, those under 50 are more likely than older adults to use ChatGPT for this purpose.

A third of employed Americans with a postgraduate degree have used ChatGPT for work, compared with smaller shares of workers who have a bachelor’s degree only (25%), some college (19%) or a high school diploma or less (8%).

Those shares have each roughly tripled since March 2023 for workers with a postgraduate degree, bachelor’s degree or some college. Among workers with a high school diploma or less, use is statistically unchanged from a year ago.

Using ChatGPT for other purposes also varies by education level, though the patterns are slightly different. For example, a quarter each of postgraduate and bachelor’s degree-holders have used ChatGPT for learning, compared with 16% of those with some college experience and 11% of those with a high school diploma or less education. Each of these shares is up from a year ago.

ChatGPT and the 2024 presidential election

With more people using ChatGPT, we also wanted to understand whether Americans trust the information they get from it, particularly in the context of U.S. politics.

A horizontal stacked bar chart showing that about 4 in 10 Americans don’t trust information about the election that comes from ChatGPT.

About four-in-ten Americans (38%) don’t trust the information that comes from ChatGPT about the 2024 U.S. presidential election – that is, they say they have not too much trust (18%) or no trust at all (20%).

A mere 2% have a great deal or quite a bit of trust, while 10% have some trust.

Another 15% aren’t sure, while 34% have not heard of ChatGPT.

Distrust far outweighs trust regardless of political party. About four-in-ten Republicans and Democrats alike (including those who lean toward each party) have not too much or no trust at all in ChatGPT’s election information.

Notably, however, very few Americans have actually used the chatbot to find information about the presidential election: Just 2% of adults say they have done so, including 2% of Democrats and Democratic-leaning independents and 1% of Republicans and GOP leaners.

These survey findings come amid growing national attention on chatbots and misinformation. Several tech companies have recently pledged to prevent the misuse of artificial intelligence – including chatbots – in this year’s election. But recent reports suggest chatbots themselves may provide misleading answers to election-related questions .

Note: Here are the questions used for this analysis , along with responses, and the survey methodology .

bibliographic data research

Sign up for our weekly newsletter

Fresh data delivered Saturday mornings

Many Americans think generative AI programs should credit the sources they rely on

Q&a: how we used large language models to identify guests on popular podcasts, striking findings from 2023, what the data says about americans’ views of artificial intelligence, most popular.

About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts .

IMAGES

  1. 9

    bibliographic data research

  2. Bibliographic data Map, based on co-authorship analysis from leading

    bibliographic data research

  3. Conceptual Model of Bibliographic Data.

    bibliographic data research

  4. (PDF) Experience with transformation of bibliographic data into linked data

    bibliographic data research

  5. 5

    bibliographic data research

  6. Flow-chart of the bibliographic research

    bibliographic data research

VIDEO

  1. Data visualization

  2. Predictable Data Centres

  3. How To Find Bibliographies on Your Topic in Dissertations and Theses

  4. How to Add a New Record in CDS/ISIS Winisis Database File |Add New Record Entry Winisis #infomrk

  5. Keynote

  6. CORE-A comprehensive bibliographic database of the world’s scholarly literature(HowtoDownloadpaper)

COMMENTS

  1. How to conduct a bibliometric analysis: An overview and guidelines

    For example, if the study intends to provide a review of the past, present, and future of a research field with a large bibliometric corpus, then a combination of co-citation analysis (past), bibliographic coupling (present), and co-word analysis (e.g., notable words in the implications and future research directions of full texts) (future) can ...

  2. Eight tips and questions for your bibliographic study in ...

    Citation data helps to identify impactful articles, authors, and journals. Such data also facilitates the identification of topic clusters and allows the measurement of knowledge diffusion within and between disciplines. 3. Provide and motivate a research goal and explain why a bibliographic study is needed to achieve this goal.

  3. Guidelines for interpreting the results of bibliometric analysis: A

    Nonetheless, given the complex world of scholarly research, bibliometric data is often ambiguous, in that it can be meaningless on its own, diverse due to varying formats, and expansive, often including hundreds to thousands of records. ... Bibliographic coupling (i.e., knowledge clusters based on citing publications);

  4. Bibliometrics

    Research. Bibliometrics is the application of statistical methods to the study of bibliographic data, especially in scientific and library and information science contexts, and is closely associated with scientometrics (the analysis of scientific metrics and indicators) to the point that both fields largely overlap.

  5. Assessing the quality of bibliographic data sources for measuring

    Abstract. Measuring international research collaboration (IRC) is essential to various research assessment tasks but the effect of various measurement decisions, including which data sources to use, has not been thoroughly studied. To better understand the effect of data source choice on IRC measurement, we design and implement a data quality assessment framework specifically for bibliographic ...

  6. Bibliometric Analysis for Medical Research

    Bibliometry, a branch of library and information science, deals with the quantitative analysis of bibliometric data. Bibliometric analysis (BA), which primarily focuses on academic productivity, uses published scientific literature (research articles, books, conference proceedings, etc.) to measure research activities in a specific area. 1 BA ...

  7. Comparison of bibliographic data sources: Implications for the

    Bibliographic data sources evidently make a significant impact on the academic landscape. This makes the selection and use of such databases essential to various stakeholders. ... WoS, Scopus, and MSA all record and maintain their own citation data. While some research has shown that the citation counts across these sources showed high ...

  8. Special issue on bibliographic data sources

    Research in quantitative science studies relies on an increasingly broad range of data sources, providing data on scholarly publications, social media activity, peer review, research funding, the scholarly workforce, scientific prizes, and so on. However, there is one type of data source that remains at the heart of research in quantitative science studies: bibliographic databases. These data ...

  9. CADRE: A Collaborative, Cloud-Based Solution for Big Bibliographic Data

    Introduction: The Rise of Big Bibliographic Datasets in Research and How Libraries Struggle to Meet Demands. Big bibliographic datasets hold promise for revolutionizing the scientific enterprise when combined with state-of-the-science computational capabilities (Fortunato et al., 2018).Yet, hosting proprietary and open big datasets poses significant difficulties for libraries, both large and ...

  10. The Role and Function of National Bibliographies for Research

    The value, function, and use of national bibliographies for data is inevitably reborn and reevaluated in the environment of the Semantic Web; this makes way for combining national bibliographic data with external data sets as part of quantitative analyses. Indeed, the role of national bibliographies for research has broadened with digitalization.

  11. Open Access Bibliographic Resources for Maintaining a ...

    Abstract Appealing to external bibliographic systems is an inevitable stage when organizing in-house resources in research libraries and information services. On the one hand, data from external sources can be widely used when working with institutional repositories, e.g., at the stage of searching for data on the organization's papers, creating alerts for new entries or exporting data using ...

  12. PDF Feburary 2020 Global Research Report The value of bibliometric

    to produce a research paper. The data shows how rapid the growth of such studies has been - compared to the overall growth of the underlying database. Whereas acknowledgments of database source were rare in studies using bibliographic data in the 1990s, numbers rose to around 1,000 papers per year by 2010 and growth since then has sky-rocketed.

  13. Frontiers

    This paper presents a large-scale document-level comparison of two major bibliographic data sources: Scopus and Dimensions. The focus is on the differences in their coverage of documents at two levels of aggregation: by country and by institution. The main goal is to analyze whether Dimensions offers as good new opportunities for bibliometric analysis at the country and institutional levels as ...

  14. Exploring Topics in Bibliometric Research Through Citation Networks and

    Cluster 7 of bibliometric theory contains foundational research of the field in terms of discovering and explaining statistical properties recurrently observed when measuring bibliographic data. Articles in this cluster study the presence of power laws in the distributions of authors, citations, and other bibliographic features ( Price, 1976 ...

  15. Introduction

    A data model designed to replace the MARC 21, BIBFRAME uses linked data principles to make bibliographic data more useful both within and outside the library community. This guide provides an overview of BIBFRAME 2.0, the editor, and the database.

  16. Bibliographic Data Science and the History of the Book (c. 1500-1800)

    It is specifically targeted at enabling the use of bibliographic metadata as a research object, deriving from the more generic paradigms of open science and data science. 7 We propose that large-scale, automated harmonization efforts can enhance the overall reliability and commensurability between independently maintained metadata collections ...

  17. Comparative Analysis of the Bibliographic Data Sources Dimensions and

    Given the potential value of these data sources for research evaluation, it is important to assess their key properties to better understand their strengths and weaknesses, in particular, to decide whether their data is sufficient in volume, completeness, and accuracy to be useful for scientists, policymakers, and other stakeholders ...

  18. An introduction to library linked data

    Technical background for linked data. When Tim Berners-Lee and the team at CERN invented the basic protocols for the web in 1989, they proposed three basic technologies to connect people to resources:. Unique resource identifiers (URIs) for anything that can be connected on the web; URLs (Uniform Resource Locators)—commonly known as "web page names"—are a type of URI

  19. (PDF) A map of Digital Humanities research across bibliographic data

    Purpose. This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources ...

  20. A map of Digital Humanities research across bibliographic data sources

    A map of Digital Humanities research across bibliographic data sources. Purpose. This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting the relations among DH and other disciplines.

  21. [2403.20208] Unleashing the Potential of Large Language Models for

    In the domain of data science, the predictive tasks of classification, regression, and imputation of missing values are commonly encountered challenges associated with tabular data. This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks. Despite their proficiency in comprehending natural language, LLMs fall short in dealing with structured ...

  22. [2403.16971] AIOS: LLM Agent Operating System

    The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating ...

  23. Scientists Tend to Inflate How Ethical They Are in Doing Their Research

    Here too the results were very skewed. 29% said their field followed good research practices to a greater extent than did scientists in other fields. Only 8% said it was the other way around.

  24. Editorial: Linked open bibliographic data for real-time research

    So far, research evaluation has been very important as a means for deciding academic tenure, awarding research grants, tracking the evolution of scholarly institutions, and assessing doctoral students (King, 1987).However, this effort has been limited by the lack of findability and accessibility of bibliographic databases allowing such an assessment and the legal and financial burdens toward ...

  25. European Health Data Space: Council and Parliament strike deal

    After months of hard work and dedication, we have a deal that will strongly support patient care and scientific research in the EU. The new law agreed on today will allow patients to access their health data wherever they are in the EU, while also providing scientific research for important reasons of public interest with a wealth of secure data that will greatly benefit the development of ...

  26. Natural Language, AI, and Quantum Computing in 2024: Research

    Language processing is at the heart of current developments in artificial intelligence, and quantum computers are becoming available at the same time. This has led to great interest in quantum natural language processing, and several early proposals and experiments. This paper surveys the state of this area, showing how NLP-related techniques including word embeddings, sequential models ...

  27. Americans increasingly using ChatGPT, but few ...

    And Americans' use of the chatbot is ticking up: 23% of U.S. adults say they have ever used it, according to a Pew Research Center survey conducted in February, up from 18% in July 2023. The February survey also asked Americans about several ways they might use ChatGPT, including for workplace tasks, for learning and for fun.