research article open access

Mission and history
Platform features
Library Advisory Group
What’s in JSTOR
For Librarians
For Publishers

Open and free content on JSTOR and Artstor

Our partnerships with libraries and publishers help us make content discoverable and freely accessible worldwide

Search open content on JSTOR

Explore our growing collection of Open Access journals

Early Journal Content , articles published prior to the last 95 years in the United States, or prior to the last 143 years if initially published internationally, are freely available to all

Even more content is available when you register to read – millions of articles from nearly 2,000 journals

Thousands of Open Access ebooks are available from top scholarly publishers, including Brill, Cornell University Press, University College of London, and University of California Press – at no cost to libraries or users.

This includes Open Access titles in Spanish:

Collaboration with El Colegio de México
Partnership with the Latin American Council of Social Sciences

Images and media

JSTOR hosts a growing number of public collections , including Artstor’s Open Access collections , from museums, archives, libraries, and scholars worldwide.

Research reports

A curated set of more than 34,000 research reports from more than 140 policy institutes selected with faculty, librarian, and expert input.

Resources for librarians

Open content title lists:

Open Access Journals (xlsx)
Open Access Books (xlsx)
JSTOR Early Journal Content (xlsx)
Research Reports (txt)

Open Access ebook resources for librarians

Library-supported collections

Shared Collections : We have a growing corpus of digital special collections published on JSTOR by our institutional partners.

Reveal Digital : A collaboration with libraries to fund, source, digitize and publish open access primary source collections from under-represented voices.

JSTOR Daily

JSTOR Daily is an online publication that contextualizes current events with scholarship. All of our stories contain links to publicly accessible research on JSTOR. We’re proud to publish articles based in fact and grounded by careful research and to provide free access to that research for all of our readers.

Share your thoughts in a quick 3-question survey to help us improve your experience.

Share Your Feedback

Advanced search
Peer review

Discover relevant research today

Advance your research field in the open

Reach new audiences and maximize your readership

ScienceOpen puts your research in the context of

Publications

For Publishers

ScienceOpen offers content hosting, context building and marketing services for publishers. See our tailored offerings

For academic publishers to promote journals and interdisciplinary collections
For open access journals to host journal content in an interactive environment
For university library publishing to develop new open access paradigms for their scholars
For scholarly societies to promote content with interactive features

For Institutions

ScienceOpen offers state-of-the-art technology and a range of solutions and services

For faculties and research groups to promote and share your work
For research institutes to build up your own branding for OA publications
For funders to develop new open access publishing paradigms
For university libraries to create an independent OA publishing environment

For Researchers

Make an impact and build your research profile in the open with ScienceOpen

Search and discover relevant research in over 96 million Open Access articles and article records
Share your expertise and get credit by publicly reviewing any article
Publish your poster or preprint and track usage and impact with article- and author-level metrics
Create a topical Collection to advance your research field

Create a Journal powered by ScienceOpen

Launching a new open access journal or an open access press? ScienceOpen now provides full end-to-end open access publishing solutions – embedded within our smart interactive discovery environment. A modular approach allows open access publishers to pick and choose among a range of services and design the platform that fits their goals and budget.

Continue reading “Create a Journal powered by ScienceOpen”

What can a Researcher do on ScienceOpen?

ScienceOpen provides researchers with a wide range of tools to support their research – all for free. Here is a short checklist to make sure you are getting the most of the technological infrastructure and content that we have to offer. What can a researcher do on ScienceOpen? Continue reading “What can a Researcher do on ScienceOpen?”

ScienceOpen on the Road

Upcoming events.

15 June – Scheduled Server Maintenance, 13:00 – 01:00 CEST

Past Events

20 – 22 February – ResearcherToReader Conference
09 November – Webinar for the Discoverability of African Research
26 – 27 October – Attending the Workshop on Open Citations and Open Scholarly Metadata
18 – 22 October – ScienceOpen at Frankfurt Book Fair.
27 – 29 September – Attending OA Tage, Berlin .
25 – 27 September – ScienceOpen at Open Science Fair
19 – 21 September – OASPA 2023 Annual Conference .
22 – 24 May – ScienceOpen sponsoring Pint of Science, Berlin.
16-17 May – ScienceOpen at 3rd AEUP Conference.
20 – 21 April – ScienceOpen attending Scaling Small: Community-Owned Futures for Open Access Books .

What is ScienceOpen?

Smart search and discovery within an interactive interface
Researcher promotion and ORCID integration
Open evaluation with article reviews and Collections
Business model based on providing services to publishers

Live Twitter stream

Some of our partners:.

Navigation group

Home banner.

Where scientists empower society

Creating solutions for healthy lives on a healthy planet.

9.4 million

2.8 billion

article views and downloads

Main Content

Editors and reviewers
Collaborators

Male doctor examining petri dish at laboratory while coworker working in background

Find a journal

We have a home for your research. Our community led journals cover more than 1,500 academic disciplines and are some of the largest and most cited in their fields.

Confident young woman gesturing while teaching students in class

Submit your research

Start your submission and get more impact for your research by publishing with us.

Active senior woman concentrating while working on laptop

Author guidelines

Ready to publish? Check our author guidelines for everything you need to know about submitting, from choosing a journal and section to preparing your manuscript.

Smiling colleagues doing research over laptop computer on desk in office

Peer review

Our efficient collaborative peer review means you’ll get a decision on your manuscript in an average of 61 days.

Interior of a library with desks and bookshelves

Article publishing charges (APCs) apply to articles that are accepted for publication by our external and independent editorial boards

Group of international university students having fun studying in library, three colleagues of modern work co-working space talking and smiling while sitting at the desk table with laptop computer

Press office

Visit our press office for key media contact information, as well as Frontiers’ media kit, including our embargo policy, logos, key facts, leadership bios, and imagery.

Back view of man presenting to students at a lecture theatre

Institutional partnerships

Join more than 555 institutions around the world already benefiting from an institutional membership with Frontiers, including CERN, Max Planck Society, and the University of Oxford.

Happy senior old korean businesswoman discussing online project on laptop with african american male colleague, working together in pairs at shared workplace, analyzing electronic documents.

Publishing partnerships

Partner with Frontiers and make your society’s transition to open access a reality with our custom-built platform and publishing expertise.

Policy Labs

Connecting experts from business, science, and policy to strengthen the dialogue between scientific research and informed policymaking.

Smiling African American Woman Talking to Boss in Office

How we publish

All Frontiers journals are community-run and fully open access, so every research article we publish is immediately and permanently free to read.

Front view portrait of African American man wearing lab coat and raising hand asking question while sitting in audience and listening to lecture on medicine

Editor guidelines

Reviewing a manuscript? See our guidelines for everything you need to know about our peer review process.

Shaking hands. African American dark-skinned man touching hands of his light-skinned workmate in greeting gesture

Become an editor

Apply to join an editorial board and collaborate with an international team of carefully selected independent researchers.

Scientist looking at 3D rendered graphic scans from Magnetic Resonance Imaging (MRI) scanner, close up

My assignments

It’s easy to find and track your editorial assignments with our platform, 'My Frontiers' – saving you time to spend on your own research.

FSCI_Mauch-Mani_Induced-resistance-in-plants_Hub-card

How ‘vaccinating’ plants could reduce pesticide use and secure global food supplies

Induced resistance, where plants’ immune systems are activated in a controlled way that prepares them to fight pests and disease, could help build a sustainable and resilient agricultural system.

winter kayaking in Antarctica, extreme sport adventure, people paddling on kayak near iceberg

Safeguarding peer review to ensure quality at scale

Making scientific research open has never been more important. But for research to be trusted, it must be of the highest quality. Facing an industry-wide rise in fraudulent science, Frontiers has increased its focus on safeguarding quality.

FSCI_Hub_Climate-Change_Collins_Hub-card

Understanding regional climate change is essential for guiding effective climate adaptation policy, study finds

From intensified monsoons and storm tracks to polar precipitation shifts, a new synthesis of regional climate data emphasizes the need for climate adaptation policy based on the latest regional climate science.

Oceanic life found to be thriving thanks to Saharan dust blown from thousands of kilometers away

US scientists found that the further Saharan dust travels, the more iron in it becomes bioreactive. This is crucial for understanding iron's impact on phytoplankton growth, terrestrial ecosystems, and carbon cycling, especially under global change

Focused young african businessman wear headphones study online watching webinar podcast on laptop listening learning education course conference calling make notes sit at work desk, elearning concept

Your Zoom background could influence how tired you feel after a video call

On many videoconferencing platforms users can set virtual backgrounds. But could this choice have varying effects on how tired people feel after a video call?

Smart watch, wearable gadget. Man wearing hybrid smartwatch. Wearables with digital touchscreen and mobile app technology. Person using wristwatch for business and work. Device with touch interface.

When procrastination becomes unhealthy: Here are five Frontiers articles you won’t want to miss

At Frontiers, we bring some of the world’s best research to a global audience. But with tens of thousands of articles published each year, it’s impossible to cover all of them. Here are just five amazing papers you may have missed.

Group of senior friends playing chess game at the park. Lifestyle concepts about seniority and third age

Three Research Topics exploring dementia diagnosis and treatment

While dementia remains a complex challenge, scientists are making significant progress in understanding and treating it.

Get the latest research updates, subscribe to our newsletter

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Find and Read Articles

PLOS publishes a suite of peer-reviewed Open Access journals that feature quality research, expert commentary, and critical analysis across all scientific disciplines.

New Articles by Email

Journal alerts.

PLOS articles publish daily and roll up into monthly issues. To receive a regular email alert with a list of articles published in each issue, sign up at the bottom of any journal homepage.

Choose how often to receive alerts and manage subscriptions from your PLOS account.

Sign in to your account, then click the Profile button at the top of the page.
Navigate to the Alerts tab.
Under Journal Alerts , check the boxes to choose weekly or monthly delivery.

The weekly email alert from PLOS ONE delivers all new articles by default.

Select Send me a custom email alert to receive articles from specific subject areas. Select one or more subject areas to add them to the email.

Search alerts

Set up a search alert to be notified when new PLOS articles are published relevant to a personalized search. To start, create or sign in to your PLOS account. Read more about PLOS accounts .

Click the link underneath the search box to navigate to the advanced search page.

Customize the search criteria by subject, article type, author, or a variety of other fields. Choose any combination of PLOS journals.

Once you have a satisfactory set of results, select a delivery method at the top of the page. PLOS offers two delivery methods for these notices: email and RSS feed. Read more about PLOS RSS feeds .

To receive new articles by email, click the search alert button. Name the search, choose the email frequency, and click Save .

Unsubscribe

Unsubscribe from email alerts in your PLOS account . Sign in to your account from the top of any journal page, and click the Profile button.

Journal alerts Navigate to Alerts > Journal Alerts. Uncheck the box to unsubscribe from a specific mailing list.

Saved search alerts Navigate to Alerts > Search Alerts . Click X to stop receiving the alert.

New Articles by RSS

PLOS RSS feeds are regular updates with article titles and abstracts, collected in your browser or a feed reader. Use the feeds if you do not want alerts delivered by email but do want to know when new articles are published or added to a saved search. Find out how to create a search alert feed .

RSS feeds are available by clicking the RSS icon on each journal home page. Use any feed reader to collect and read the article list.

Browse by Subject Area

Each PLOS article offers a list of subject area tags based on its subject matter. Click on any term to view other PLOS articles that use the same tag.

Sign in or create a PLOS account .
Click the Browse button at the top of any PLOS ONE page.

Select a subject area from the menu to view the associated articles from PLOS ONE .

Click the email alert button in the top right corner of the page. Then, click Save to subscribe.

Sign Up for a PLOS Account

Create an account

Click create account at the top of any PLOS journal page to sign up.

After you fill in the form, you’ll receive an email to verify your account and complete registration.

Customize and edit your account

Click the Profile button at the top of the page to edit your account details, including personal information, email address, password, and email subscriptions.

The first time you sign in to your account, create a profile with a username and basic identifying information. Your username is attached to all comments that you post on PLOS web sites.

Search Menu
Sign in through your institution
Browse content in Arts and Humanities
Browse content in Archaeology
Anglo-Saxon and Medieval Archaeology
Archaeological Methodology and Techniques
Archaeology by Region
Archaeology of Religion
Archaeology of Trade and Exchange
Biblical Archaeology
Contemporary and Public Archaeology
Environmental Archaeology
Historical Archaeology
History and Theory of Archaeology
Industrial Archaeology
Landscape Archaeology
Mortuary Archaeology
Prehistoric Archaeology
Underwater Archaeology
Urban Archaeology
Zooarchaeology
Browse content in Architecture
Architectural Structure and Design
History of Architecture
Residential and Domestic Buildings
Theory of Architecture
Browse content in Art
Art Subjects and Themes
History of Art
Industrial and Commercial Art
Theory of Art
Biographical Studies
Byzantine Studies
Browse content in Classical Studies
Classical History
Classical Philosophy
Classical Mythology
Classical Numismatics
Classical Literature
Classical Reception
Classical Art and Architecture
Classical Oratory and Rhetoric
Greek and Roman Papyrology
Greek and Roman Epigraphy
Greek and Roman Law
Greek and Roman Archaeology
Late Antiquity
Religion in the Ancient World
Social History
Digital Humanities
Browse content in History
Colonialism and Imperialism
Diplomatic History
Environmental History
Genealogy, Heraldry, Names, and Honours
Genocide and Ethnic Cleansing
Historical Geography
History by Period
History of Emotions
History of Agriculture
History of Education
History of Gender and Sexuality
Industrial History
Intellectual History
International History
Labour History
Legal and Constitutional History
Local and Family History
Maritime History
Military History
National Liberation and Post-Colonialism
Oral History
Political History
Public History
Regional and National History
Revolutions and Rebellions
Slavery and Abolition of Slavery
Social and Cultural History
Theory, Methods, and Historiography
Urban History
World History
Browse content in Language Teaching and Learning
Language Learning (Specific Skills)
Language Teaching Theory and Methods
Browse content in Linguistics
Applied Linguistics
Cognitive Linguistics
Computational Linguistics
Forensic Linguistics
Grammar, Syntax and Morphology
Historical and Diachronic Linguistics
History of English
Language Evolution
Language Reference
Language Acquisition
Language Variation
Language Families
Lexicography
Linguistic Anthropology
Linguistic Theories
Linguistic Typology
Phonetics and Phonology
Psycholinguistics
Sociolinguistics
Translation and Interpretation
Writing Systems
Browse content in Literature
Bibliography
Children's Literature Studies
Literary Studies (Romanticism)
Literary Studies (American)
Literary Studies (Asian)
Literary Studies (European)
Literary Studies (Eco-criticism)
Literary Studies (Modernism)
Literary Studies - World
Literary Studies (1500 to 1800)
Literary Studies (19th Century)
Literary Studies (20th Century onwards)
Literary Studies (African American Literature)
Literary Studies (British and Irish)
Literary Studies (Early and Medieval)
Literary Studies (Fiction, Novelists, and Prose Writers)
Literary Studies (Gender Studies)
Literary Studies (Graphic Novels)
Literary Studies (History of the Book)
Literary Studies (Plays and Playwrights)
Literary Studies (Poetry and Poets)
Literary Studies (Postcolonial Literature)
Literary Studies (Queer Studies)
Literary Studies (Science Fiction)
Literary Studies (Travel Literature)
Literary Studies (War Literature)
Literary Studies (Women's Writing)
Literary Theory and Cultural Studies
Mythology and Folklore
Shakespeare Studies and Criticism
Browse content in Media Studies
Browse content in Music
Applied Music
Dance and Music
Ethics in Music
Ethnomusicology
Gender and Sexuality in Music
Medicine and Music
Music Cultures
Music and Media
Music and Religion
Music and Culture
Music Education and Pedagogy
Music Theory and Analysis
Musical Scores, Lyrics, and Libretti
Musical Structures, Styles, and Techniques
Musicology and Music History
Performance Practice and Studies
Race and Ethnicity in Music
Sound Studies
Browse content in Performing Arts
Browse content in Philosophy
Aesthetics and Philosophy of Art
Epistemology
Feminist Philosophy
History of Western Philosophy
Meta-Philosophy
Metaphysics
Moral Philosophy
Non-Western Philosophy
Philosophy of Language
Philosophy of Mind
Philosophy of Perception
Philosophy of Science
Philosophy of Action
Philosophy of Law
Philosophy of Religion
Philosophy of Mathematics and Logic
Practical Ethics
Social and Political Philosophy
Browse content in Religion
Biblical Studies
Christianity
East Asian Religions
History of Religion
Judaism and Jewish Studies
Qumran Studies
Religion and Education
Religion and Health
Religion and Politics
Religion and Science
Religion and Law
Religion and Art, Literature, and Music
Religious Studies
Browse content in Society and Culture
Cookery, Food, and Drink
Cultural Studies
Customs and Traditions
Ethical Issues and Debates
Hobbies, Games, Arts and Crafts
Natural world, Country Life, and Pets
Popular Beliefs and Controversial Knowledge
Sports and Outdoor Recreation
Technology and Society
Travel and Holiday
Visual Culture
Browse content in Law
Arbitration
Browse content in Company and Commercial Law
Commercial Law
Company Law
Browse content in Comparative Law
Systems of Law
Competition Law
Browse content in Constitutional and Administrative Law
Government Powers
Judicial Review
Local Government Law
Military and Defence Law
Parliamentary and Legislative Practice
Construction Law
Contract Law
Browse content in Criminal Law
Criminal Procedure
Criminal Evidence Law
Sentencing and Punishment
Employment and Labour Law
Environment and Energy Law
Browse content in Financial Law
Banking Law
Insolvency Law
History of Law
Human Rights and Immigration
Intellectual Property Law
Browse content in International Law
Private International Law and Conflict of Laws
Public International Law
IT and Communications Law
Jurisprudence and Philosophy of Law
Law and Politics
Law and Society
Browse content in Legal System and Practice
Courts and Procedure
Legal Skills and Practice
Legal System - Costs and Funding
Primary Sources of Law
Regulation of Legal Profession
Medical and Healthcare Law
Browse content in Policing
Criminal Investigation and Detection
Police and Security Services
Police Procedure and Law
Police Regional Planning
Browse content in Property Law
Personal Property Law
Restitution
Study and Revision
Terrorism and National Security Law
Browse content in Trusts Law
Wills and Probate or Succession
Browse content in Medicine and Health
Browse content in Allied Health Professions
Arts Therapies
Clinical Science
Dietetics and Nutrition
Occupational Therapy
Operating Department Practice
Physiotherapy
Radiography
Speech and Language Therapy
Browse content in Anaesthetics
General Anaesthesia
Clinical Neuroscience
Browse content in Clinical Medicine
Acute Medicine
Cardiovascular Medicine
Clinical Genetics
Clinical Pharmacology and Therapeutics
Dermatology
Endocrinology and Diabetes
Gastroenterology
Genito-urinary Medicine
Geriatric Medicine
Infectious Diseases
Medical Toxicology
Medical Oncology
Pain Medicine
Palliative Medicine
Rehabilitation Medicine
Respiratory Medicine and Pulmonology
Rheumatology
Sleep Medicine
Sports and Exercise Medicine
Community Medical Services
Critical Care
Emergency Medicine
Forensic Medicine
Haematology
History of Medicine
Browse content in Medical Skills
Clinical Skills
Communication Skills
Nursing Skills
Surgical Skills
Browse content in Medical Dentistry
Oral and Maxillofacial Surgery
Paediatric Dentistry
Restorative Dentistry and Orthodontics
Surgical Dentistry
Medical Ethics
Medical Statistics and Methodology
Browse content in Neurology
Clinical Neurophysiology
Neuropathology
Nursing Studies
Browse content in Obstetrics and Gynaecology
Gynaecology
Occupational Medicine
Ophthalmology
Otolaryngology (ENT)
Browse content in Paediatrics
Neonatology
Browse content in Pathology
Chemical Pathology
Clinical Cytogenetics and Molecular Genetics
Histopathology
Medical Microbiology and Virology
Patient Education and Information
Browse content in Pharmacology
Psychopharmacology
Browse content in Popular Health
Caring for Others
Complementary and Alternative Medicine
Self-help and Personal Development
Browse content in Preclinical Medicine
Cell Biology
Molecular Biology and Genetics
Reproduction, Growth and Development
Primary Care
Professional Development in Medicine
Browse content in Psychiatry
Addiction Medicine
Child and Adolescent Psychiatry
Forensic Psychiatry
Learning Disabilities
Old Age Psychiatry
Psychotherapy
Browse content in Public Health and Epidemiology
Epidemiology
Public Health
Browse content in Radiology
Clinical Radiology
Interventional Radiology
Nuclear Medicine
Radiation Oncology
Reproductive Medicine
Browse content in Surgery
Cardiothoracic Surgery
Gastro-intestinal and Colorectal Surgery
General Surgery
Neurosurgery
Paediatric Surgery
Peri-operative Care
Plastic and Reconstructive Surgery
Surgical Oncology
Transplant Surgery
Trauma and Orthopaedic Surgery
Vascular Surgery
Browse content in Science and Mathematics
Browse content in Biological Sciences
Aquatic Biology
Biochemistry
Bioinformatics and Computational Biology
Developmental Biology
Ecology and Conservation
Evolutionary Biology
Genetics and Genomics
Microbiology
Molecular and Cell Biology
Natural History
Plant Sciences and Forestry
Research Methods in Life Sciences
Structural Biology
Systems Biology
Zoology and Animal Sciences
Browse content in Chemistry
Analytical Chemistry
Computational Chemistry
Crystallography
Environmental Chemistry
Industrial Chemistry
Inorganic Chemistry
Materials Chemistry
Medicinal Chemistry
Mineralogy and Gems
Organic Chemistry
Physical Chemistry
Polymer Chemistry
Study and Communication Skills in Chemistry
Theoretical Chemistry
Browse content in Computer Science
Artificial Intelligence
Computer Architecture and Logic Design
Game Studies
Human-Computer Interaction
Mathematical Theory of Computation
Programming Languages
Software Engineering
Systems Analysis and Design
Virtual Reality
Browse content in Computing
Business Applications
Computer Security
Computer Games
Computer Networking and Communications
Digital Lifestyle
Graphical and Digital Media Applications
Operating Systems
Browse content in Earth Sciences and Geography
Atmospheric Sciences
Environmental Geography
Geology and the Lithosphere
Maps and Map-making
Meteorology and Climatology
Oceanography and Hydrology
Palaeontology
Physical Geography and Topography
Regional Geography
Soil Science
Urban Geography
Browse content in Engineering and Technology
Agriculture and Farming
Biological Engineering
Civil Engineering, Surveying, and Building
Electronics and Communications Engineering
Energy Technology
Engineering (General)
Environmental Science, Engineering, and Technology
History of Engineering and Technology
Mechanical Engineering and Materials
Technology of Industrial Chemistry
Transport Technology and Trades
Browse content in Environmental Science
Applied Ecology (Environmental Science)
Conservation of the Environment (Environmental Science)
Environmental Sustainability
Environmentalist Thought and Ideology (Environmental Science)
Management of Land and Natural Resources (Environmental Science)
Natural Disasters (Environmental Science)
Nuclear Issues (Environmental Science)
Pollution and Threats to the Environment (Environmental Science)
Social Impact of Environmental Issues (Environmental Science)
History of Science and Technology
Browse content in Materials Science
Ceramics and Glasses
Composite Materials
Metals, Alloying, and Corrosion
Nanotechnology
Browse content in Mathematics
Applied Mathematics
Biomathematics and Statistics
History of Mathematics
Mathematical Education
Mathematical Finance
Mathematical Analysis
Numerical and Computational Mathematics
Probability and Statistics
Pure Mathematics
Browse content in Neuroscience
Cognition and Behavioural Neuroscience
Development of the Nervous System
Disorders of the Nervous System
History of Neuroscience
Invertebrate Neurobiology
Molecular and Cellular Systems
Neuroendocrinology and Autonomic Nervous System
Neuroscientific Techniques
Sensory and Motor Systems
Browse content in Physics
Astronomy and Astrophysics
Atomic, Molecular, and Optical Physics
Biological and Medical Physics
Classical Mechanics
Computational Physics
Condensed Matter Physics
Electromagnetism, Optics, and Acoustics
History of Physics
Mathematical and Statistical Physics
Measurement Science
Nuclear Physics
Particles and Fields
Plasma Physics
Quantum Physics
Relativity and Gravitation
Semiconductor and Mesoscopic Physics
Browse content in Psychology
Affective Sciences
Clinical Psychology
Cognitive Psychology
Cognitive Neuroscience
Criminal and Forensic Psychology
Developmental Psychology
Educational Psychology
Evolutionary Psychology
Health Psychology
History and Systems in Psychology
Music Psychology
Neuropsychology
Organizational Psychology
Psychological Assessment and Testing
Psychology of Human-Technology Interaction
Psychology Professional Development and Training
Research Methods in Psychology
Social Psychology
Browse content in Social Sciences
Browse content in Anthropology
Anthropology of Religion
Human Evolution
Medical Anthropology
Physical Anthropology
Regional Anthropology
Social and Cultural Anthropology
Theory and Practice of Anthropology
Browse content in Business and Management
Business Ethics
Business Strategy
Business History
Business and Technology
Business and Government
Business and the Environment
Comparative Management
Corporate Governance
Corporate Social Responsibility
Entrepreneurship
Health Management
Human Resource Management
Industrial and Employment Relations
Industry Studies
Information and Communication Technologies
International Business
Knowledge Management
Management and Management Techniques
Operations Management
Organizational Theory and Behaviour
Pensions and Pension Management
Public and Nonprofit Management
Social Issues in Business and Management
Strategic Management
Supply Chain Management
Browse content in Criminology and Criminal Justice
Criminal Justice
Criminology
Forms of Crime
International and Comparative Criminology
Youth Violence and Juvenile Justice
Development Studies
Browse content in Economics
Agricultural, Environmental, and Natural Resource Economics
Asian Economics
Behavioural Finance
Behavioural Economics and Neuroeconomics
Econometrics and Mathematical Economics
Economic History
Economic Systems
Economic Methodology
Economic Development and Growth
Financial Markets
Financial Institutions and Services
General Economics and Teaching
Health, Education, and Welfare
History of Economic Thought
International Economics
Labour and Demographic Economics
Law and Economics
Macroeconomics and Monetary Economics
Microeconomics
Public Economics
Urban, Rural, and Regional Economics
Welfare Economics
Browse content in Education
Adult Education and Continuous Learning
Care and Counselling of Students
Early Childhood and Elementary Education
Educational Equipment and Technology
Educational Research Methodology
Educational Strategies and Policy
Higher and Further Education
Organization and Management of Education
Philosophy and Theory of Education
Schools Studies
Secondary Education
Teaching of a Specific Subject
Teaching of Specific Groups and Special Educational Needs
Teaching Skills and Techniques
Browse content in Environment
Applied Ecology (Social Science)
Climate Change
Conservation of the Environment (Social Science)
Environmentalist Thought and Ideology (Social Science)
Management of Land and Natural Resources (Social Science)
Natural Disasters (Environment)
Pollution and Threats to the Environment (Social Science)
Social Impact of Environmental Issues (Social Science)
Sustainability
Browse content in Human Geography
Cultural Geography
Economic Geography
Political Geography
Browse content in Interdisciplinary Studies
Communication Studies
Museums, Libraries, and Information Sciences
Browse content in Politics
African Politics
Asian Politics
Chinese Politics
Comparative Politics
Conflict Politics
Elections and Electoral Studies
Environmental Politics
Ethnic Politics
European Union
Foreign Policy
Gender and Politics
Human Rights and Politics
Indian Politics
International Relations
International Organization (Politics)
Irish Politics
Latin American Politics
Middle Eastern Politics
Political Behaviour
Political Economy
Political Institutions
Political Methodology
Political Communication
Political Philosophy
Political Sociology
Political Theory
Politics and Law
Politics of Development
Public Policy
Public Administration
Qualitative Political Methodology
Quantitative Political Methodology
Regional Political Studies
Russian Politics
Security Studies
State and Local Government
UK Politics
US Politics
Browse content in Regional and Area Studies
African Studies
Asian Studies
East Asian Studies
Japanese Studies
Latin American Studies
Middle Eastern Studies
Native American Studies
Scottish Studies
Browse content in Research and Information
Research Methods
Browse content in Social Work
Addictions and Substance Misuse
Adoption and Fostering
Care of the Elderly
Child and Adolescent Social Work
Couple and Family Social Work
Direct Practice and Clinical Social Work
Emergency Services
Human Behaviour and the Social Environment
International and Global Issues in Social Work
Mental and Behavioural Health
Social Justice and Human Rights
Social Policy and Advocacy
Social Work and Crime and Justice
Social Work Macro Practice
Social Work Practice Settings
Social Work Research and Evidence-based Practice
Welfare and Benefit Systems
Browse content in Sociology
Childhood Studies
Community Development
Comparative and Historical Sociology
Disability Studies
Economic Sociology
Gender and Sexuality
Gerontology and Ageing
Health, Illness, and Medicine
Marriage and the Family
Migration Studies
Occupations, Professions, and Work
Organizations
Population and Demography
Race and Ethnicity
Social Theory
Social Movements and Social Change
Social Research and Statistics
Social Stratification, Inequality, and Mobility
Sociology of Religion
Sociology of Education
Sport and Leisure
Urban and Rural Studies
Browse content in Warfare and Defence
Defence Strategy, Planning, and Research
Land Forces and Warfare
Military Administration
Military Life and Institutions
Naval Forces and Warfare
Other Warfare and Defence Issues
Peace Studies and Conflict Resolution
Weapons and Equipment

Open access

Our open access publishing is key to delivering on our mission

Open access (OA) is a key part of how Oxford University Press (OUP) supports our mission to achieve the widest possible dissemination of high-quality research. We publish rigorously peer-reviewed, world-leading, trusted open access research, upholding the highest standards of publication ethics and integrity.

We work closely with our publishing partners to ensure that we offer open access in a sustainable way, supporting publications for their communities and offering researchers publishing options for making their research available to all and compliant with funder mandates.

Our open access publishing in numbers

Our open access articles have the highest number of policy and patent document mentions, relative to volume of output, compared to other major academic publishers*

Our open access articles have the 2nd highest mean lifetime citation rate compared to other major academic publishers**

12 of our journals are diamond OA, meaning authors publish for free and readers access for free

We publish over 140 fully open access journals

We have published over 350 open access books to date

Over 450 of our journals offer an open access publishing option

Our Read & Publish agreements cover over 1000 institutions at which authors can use funds to publish their article open access in an OUP journal

In 2023 we published over 27,000 open access journal articles

Open access for Journals

OUP’s options for publishing open access in journals include:

Fully open access

Articles published in fully OA journals are available to all; no subscription is required. OUP’s fully OA journals use Creative Commons licenses and there is usually an Article Processing Charge (APC) for OA publication.

Hybrid open access

Hybrid journals include a mix of open access articles and articles available to those with a journal subscription.

Hybrid journals offer authors the option of gold open access publishing. With gold open access, authors usually pay an APC to make their research articles available immediately upon publication, under a Creative Commons licence with re-use rights for readers.

For articles published under a Creative Commons licence, readers can re-use the work under the terms of the applicable licence.

‘Read and Publish’ transformative agreements

OUP has agreements with many institutions to provide access to OUP journals for faculty and students and provide funding for open access publishing for affiliated researchers. Find out which institutions are participating, and how to take advantage of available funding for publishing in an OUP journal .

Green open access and self-archiving

OUP has self-archiving policies that permit authors to take advantage of green open access by depositing their accepted manuscript (i.e. the post-acceptance version, before copyediting) into a non-commercial repository. In non-commercial repositories, articles can become freely available after the proscribed embargo period. Find out more about OUP green OA for journals .

Inclusive publishing

OUP believes that the move to open access and open research needs to be equitable and inclusive for all. We want to ensure that authors can publish in their journal of choice. As part of our Developing Countries Initiative , corresponding authors based in qualifying countries publishing in any of OUP’s fully open access journals are eligible for a full waiver of their open access charge.

Open access for Books

OUP has supported OA for books since 2012 as part of our mission to publish high-quality academic and research publications and ensure they are accessible and discoverable.

Publishing your book on an OA basis makes your work freely available online, with no barriers to access. OUP applies the same peer review and editorial development processes to all books whether published open access or under a customer sales model.

If you are considering publishing a book on an OA basis with OUP, please discuss the idea with your Editor. In most instances, the open access fee for books is met by a research funder under their funding and open access policy. All prospective authors are encouraged to provide information on any funding which directly supports the research for a proposed book so that we can plan the publishing route accordingly. You can also consult our information on funders and funder policies .

When a book is published OA it is:

available to read on the Oxford Academic platform both in a browser and as a downloadable PDF

available on Google books as a full preview

indexed in, and available from, the OAPEN online library and the Directory of Open Access Books (DOAB) as a PDF

sold in print and as an eBook

As well as publishing new books on an open access basis we are also able to convert backlist titles to OA and if you are the author of a published work and a funder has made funds available to help accelerate OA by converting existing published works, please contact your Editor.

See the full list of our open access books.

Find out more about licences, charges and self-archiving for your open access book .

*Data source: Altmetric. Comparing number of policy and patent document mentions, relative to number of articles published, to Cambridge University Press, Elsevier, Frontiers, Hindawi, Institute of Physics Publishing, MDPI, PLOS, Sage, Springer Nature, Taylor & Francis, and Wiley.

**Data source: Dimensions. Comparing the mean lifetime citation rate of open access articles to those published by Cambridge University Press, Elsevier, Frontiers, Hindawi, Institute of Physics Publishing, MDPI, PLOS, Sage, Springer Nature, Taylor & Francis, and Wiley.

Related information

Complying with funder policies on open access
Charges, licences, and self-archiving
Read and publish agreements
About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Copyright © 2024 Oxford University Press
Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

SpringerOpen

The SpringerOpen portfolio has grown tremendously since its launch in 2010, so that we now offer researchers from all areas of science, technology, medicine, the humanities and social sciences a place to publish open access in journals. Publishing with SpringerOpen makes your work freely available online for everyone, immediately upon publication, and our high-level peer-review and production processes guarantee the quality and reliability of the work. Open access books are published by our Springer imprint.

Find the right journal for you

Explore our subject areas, learn all about open access.

Browse our alphabetical journal list
Explore our journals by subject
Tips for finding the right journal
Find the right journal with our Journal Suggester
Find out if open access book publishing is right for you

Visit our subject pages covering all subject areas in science, technology, medicine, the humanities and social sciences

Visit the Springer Author & Reviewer tutorials and learn all about open access, your benefits, mandates, funding, copyright and more in the different interactive tutorials. And take our quiz to test your knowledge!

C heck out the free tutorials
Take the quiz

Video library

Your browser needs to have JavaScript enabled to view this video

Our video library contains how-to videos, videos on the research we publish and journal videos.

200 million monthly downloads

24 million monthly readers

3 million authors submit annually

Springer Nature Link - Home for all research

Discover open access.

Publish with us

Track your research

Featured articles and journals

Browse by subject, about springer nature link.

Calls for papers

Edge-Computing AI for Medical Devices

Electronic medical devices can be implantables, wearables, and remote. These devices can collect varioius multimodal and multigrain bio-signals invasively or non-invasively. This large heterogeneous data from medical devices contains tremendous information that can be analyzed with artificial...

Interruption of Amino Acids Supply as Anti-Tumor Strategy

The interruption of amino acids supply has emerged as a promising anti-tumor strategy, with research focusing on the metabolic vulnerabilities of cancer cells. Depletion of essential amino acids necessary for tumor growth has shown potential in inhibiting cancer cell proliferation and inducing cell...

Photothermal Membranes for Water Treatment

This topical collection aims to showcase the latest developments in photothermal materials and their applications in solar-driven water treatment and desalination technologies. The journal will publish innovative research on the design, synthesis, and performance evaluation of various photothermal...

Trending research

Ultra-processed food intake in toddlerhood and mid-childhood in the UK: cross sectional and longitudinal perspectives

A core in a star-forming disc as evidence of inside-out growth in the early Universe

A single amplified genome catalog reveals the dynamics of mobilome and resistome in the human microbiome

Dance displays in gibbons: biological and linguistic perspectives on structured, intentional, and rhythmic body movement.

Natural forest regeneration is projected to reduce local temperatures

Age-adapted painting descriptions change the viewing behavior of young visitors to the Rijksmuseum

Featured journals

Discover Applied Sciencesis a multi-disciplinary open access journal covering applied life sciences, chemistry, earth and environmental sciences,...

Discover Sustainability is an open access journal publishing research across all fields relevant to sustainability. Indexed in Web of Science’s...

The Journal of Epidemiology and Global Healthis an international peer reviewed journal which aims to impact global epidemiology and international...

Featured books

As part of Springer Nature, Springer Nature Link delivers fast access to the depth and breadth of our online collection of journals, eBooks, reference works and protocols across a vast range of subject disciplines.

Springer Nature Link is the reading platform of choice for hundreds of thousands of researchers worldwide. Find out how to publish your research with Springer Nature .

Find a journal

Search Search
CN (Chinese)
DE (German)
ES (Spanish)
FR (Français)
JP (Japanese)
Open science
Booksellers
Peer Reviewers
Springer Nature Group ↗
Fundamentals of open research
Gold or Green routes to open research
Benefits of open research
Open research timeline
Whitepapers
About overview
Journal pricing FAQs
Publishing an OA book
Journals & books overview
OA article funding
Article OA funding and policy guidance
OA book funding
Book OA funding and policy guidance
Funding & support overview
Open access agreements
Springer Nature journal policies
APC waivers and discounts
Springer Nature book policies
Publication policies overview

Open access journals

We have published over 124,000 open access articles via gold open access across disciplines –from the life sciences to the humanities, representing 33% of all springer nature articles in 2020. authors can also publish their article under an open access licence in more than 2,200 of our hybrid journals..

Our portfolio focuses on robust and insightful research, supporting the development of new areas of knowledge and making ideas and information accessible around the globe.

Across our publishing imprints there are leading multidisciplinary and community-focused journals that offer rigorous, high-impact open access. Many of our titles are also published in partnership with academic societies, enabling them to achieve their own open research ambitions.

OA articles published via Gold OA

Hybrid OA journals

Open access books

Fully open access journals

Download a list of our fully open access journals, including APC and licence information.

This list indicates the standard article processing charge (APC) for each journal. APCs are payable for articles upon acceptance. While we make every effort to keep this list updated, please note that APCs are subject to change and may vary from the price listed. For further information on the licences and other currencies available, self-archiving embargoes, manuscript deposition, and abstracting & indexing, visit the individual journal’s website. VAT or local taxes will be added where applicable.

Questions about paying for open access?

View our frequently asked questions about article processing charges (APCs).

Visit our imprint sites

Hybrid journals

Download a list of our hybrid journals, including Springer Open Choice titles. We publish more than 2,200 journals that offer open access at the article level, allowing optional open access in the majority of Springer Nature's subscription-based journals.

Find out more by imprint

Springer open choice, springer nature hybrid journals on nature.com, palgrave macmillan hybrid journals, stay up to date.

Here to foster information exchange with the library community

Connect with us on LinkedIn and stay up to date with news and development.

Tools & Services
Account Development
Sales and account contacts
Professional
Press office
Locations & Contact

We are a world leading research, educational and professional publisher. Visit our main website for more information.

© 2024 Springer Nature
General terms and conditions
Your US State Privacy Rights
Your Privacy Choices / Manage Cookies
Accessibility
Legal notice
Help us to improve this site, send feedback.

Understanding Open Access

In this guide.

What is Open Access?
Open Access Policies
Open Access at Lane Library
Frequently Asked Questions

What is OA?

Open access ( OA ) is a set of principles and practices through which research outputs like journal articles are distributed online, free of cost or other access barriers.

In "traditional" scholarly publishing, the publisher owns the rights to the articles in their journals. Individuals looking to read these articles may encounter a paywall, requiring them to pay a fee for access. Institutions and libraries (including Lane Library) help provide access to such paywalled research by negotiating with the publishers and paying costly subscription fees. In contrast, open access ensures that the outputs of the research process can be read and built upon by everyone.

Open access to publications is a component of Open Science , which encompasses a variety of efforts focused on making scientific research more transparent and accessible. Though the term is frequently used to refer to efforts aimed at ensuring access to the products of the research process - journal articles, datasets, code, and other materials - open science also encompasses efforts to ensure that the scientific enterprise is inclusive and equitable.

This guide is intended to help you understand open access-related policies, the various routes you may use to make work "open", and the OA-related resources available to you through Lane Library.

If you have specific questions about open access, please do not hesitate to contact your liaison librarian . If you are interested in engaging in a broader discussion about open access and other open science-related issues, consider attending a meeting of the Open Science Reading Group .

Methods of Making Work Open

There are a variety of ways to make work "open". Below we have highlighted some of the most common and provided detail about how they differ during the writing and submission process, during the evaluation (peer review) process, during the production and publishing process, and how readers are able to access and read articles.

Please note that open access is best conceived of as a continuum of practice. As shown by visualizations like How Open Is It? , individual journals may exhibit greater or lesser degrees of "openness".

Open Access Publishing

Open Access publishing (also sometimes called Gold OA) is a form of open access in which a publisher makes all articles and related content associated with a certain journal available for free immediately on the journal's website. In this model, authors are often asked to bear the cost of publication, typically through an article processing charge (APC). Examples of this form of open access are journals like eLife and those published by PLOS .

Self Archiving

Self-archiving (also sometimes called Green OA) is a form of open access in which, independently of publication by a journal publisher, an author posts their work to a website where it can be accessed and read by others. The NIH Public Access Policy can be considered an example of this type of open access. Stanford University's proposed open access policy includes self archiving.

There are a variety of ways to find and read articles that have been self-archived in this manner. We recommend the Unpaywall browser extension.

Preprints are a special case of self-archiving where authors submit a copy of an article that has not yet gone through peer view to a preprint repository so it can be accessed and read by others. Preprint servers for biomedical and health sciences-related work include bioRxiv and MedRxiv . Europe PMC can be used to search for preprints and there are a limited number of COVID-19 related preprints in PubMed Central .

Other Forms of OA

In addition to the forms of open access discussed above, there may be cases where a "traditional" journal makes temporarily removes paywalls for specific articles or instances where paywalls are removed for articles after a certain period of time following publication. There are also cases where a "traditional" journal will make an individual article free to read if the authors pay a fee. In some cases, the journal may maintain copyright of the articles under these models.

More Information

The video below, created by Jorge Cham and featuring Nick Shockey and Jonathan Eisen , provides a quick introduction to the motivations behind and principles of open access.

For even more information about open access, see the list of resources below:

SPARC SPARC (the Scholarly Publishing and Academic Resources Coalition) works to enable the open sharing of research outputs and educational materials in order to democratize access to knowledge, accelerate discovery, and increase the return on our investment in research and education.
Open Access (The Book) Peter Suber's excellent book provides an introduction to Open Access. It is freely available through a variety of sources.
Open Science Reading Group The Open Science Reading Group is intended to bring together members of the Stanford Medicine community to learn about open science, discuss the application of open science practices in a biomedical context.
Next: Open Access Policies >>
Last Updated: Jul 17, 2024 3:27 PM
URL: https://laneguides.stanford.edu/openaccess

Creative Commons

Open access.

Open access literature is digital, online, free of charge, and free of most copyright and licensing restrictions.

There’s an incredible amount of scientific research conducted at universities and institutions around the world. Historically, the findings of this research have been published in scholarly journals. However, access to this research is typically restricted — granted only to those who are granted permission via their university affiliation, or by purchasing access to individual articles. This is fundamentally problematic, for many reasons :

Governments provide most of the funding for research — hundreds of billions of dollars annually — and public institutions employ a large portion of all researchers.
Researchers publish their findings without the expectation of compensation. Unlike other authors, they hand their work over to publishers without payment, in the interest of advancing human knowledge.
Through the process of peer review, researchers review each other’s work for free.
Once published, those that contributed to the research (from taxpayers to the institutions that supported the research itself) have to pay again to access the findings. Though research is produced as a public good, it isn’t available to the public who paid for it.

Open access publishing is a solution to these problems. Open access literature is defined as “digital, online, free of charge, and free of most copyright and licensing restrictions.” The recommendations of the Budapest Open Access Declaration — including the use of liberal licensing (such as CC BY ) — is widely recognized in the community as a means to make a work truly open access.

The existing system for producing and distributing publicly funded research articles is expensive and doesn’t take advantage of the possibilities of innovations like open licensing. Without a free-flowing system, access to the results of scientific research is limited to institutions that are able to commit to hefty journal subscriptions — paid for year after year — which don’t allow for broad redistribution, or repurposing for activities such as text and data mining without additional permissions from the rightsholder. This closed system limits the impact on the scientific and scholarly community and progress is slowed significantly.

When funding cycles for research include open license requirements for publications, increased access and opportunities for reuse extends the value of research funding. As an example, the US National Institutes of Health (NIH) Public Access Policy requires the published results of all NIH-funded research to be deposited in PubMed Central’s repository, the peer-reviewed manuscript immediately, and the final journal article within twelve months of publication. Similarly, the directive issued by the White House Office of Science and Technology Policy mandates that federal agencies with more than $100 million in research expenditures must make the results of their research publicly available within one year of publication, and better manage the resultant data supporting their results. These policies utilize aspects of the optimized cycle below, and are a step in the right direction for making better use of public funding for research articles.

Open access policies and practices are being adopted in a variety of different settings. Above and beyond open licensing policies for publicly funded research, philanthropic foundations , NGOs , and intergovernmental organizations are using open licensing to share the research that they — or their grantees — are creating. Open access journals are publishing Creative Commons-licensed research, which promotes access and re-use of scientific and scholarly research.

Kathryn A. Martin Library

Research & Collections

Open Access (OA) Resources Research Guide

Green Open Access
Gold Open Access
Other types of Open Access
Gratis vs. Libre
Preprint vs. Postprint
Sustainable and equitable models
Open access infrastructure
Reduced author publishing fees
Open Access Organizations
OA Journals
Types of Open Access
OA Databases This link opens in a new window
Go to OER Guide

What is Open Access (OA)?

Open Access is the free, immediate, online availability of research articles combined with the rights to use these articles fully in the digital environment. Open Access is the needed modern update for the communication of research that fully utilizes the Internet for what it was originally built to do—accelerate research. (Defined by SPARC* )

Some barriers

Open access research outputs are not free to produce, publish, disseminate, or preserve since all have costs associated with them .
Nor does open access mean universal access, as there are language, technological and censorship barriers to overcome in many parts of the world.

Short explainer videos

Open Access - Myth vs. Fact (Editage Insights - 2:36)
What is “Open Access?” by Open Society Foundations (Open Access Explained! - 8:23)
Open Access Policies: an Introduction from COAPI (SPARC - 1:38)

What is Open Access? (Video Explanation)

Resources from video

Green and Gold

SHERPA/RoMEO

Creative Commons

Center for Open Science

Open Access Poll

Myths about Open Access (UMN Libraries)

Open Access at the Unviersity of Minnesota

University-wide policy on Open Access to Scholarly Articles that took effect in January of 2015.

FAQ Open Access to Scholarly Articles

Next: Green Open Access >>
Last Updated: Oct 1, 2024 12:08 PM
URL: https://libguides.d.umn.edu/OA
Give to the Library

Open Access

Open Access is the free, immediate, online availability of research articles coupled with the rights to use these articles fully in the digital environment. Open Access ensures that anyone can access and use these results—to turn ideas into industries and breakthroughs into better lives.

Open Education

Impact Stories
Share on Facebook
Share on Twitter
Share via Email

Research provides the foundation of modern society. Research leads to breakthroughs, and communicating the results of research is what allows us to turn breakthroughs into better lives—to provide new treatments for disease, to implement solutions for challenges like global warming, and to build entire industries around what were once just ideas.

However, our current system for communicating research is crippled by a centuries old model that hasn’t been updated to take advantage of 21st century technology:

Governments provide most of the funding for research—hundreds of billions of dollars annually—and public institutions employ a large portion of all researchers.
Researchers publish their findings without the expectation of compensation. Unlike other authors, they hand their work over to publishers without payment, in the interest of advancing human knowledge.
Through the process of peer review, researchers review each other’s work for free.
Once published, those that contributed to the research (from taxpayers to the institutions that supported the research itself) have to pay again to access the findings. Though research is produced as a public good, it isn’t available to the public who paid for it.

Our current system for communicating research uses a print-based model in the digital age. Even though research is largely produced with public dollars by researchers who share it freely, the results are hidden behind technical, legal, and financial barriers. These artificial barriers are maintained by legacy publishers and restrict access to a small fraction of users, locking out most of the world’s population and preventing the use of new research techniques.

This fundamental mismatch between what is possible with digital technology—an open system for communicating research results in which anyone, anywhere can contribute—and our outdated publishing system has led to the call for Open Access.

Funders invest in research to advance human knowledge and ultimately improve lives. Open Access increases the return on that investment by ensuring the results of the research they fund can be read and built on by anyone.

Breakthroughs often come from unexpected places ; the Theory of Relativity was developed by a patent clerk. Open Access expands the number of potential contributors to research from just those at institutions wealthy enough to afford journal subscriptions to anyone with an internet connection.

Researchers benefit from having the widest possible audience. Researchers provide their articles to publishers for free, because their compensation comes in the form of recognition for their findings. Open Access means more readers, more potential collaborators, more citations for their work, and ultimately more recognition.

The research enterprise itself benefits when the latest techniques can be easily used. For years, we have had powerful text and data mining tools that can analyze the entire research literature, uncovering trends and connections that no human reader could. While publishers’ technical and legal barriers currently prevent their widespread use, Open Access empowers anyone to use these tools, which hold the potential of revolutionizing how research is conducted.

Even the best ideas remain just that until they are shared, until they can be utilized by others. The more people that can access and build upon the latest research, the more valuable that research becomes and the more likely we are to benefit as a society. More eyes make for smaller problems.

Learn about SPARC’s policy priorities
Download SPARC's OA policy one-pager

Open Access Impact Stories

The impact of embracing community over commercialization.

To catalyze discussion around the 2023 International Open Access Week theme of “Community over...

Zenodo’s Open Repository Streamlines Sharing Science

A decade ago, the scientific community recognized that to move from open access to open science,...

African Open Access Textbook and Journal Publishing Gains...

The high cost of college textbooks and scholarly journals puts many students and institutions at a...

Popular Resources

Open access 101 series, 2021 update to the sparc landscape analysis & roadmap for action, data analysis for negotiation, latest news, deepening our efforts to prioritize community over commercialization, sparc releases second vendor privacy report urging action to address concerns with springerlink..., arcadia provides sparc with key support through 2030, upcoming events, learn more about our work.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 07 June 2023

CORE: A Global Aggregation Service for Open Access Papers

Petr Knoth ORCID: orcid.org/0000-0003-1161-7359 1 ,
Drahomira Herrmannova ORCID: orcid.org/0000-0002-2730-1546 1 nAff2 ,
Matteo Cancellieri 1 ,
Lucas Anastasiou 1 ,
Nancy Pontika 1 ,
Samuel Pearce 1 ,
Bikash Gyawali 1 &
David Pride 1

Scientific Data volume 10 , Article number: 366 ( 2023 ) Cite this article

9023 Accesses

3 Citations

73 Altmetric

Metrics details

Research data

This paper introduces CORE, a widely used scholarly service, which provides access to the world’s largest collection of open access research publications, acquired from a global network of repositories and journals. CORE was created with the goal of enabling text and data mining of scientific literature and thus supporting scientific discovery, but it is now used in a wide range of use cases within higher education, industry, not-for-profit organisations, as well as by the general public. Through the provided services, CORE powers innovative use cases, such as plagiarism detection, in market-leading third-party organisations. CORE has played a pivotal role in the global move towards universal open access by making scientific knowledge more easily and freely discoverable. In this paper, we describe CORE’s continuously growing dataset and the motivation behind its creation, present the challenges associated with systematically gathering research papers from thousands of data providers worldwide at scale, and introduce the novel solutions that were developed to overcome these challenges. The paper then provides an in-depth discussion of the services and tools built on top of the aggregated data and finally examines several use cases that have leveraged the CORE dataset and services.

A large dataset of scientific text reuse in Open-Access publications

SciSciNet: A large-scale open data lake for the science of science research

re3data – Indexing the Global Research Data Repository Landscape Since 2012

Introduction.

Scientific literature contains some of the most important information we have assembled as a species, such as how to treat diseases, solve difficult engineering problems, and answer many of the world’s challenges we are facing today. The entire body of scientific literature is growing at an enormous rate with an annual increase of more than 5 million articles (almost 7.2 million papers were published in 2022 according to Crossref, the largest Digital Object Identifier (DOI) registration agency). Furthermore, it was estimated that the amount of research published each year increases by about 10% annually 1 . At the same time, an ever growing amount of research literature, which has been estimated to be well over 1 million publications per year in 2015 2 , is being published as open access (OA), and can therefore be read and processed with limited or no copyright restrictions. As reading this knowledge is now beyond the capacities of any human being, text mining offers the potential to not only improve the way we access and analyse this knowledge 3 , but can also lead to new scientific insights 4 .

However, systematically gathering scientific literature to enable automated methods to process it at scale is a significant problem. Scientific literature is spread across thousands of publishers, repositories, journals, and databases, which often lack common data exchange protocols and other support for inter-operability. Even when protocols are in place, the lack of infrastructure for collecting and processing this data, as well as restrictive copyrights and the fact that OA is not yet the default publishing route in most parts of the world further complicate the machine processing of scientific knowledge.

To alleviate these issues and support text and data mining of scientific literature we have developed CORE ( https://core.ac.uk/ ). CORE aggregates open access research papers from thousands of data providers from all over the world including institutional and subject repositories, open access and hybrid journals. CORE is the largest collection of OA literature–at the time of writing this article, it provides a single point of access to scientific literature collected from over ten thousand data providers worldwide and it is constantly growing. It provides a number of ways for accessing its data for both users and machines, including a free API and a complete dump of its data.

As of January 2023, there are 4,700 registered API users and 2,880 registered dataset and more than 70 institutions have registered to use CORE Recommender in their repository systems.

The main contributions of this work are the development of CORE’s continuously growing dataset and the tools and services built on top of this corpus. In this paper, we describe the motivation behind the dataset’s creation and the challenges and methods of assembling it and keeping it continuously up-to-date. Overcoming the challenges posed by creating a collection of research papers of this scale required devising innovative solutions to harvesting and resource management. Our key innovations in this area which have contributed to the improvement of the process of aggregating research literature include:

Devising methods to extend the functionality of existing widely-adopted metadata exchange protocols which were not designed for content harvesting, to enable efficient harvesting of research papers’ full texts.

Developing a novel harvesting approach (referred to here as CHARS) which allows us to continuously utilise the available compute resources while providing improved horizontal scalability, recoverability, and reliability.

Designing an efficient algorithm for scheduling updates of harvested resources which optimises the recency of our data while effectively utilising the compute resources available to us.

This paper is organised as follows. First, in the remainder of this section, we present several use cases requiring large scale text and data mining of scientific literature, and explain the challenges in obtaining data for these tasks. Next, we present the data offered by CORE and our approach for systematically gathering full text open access articles from thousands of repositories and key scientific publishers.

Terminology

In digital libraries the term record is typically used to denote a digital object such as text, image, or video. In this paper and when referring to data in CORE, we use the term metadata record to refer to the metadata of a research publication, i.e. the title, authors, abstract, project funding details, etc., and the term full text record to describe a metadata record which has an associated full text.

We use the term data provider to refer to any database or a dataset from which we harvest records. Data providers harvested by CORE include disciplinary and institutional repositories, publishers and other databases.

When talking about open access (OA) to scientific literature, we refer to the Budapest Open Access Initiative (BOAI) definition which defines OA as “free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose” ( https://www.budapestopenaccessinitiative.org/read ). There are two routes to open access, 1) OA repositories and 2) OA journals. The first can be achieved by self-archiving (depositing) publications in repositories (green OA), and the latter by directly publishing articles in OA journals (gold OA).

Text and Data Mining of Scientific Literature

Text and data mining (TDM) is the discovery by a computer of new, previously unknown information, by automatically extracting information from different written resources ( http://bit.ly/jisc-textm ). The broad goal of TDM of scientific literature is to build tools that can retrieve useful information from digital documents, improve access to these documents, or use these documents to support scientific discovery. OA and TDM of scientific literature have one thing in common–they both aim to improve access to scientific knowledge for people. While OA aims to widen the availability of openly available research, TDM aims to improve our ability to discover, understand and interpret scientific knowledge.

TDM of scientific literature is being used in a growing number of applications, many of which were until recently not viable due to the difficulties associated with accessing the data from across many publishers and other data providers. Because many use cases involving text and data mining can only realise their full potential when they are executed on an as large corpus of research papers as possible, these data access difficulties have rendered many of the uses cases described below very difficult to achieve. For example, to reliably detect plagiarism in newly submitted publications it is necessary to have access to an always up-to-date dataset of published literature spanning all disciplines. Based on data needs, scientific literature TDM use cases can be broadly categorised into the following two categories, which are shown in Fig. 1 :

A priori defined sample use cases: Use cases which require access to a subset of scientific publications that can be specified prior to the execution of the use case. For example, gathering the list of all trialled treatments for a particular disease in the period 2000–2010 is a typical example of such a use case.

Undefined sample use cases: Use cases which cannot be completed using data samples that are defined a priori. The execution of such use cases might require access to data not known prior to the execution or may require access to all data available. Plagiarism detection is a typical example of such use case.

Example uses cases of text and data mining of scientific literature. Depending on data needs, TDM uses can be categorised into a) a priori defined sample use cases, and b) undefined sample use cases. Furthermore, TDM use cases can broadly be categorised into 1) indirect applications which aim to improve access to and organisation of literature and 2) direct applications which focus on answering specific questions or gaining insights.

However, there are a number of factors that significantly complicate access to data for these applications. The needed data is often spread across many publishers, repositories, and other databases, often lacking interoperability (these factors will be further discussed in the next section). Consequently, researchers and developers working in these areas typically invest a considerable amount of time in corpus collection, which could be up to 90% of the total investigation time 5 . For many, this task can even prove impossible due to technical restrictions and limitations of publisher platforms, some of which will be discussed in the next section. Consequently, there is a need for a global, continuously updated, and downloadable dataset of full text publications to enable such analysis.

Challenges in machine access to scientific literature

Probably the largest obstacle to the effective and timely retrieval of relevant research literature is that it may be stored in a wide variety of locations with little to no interoperability: repositories of individual institutions, publisher databases, conference and journal websites, pre-print databases, and other locations, each of which typically offers different means for accessing their data. While repositories often implement a standard protocol for metadata harvesting, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), publishers typically allow access to their data through custom made APIs, which are not standardised and are subject to changes 6 . Other data sources may provide static data dumps in a variety of formats or not offer programmatic access to their data at all.

However, even when publication metadata can be obtained, other steps involved in the data collection process complicate the creation of a final dataset suitable for TDM applications. For example, the identification of scientific publications within all downloaded documents, matching these publications correctly to the original publication metadata, and their conversion from formats used in publishing, such as the PDF format, into a textual representation suitable for text and data mining, are just some of the additional difficulties involved in this process. The typical minimum steps involved in this process are illustrated in Fig. 2 . As there are no widely adopted solutions providing interoperability across different platforms, custom harvesting solutions need to be created for each.

Example illustration of the data collection process. The figure depicts the typical minimum steps which are necessary to produce a dataset for TDM of scientific literature. Depending on the use case, tens or hundreds of different data sources may need to be accessed, each potentially requiring a different process–for example accessing a different set of API methods or a different process for downloading publication full text. Furthermore, depending on the use case, additional steps may be needed, such as extraction of references, identification of duplicate items or detection of the publication’s language. In the context of CORE, we provide the details of this process in Section Methods.

Challenges in systematically gathering open access research literature

Open access journals and repositories are increasingly becoming the central providers of open access content, in part thanks to the introduction of funder and institutional open access policies 7 . Open access repositories include institutional repositories such as the University of Cambridge Repository https://www.repository.cam.ac.uk/ , and subject repositories such arXiv https://arxiv.org/ . As of February 2023, there are 6,015 open access repositories indexed in the Directory of Open Access Repositories http://v2.sherpa.ac.uk/opendoar/ (OpenDOAR), as well as 18,935 open access journals indexed in the Directory of Open Access Journals https://doaj.org/ (DOAJ). However, open access research literature can be stored in a wide variety of other locations, including publisher and conference websites, individual researcher websites, and elsewhere. Consequently, a system for harvesting open access content needs to be able to harvest effectively from thousands of data providers. Furthermore, a large number of open access repositories (69.4% of repositories indexed in OpenDOAR as of January 2018) expose their data through the OAI-PMH protocol while often not providing any alternatives. An open access harvesting system therefore also needs to be able to effectively utilise OAI-PMH for open access content harvesting. However, these two requirements–harvesting from thousands of data providers and utilising OAI-PMH for content harvesting–pose a number of significant scalability challenges.

Challenges related to harvesting from thousands of data providers

Open access data providers vary greatly in size, with some hosting millions of documents while others host a significantly lower number. New documents are added and old documents are often updated by data providers daily.

Different geographic locations and internet connection speeds may result in vastly differing times needed to harvest information from different providers, even when their size in terms of publication numbers is the same. As illustrated in Table 1 , there are also a variety of OAI-PMH implementations across commonly used repository platforms providing significantly different harvesting performance. To construct this table, we analysed OAI-PMH metadata harvesting performances of 1,439 repositories in CORE, covering eight different repository platforms. It should be noted that the OAI-PMH protocol only necessitates metadata to be expressed in the Dublin Core (DC) format. However, it also can be extended to express the metadata in other formats. Because the Dublin-Core standard is constrained to just 15 elements, it is not uncommon for OAI-PMH repositories to also use and extended metadata format such as Rioxx ( https://rioxx.net ) or the OpenAIRE Guidelines ( https://www.openaire.eu/openaire-guidelines-for-literature-institutional-and-thematic-repositories ).

Additionally, harvesting is limited not only by factors related to the data providers, but also by the compute resources (hardware) available to the aggregator. As many use cases listed in the Introduction, such as in plagiarism detection or systematic review automation, require access to very recent data, ensuring that the harvested data stays recent and that the compute resources are utilised efficiently both pose significant challenges.

To overcome these challenges, we designed the CORE Harvesting System (CHARS) which relies on two key principles. The first is the application of the microservices software principles to open access content harvesting 8 . The second is our strategy we denote pro-active harvesting , which means that providers are scheduled automatically according to current need. This strategy is implemented in the harvesting Scheduler (Section CHARS_architecture). The Scheduler uses a formula we designed for prioritising data providers.

The combination of the Scheduler with CHARS microservices architecture enables us to schedule harvesting according to current compute resource utilisation, thus greatly increasing our harvesting efficiency. Since switching from a fixed-schedule approach described above to pro-active harvesting, we have been able to greatly improve the data recency of our collection as well as to increase the size of the collection threefold within the span of three years.

Challenges related to the use of OAI-PMH protocol for content harvesting

As explained above, OAI-PMH is currently the standard method for exchanging data across repositories. While the OAI-PMH protocol was originally been designed for metadata harvesting only, it has been, due to its wide adoption and lack of alternatives, used as an entry point for full text harvesting. Full text harvesting is achieved by extracting URLs from the metadata records collected through OAI-PMH, the extracted URLs are then used to discover the location of the actual resource 9 . However, there are a number of limitations of the OAI-PMH protocol which make it unsuitable for large-scale content harvesting:

It directly supports only metadata harvesting, meaning additional functionality has to be implemented in order to use it for content harvesting.

The location of full text links in the OAI-PMH metadata is not standardised and the OAI-PMH metadata records typically contain multiple links. From the metadata it is not clear which of these links points to the described representation of the resource and in many cases none of them does so directly. Therefore, all possible links to the resource itself have to be extracted from the metadata and tested to identify the correct resource. Furthermore, OAI-PMH does not facilitate any validation for ensuring the discovered resource is truly the described resource. In order to overcome this issues, the adoption of the RIOXX https://rioxx.net/ metadata format or the OpenAIRE guidelines https://guidelines.openaire.eu/ has been promoted. However, the issue of unambiguously connecting metadata records and the described resource is still present.

The architecture of the OAI-PMH protocol is inherently sequential, which makes it ill-suited for harvesting from very large repositories. This is because the processing of large repositories cannot be parallelised and it is not possible to recover the harvesting in case of failures.

Scalability across different implementations of OAI-PMH differs dramatically. Our analysis (Table 1 ) shows that performance can differ significantly also when only a single repository software is considered 10 .

Other limitations include difficulties in incremental harvesting, reliability issues, metadata interoperability issues, and scalability issues 11 .

We have designed solutions to overcome a number of these issues, which have enabled us to efficiently and effectively utilise OAI-PMH to harvest open access content from repositories. We present these solutions in Section Using OAI-PMH for content harvesting. While we currently rely on a variety of solutions and workarounds to enable content harvesting through OAI-PMH, most of the limitations listed in this section could also be addressed by adopting more sophisticated data exchange protocols, such as the ResourceSync ( http://www.openarchives.org/rs/1.1/resourcesync ) protocol which was designed with content harvesting in mind 10 and the adoption in the systems of data providers we support.

Our solution

In the above sections we have highlighted a critical need for many researchers and organisations globally for large-scale always up-to-date seamless machine access to scientific literature originating from thousands of data providers at full text level. Providing this seamless access has become both a defining goal and a feature of CORE and has enabled other researchers to design and test innovative methods on CORE data, often powered by artificial intelligence processes. In order to put together this vast continuously updated dataset, we had to overcome a number of research challenges, such as those related to the lack of interoperability, scalability, regular content synchronisation, content redundancy and inconsistency. Our key innovation in this area is the improvement of the process of aggregating research literature , as specified in the Introduction section.

This underpinning research has allowed CORE to become a leading provider of open access papers. The amount of data made available by CORE has been growing since 2011 12 and is continuously kept up to date. As of February 2023, CORE provides access to over 291 million metadata records and 32.8 million full text open access articles, making it the world’s largest archive of open access research papers, significantly larger than PubMed, arXiv and JSTOR datasets.

Whilst there are other publication databases that could be initially viewed as similar to CORE, such as BASE or Unpaywall, we will demonstrate the significant differences that set CORE apart and show how it provides access to a unique, harmonised corpus of Open Access literature. A major difference between these existing services is that CORE is completely free to use for the end user, it hosts full text content, and offers several methods for accessing its data for machine processing. Consequently, it removes the need to harvest and pre-process full text for text mining, since CORE provides plain text access to the full texts via its raw data services, eliminating the need for text and data miners to work on PDF formats. A detailed comparison of other publication databases is provided in the Discussion. In addition, CORE enables building powerful services on top of the collected full texts, supporting all the categories of use cases outlined in the Use cases section.

As of today, CORE provides three services for accessing its raw data: API, dataset, and a FastSync service. The CORE API provides real-time machine access to both metadata and full texts of research papers. It is intended for building applications that need reliable access to a fraction of CORE data at any time. CORE provides a RESTful API. Users can register for an API key to access the service. Full documentation and Python notebooks containing code examples can be found on the CORE documentation pages online ( https://api.core.ac.uk/docs/v3 ). The CORE Dataset can be used to download CORE data in bulk. Finally, CORE FastSync enables third party systems to keep an always up to date copy of all CORE data within their infrastructure. Content can be transferred as soon as it becomes available in CORE using a data synchronisation service on top of the ResourceSync protocol 13 optimised by us for improved synchronisation scalability with an on-demand resource dumps capability. CORE FastSync provides fast, incremental and enterprise data synchronisation.

CORE is the largest up-to-date full text open access dataset as well as one of the most widely used services worldwide supporting access to freely available research literature. CORE regularly releases data dumps licensed as ODC-By, making the data freely available for both commercial and non-commercial purposes. Access to CORE data via the API is provided freely to individuals conducting work in their own personal capacity and to public research organisations for unfunded research purposes. CORE offers licenses to commercial organisations wanting to use CORE services to obtain a convenient way of accessing CORE data with a guaranteed level of service support. CORE is operated as a not-for-profit entity by The Open University and this business model makes it possible for CORE to remain free for the >99.99% of its users.

A large number of commercial organisations have benefited from these licenses in areas as diverse as plagiarism detection in research, building specialised scholarly publication search engines, developing scientific assistants and machine translation systems and supporting education etc. https://core.ac.uk/about/endorsements/partner-projects . The CORE data services–CORE API and Dataset, have been used by over 7,000 experts to analyse data, develop text-mining applications and to embed CORE into existing production systems.

Additionally, more than 70 repository systems have registered to use the CORE Recommender and the service is notably used by prestigious institutions, including the University of Cambridge and by popular pre-prints services such as arXiv.org. Other CORE services are the CORE Discovery and the CORE Repository Dashboard. The first was released on July 2019 and at the time of writing it has more than 5000 users. The latter is a tool designed specifically for repository managers which provides access to a range of tools for managing the content within their repositories. The CORE Repository Dashboard is currently used by 499 users from 36 countries.

In the rest of this paper we describe the CORE dataset and the methods of assembling it and keeping it continuously up-to-date. We also present the services and tools built on top of the aggregated corpus and provide several examples of how the CORE dataset has been used to create real-world applications addressing specific use-cases.

As highlighted in the Introduction, CORE is a continuously growing dataset of scientific publications for both human and machine processing. As we will show in this section, it is a global dataset spanning all disciplines and containing publications aggregated from more than ten thousand data providers including disciplinary and institutional repositories, publishers, and other databases. To improve access to the collected publications, CORE performs a number of data enrichment steps. These include metadata and full text extraction, language and DOI detection, and linking with other databases. Furthermore, CORE provides a number of services which are built on top of the data: a publications recommender ( https://core.ac.uk/services/recommender/ ), CORE Discovery service ( https://core.ac.uk/services/discovery/ ) (a tool for discovering OA versions of scientific publications), and a dashboard for repository managers ( https://core.ac.uk/services/repository-dashboard/ ).

Dataset size

As of February 2023, CORE is the world’s largest dataset of open access papers (comparison with other systems is provided in the Discussion). CORE hosts over 291 million metadata records including over 34 million articles with full text written in 82 languages and aggregated from over ten thousand data providers located in 150 countries. Full details of CORE Dataset size are presented in Table 2 . In the table, “Metadata records” represent all valid (not retracted, deleted, or for some other reason withdrawn) records in CORE. It can be seen that about 13% of records in CORE contain full text. This number represents records for which a manuscript was successfully downloaded and converted to plain text. However, a much higher proportion of records contains links to additional freely available full text articles hosted by third-party providers. Based on analysing a subset of our data, we estimate that about 48% of metadata records in CORE fall into this category, indicating that CORE is likely to contain links to open access full texts for 139 million articles. Due to the nature of academic publishing there will be instances where multiple versions of the same paper are deposited in different repositories. For example, an early version of an article can be deposited by an author to a pre-print server such as arXiv or BiorXiv and then a later version uploaded to an institutional repository. Identifying and matching these different versions is a significant undertaking. CORE has carried out research to develop techniques based on locality sensitive hashing for duplicates identification 8 and integrated these into its ingestion pipeline to link versions of papers from across the network of OA repositories and group these under a single works entity. The large number of records in CORE translates directly into the size of the dataset in bytes as the uncompressed version of the dataset including PDFs is about 100 TB. The compressed version of the CORE dataset with plain texts only amounts to 393 GB and uncompressed to 3.5 TBs.

Recent studies have estimated that around 24%–28% of all articles are available free to read 2 , 14 . There are a number of reasons why the proportion of full text content in CORE is lower than these estimates. The main reason is likely that a significant proportion of the free to read articles represents content hosted on platform with many restrictions for machine accessibility, i.e. some repositories severely restrict or fully prohibit content harvesting 9 .

The growth of CORE has been made possible thanks to the introduction of a novel harvesting system and the creation of an efficient harvesting scheduler, both of which are described in the Methods section. The growth of metadata and full text records in CORE is shown in Fig. 3 . Finally, Fig. 4 shows age of publications in CORE.

Growth of records in CORE per month since February 2012. “Full text growth” represents growth of records containing full text, while “Metadata growth” represents growth of records without full text, i.e. the two numbers do not overlap. The two area plots are stacked on top of each other, their sum therefore represents the total number of records in CORE.

Age of publications in CORE. Similarly as in Fig. 3 , the “Metadata” and “Full text” records bars are stacked on top of each other.

Data sources and languages

As of February 2023, CORE was aggregating content from 10,744 data sources. These data sources include institutional repositories (for example the USC Digital Library or the University of Michigan Library Repository), academic publishers (Elsevier, Springer), open access journals (PLOS), subject repositories, including those hosting eprints (arXiv, bioRxiv, ZENODO, PubMed Central) and aggregators (e.g. DOAJ). The ten largest data sources in CORE are shown in Table 3 . To calculate the total number of data providers in CORE, we consider aggregators and publishers as one data source despite each aggregating data from multiple sources. A full list of all data providers can be found on the CORE website. ( https://core.ac.uk/data-providers ).

The data providers aggregated by CORE are located in 150 different countries. Figure 5 shows the top ten countries in terms of number of data providers aggregated by CORE from each country alongside the top ten languages. The geographic spread of repositories is largely reflective of the size of the research economy in those countries. We see the US, Japan, Germany, Brazil and the UK all in the top six. One result that at first may appear surprising is the significant number of repositories in Indonesia, enough to place them at the top of the list. An article in Nature in 2019 showed that Indonesia may be the world’s OA leader, finding that 81% of 20,000 journal articles published in 2017 with an Indonesia-affiliated author are available to read for free somewhere online. ( https://www.nature.com/articles/d41586-019-01536-5 ). Additionally, there are a large number of Indonesian open-access journals registered with Crossref. This subsequently leads to a much higher number of individual repositories in this country.

Top ten languages and top ten provider locations in CORE.

As part of the enrichment process, CORE performs language detection. Language is either extracted from the attached metadata where available or identified automatically from full text in case it is not available in metadata. More than 80% of all documents with language information are in English. Overall, CORE contains publications in a variety of languages, the top 10 of which are shown in Fig. 5 .

Document types

The CORE dataset comprises a collection of documents gathered from various sources, many of which contain articles of different types. Consequently, aside of research articles from journals and conferences, it includes other types of research outputs such as research theses, presentations, and technical reports. To distinguish different types of articles, CORE has implemented a method of automatically classifying documents into one of the following four categories 15 : (1) research article, (2) thesis, (3) presentation, (4) unknown (for articles not belonging into any of the previous three categories). This method is based on a supervised machine learning model trained on article full texts. Figure 6 shows the distribution of articles in CORE into these four categories. It can be seen that the collection aggregated by CORE consists predominantly of research articles. We have observed in the data collected from repositories that the vast majority of research theses deposited in repositories has full text associated with the metadata. As this is not always the case for research articles, and as Fig. 6 is produced on articles with full text only, we expect that the proportion of research articles compared to research theses in CORE is actually higher across the entire collection.

Distribution of document types.

Research disciplines

To analyse the distribution of disciplines in CORE we have leveraged a third-party service. Figure 7 shows a subject distribution of a sample of 20,758,666 publications in CORE. For publications with multiple subjects we count the publication towards each discipline.

Subject distribution of a sample of 20,758,666 CORE publications.

The subject for each article was obtained using Microsoft Academic ( https://academic.microsoft.com/home ) prior to its retirement in November 2021. Our results are consistent with other studies, which have reported Biology, Medicine, and Physics to be the largest disciplines in terms of number of publications 16 , 17 , suggesting that the distribution of articles in CORE is representative of research publications in general.

Additional CORE Tools and Services

CORE has built several additional tools for a range of stakeholders including institutions, repository managers and researchers from across all scientific domains. Details of usage of these services is covered in the Uptake of CORE section.

The Dashboard provides a suite of tools for repository management, content enrichment, metadata quality assessment and open access compliance checking. Further, it can provide statistics regarding content downloads and suggestions for improving the efficiency of harvesting and the quality of metadata.

CORE Discovery helps users to discover freely accessible copies of research papers. There are several methods for interacting with the Discovery tool. First, as a plugin for repositories, enriching metadata only pages in repositories with links to open access copies of full text documents. Second, via a browser extension for researchers and anyone interested in reading scientific documents. And finally as an API service for developers.

Recommender

The recommender is a plugin for repositories, journal systems and web interfaces that provides suggestions on relevant articles to the one currently displayed. Its purpose is to support users in discovering articles of interest from across the network of open access repositories. It is notably used by prestigious institutions, including the University of Cambridge and by popular pre-prints services such as arXiv.org.

OAI Resolver

An OAI (Open Archives Initiative) identifier is a unique identifier of a metadata record. OAI identifiers are used in the context of repositories using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI Identifiers are viable persistent identifiers for repositories that can be, as opposed to DOIs, minted in a distributed fashion and cost-free, and which can be resolvable directly to the repository rather than to the publisher. The CORE OAI Resolver can resolve any OAI identifier to either a metadata page of the record in CORE or route it directly to the relevant repository page. This approach has the potential to increase the importance of repositories in the process of disseminating knowledge.

Uptake of CORE

As of February 2023, CORE averages over 40 million monthly active users and is the top 10th website in the category Science and Education according to SimilarWeb ( https://www.similarweb.com/ ). There are currently 4,700 registered API users and 2,880 registered dataset users. The CORE Dashboard is currently used by 499 institutional repositories to manage their open access content, monitor content download statistics, manage issues with metadata within the repository and ensure compliance with OA funder policies, notably REF in the U.K. The CORE Discovery plugin has been integrated into 434 repositories and the browser extension has been downloaded by more than 5,000 users via the Google Chrome Web Store ( https://chrome.google.com/webstore/category/extensions ). The CORE Recommender has been embedded in 70 repository systems including the University of Cambridge and arXiv.

In this section we discuss differences between CORE and other open access aggregation services and present several real-word use cases where CORE was used to develop services to support science. In this section we also present our future plans.

Existing open access aggregation services

Currently there are a number of open access aggregation services available (Table 4 ), with some examples being BASE ( https://base-search.net/ ), OpenAIRE ( https://www.openaire.eu/ ), Unpaywall ( http://unpaywall.org/ ), Paperity ( https://paperity.org/ ). BASE (Bielfield Academic Search Engine) is a global metadata harvesting service. It harvests repositories and journals via OAI-PMH and exposes the harvested content through an API and a dataset. OpenAIRE is a network of open access data providers who support open access policies. Even though in the past the project focused on European repositories, it has recently expanded by including institutional and subject repositories from outside Europe. A key focus of OpenAIRE is to assist the European Council to monitor compliance of its open access policies. OpenAIRE data is exposed via an API. Paperity is a service which harvests publications from open access journals. Paperity harvests both metadata and full text but does not host full texts. SHARE (Shared Access Research Ecosystem) is a harvester of open access content from US repositories. Its aim is to assist with the White House Office of Science and Technology Policy (OSTP) open access policies compliance. Even though SHARE harvests both metadata and full text it does not host the latter. Unpaywall is not primarily a harvester, but rather collects content from Crossref, whenever a free to read available version can be retrieved. It processes both metadata and full text but does not host them. It exposes the discovered links to documents through an API.

CORE differs from these services in a number of ways. CORE is currently the largest database of full text OA documents. In addition, CORE offers via its API a rich metadata record for each item in its collection which includes additional enrichments, contrary, for example, to Unpaywall’s API, which focuses only on delivering to the user information as to whether a free to read version is available. CORE also provides the largest number of links to OA content. To simplify access to data for end users it provides a number of ways for accessing its collection. All of the above services are free to use for research purposes however both CORE and Unpaywall also offer services to commercial partners on a paid-for basis.

Existing publication databases

Apart from OA aggregation services, a number of other services exists for searching and downloading scientific literature (Table 5 ). One of the main publication databases is Crossref ( https://www.crossref.org/ ), an authoritative index of DOI identifiers. Its primary function is to maintain metadata information associated with each DOI. The metadata stored by Crossref includes both OA and non-OA records. Crossref does not store publication full text, but for many publications provides full text links. As of February 2023, 5.9 m records in Crossref were associated with an explicit Creative Commons license (we have used the Crossref API to determine this number). Although Crossref provides an API, it does not offer its data for download in bulk, or provide a data sync service.

The remaining services from Table 5 can be roughly grouped into the following two categories: 1) citation indices, 2) academic search engines and scholarly graphs. The two major citation indices are Elsevier’s Scopus ( https://www.elsevier.com/solutions/scopus ) and Clarivate’s Web of Science ( https://clarivate.com/webofsciencegroup/solutions/web-of-science/ ), both of which are premium subscription services. Google Scholar, the best known academic search engine does not provide an API for accessing its data and does not permit crawling its website. Semantic Scholar ( https://www.semanticscholar.org/ ) is a relatively new academic search service which aims to create an “intelligent academic search engine” 18 . Dimensions ( https://www.dimensions.ai/ ) is a service focused on data analysis. It integrates publications, grants, policy documents, and metrics. 1findr ( https://1findr.1science.com/home ) is a curated abstract indexing service. It provides links to full text, but no API or a dataset for download.

The added value of CORE

There are other services that claim to provide access to a large dataset of open access papers. In particular, Unpaywall 2 , claim to provide access to 46.4 million free to read articles, and BASE, who state they provide access to full texts of about 60% of their 300 million metadata records. However, these statistics are not directly comparable to the numbers we report and are a product of a different focus of these two projects. This is because both the analysis of BASE and now Unpaywall define “providing access to” in terms of having a list of URLs from which a human user can navigate to the full text of the resource. This means that both Unpaywall and BASE do not collect these full text resources, which is also why they do not face many of the challenges we described in the Introduction. Using this approach, we could say that the CORE Dataset provides access to approximately 139 million full texts, i.e. about 48% of our 291 million metadata records point to a URL from which a human can navigate to the full text. However, to people concerned with text and data mining of scientific literature, it makes little sense to count URLs pointing to many different domains on the Web as the number of full texts made available.

As a result, our 32.8 million statistic refers to the number of OA documents we identified, downloaded, extracted text from, validated their relationship to the metadata record and the full texts of which we host on the CORE servers and make available to others. In contrast, BASE and Unpaywall do not aggregate the full texts of the resources they provide access to and consequently do not offer the means to interact with the full texts of these resources or offer bulk download capability of these resources for text analytics over scholarly literature.

We have also integrated CORE data with the OpenMinTeD infrastructure, a European Commission funded project which aimed to provide a platform for text mining of scholarly literature in the cloud 6 .

A number of academia and industry partners have utilised CORE in their services. In this section we present three existing uses of CORE demonstrating how CORE can be utilised to support text and data mining use cases.

Since 2017, CORE has been collaborating with a range of scholarly search and discovery systems. These include Naver ( https://naver.com/ ), Lean Library ( https://www.leanlibrary.com/ ) and Ontochem ( https://ontochem.com/ ). As part of this work, CORE serves as a provider of full text copies of reserch papers to existing records in these systems (Lean Library) or even supplies both metadata and full texts for indexing (Ontochem, NAVER). This collaboration also benefits CORE’s data providers as it expands and increases the visibility of their content.

In 2019, CORE entered into a collaboration with Turnitin, a global leader in plagiarism detection software. By using the CORE FastSync service, Turnitin’s proprietary web crawler searches through CORE’s global database of open access content and metadata to check for text similarity. This partnership enables Turnitin to significantly enlarge its content database in a fast and efficient manner. In turn, it also helps protect open access content from misuse, thus protecting authors and institutions.

As of February 2023, CORE Recommender 19 is actively running in over 70 repositories including the University of Cambridge institutional repository and arXiv.org among others. The purpose of the recommender is to improve the discoverability of research outputs by providing suggestions for similar research papers both within the collection of the hosting repository and the CORE collection. Repository managers can install the recommender to advance the accessibility of other scientific papers and outreach to other scientific communities, since the CORE Recommender acts as a gate to millions of open access research papers. The recommender is integrated with the CORE search functionality and is also offered as a plugin for all repository software, for example EPrints, DSpace, etc. as well as open access journals and any other webpage. Based on the fact that CORE harvests open repositories, the recommender only displays research articles where the full text is available as open access, i.e. for immediate use, without access barriers or limited rights’ restrictions. Through the recommender, CORE promotes the widest discoverability and distribution of the open access scientific papers.

Future work

An ongoing goal of CORE is to keep growing the collection to become a single point of access to all of world’s open access research. However, there are a number of other ways we are planning to improve both the size and ease of access to the collection. The CORE Harvesting System was designed to enable adding new harvesting steps and enrichment tasks. There remains scope for adding more of such enrichments. Some of these are machine learning powered, such as classification of scientific citations 20 . Further, CORE is currently developing new methodologies to identify and link different versions of the same article. The proposed system, titled CORE Works, will leverage CORE’s central position in the OA infrastructure landscape and will link different versions of the same paper using a unique identifier. We will continue to keep linking the CORE collection to scholarly entities from other services, thereby making CORE data participate in a global scholarly knowledge graph.

In the Introduction section we focused on a a number of challenges researchers face when collecting research literature for text and data mining. In this section, we instead focus on the perspective of a research literature aggregator, i.e. a system whose goal is to continuously provide seamless access to research literature aggregated from thousands of data providers worldwide in a way that enables the resulting research publication collection to be used by others in production applications. We describe the challenges we had to overcome to build this collection and to keep it continuously up-to-date, and present the key technical innovations which allowed us to greatly increase the size of the CORE collection and become a leading provider of open access literature which we illustrate using our content growth statistics.

CORE Harvesting system (CHARS)

CORE Harvesting System (CHARS) is the backbone of our harvesting process. CHARS uses the Harvesting Scheduler (Section CHARS_architecture) to select data providers to be processed next. It manages all the running processes (tasks) and ensures the available compute resources are well utilised.

Prior to implementing CHARS, CORE was centralised around data providers rather than around individual tasks needed to harvest and process these data providers (e.g. metadata download and parsing, full text download, etc.). Consequently, even though the scaling up and the continuation of this system was possible, the infrastructure was not horizontally scalable and the architecture suffered from tight coupling of services. This was not consistent with CORE’s high availability requirements and was regularly causing problems in the complexity of maintenance. In response to these challenges, we designed CHARS using a microservices architecture, i.e. using small manageable autonomous components that work together as part of a larger infrastructure 21 . One of the key benefits of microservices-oriented architecture is that the implementation focus can be put on the individual components which can be improved and redeployed as frequently as needed and independently of the rest of the infrastructure. As the process of open access content harvesting can be inherently split into individual consecutive tasks, a microservices-oriented architecture presents a natural fit for aggregation systems like CHARS.

Tasks involved in open access content harvesting

The harvesting process can be described as a pipeline where each task performs a certain action and where the output of each task feeds into the next task. The input to this pipeline is a set of data providers and the final output is a system populated with records of research papers available from them. The main types of key tasks currently performed as part of CORE’s harvesting system are (Fig. 8 ):

Metadata download: The metadata exposed by a data provider via OAI-PMH are downloaded and stored in the file system (typically as an XML). The downloading process is sequential, i.e. a repository provides typically between 100–1,000 metadata records per request and a resumption token. This token is then used to provide the next batch. As a result, full harvesting can a significant amount of time (hours-days) for large data providers. Therefore, this process has been implemented to provide resilience to a range of communication failures.

Metadata extraction : Metadata extraction parses, cleans, and harmonises the downloaded metadata and stores them into the CORE internal data structure (database). The harmonisation and cleaning process addresses the fact that different data providers/repository platforms describe the same information in different ways (syntactic heterogeneity) as well as having different interpretations for the same information (semantic heterogeneity).

Full text download : Using links extracted from the metadata CORE attempts to download and store publication manuscripts. This process is non-trivial and is further described in the Using OAI-PMH for content harvesting section.

Information extraction : Plain text from the downloaded manuscripts is extracted and processed to create a semi-structured representation. This process includes a range of information extraction tasks, such as references extraction.

Enrichment : The enrichment task works by increasing both metadata and full text harvested from the data providers with additional data from multiple sources. Some of the enrichments are performed directly by specific tasks in the pipeline such as language detection and document type detection. The remaining enrichments that involve external datasets are performed externally and independently to the CHARS pipeline and ingested into the dataset as described in the Enrichments section.

Indexing : The final step in the harvesting pipeline is indexing the harvested data. The resulting index powers CORE’s services, including search, API and FastSync.

CORE Harvesting Pipeline. Each tasks’ output produces the input for the following task. In some cases the input is considered as a whole, for example all the content harvested from a data provider, while in other cases, the output is split in multiple small tasks performed on a record level.

Scalable infrastructure requirements

Based on the experience obtained while developing and maintaining our harvesting system as well as taking into consideration the features of the CiteSeerX 22 architecture, we have defined a set of requirements for a scalable harvesting infrastructure 8 . These requirements are generic and apply to any aggregation or digital library scenario. These requirements informed and are reflected in the architecture design of CHARS (Section CHARS architecture):

Easy to maintain: The system should be easy to manage, maintain, fix, and improve.

High levels of automation: The system should be completely autonomous while allowing manual interaction.

Fail fast: Items in the harvesting pipeline should be validated immediately after a task is performed, instead of having only one and final validation at the end of the pipeline. This has the benefit of recognising issues and enabling fixes earlier in the process.

Easy to troubleshoot: Possible code bugs should be easily discerned.

Distributed and scalable: The addition of more compute resources should allow scalability, be transparent and replicable.

No single point of failure: A single crash should not affect the whole harvesting pipeline, individual tasks should work independently.

Decoupled from user-facing systems: Any failure in the ingestion processing services should not have an immediate impact on user-facing services.

Recoverable: When a harvesting task stops, either manually or due to a failure, the system should be able to recover and resume the task without manual intervention.

Performance observable: The system’s progress must be properly logged at all times and overlay monitoring services should be set up to provide a transparent overview of the services’ progress at all times, to allow early detection of scalability problems and identification of potential bottlenecks.

CHARS architecture

An overview of CHARS is shown in Fig. 9 . The system consists of the following main software components:

Scheduler: it becomes active when a task finishes. It monitors resource utilisation and selects and submits data providers to be harvested.

Queue (Qn): a messaging system that assists with communication between parts of the harvesting pipeline. Every individual task, such as metadata download, metadata parsing, full text download, and language detection, has its own message queue.

Worker (W i ): an independent and standalone application capable of executing a specific task. Every individual task has its own set of workers.

CORE Harvesting System.

A complete harvest of a data provider can be described as follows. When an existing task finishes, the scheduler is activated and informed of the result. It then uses the formula described in Appendix A to assign a score to each data provider. Depending on current resource utilisation, i.e. if there are any idle workers, and the number of data providers already scheduled for harvesting, the data provider with the highest score is then placed in the first queue Q 1 which contains data providers scheduled for metadata download. Once one of the metadata download workers W i -W j becomes available, a data provider is taken out of the queue and a new download of its metadata starts. Upon completion, the worker notifies the scheduler and, if the task is completed successfully, places the data provider in the next queue. This process continues until the data provider passes through the entire pipeline.

While some of the tasks in the pipeline need to be performed at the granularity of data providers, specifically metadata download and parsing, other tasks, such as full text extraction and language detection, can be performed at the granularity of individual records. While these tasks are originally scheduled at the granularity of data providers, only the individual records of a selected data provider which require processing are subsequently independently placed in the appropriate queue. Workers assigned to these tasks then process the individual records in the queue and they move through the pipeline once completed.

A more detailed description of CHARS, which includes technologies used to implement it, as well as other details can be found in 8 .

The harvesting scheduler is a component responsible for identifying data providers which need to be harvested next and placing these data providers in the harvesting queue. In the original design of CORE, our harvesting schedule was created manually, assigning the same harvesting frequency to every data provider. However, we found this approach inefficient as it does not scale due to the varying data providers size, differences in the update frequency of their databases and the maximum data delivery speeds of their repository platforms. To address these limitations, we designed the CHARS scheduler according to our new concept of “pro-active harvesting.” This means that the scheduler is event driven. It is triggered whenever the underlying hardware infrastructure has resources available to determine which data provider should be harvested next. The underlying idea is to maximise the number of ingested documents over a unit of time. The pseudocode and the formula we use to determine which repository to harvest next is described in Algorithm 1.

The size of the metadata download queue, i.e. the queue which represents an entry into the harvesting pipeline, is kept limited in order to keep the system responsive to the prioritisation of data providers. A long queue makes prioritising data providers harder, as it is not known beforehand how long the processing of a particular data provider will take. An appropriate size of the queue ensures a good balance between the reactivity and utilisation of the available resources.

Using OAI-PMH for content harvesting

We now describe the third key technical innovation which enables us to harvest full text content (as opposed to just metadata) from data providers using the OAI-PMH protocol. This process represents one step in the harvesting pipeline (Fig. 9 ), specifically, the third step which is activated after data provider metadata have been downloaded and parsed.

The OAI-PMH protocol was originally designed for metadata harvesting only, but due to its wide adoption and lack of alternatives it has been used as an entry point for full text harvesting from repositories. Full text harvesting is achieved by using URLs found in the metadata records to discover the location of the actual resource and subsequently downloading it 9 . We summarised the key challenges of this approach in the Challenges related to the use of OAI-PMH protocol for content harvesting section. The algorithm follows a depth first search strategy with prioritisation and finishes as soon as the first matching document is found.

The procedure works in the following way. First, all metadata records from a selected data provider with no full text are collected. Those records for which full text download was attempted within the retry period ( RP ) (usually six months) are filtered out. This is to avoid repeatedly downloading URLs that do not lead to the sought after documents. The downside of this approach is that if a data provider updates a link in the metadata, it might take up to the duration of the retry period to acquire the full text.

Algorithm 1

Next, the records are further filtered using a set of rules and heuristics we developed to a) increase the chances of identifying the URL leading to the described document quickly and b) to ensure that we identify the correct document. These filtering rules include:

Accepted file extensions: URLs are filtered according to a list of accepted file extensions. URLs ending with extensions such as .pptx that clearly indicate that the URL does not link to the required resource are removed from the list.

Same domain policy: URLs in the OAI-PMH metadata can link to any resources and domains. For example, a common practice is to provide a link to the associated presentation, dataset, or another related resource. As these are often stored in external databases, filtering out all URLs that lead to an external domain, i.e. domain different than the domain of the data provider, presents a simple method of avoiding the download of resources which with very high likelihood do not represent the target document. Exceptions include dx.doi.org and hdl.handle.net domains whose purpose is to provide a persistent identifier pointing to the document. The same domain policy is disabled for data providers which are aggregators and link to many different domains by design.

Provider-specific crawling heuristics: Many data providers follow a specific pattern when composing URLs. For example, a link to a full text document may be composed of the following parts: data provider URL + record handle + .pdf . For data providers utilising such patterns, URLs may be composed automatically where the relevant information (record handle) is known to us from the metadata. These generated URLs are then added to the list of URLs obtained from the metadata.

Prioritising certain URLs: As it is more likely for PDF URL to contain the target record than for an HTML URL, the final step is to sort URLs according to file and URL type. Highest priority is assigned to URLs that uses repository software specific patterns to identify full text, document, and PDF filetypes, while the lowest priority is assigned to hdl.handle.net URLs.

The system then attempts to request the document at each URL and download it. After each download, checks are performed to determine whether the downloaded document represents the target record. Currently, the downloaded document has to be a valid PDF with a title matching the original metadata record. If the target record is identified, the downloaded document is stored and the download process for that record ends. If the downloaded document contains an HTML page, URLs are extracted from this page and filtered using the same method mentioned above. This is because it is common in some of the most widely used repository systems such as DSpace for the documents not to be directly referenced from within the metadata records. Instead, the metadata records typically link to an HTML overview page of the document. To deal with this problem, we use the concept of harvesting levels. A maximum harvesting level corresponds to the maximum search depth for the referenced document. The algorithm finishes either as soon as the first matching document is found or after all the available URLs up to the maximum harvesting level have been exhausted. Algorithm 2 describes our approach for collecting the full texts using the OAI-PMH protocol. The algorithm follows a depth first search strategy with prioritisation and finishes as soon as the first matching document is found.

Algorithm 2

CHARS limitations

Despite overcoming the key issues to scalable harvesting of content from repositories, there still remains a number of important challenges. The first relates to the difficulty of estimating the optimal number of workers in our system to run efficiently. While the worker allocation is still largely established empirically, we are investigating more sophisticated approaches based on formal models of distributed computation, such as Petri Nets. This will allow us to investigate new approaches to dynamically allocating and launching workers to optimise the usage of our resources.

Enrichments

Conceptually, two types of enrichment processes are used within CORE: 1) an online enrichment process enriching a single record at the time of it being processed by the CHARS pipeline and 2) a periodic offline enrichment process which enriches a record based on information in external datasets (Fig. 10 ).

CORE Offline Enrichments.

Online enrichments

Online enrichments are fully integrated into the CHARS pipeline described earlier in this section. These enrichments generally involve the application of machine learning models and rule-based tools to gather additional insights about the record, such as language detection, document type detection. As opposed to offline enrichments, online enrichments are always performed just once for a given record. The following is a list of the current enrichments performed online:

Article type detection: A machine learning algorithm assigns each publication one of the following four types: presentation, thesis, research paper, other. In the future we may include other types.

Language identification: This task uses third-party libraries to identify the language based on the full text of a document. The resulting language is then compared to the one provided by the metadata record. Some heuristics are applied to disambiguate and harmonise languages.

Offline enrichments

Offline enrichments are carried out by means of gathering a range of information from large third-party scholarly datasets (research graphs). Such information includes metadata that do not necessarily change, such as a DOI identifier, as well as metadata that evolve, such as the number of citations. Especially due to the latter, CORE performs offline enrichments periodically, i.e. all records in CORE go through this process repeatedly at specified time intervals (currently once per month).

The process is depicted in Fig. 10 . The initial mapping of a record is carried out using a DOI, if available. However, as the majority of records from repositories do not come with a DOI, we carry out a matching process against the Crossref database using a subset of metadata fields including title, authors and year. Once the mapping is performed, we can harmonise fields as well as gather a wide range of additional useful data from relevant external databases, thereby enriching the CORE record. Such data include, ORCID identifiers, citation information, additional links to freely available full texts, field of study information and PubMed identifiers. Our solution is based on a set of map-reduce tasks to enrich the dataset and implemented on a Cloudera Enterprise Data Hub ( https://www.cloudera.com/products/enterprise-data-hub.html ) 23 , 24 , 25 , 26 .

Data availability

CORE provides several large data dumps of the processed and aggregated data under the ODC-BY licence ( https://core.ac.uk/documentation/dataset ). The only condition for both commercial and non-commercial reuse of these datasets is to acknowledge the use of CORE in their outputs. Additionally, CORE makes its API and most recent data dump freely available to registered individual users and researchers. Please note that CORE claims no rights in the aggregated content itself which is open access and therefore freely available to everyone. All CORE data rights correspond to the sui generis database rights of the aggregated and processed collection.

Licences for CORE services, such as the API and FastSync, are available for commercial users wishing to benefit from convenient access to CORE data with guaranteed level of customer support. The organisation running CORE, i.e. The Open University, is a charitable organisation fully committed to the Open Research mission. CORE is a signatory of the Principles of Open Scholarly Infrastructure (POSI) ( https://openscholarlyinfrastructure.org/posse ). No profit generation is practised. Instead, CORE’s income from licences to commercial parties is used solely to provide sustainability by means of enabling CORE to become less reliant on unstable project grants, thus offsetting and reducing the cost of CORE to the taxpayer. This is done in full compliance with the principles and best practices of sustainable open science infrastructure.

Code availability

CORE consists of multiple services. Most of our source code is open source and available in our public repository on GitHub ( https://github.com/oacore/ ). As of today, we are unfortunately not yet able to provide the source code to our data ingestion module. However, as we want to be as transparent as possible with our community, we have documented in this paper the key algorithms and processes which we apply using pseudocode.

Bornmann, L. & Mutz, R. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. JASIST 66 (11), 2215–2222 (2015).

CAS Google Scholar

Piwowar, H. et al . The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles. PeerJ 6 , e4375 (2018).

Article PubMed PubMed Central Google Scholar

Saggion, H. & Ronzano, F. Scholarly data mining: making sense of scientific literature. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) : 1–2 (2017).

Kim, E. et al . Materials synthesis insights from scientific literature via text extraction and machine learning. Chemistry of Materials 29 (21), 9436–9444 (2017).

Article CAS Google Scholar

Jacobs, N. & Ferguson, N. Bringing the UK’s open access research outputs together: Barriers on the Berlin road to open access. Jisc Repository (2014).

Knoth, P., Pontika, N. Aggregating Research Papers from Publishers’ Systems to Support Text and Data Mining: Deliberate Lack of Interoperability or Not? In: INTEROP2016 (2016).

Herrmannova, D., Pontika, N. & Knoth, P. Do Authors Deposit on Time? Tracking Open Access Policy Compliance. Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries , Urbana-Champaign, IL (2019).

Cancellieri, M., Pontika, N., Pearce, S., Anastasiou, L. & Knoth, P. Building Scalable Digital Library Ingestion Pipelines Using Microservices. Proceedings of the 11th International Conference on Metadata and Semantics Research (MTSR 2017) : 275–285. Springer (2017).

Knoth, P. From open access metadata to open access content: two principles for increased visibility of open access content. Proceedings of the 2013 Open Repositories Conference , Charlottetown, Prince Edward Island, Canada (2013).

Knoth, P.; Cancellieri, M. & Klein, M. Comparing the Performance of OAI-PMH with ResourceSync. Proceedings of the 2019 Open Repositories Conference , Hamburg, Germany (2019).

Kapidakis, S. Metadata Synthesis and Updates on Collections Harvested Using the Open Archive Initiative Protocol for Metadata Harvesting. Digital Libraries for Open Knowledge. TPDL 2018. Lecture Notes in Computer Science 11057 , 16–31 (2018).

Google Scholar

Knoth, P. and Zdrahal, Z. CORE: three access levels to underpin open access. D-Lib Magazine 18 (11/12) (2012).

Haslhofer, B. et al . ResourceSync: leveraging sitemaps for resource synchronization. Proceedings of the 22nd International Conference on World Wide Web : 11–14 (2013).

Khabsa, M. & Giles, C. L. The number of scholarly documents on the public web. PLOS One 9 (5), e93949 (2014).

Article ADS PubMed PubMed Central Google Scholar

Charalampous, A. & Knoth, P. Classifying document types to enhance search and recommendations in digital libraries. Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science 10450 , 181–192 (2017).

Rosvall, M. & Bergstrom, C. T. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105 (4), 1118–1123 (2008).

Article ADS CAS Google Scholar

D’Angelo, C. A. & Abramo, G. Publication rates in 192 research fields of the hard sciences. Proceedings of the 15th ISSI Conference : 915–925 (2015).

Ammar, W. et al . Construction of the Literature Graph in Semantic Scholar. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 3 (Industry Papers): 84–91 (2018).

Knoth, P. et al . Towards effective research recommender systems for repositories. Open Repositories , Bozeman, USA (2017).

Pride, D. & Knoth, P. An Authoritative Approach to Citation Classification. Proceedings of the 2020 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020), Virtual–China (2020).

Newman, S. Building microservices: designing fine-grained systems. O’Reilly Media, Inc. (2015).

Li, H. et al . CiteSeer χ : a scalable autonomous scientific digital library. Proceedings of the 1st International Conference on Scalable Information Systems , ACM (2006).

Bastian, H., Glasziou, P. & Chalmers, I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS medicine 7 (9), e1000326 (2010).

Shojania, K. G. et al . How quickly do systematic reviews go out of date? A survival analysis. Annals of internal medicine 147 (4), 224–233 (2007).

Article PubMed Google Scholar

Tsafnat, G. et al . Systematic review automation technologies. Systematic reviews 3 (1), 74 (2014).

Harzing, A.-W. & Alakangas, S. Microsoft Academic is one year old: The Phoenix is ready to leave the nest. Scientometrics 112 (3), 1887–1894 (2017).

Article Google Scholar

Download references

Acknowledgements

We would like to acknowledge the generous support of Jisc, under a number of grants and service contracts with The Open University. These included projects CORE, ServiceCORE, UK Aggregation (1 and 2) and DiggiCORE, which was co-funded by Jisc with NWO. Since 2015, CORE has been supported in three iterations under the Jisc Digital Services–CORE (JDSCORE) service contract with The Open University. Within Jisc, we would like to thank primarily the CORE project managers, Andy McGregor, Alastair Dunning, Neil Jacobs and Balviar Notay. We would also like to thank the European Commission for funding that contributed to CORE, namely OpenMinTeD (739563) and EOSC Pilot (654021). We would like to show our gratitude to all current CORE Team members who contributed to CORE but are not authors of the manuscript, namely Valeriy Budko, Ekaterine Chkhaidze, Viktoriia Pavlenko, Halyna Torchylo, Andrew Vasilyev and Anton Zhuk. We would like to show our gratitude to all past CORE Team members who have contributed to CORE over the years, namely Lucas Anastasiou, Giorgio Basile, Aristotelis Charalampous, Josef Harag, Drahomira Herrmannova, Alexander Huba, Bikash Gyawali, Tomas Korec, Dominika Koroncziova, Magdalena Krygielova, Catherine Kuliavets, Sergei Misak, Jakub Novotny, Gabriela Pavel, Vojtech Robotka, Svetlana Rumyanceva, Maria Tarasiuk, Ian Tindle, Bethany Walker and Viktor Yakubiv, Zdenek Zdrahal and Anna Zelinska.

Author information

Drahomira Herrmannova

Present address: Oak Ridge National Laboratory Oak Ridge, Oak Ridge, TN, USA

Authors and Affiliations

Knowledge Media Institute, The Open University Walton Hall, Milton Keynes, UK

Petr Knoth, Drahomira Herrmannova, Matteo Cancellieri, Lucas Anastasiou, Nancy Pontika, Samuel Pearce, Bikash Gyawali & David Pride

You can also search for this author in PubMed Google Scholar

Contributions

P.K. is the Founder and Head of CORE. He conceived the idea and has been the project lead since the start in 2011. He researched and created the first version of CORE, acquired funding, built the team, and has been managing and leading all research and development. M.C., L.A., S.P. and P.K. designed, worked out all technical details, and implemented significant parts of the system including CHARS, the harvesting scheduler, and the OAI-PMH content harvesting method. All authors contributed to the maintenance, operation and improvements of the system. D.H. drafted the initial version of the manuscript based on consultations with P.K. D.P. and P.K. wrote the final manuscript with additional input from L.A. and N.P. D.H., M.C. and L.A. performed the data analysis for the paper and D.H. produced the figures. D.H., D.P., B.G. and L.A. participated in research activities and tasks related to CORE following the instructions and directly supervised by P.K.

Corresponding author

Correspondence to Petr Knoth .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Knoth, P., Herrmannova, D., Cancellieri, M. et al. CORE: A Global Aggregation Service for Open Access Papers. Sci Data 10 , 366 (2023). https://doi.org/10.1038/s41597-023-02208-w

Download citation

Received : 18 May 2021

Accepted : 03 May 2023

Published : 07 June 2023

DOI : https://doi.org/10.1038/s41597-023-02208-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Detail of a painting depicting the landscape of New Mexico with mountains in the distance

Explore millions of high-quality primary sources and images from around the world, including artworks, maps, photographs, and more.

Explore migration issues through a variety of media types

Part of Street Art Graphics
Part of The Journal of Economic Perspectives, Vol. 34, No. 1 (Winter 2020)
Part of Cato Institute (Aug. 3, 2021)
Part of University of California Press
Part of Open: Smithsonian National Museum of African American History & Culture
Part of Indiana Journal of Global Legal Studies, Vol. 19, No. 1 (Winter 2012)
Part of R Street Institute (Nov. 1, 2020)
Part of Leuven University Press
Part of UN Secretary-General Papers: Ban Ki-moon (2007-2016)
Part of Perspectives on Terrorism, Vol. 12, No. 4 (August 2018)
Part of Leveraging Lives: Serbia and Illegal Tunisian Migration to Europe, Carnegie Endowment for International Peace (Mar. 1, 2023)
Part of UCL Press

Harness the power of visual materials—explore more than 3 million images now on JSTOR.

Enhance your scholarly research with underground newspapers, magazines, and journals.

Explore collections in the arts, sciences, and literature from the world’s leading museums, archives, and scholars.

IMAGES

(PDF) RESEARCH ARTICLE Open Access
Open Research
Article 2
Open Access
(PDF) Open Access Research Article Athreya
(PDF) RESEARCH ARTICLE Open Access

VIDEO

Science for all with compatible AI
Elsevier Explainers
Open Access at UBC Library
Why Feminism Opposes Islam. With Dr. Tanzeen Doha
50% open access
Flipping a journal to open access ft. Huw Golledge

COMMENTS

Directory of Open Access Journals
DOAJ is a free and comprehensive index of open access journals from various fields and languages. Search for journals and articles by keywords, browse by categories, or support DOAJ's 20th anniversary campaign.
Open and free content on JSTOR and Artstor
JSTOR and Artstor offer millions of articles, ebooks, images, and media that are open access or free to everyone. Learn how to search, access, and use these resources from libraries, publishers, and partners worldwide.
ScienceOpen
ScienceOpen is a platform that offers content hosting, context building and marketing services for publishers, institutions and researchers. It also provides smart search and discovery within an interactive interface, researcher promotion and ORCID integration, and open evaluation with article reviews and Collections.
Frontiers
Frontiers is a leading open access publisher of peer-reviewed articles across more than 1,500 academic disciplines. Find a journal, submit your research, and explore the latest news and events in science.
Find and Read Articles
PLOS ONE is an open access journal that publishes research across all scientific disciplines. Learn how to use PLOS tools to search, browse, and subscribe to articles by journal, subject, or personalized criteria.
Open access
Hybrid journals include a mix of open access articles and articles available to those with a journal subscription. Hybrid journals offer authors the option of gold open access publishing. With gold open access, authors usually pay an APC to make their research articles available immediately upon publication, under a Creative Commons licence ...
SpringerOpen
SpringerOpen offers researchers a platform to publish open access in journals in science, technology, medicine, humanities and social sciences. Learn about open access, find the right journal, explore subject areas and watch videos on SpringerOpen.
Open access
Elsevier is a leading open access publisher with over 2,900 journals and 3.3 million articles. Learn how to publish, access and share research with Elsevier's open access platforms, initiatives and services.
The fundamentals of open access and open research
Learn what open access and open research are, how to publish your work OA, and why it matters. Explore Springer Nature's services and policies for OA journals, books, data, preprints and peer review.
Open Access Journals
Learn how to publish open access articles in Elsevier journals that are peer reviewed, free to access and download, and reusable under Creative Commons licenses. Find out about the publication fee, funding body agreements, and transformative journals.
Home
SpringerLink is the online platform for Springer Nature, offering access to journals, eBooks, reference works and protocols across various disciplines. Find out how to publish, discover open access, browse by subject, and see featured articles and journals.
Open access journals
Springer Nature offers over 124,000 open access articles and 1,300 open access books across disciplines. Find out more about our fully open access and hybrid journals, article processing charges, licences and self-archiving options.
What is Open Access?
Open access (OA) is a set of principles and practices through which research outputs like journal articles are distributed online, free of cost or other access barriers.. In "traditional" scholarly publishing, the publisher owns the rights to the articles in their journals. Individuals looking to read these articles may encounter a paywall, requiring them to pay a fee for access.
"Unpaywall is transforming Open Science"
An open database of 51,435,134 free scholarly articles. We harvest Open Access content from over 50,000 publishers and repositories, and make it easy to find, track, and use. Get the extension
Open Access
Open access literature is digital, online, free of charge, and free of most copyright and licensing restrictions. There's an incredible amount of scientific research conducted at universities and institutions around the world. Historically, the findings of this research have been published in scholarly journals. However, access to this research is typically restricted — granted only…
Open Access (OA) Resources Research Guide
Open Access is the free, immediate, online availability of research articles combined with the rights to use these articles fully in the digital environment. Open Access is the needed modern update for the communication of research that fully utilizes the Internet for what it was originally built to do—accelerate research.
US government reveals big changes to open-access policy
The Biden administration instructs all US agencies to require immediate access to federally funded research after publication, starting in 2026. This is a boost for the open-access movement, but ...
Open Access
SPARC advocates for Open Access, which is the free, immediate, online availability of research articles coupled with the rights to use them fully in the digital environment. Learn how Open Access benefits researchers, funders, and society, and how to access and use research results.
CORE: A Global Aggregation Service for Open Access Papers
As of February 2023, CORE provides access to over 291 million metadata records and 32.8 million full text open access articles, making it the world's largest archive of open access research ...
JSTOR Home
Enrich your research with primary sources Enrich your research with primary sources. Explore millions of high-quality primary sources and images from around the world, including artworks, maps, photographs, and more. ... Part of Open: Smithsonian National Museum of African American History & Culture. Part of Indiana Journal of Global Legal ...
Open access
Hybrid open-access journals contain a mixture of open access articles and closed access articles. [ 18 ] [ 19 ] A publisher following this model is partially funded by subscriptions, and only provide open access for those individual articles for which the authors (or research sponsor) pay a publication fee. [ 20 ]