Research outputs by CORE
Writing about CORE? Discover our research outputs and cite our work. Below we have listed key articles about CORE by the Big Scientific Data and Text Analytics group (BSDTAG).
If you use CORE in your work, we kindly ask you to cite one of our reference publications.
This paper introduces CORE, describes CORE’s continuously growing dataset and the motivation behind its creation, presents the challenges associated with systematically gathering research papers from thousands of data providers worldwide at scale and outlines the solutions developed to overcome these challenges. It provides an in-depth discussion of the services and tools built on top of the indexed content and finally examines several use cases that have leveraged the CORE dataset and services.
Sets the vision for creating the CORE service, developing global-wide content aggregation of all open access research literature (on top of OAI-PMH protocol for metadata harvesting and other protocols). It sets the mission to develop the three access levels (access at the granularity of papers; analytical access; access to raw data) via CORE.
This paper describes the use cases that must be supported by open access aggregators, it establishes the CORE use case and demonstrates the benefits of open access content aggregators.
This paper describes the technical challenges relating to machine interfaces, the interoperability issues on obtaining open access content and the complications of achieving a harmonisation across repositories’ and publishers’ systems.
In this paper, we present CORE-GPT, a novel question-answering platform that combines GPT-based language models and more than 32 million full-text open access scientific articles from CORE. We first demonstrate that GPT3.5 and GPT4 cannot be relied upon to provide references or citations for generated text. We then introduce CORE-GPT which delivers evidence-based answers to questions, along with citations and links to the cited papers, greatly increasing the trustworthiness of the answers and reducing the risk of hallucinations. CORE-GPT's performance was evaluated on a dataset of 100 questions covering the top 20 scientific domains in CORE, resulting in 100 answers and links to 500 relevant articles. The quality of the provided answers and relevance of the links were assessed by two annotators. Our results demonstrate that CORE-GPT can produce comprehensive and trustworthy answers across the majority of scientific domains, complete with links to genuine, relevant scientific articles.
National research evaluation initiatives and incentive schemes choose between simplistic quantitative indicators and time-consuming peer/expert review, sometimes supported by bibliometrics. Here we assess whether machine learning could provide a third alternative, estimating article quality using more multiple bibliometric and metadata inputs. We investigated this using provisional three-level REF2021 peer review scores for 84,966 articles submitted to the U.K. Research Excellence Framework 2021, matching a Scopus record 2014–18 and with a substantial abstract. We found that accuracy is highest in the medical and physical sciences Units of Assessment (UoAs) and economics, reaching 42% above the baseline (72% overall) in the best case. This is based on 1,000 bibliometric inputs and half of the articles used for training in each UoA. Prediction accuracies above the baseline for the social science, mathematics, engineering, arts, and humanities UoAs were much lower or close to zero. The Random Forest Classifier (standard or ordinal) and Extreme Gradient Boosting Classifier algorithms performed best from the 32 tested. Accuracy was lower if UoAs were merged or replaced by Scopus broad categories. We increased accuracy with an active learning strategy and by selecting articles with higher prediction probabilities, but this substantially reduced the number of scores predicted.
We investigate the effect of varying citation context window sizes on model performance in citation intent classification. Prior studies have been limited to the application of fixed-size contiguous citation contexts or the use of manually curated citation contexts. We introduce a new automated unsupervised approach for the selection of a dynamic-size and potentially non-contiguous citation context, which utilises the transformer-based document representations and embedding similarities. Our experiments show that the addition of non-contiguous citing sentences improves performance beyond previous results. Evaluating on the (1) domain-specific (ACL-ARC) and (2) the multi-disciplinary (SDP-ACT) dataset demonstrates that the inclusion of additional context beyond the citing sentence significantly improves the citation classification model’s performance, irrespective of the dataset’s domain.
We present a new gold-standard dataset and a benchmark for the Research Theme Identification task, a sub-task of the Scholarly Knowledge Graph Generation shared task, at the 3rd Workshop on Scholarly Document Processing. The objective of the shared task was to label given research papers with research themes from a total of 36 themes. The benchmark was compiled using data drawn from the largest overall assessment of university research output ever undertaken globally (the Research Excellence Framework-2014). We provide a performance comparison of a transformer-based ensemble, which obtains multiple predictions for a research paper, given its multiple textual fields (eg title, abstract, reference), with traditional machine learning models. The ensemble involves enriching the initial data with additional information from open-access digital libraries and Argumentative Zoning techniques (CITATION). It uses a weighted sum aggregation for the multiple predictions to obtain a final single prediction for the given research paper.
In this paper, we argue why and how the integration of recommender systems for research can enhance the functionality and user experience in repositories.
This paper presents the CORE Repositories Dashboard, a tool designed primarily for repository managers. It describes how the Dashboard improves the quality of the harvested papers, advances the collaboration between the repository managers and CORE, enables a straightforward management of their collections and enhances the transparency of the harvested content.
This paper describes the online tool for annotating citations in a research paper based on their purpose. Citations are also annotated based on how influential it is for research assessment.
The overview paper highlights findings from the first edition of the 3C Citation Context Classification task, organised as part of the workshop, WOSP 2020.