Wikidata : Main Page

Welcome to Wikidata

the free knowledge base with 113,194,393  data  items that anyone can edit .

Introduction • Project Chat • Community Portal • Help

Want to help translate? Translate the missing messages .

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.

Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

Wikidata also provides support to many other sites and services beyond just Wikimedia projects! The content of Wikidata is available under a free license , exported using standard formats , and can be interlinked to other open data sets on the linked data web.

Learn about Wikidata

  • What is Wikidata? Read the Wikidata introduction .
  • Explore Wikidata by looking at a featured showcase item for author Douglas Adams (Q42) .
  • Get started with Wikidata's SPARQL query service .

Contribute to Wikidata

  • Learn to edit Wikidata: follow the tutorials .
  • Work with other volunteers on a subject that interests you: join a WikiProject .
  • Individuals and organizations can also donate data .

Meet the Wikidata community

  • Visit the community portal or attend a Wikidata event .
  • Create a user account .
  • Talk and ask questions on the Project chat , Telegram groups , or the live IRC chat connect .

Use data from Wikidata

  • Learn how you can retrieve and use data from Wikidata .
  • 2024-07-10: Wikidata records its 2,200,000,000th edit .
  • 2024-07-10: The Wikidata development team held the Q3 Wikidata+Wikibase office hour on July 10th at 16:00 UTC. They presented their work from the past quarter and discussed what's coming next for Q3. Find the session log here .
  • 2024-05-07: Wikidata records its 2 31 th edit , the revision IDs not fitting into 32-bit signed integer anymore
  • 2024-04-10: The development team at WMDE held the 2024 Q2 Wikidata+Wikibase office hour in the Wikidata Telegram group. You can read session log .
  • 2024-04: Wikidata held the Leveling Up Days , an online event focused on learning more about how to contribute to Wikidata from the 5th to 7th and 12th to 14th of April.

More news... ( edit [in English] )

New to the wonderful world of data? Develop and improve your data literacy through content designed to get you up to speed and feeling comfortable with the fundamentals in no time.

Item: Earth (Q2)

  • Tropical Storm Jongdari (Q129393829)
  • Tropical Storm Shanshan (Q129555713)
  • Mike Lynch (Q6833839) (pictured)
  • Delta State College of Physical Education, Mosogar (Q107478063)
  • Rabea Rogge (Q128996418)
  • It Ends With Us (Q118641054)
  • Christopher J. Morvillo (Q129401605)

Innovative applications and contributions from the Wikidata community

Featured WikiProject: WikiProject Music

Wikiproject Music is home to editors that help add data about artists, music releases, tracks, awards, and performances! Additionally, importing from and linking Wikidata with the many music databases and streaming services is another focus of the project. Read about our data model on our project page and come chat with us on Telegram.

  • Check out Wikidata:Tools for some of our best tools and gadgets for using and exploring Wikidata.

Know of an interesting project or research conducted using Wikidata? You can nominate content to be featured on the Main page here !

  • Wikidata mailing list
  • Wikidata technical mailing list
  • Discussion requests for specific topics
  • Facebook , Mastodon , X/Twitter
  • Leave a message at project chat
  • Telegram General Chat , Telegram Help or on IRC connect
  • Report a technical problem
  • Keep up-to-date: Weekly summaries

wikipedia biography dataset

Navigation menu

  • Create account
  • Contributions

Various places that have Wikimedia datasets, and tools for working with them.

Also, you can now store table and maps data using Commons Datasets, and use them from all wikis from Lua and Graphs.

Dataset Description URL Last Updated
Official Wikipedia database dumps Present
exposes semantics of content in fully rendered , and is available for various languages and projects: , , ..., , dewikibooks, ... The prefix pattern is the wikimedia database name. include VE, Flow, Kiwix and Google. Parsoid also supports the conversion of (possibly modified) HTML back to wikitext without introducing dirty diffs. Dead
Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species Dead
Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples Dead
DBpedia Facts extracted from Wikipedia info boxes and link structure in RDF format(Auer et al.,2007) 2019
Multiple data sets (English Wikipedia articles that have been transformed into XML) Dead
This is an alphabetical list of film articles (or sections within articles about films). It includes made for television films Dead
Using the Wikipedia page-to-page link database Dead
Wikipedia: Lists of common misspellings/For machines Dead
Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. Dead
Wikipedia XML Data 2015
Wikipedia Page Traffic Statistics (up to November 2015) 2015
Complete Wikipedia edit history (up to January 2008) 2008
Wikitech-l page counters 2016
MusicBrainz Database Dead
Datasets of network extracted from User Talk pages 2011
Wikipedia Statistics Present
List of articles created last month/week/day with most users contributing to article within the same period Dead
Wikipedia Taxonomy automatically generated from the network of categories in Wikipedia(RDF Schema format)(Ponzetto and Strube, 2007 a–c; Zirn et al., 2008) Dead
Semantic Wikipedia: A snapshot of Wikipedia automatically annotated with named entity tags(Zaragoza etal.,2007) Dead
Cyc to Wikipedia mappings: 50,000 automatically created mappings from Cyc terms to Wikipedia articles (Medelyan and Legg, 2008) Dead
Topic indexed documents: A set of 20 Computer Science technical reports indexed with Wikipedia articles as topics. 15 teams of 2 senior CS undergraduates have independently assigned topics from Wikipedia to each article (Medelyan et al., 2008) Dead
Wikipedia Page Traffic API Present
Articles published using the tool. Both detailed lists and summary statistics are available.

2022

Tools to extract data from Wikipedia:

This table might be migrated to the Knowledge Extraction Wikipedia Article

Tool Description URL Last Updated
Wikilytics Extracting the dumps into a NoSQL database 2017
Wikipedia2text Extracting Text from Wikipedia 2008
Traffic Statistics Wikipedia article traffic statistics Dead
Wikipedia to Plain text Generating a Plain Text Corpus from Wikipedia 2009
DBpedia Extraction Framework The DBpedia software that produces RDF data from over 90 language editions of Wikipedia and Wiktionary (highly configurable for other MediaWikis also).

2019
Wikiteam Tools for archiving wikis including Wikipedia 2019
History Flow History flow is a tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors Dead
WikiXRay This tool includes a set of Python and GNU R scripts to obtain statistics, graphics and quantitative results for any Wikipedia language version 2012
StatMediaWiki StatMediaWiki is a project that aims to create a tool to collect and aggregate information available in a MediaWiki installation.Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development, or a CSV file for custom processing. Dead
Java Wikipedia Library (JWPL) This is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia 2016
Wikokit Wiktionary parser and visual interface 2019
wiki-network Python scripts for parsing Wikipedia dumps with different goals 2012
Pywikipediabot Python Wikipedia robot framework 2019
WikiRelate API for computing semantic relatedness using Wikipedia (Strube and Ponzetto,2006) 2006
WikiPrep A Perl tool for preprocessing Wikipedia XML dumps(Gabrilovich andMarkovitch,2007) 2014
W.H.A.T. Wikipedia Hybrid Analysis Tool An analytic tool for Wikipedia with two main functionalities: an article network and extensive statistics.It contains a visualization of the article networks and a powerful interface to analyze the behavior of authors 2013
QuALiM A Question Answering system. Given a question in a natural language returns relevant passages from Wikipedia (Kaisser, 2008) 2008
Koru A demo of a search interface that maps topics involved in both queries and documents to Wikipedia articles. Supports automatic and interactive query expansion(Milne et al.,2007) 2007
Wikipedia Thesaurus A large scale association thesaurus containing 78M associations(Nakayama et al.,2007a,2008) Dead
Wikipedia English–Japanese dictionary A dictionary returning translations from English into Japanese and vise versa, enriched with probabilities of these translations(Erdmann et al.,2008) Dead
Wikify Automatically annotates any text with links to Wikipedia articles(Mihalcea and Csomai,2007) Dead
Wikifier Automatically annotates any text with links to Wikipedia articles describing named entities Dead
Wikipedia Cultural Diversity Observatory Creates a dataset named Cultural Context Content (CCC) for each language edition with the articles that relate to its cultural context (geography, people, traditions, history, companies, etc.). 2019
Time-series graph of Wikipedia Wikipedia web network stored in Neo4J database. Pagecounts data stored in Apache Cassandra database. Deployment scripts and instructions use corresponding Wikimedia dumps. 2020
Basic python parsing of dumps A guide for how to parse Wikipedia dumps in python 2017
Wiki Dump Reader A python package to extract text from Wikipedia dumps 2019
MediaWiki Parser from Hell A python library to parse MediaWiki wikicode. 2020
Mediawiki Utilities A collection of utilities for interfacing with MediaWiki: 2020
qwikidata A python utility for interacting with WikiData 2020
Namespace Database A python utility which: 2020
  • Research:Index
  • Research:Query Library
  • en:Category:Websites which use Wikipedia
  • Data dumps/Other tools
  • Research:Data
  • Data dumps/More resources
  • Help:Export

wikipedia biography dataset

  • CD and paper
  • Toggle limited content width

W iki G raphs: A W ikipedia Text - Knowledge Graph Paired Dataset

Luyu Wang , Yujia Li , Ozlem Aslan , Oriol Vinyals

Export citation

  • Preformatted

Markdown (Informal)

[WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset](https://aclanthology.org/2021.textgraphs-1.7) (Wang et al., TextGraphs 2021)

  • WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset (Wang et al., TextGraphs 2021)
  • Luyu Wang, Yujia Li, Ozlem Aslan, and Oriol Vinyals. 2021. WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset . In Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15) , pages 67–82, Mexico City, Mexico. Association for Computational Linguistics.

wikipedia biography dataset

Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper .

Setting   #Entity #Relation #Triplet
Transductive Train 4,594,485 822 20,614,279
  Valid 4,594,485 822 5,163
  Test 4,594,485 822 5,133
Inductive Train 4,579,609 822 20,496,514
  Valid 7,374 199 6,699
  Test 7,475 201 6,894
  • Knowledge graph: Transductive split , 160 MB. Inductive split , 160 MB. Raw , 168 MB.
  • Corpus , 991 MB.
  • Entity & relation aliases , 188 MB.

For raw knowledge graph, it may also contain entities that do not have corresponding Wikipedia pages.

Wikidata5m follows the identifier system used in Wikidata. Each entity and relation is identified by a unique ID. Entities are prefixed by Q , while relations are prefixed by P .

Knowledge Graph

The knowledge graph is stored in the triplet list format. For example, the following line corresponds to <Donald Trump, position held, President of the United States> .

Each line in the corpus is a document, indexed by entity ID. The following line shows the description for Donald Trump .

Each line lists the alias for an entity or relation. The following line shows the aliases of Donald Trump .

Publications

  • KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhag, Zhiyuan Liu, Juanzi Li, Jian Tang TACL 2021 arXiv BibTeX

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

google-research-datasets/wit

Folders and files.

NameName
62 Commits

Repository files navigation

Wit : wikipedia-based image text dataset.

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

  • The largest multimodal dataset (publicly available at the time of this writing) by the number of image-text examples.
  • A massively multilingual dataset (first of its kind) with coverage for 108 languages.
  • First image-text dataset with page level metadata and contextual information
  • A collection of diverse set of concepts and real world entities.
  • Brings forth challenging real-world test sets.

You can learn more about WIT Dataset from our arXiv paper .

Latest Updates

2021 April: Happy to share the good news that our paper got accepted at SIGIR Conference . From ACM site, you can find our paper, slides and presentation .

2021 September: WIT Image-Text Competition is live on Kaggle. Our collaborators from Wikimedia Research blogged about this and they have made available the raw pixels and resnet50 embeddings for the images in this set. Here is our Google AI blog post .

2022 April: We are happy to share that the WIT paper and dataset was awarded the WikiMedia Foundation's Research Award of the Year ( tweet 1 , tweet 2 ). We are deeply honored and thank you for the recognition.

2022 May: We have released the WIT validation set and test set. Please see the data page for download links.

2022 Oct: Authoring Tools for Multimedia Content proposal accepted at TREC 2023

2023 Apr: AToMiC accepted at SIGIR 2023.

2023 Apr: WikiWeb2M Dataset released.

2023 May: Accepted submissions at WikiWorkshop 2023 .

  • WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset ( pdf , arXiv )
  • Building Authoring Tools for Multimedia Content with Human-in-the-loop Relevance Annotations ( pdf )
  • Characterizing Image Accessibility on Wikipedia across Languages ( pdf )

WIT Example

Wikipedia page.

For example, let's take the Wikipedia page for Half Dome, Yosemite in CA .

WIT Wikipedia Half Dome Image

From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0

Wikipedia Page with Annotations of what we can extract

From this page, we highlight the various key pieces of data that we can extract - images, their respective text snippets and some contextual metadata.

WIT Half Dome Page with Annotations

By extracting and filtering these carefully, we get a clean, high quality image-text example that can be used in multimodal modeling.

Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.

To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.

The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).

WIT: Dataset Numbers

Type Train Val Test Total / Unique
Rows / Tuples 37.13M 261.8K 210.7K 37.6M
Unique Images 11.4M 58K 57K 11.5M
Ref. Text 16.9M 150K 104K 17.2M / 16.7M
Attr. Text 34.8M 193K 200K 35.2M / 10.9M
Alt Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M

WIT: Image-Text Stats by Language

Image-Text # Lang Uniq. Images # Lang
total > 1M 9 images > 1M 6
total > 500K 10 images > 500K 12
total > 100K 36 images > 100K 35
total > 50K 15 images > 50K 17
total > 14K 38 images > 13K 38

We believe that such a powerful diverse dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

WIT Dataset is now available for download. Please check the data page.

If you use the WIT dataset, you can cite our work as follows.

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Projects using WIT

For information regarding MURAL (Multimodal, Multitask Retrieval Across Languages) paper accepted at EMNLP 2021.

For any questions, please contact [email protected] .

If WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper, we are delighted to learn about it.

Contributors 3

Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader.

  • huggingface/datasets -
  • facebookresearch/ParlAI -

Edit Dataset Modalities

Edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Wikiqa (wikipedia open-domain question answering).

wikipedia biography dataset

The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, Bing query logs were used as the question source. Each question is linked to a Wikipedia page that potentially has the answer. Because the summary section of a Wikipedia page provides the basic and usually most important information about the topic, sentences in this section were used as the candidate answers. The corpus includes 3,047 questions and 29,258 sentences, where 1,473 sentences were labeled as answer sentences to their corresponding questions.

Benchmarks Edit Add a new result Link an existing benchmark

Trend Task Dataset Variant Best Model Paper Code
Paper Code Results Date Stars

Dataset Loaders Edit Add Remove

wikipedia biography dataset

Similar Datasets

Insuranceqa, license edit, modalities edit, languages edit.

IMAGES

  1. GitHub

    wikipedia biography dataset

  2. Wikipedia Summary Dataset

    wikipedia biography dataset

  3. Wikipedia Knowledge Graph dataset Dataset

    wikipedia biography dataset

  4. Diagram of files and relationships of the Wikipedia knowledge graph

    wikipedia biography dataset

  5. Summary of the wikipedia dataset.

    wikipedia biography dataset

  6. Examples of images and corresponding texts on Wikipedia dataset

    wikipedia biography dataset

COMMENTS

  1. GitHub

    This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). - GitH...

  2. legacy-datasets/wikipedia · Datasets at Hugging Face

    pip install mwparserfromhell. Then, you can load any subset of Wikipedia per language and per date this way: from datasets import load_dataset. load_dataset("wikipedia", language="sw", date="20220120") You can specify num_proc= in load_dataset to generate the dataset in parallel. You can find the full list of languages and dates here.

  3. WikiBio (Wikipedia Biography Dataset)

    This dataset gathers 728,321 biographies from English Wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).

  4. Wikipedia:Database download

    Start downloading a Wikipedia database dump file such as an English Wikipedia dump. It is best to use a download manager such as GetRight so you can resume downloading the file even if your computer crashes or is shut down during the download. Download XAMPPLITE from [2] (you must get the 1.5.0 version for it to work).

  5. michaelauli/wiki_bio · Datasets at Hugging Face

    Dataset Summary This Dataset contains 728321 biographies extracted from Wikipedia containing the first paragraph of the biography and the tabular infobox.

  6. Wikipedia-biography-dataset

    Wikipedia-biography-dataset : This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).

  7. wiki_bio

    WikiBio is constructed using Wikipedia biography pages, it contains the first paragraph and the infobox tokenized. The dataset follows a standarized table format.

  8. wikipedia-biography-dataset/wikipedia-biography-dataset.z00 at master

    This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). - Davi...

  9. Wikidata

    Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Wikidata also provides support to many other sites and services beyond just Wikimedia ...

  10. List of datasets for machine-learning research

    Machine learningand data mining. These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning ), computer hardware, and, less ...

  11. README.md · legacy-datasets/wikipedia at main

    Then, you can load any subset of Wikipedia per language and per date this way: from datasets import load_dataset. load_dataset("wikipedia", language="sw", date="20220120") You can specify num_proc= in load_dataset to generate the dataset in parallel. You can find the full list of languages and dates here.

  12. wikipedia-biography-dataset/README.md at master

    This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). It was used in our work, Neural Text Generation from Structured Data with Application to the Biography Domain. Rémi Lebret, David Grangier and Michael Auli ...

  13. wikipedia

    wikipedia/20230601.ace Config description: Wikipedia dataset for ace, parsed from 20230601 dump.

  14. PDF Generating Wikipedia Article Sections from Diverse Data Sources

    WIK-IBIO (Lebret et al., 2016) is a biography dataset that pairs Wikipedia infoboxes with initial sentences in corresponding Wikipedia articles. Similarly, Vou-giouklis et al. (2017) create a biography dataset from Wikipedia using the first two sentences in Wikipedia articles and the aligned data triples from DBpedia and Wikidata.

  15. wikipedia-biography-dataset.z15

    This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). - Davi...

  16. Datasets

    Datasets. From Meta, a Wikimedia project coordination wiki. Various places that have Wikimedia datasets, and tools for working with them. Also, you can now store table and maps data using Commons Datasets, and use them from all wikis from Lua and Graphs.

  17. UAH satellite temperature dataset

    The UAH satellite temperature dataset, developed at the University of Alabama in Huntsville, infers the temperature of various atmospheric layers from satellite measurements of the oxygen radiance in the microwave band, using Microwave Sounding Unit temperature measurements.. It was the first global temperature datasets developed from satellite information and has been used as a tool for ...

  18. WikiText-2 Dataset

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText ...

  19. WikiGraphs: A Wikipedia Text

    We present a new dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models ...

  20. Wikidata5m

    Publications. KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhag, Zhiyuan Liu, Juanzi Li, Jian Tang TACL 2021 arXiv BibTeX. Project page of MilaGraph group.

  21. WIT : Wikipedia-based Image Text Dataset

    WIT : Wikipedia-based Image Text Dataset Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

  22. Wiki-en Dataset

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. **Wiki-en** is an annotated English dataset for domain detection extracted from Wikipedia.

  23. WikiQA Dataset

    The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, Bing query logs were used as the question source. Each question is linked to a Wikipedia page that potentially has the answer. Because the summary section of a Wikipedia page ...