A large scale dataset for Question Answering in Italian
SQuAD [Rajpurkar et al. 2016] is a large scale dataset for the training of question answering systems on factoid questions.
It contains more than 100,000 question-answer pairs about passages from 536 articles chosen from various domains of Wikipedia.
SQuAD-it is derived from the SQuAD dataset and it is obtained through the semi-automatic translation of the SQuAD dataset
in Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian.
This dataset contains more than 60,000 question/answer pairs derived from the original English dataset. The dataset is split in a train and test datasets to support the replicability of the benchmarking of QA systems:
SQuAD_it-train.json
: it contains the training examples derived from the original SQuAD 1.1 trainig material.SQuAD_it-test.json
: it contains the test/benchmarking examples derived from the origial SQuAD 1.1 development material.
More details about SQuAD-it can be found in [Croce et al. 2018]. The original paper can be found at this link.
How to cite SQuAD-it
If you find SQuAD-it useful for your research, please cite the following paper:
1 2 3 4 5 6 7 8 9 10 11 |
@InProceedings{10.1007/978-3-030-03840-3_29, author="Croce, Danilo and Zelenanska, Alexandra and Basili, Roberto", editor="Ghidini, Chiara and Magnini, Bernardo and Passerini, Andrea and Traverso, Paolo", title="Neural Learning for Question Answering in Italian", booktitle="AI*IA 2018 -- Advances in Artificial Intelligence", year="2018", publisher="Springer International Publishing", address="Cham", pages="389--402", isbn="978-3-030-03840-3" } |
Download
To download the SQuAD-it, please refer to https://github.com/crux82/squad-it
You can download a copy of the dataset (distributed under the CC BY-SA 4.0 license).
Release format
The same format used in the SQuAD dataset is here adopted: both train/test files are released into the following JSON format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
file.json ├── "data" │ └── [i] │ ├── "paragraphs" │ │ └── [j] │ │ ├── "context": "paragraph text" │ │ └── "qas" │ │ └── [k] │ │ ├── "answers" │ │ │ └── [l] │ │ │ ├── "answer_start": N │ │ │ └── "text": "answer" │ │ ├── "id": "<uuid>" │ │ └── "question": "paragraph question?" │ └── "title": "document id" └── "version": 1.1 |
The Wikipedia articles are split into paragraphs, one or more questions are made about the paragraph. The training set contains exactly one answer for each question, the test set contains one or more answers for each question. One of the most important properties of the dataset is that the answers are part of the paragraph and thus can be expressed as indices of the paragraph tokens, i.e., context [ answer_start : answer_start + len(text)]
.
Size of SQuAD-it
The original SQuAD dataset contains the following elements:
Element | Training Set | Dev set |
---|---|---|
Paragraphs | 18 896 | 2 067 |
Questions | 87 599 | 10 570 |
Answers | 87 599 | 34 726 |
Automatic translation lead to nearly 90% of translated paragraph-question-answer triplets. Some of these triplets were not translated due to unknown characters or low lexical coverage of the translation system. In many cases, the answer was not part of the translated paragraph any more after the translation, which lead to cancellation of most of the translations and lead to approximately 45% of the translated triplets w.r.t. the original dataset.
Further corrections were made, such as exchange of the word order in the answer text, different morphological forms were looked for and others, and if such modified answer was part of the paragraph, it was substituted. This increased the translated triplets to nearly 62% of the original.
The final SQuAD-it contains the following elements:
Training Set IT | Perc. w.r.t. SQuAD | Test Set IT | Perc. w.r.t. SQuAD | |
---|---|---|---|---|
Paragraphs | 18 506 | 97.9% | 2 010 | 97.2% |
Questions | 54 159 | 61.8% | 7 609 | 72.0% |
Answers | 54 159 | 61.8% | 21 489 | 61.9% |
Evaluating a Neural Model over SQuAD-it
The DrQA system [Chen et al. 2017] was adapted to the Italian language. The system was modified in order to be able to measure the performance on the SQuAD-it dataset. The evaluation script and the description of the measures used (Exact Match and Macro F1) is available on the original SQuAD website, https://rajpurkar.github.io/SQuAD-explorer/ (version v1.0 was used). The results were obtained on the Test Set.
EM | F1 | |
---|---|---|
Results over the Test Set | 56.1 | 65.9 |
More details can be found in [Croce et al. 2018].
Examples
Few examples of the answers given by the DrQA-it trained over the SQuAD-it are hereafter reported.
- The answer to the question “Dove si trovano le Isole Marshall?”
is correct – “Pacifico“.- Question: Dove si trovano le Isole Marshall?
- Answer: Pacifico
- Sentence: Facente parte della Micronesia, le Isole Marshall sono un gruppo di atolli e isole situate nel Pacifico, poco a nord dell’equatore.
- Wikipedia Page: Isole Marshall
- In this case, the answer to the question “Quali sono gli stati che confinano con l’Austria?” is “Italia, Germania e Slovenia“, which is partially correct.
- Question: Quali sono gli stati con cui confina l’Austria?
- Answer: Italia, Germania e Slovenia
- Sentence: Ma quattro direttrici nord-sud che collegano il paese con i principali stati confinanti (Italia, Germania E Slovenia).
- Wikipedia Page: Austria
- The answer to the question “Dov’è nato Virgilio?” is “Ulassai“, which is correct, if referring to Virglio Lai, an Italian photographer, even if most people would be referring to the more famous Publius Vergilius Maro, (Publio Virgilio Marone, or simply Virgilio in Italian) born near Mantova 70 B.C.
- Question: Dove è nato Virgilio?
- Answer: Ulassai
- Sentence: Virgilio Lai (Ulassai, 1º Agosto 1926- Cagliari, 6 Aprile 2009) è fotografo del 1900 italiano.
- Wikipedia Page: Virgilio Lai
References
[Rajpurkar et al. 2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang SQuAD: 100,000+ Questions for Machine Comprehension of Text In the Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) — November 1–5, 2016 — Austin, Texas, USA.
[Chen et al. 2017] Danqi Chen, Adam Fisch, Jason Weston and Antoine Bordes Reading Wikipedia to Answer Open-Domain Questions. In the Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2017 Vancouver
[Croce et al. 2018] Danilo Croce, Alexandra Zelenanska, Roberto Basili Neural Learning for Question Answering in Italian. AI*IA 2018 — Advances in Artificial Intelligence, 2018 Trento. pages 389-402
Contacts
For any question or suggestion, you can send an e-mail to croce@info.uniroma2.it