Resources 

Here we present the datasets and linguistic resources developed and curated by our research group, with a strong focus on enriching AI capabilities in Italian. These resources cover a wide range of tasks, including question answering, fact verification, image and video captioning, and visual question answering. Many of them are the first large-scale benchmarks of their kind for the Italian language, developed through a combination of machine translation, manual validation, and semantic annotation. These efforts aim to empower the Italian AI ecosystem by providing high-quality, publicly available data for training and evaluating models in diverse linguistic and multimodal contexts.

SQuAD-it – Italian Question Answering Dataset

GitHub:https://github.com/crux82/squad-it

Description: SQuAD-it is a large-scale dataset for open-domain question answering in Italian, derived from the original English SQuAD dataset through semi-automatic translation. It contains over 60,000 question-answer pairs and is split into training and test sets. This dataset supports the development and benchmarking of question answering systems in the Italian language. 

FEVER-it – Italian Fact Verification Dataset

GitHub: https://github.com/crux82/FEVER-it

Description: FEVER-it is a large-scale dataset designed for training and evaluating fact verification systems in Italian. Derived from the English FEVER dataset, it comprises over 180,000 claims manually verified against Wikipedia, annotated with labels indicating whether the evidence supports, refutes, or provides insufficient information about the claim. This resource facilitates the development of robust fact-checking models tailored to the Italian language.

MSCOCO-it – Italian Image Captioning Dataset

GitHub: https://github.com/crux82/mscoco-it

Description: MSCOCO-it is a large-scale dataset for image captioning in Italian, derived from the MSCOCO dataset. It includes over 600,000 image-caption pairs obtained through semi-automatic translation. The dataset comprises training and validation subsets, with each image annotated with five human-written captions. This resource facilitates the training of image captioning systems in the Italian language. 

MSR-VTT-it – Italian Video Captioning Dataset

GitHub: https://github.com/crux82/msr-vtt-it

Description: MSR-VTT-it is a large-scale dataset for video captioning in Italian. It contains 200,000 video-caption pairs derived from the original English MSR-VTT dataset through semi-automatic translation. The dataset includes 10,000 web video clips, each annotated with 20 human-written captions, and is split into training (6,513 videos), validation (497 videos), and testing (2,990 videos) sets. This resource supports the development of video understanding systems in the Italian language.

HuRIC – Human-Robot Interaction Corpus

GitHub: https://github.com/crux82/huric

Description: HuRIC is a corpus designed for Human-Robot Interaction in natural language, focusing on commands directed at robots in domestic environments. It comprises audio files paired with their transcriptions, annotated with morphological, syntactic, and rich semantic information based on Frame Semantics and Spatial Semantics. This resource aids in developing systems capable of understanding and executing natural language instructions in robotic applications.

GQA-it – Italian Visual Question Answering Dataset

GitHub: https://github.com/crux82/gqa-itDescription: GQA-it is a large-scale Italian dataset for Visual Question Answering, based on the balanced version of the GQA dataset. It contains over 1 million question-answer pairs in Italian over 80,000 images, obtained through Neural Machine Translation. A test set of 3,000 question-answer pairs has been manually validated to provide a reliable benchmark. This dataset enables the development of VQA systems tailored to the Italian language.