Context-based Corpus for Sentiment Analysis in Twitter

The Context-based Corpus for Sentiment Analysis in Twitter is a collection of Twitter messages annotated with classes reflecting the underlying polarity. This corpus has been collected starting from the SemEval 2013 Task 2 – Sentiment Analysis in Twitter  training and development sets and from the Evalita 2014 Sentipolc dataset, and it has been used for the experimental evaluations of the works presented in (Vanzo et Al, 2014) and (Castellucci et Al, 2015). Differently from existing corpora for Sentiment Analysis, all messages are here provided with their own contexts.

The basic idea behind the corpus construction is that the sentiment expressed by a tweet could be influenced by the opinions expressed within the context in which it is immersed. This context can be made explicit by the Twitter conversation to which a specific tweet belongs. Otherwise, the context can be implicit, i.e. the tweet expresses an opinion about a topic already discussed by other users. These two settings are reflected in the resource by two different types of contextual information: conversational and topical. The former is based on the reply-to chain, where the context is built by recovering the reply information available for Twitter messages. The latter takes into account the hashtags, that allow to aggregate different tweets around user specified topics; the context of a tweet thus includes preceding messages that share at least one hashtag.

How to get the corpus

To download the corpus, please write an email to castellucci@ing.uniroma2.it or croce@info.uniroma2.it

Annotations

According to the SemEval-2013 Sentiment Analysis in Twitter task, tweets are labeled according to the positive, negative or neutral classes.
The corpus has been annotated with different methods, depending on the source of the tweet: messages originally belonging to the SemEval-2013 dataset inherit the gold annotations and these are also the target tweet for the evaluation; instead, contextual tweets have been automatically annotated through a SVM multi-classifier trained over the SemEval-2013 official training dataset, as discussed in (Vanzo et Al, 2014) and (Castellucci et Al, 2015).
The Italian dataset has been gathered and annotated similarly, after having filtered those messages expressing irony, which was manually labeled in the original Sentipolc dataset.

Corpus Details

The corpus is organized in six datasets. Each dataset is composed of target tweets and their contexts. Conversational and topical contexts are separated by an empty line. Each tweet is identified by the original Twitter ID (column 1).

Column 2 contains the sentiment polarity label (positive, negative and neutral). Column 3 specifies with the tag goldStandard if the message has been labeled manually (as it belongs to the SemEval-2013 dataset) or labeled through the multi-classifier, specified by the tag classifier. Columns are separated by a [TAB] character. In some cases, tweets do not have a context. In this case, a single line represents a target tweet.

For instance, the following fragments

show three target tweets (100001727910121473, 100002367457603584 and 100002521464053761), where tweets 100001727910121473 and 100002367457603584 are supplied with their contexts of size 3 and 2, respectively, while the tweet 100002521464053761 has no context available.

The downloadable zip file contains three folder each containing six files, corresponding to the datasets adopted in (Castellucci et Al, 2015) (IJCOL2015_*) (Vanzo et Al, 2014) (COLING2014):

  • conv_training_set.tsv collects the training tweets and their conversation context;
  • conv_development_set.tsv collects the development tweets and their conversation context;
  • conv_testing_set.tsv collects the test tweets and their conversation context;
  • hash_training_set.tsv collects the training tweets and their hashtag context;
  • hash_development_set.tsv collects the development tweets and their hashtag context;
  • hash_testing_set.tsv collects the test tweets and their hashtag context.

Notice that the COLING2014 dataset only contains data in English (derived from the SemEval 2013 Twitter corpus). The IJCOL2015_* dataset is composed by the English data (IJCOL2015_ENG) and the Italian data (IJCOL2015_ITA). The English data are the same set of tweets of the COLING2014 dataset, but they have been re-labeled with an improved version of the multi-classifier. In the Italian dataset (IJCOL2015_ITA), only a training and test split is provided.

Reference Publications
Giuseppe Castellucci, Andrea Vanzo, Danilo Croce, Roberto Basili (2015): Context-aware Models for Twitter Sentiment Analysis. In: IJCoL vol. 1, n. 1 december 2015: Emerging Topics at the First Italian Conference on Computational Linguistics, pp. 69, 2015.

Andrea Vanzo, Danilo Croce, Roberto Basili (2014): A context based model for Sentiment Analysis in Twitter. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, pp. 2345–2354, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, 2014.

Andrea Vanzo, Giuseppe Castellucci, Danilo Croce, Roberto Basili (2014): A context based model for Sentiment Analysis in Twitter for the Italian Language. In: First Italian Conference on Computational Linguistics CLiC-it, Pisa, Italy, 2014.

If you use the IJCOL2015 version of the corpus, please cite the  following paper:

Otherwise, if you use the COLING2014 corpus, please cite the following paper:

Contacts

For more information about the corpus, please contact: