We evaluate the ACD and ACP subtasks separately by comparing the classifications provided by the participant systems to the gold standard annotations of the test set.
For the ACD task, we compute Precision, Recall and F-score defined as:
,
where Precision () and Recall () are defined as:
.
Here is the set of aspect category annotations that a system returned for all the test sentences, and is the set of the gold (correct) aspect category annotations.
For instance, if a review is labeled in the gold standard with the two aspects
,
and the system predicts the two aspects
,
we have that , and so that , and .
For the ACD task the baseline will be computed by considering a system which assigns the most frequent aspect category (estimated over the training set) to each sentence.
For the ACP task we will evaluate the entire chain, thus considering both the aspect categories detected in the sentences together with their corresponding polarity, in the form of pairs.
We again compute Precision, Recall and F-score now defined as
.
Precision () and Recall () are defined as
,
where is the set of pairs that a system returned for all the test sentences, and is the set of the gold (correct) pairs annotations.
For instance, if a review is labeled in the gold standard with the pairs
,
and the system predicts the three pairs
,
we have that , and so that , and .
For the ACP task, the baseline will be computed by considering a system which assigns the most frequent pair (estimated over the training set) to each sentence.
We will produce separate rankings for the tasks, based on the scores. Participants who submit only the result of the ACD task will appear in the first ranking only.