# Evaluation

We evaluate the ACD and ACP subtasks separately by comparing the classifications provided by the participant systems to the gold standard annotations of the test set.

For the ACD task, we compute Precision, Recall and F$_1$-score defined as:
$F1_{a} = \frac{2 P_a R_a}{P_a + R_a}$,
where Precision ($P_a$) and Recall ($R_a$) are defined as:
$P_a = \frac{|S_{a} \cap G_a|}{|S_a|}; R_a = \frac{|S_a \cap G_a|}{|G_a|}$.
Here $S_a$ is the set of aspect category annotations that a system returned for all the test sentences, and $G_a$ is the set of the gold (correct) aspect category annotations.
For instance, if a review is labeled in the gold standard with the two aspects
$G_a=\{\textsc{cleanliness}, \textsc{staff}\}$,
and the system predicts the two aspects
$S_a=\{\textsc{cleanliness},\textsc{comfort}\}$,
we have that $|S_{a} \cap G_a|=1$, $|G_{a}|=2$ and $|S_{a}|=2$ so that $P_a=\frac{1}{2}$, $R_a=\frac{1}{2}$ and $F1_a=\frac{1}{2}$.
For the ACD task the baseline will be computed by considering a system which assigns the most frequent aspect category (estimated over the training set) to each sentence.

For the ACP task we will evaluate the entire chain, thus considering both the aspect categories detected in the sentences together with their corresponding polarity, in the form of $(aspect, polarity)$ pairs.
We again compute Precision, Recall and F$_1$-score now defined as
$F1_{p} = \frac{2 P_p R_p}{P_p + R_p}$.
Precision ($P_p$) and Recall ($R_p$) are defined as
$P_p = \frac{|S_{p} \cap G_p|}{|S_p|}; R_p = \frac{|S_p \cap G_p|}{|G_p|}$,
where $S_p$ is the set of $(aspect, polarity)$ pairs that a system returned for all the test sentences, and $G_a$ is the set of the gold (correct) pairs annotations.

For instance, if a review is labeled in the gold standard with the pairs
$G_p=\{(\textsc{cleanliness}, POS), (\textsc{staff}, POS)\}$,
and the system predicts the three pairs
$S_p=\{(\textsc{cleanliness}, POS), (\textsc{cleanliness}, NEG),(\textsc{comfort}, POS)\}$,
we have that $|S_{p} \cap G_p|=1$, $|G_{p}|=2$ and $|S_{p}|=3$ so that $P_a=\frac{1}{3}$, $R_a=\frac{1}{2}$ and $F1_a=0.28$.

For the ACP task, the baseline will be computed by considering a system which assigns the most frequent $(aspect, polarity)$ pair (estimated over the training set) to each sentence.

We will produce separate rankings for the tasks, based on the $F_1$ scores. Participants who submit only the result of the ACD task will appear in the first ranking only.