Input Format

In defining the dataset input format for KeLP we largely took inspiration by the SvmLight formalism, extending it in order to deal with multiple labels and multiple representations. We did not exploit JSON cause it would have imply a lot of overhead.

Every row of a dataset represents an Example and has the following form:

Basically each row starts with a list of labels separated by a white space. A label can be a simple string in the case of a classification label or can have the form propertyName:value (for instance height:10) in the case of regression values. This formalisms allows to deal with multilabel classification tasks as well as with multivariate regression tasks. Note that an isolated number will be considered a classification label.

After the labels parts, a list of representations begins. Each representation must be included between a preamble of the formĀ |Btype:name| and an end representation sequence of the form |Etype| where type is an identifier of the representation type (e.g. V for SparseVector) and name is an identifier for that specific representation (e.g. BoW for a bag-of-words representation). This allows to model an instance with multiple representations where a representation type can be repeated (e.g. a single instance can be modeled using two different TreeRepresentations and a SparseVector). If no name is specified for a representation, it will be identified by its position within the sequence (i.e. the third representation will be automatically named 3)

Each representation has its own formalism. Currently KeLP supports three representation types:

  • DenseVector. Its type identifier is DV and its textual description is a sequence of numbers separated by a white space. For instance:

  • SparseVector.Its type identifier is V and its textual description is a sequence of featureName:featureValue pairs separated by a white space (i.e. the same formalism of SVMlight and LibSVM, but featureName can be a generic string). For instance:

  • TreeRepresentation. Its type identifier is T and its textual description must be in the Penn Treebank notation. For instance:

An example of a complete row of a dataset can be: