ids.a.txt
692 original-test.ids.a.txt
1384 total
Tianze's split of the original data was created as follows.
% tail -n +2 sentences.tsv.txt | cut -f 1,3 | shuf | grep "+1" | cut -f 1 > original.pos.ids
% tail -n +2 sentences.tsv.txt | cut -f 1,3 | shuf | grep "\-1" | cut -f 1 > original.neg.ids
% sed -n '1,'`expr 361 \* 8`' p' original.pos.ids > original.pos.train.ids
% sed -n `expr 361 \* 8 + 1`','`expr 361 \* 9`' p' original.pos.ids > original.pos.dev.ids
% sed -n `expr 361 \* 9 + 1`','`expr 361 \* 10`' p' original.pos.ids > original.pos.test.ids
% sed -n '1,'`expr 331 \* 8`' p' original.neg.ids > original.neg.train.ids
% sed -n `expr 331 \* 8 + 1`','`expr 331 \* 9`' p' original.neg.ids > original.neg.dev.ids
% sed -n `expr 331 \* 9 + 1`','`expr 331 \* 10`' p' original.neg.ids > original.neg.test.ids
% for split in train dev test; do (cat original.pos.${split}.ids original.neg.${split}.ids > original.${split}.ids) done
#### Sanity check after generation:
% cat original.train.ids original.dev.ids original.test.ids | wc -l
% cat original.train.ids original.dev.ids original.test.ids | sort | uniq | wc -l
#### Both gave 6920.
The challenge data split is as follows. This is not what we talked about
in class, due to some imbalance in Team4_breaker_test.tsv and the fact that
10% of the data being training could be too small to allow interesting variation
in fine-tuning-set size.
% cat Team{1,2,3}_breaker_test.tsv
# Then some manual editing (including removing:
# 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. -1
# 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. 1
# )
#
# to yield challenge.tsv
% cut -f1 challenge.tsv | cut -f1 -d'_' | sort | uniq | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' | head -50 > challenge.train.id-prefixes.txt
The first entry in challenge.train.id-prefixes.txt is "850", so, the following
two sentences from challenge.tsv should be in the small challenge training set:
850_a It's basically the videogame version of Top Gun... on steroids! 1
850_b It's basically the videogame version of Top Gun... -1
Note that there may be "repeated" IDs, as posted about in CampusWire:
Q:
duplicate indices in challenge.tsv
#17I noticed that there are duplicate indices in challenge.tsv. For one example, there are two instances of 559_b's from challenge.tsv:
559_a Unfolds with the creepy elegance and carefully calibrated precision of a Dario Argento horror film. 1 559_b Unfolds with all the creepy elegance and carefully calibrated precision of a Jim Carrey comedy film. -1 559_b Unfolds with the creepy elegance and carefully calibrated precision of a Uwe Boll horror film. -1
I am not sure if this was intentional, or the third 559 example was meant to be encoded as something like 559_c. I first assumed there would only be pairs (a and b) of similar sentences in the challenge dataset, but the above examples show that there can be either pairs or trios of them.
A: This was a design choice, but good to check! Note that the actual sentences for the two 559_b's are different, although both are "challenges" to the same 559_a. So you will want all three 559s to be in the same split, counting as three different examples.as a design choice, but good to check! Note that the actual sentences for the two 559_b's are different, although both are "challenges" to the same 559_a. So you will want all three 559s to be in the same split, counting as three different examples.
In general, there could be as many as 3 x_b's, one per each of the three breaker teams' data.