There are two splits of the original data, and one split of the challenge data.
Students with the following initials: AR, EM, HL, and JA should use Lillian's
split.
Students with the following initials: JC, JL, VS should use Tianze's split.
File listing:
Lillian's split:
Tianze's split:
Challenge split
challenge.train.id-prefixes.txt
Lillian's split of the original data was created as follows.
% cat sentences.tsv | awk '{if ($(NF-1)= "+1") print $0}' | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' | head -`echo "dummy" | awk '{print int(.2*3610)}'` > original20percent.pos.a.txt
% cat sentences.tsv | awk '{if ($(NF-1)= "-1") print $0}' | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' | head -`echo "dummy" | awk '{print int(.2*3310)}'` > original20percent.neg.a.txt
% tail -331 original20percent.neg.a.txt | awk '{print $1}' > original-dev.ids.a.txt
% tail -361 original20percent.pos.a.txt | awk '{print $1}' >> original-dev.ids.a.txt
% head -331 original20percent.neg.a.txt | awk '{print $1}' > original-test.ids.a.txt
% head -361 original20percent.pos.a.txt| awk '{print $1}' >> original-test.ids.a.txt
Thus, you have files that specify the sentence ids for the sentences belonging
to the development and test set, respectively; the training set consists of the sentence
IDs that aren't in either original-dev.ids.a.txt or original-test.ids.a.txt .
Tianze's split of the original data was created as follows.
% tail -n +2 sentences.tsv.txt | cut -f 1,3 | shuf | grep "+1" | cut -f 1 > original.pos.ids
% tail -n +2 sentences.tsv.txt | cut -f 1,3 | shuf | grep "\-1" | cut -f 1 > original.neg.ids
% sed -n '1,'`expr 361 \* 8`' p' original.pos.ids > original.pos.train.ids
% sed -n `expr 361 \* 8 + 1`','`expr 361 \* 9`' p' original.pos.ids > original.pos.dev.ids
% sed -n `expr 361 \* 9 + 1`','`expr 361 \* 10`' p' original.pos.ids > original.pos.test.ids
% sed -n '1,'`expr 331 \* 8`' p' original.neg.ids > original.neg.train.ids
% sed -n `expr 331 \* 8 + 1`','`expr 331 \* 9`' p' original.neg.ids > original.neg.dev.ids
% sed -n `expr 331 \* 9 + 1`','`expr 331 \* 10`' p' original.neg.ids > original.neg.test.ids
% for split in train dev test; do (cat original.pos.${split}.ids original.neg.${split}.ids > original.${split}.ids) done
#### Sanity check after generation:
% cat original.train.ids original.dev.ids original.test.ids | wc -l
% cat original.train.ids original.dev.ids original.test.ids | sort | uniq | wc -l
#### Both gave 6920.
The challenge data split is as follows. This is not what we talked about
in class, due to some imbalance in Team4_breaker_test.tsv and the fact that
10% of the data being training could be too small to allow interesting variation
in fine-tuning-set size.
% cat Team{1,2,3}_breaker_test.tsv
# Then some manual editing (including removing:
# 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. -1
# 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. 1
# )
#
# to yield challenge.tsv
% cut -f1 challenge.tsv | cut -f1 -d'_' | sort | uniq | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' | head -50 > challenge.train.id-prefixes.txt
The first entry in challenge.train.id-prefixes.txt is "850", so, the following
two sentences from challenge.tsv should be in the small challenge training set:
850_a It's basically the videogame version of Top Gun... on steroids! 1
850_b It's basically the videogame version of Top Gun... -1