There are two splits of the original data, and one split of the challenge data. Students with the following initials: AR, EM, HL, and JA should use Lillian's split. Students with the following initials: JC, JL, VS should use Tianze's split. Lillian's split of the original data was created as follows. % cat sentences.tsv.txt | awk '{if ($(NF-1)= "+1") print $0}' | perl -MList::Util=shuffle -e 'print shuffle();' | head -`echo "dummy" | awk '{print int(.2*3610)}'` > original20percent.pos.a.txt % cat sentences.tsv.txt | awk '{if ($(NF-1)= "-1") print $0}' | perl -MList::Util=shuffle -e 'print shuffle();' | head -`echo "dummy" | awk '{print int(.2*3310)}'` > original20percent.neg.a.txt % tail -331 original20percent.neg.a.txt | awk '{print $1}' > original-dev.ids.a.txt % tail -361 original20percent.pos.a.txt | awk '{print $1}' >> original-dev.ids.a.txt % head -331 original20percent.neg.a.txt | awk '{print $1}' > original-test.ids.a.txt % head -361 original20percent.pos.a.txt| awk '{print $1}' >> original-test.ids.a.txt Thus, you have files that specify the sentence ids for the sentences belonging to the development and test set, respectively; the training set consists of the sentence IDs that aren't in either original-dev.ids.a.txt or original-test.ids.a.txt . Tianze's split of the original data was created as follows. % tail -n +2 sentences.tsv.txt | cut -f 1,3 | shuf | grep "+1" | cut -f 1 > original.pos.ids % tail -n +2 sentences.tsv.txt | cut -f 1,3 | shuf | grep "\-1" | cut -f 1 > original.neg.ids % sed -n '1,'`expr 361 \* 8`' p' original.pos.ids > original.pos.train.ids % sed -n `expr 361 \* 8 + 1`','`expr 361 \* 9`' p' original.pos.ids > original.pos.dev.ids % sed -n `expr 361 \* 9 + 1`','`expr 361 \* 10`' p' original.pos.ids > original.pos.test.ids % sed -n '1,'`expr 331 \* 8`' p' original.neg.ids > original.neg.train.ids % sed -n `expr 331 \* 8 + 1`','`expr 331 \* 9`' p' original.neg.ids > original.neg.dev.ids % sed -n `expr 331 \* 9 + 1`','`expr 331 \* 10`' p' original.neg.ids > original.neg.test.ids % for split in train dev test; do (cat original.pos.${split}.ids original.neg.${split}.ids > original.${split}.ids) done #### Sanity check after generation: % cat original.train.ids original.dev.ids original.test.ids | wc -l % cat original.train.ids original.dev.ids original.test.ids | sort | uniq | wc -l #### Both gave 6920. The challenge data split is as follows. This is not what we talked about in class, due to some imbalance in Team4_breaker_test.tsv and the fact that 10% of the data being training could be too small to allow interesting variation in fine-tuning-set size. % cat Team{1,2,3}_breaker_test.tsv # Then some manual editing (including removing: # 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. -1 # 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. 1 # ) # # to yield challenge.tsv % cut -f1 challenge.tsv | cut -f1 -d'_' | sort | uniq | perl -MList::Util=shuffle -e 'print shuffle();' | head -50 > challenge.train.id-prefixes.txt The first entry in challenge.train.id-prefixes.txt is "850", so, the following two sentences from challenge.tsv should be in the small challenge training set: 850_a It's basically the videogame version of Top Gun... on steroids! 1 850_b It's basically the videogame version of Top Gun... -1