CS 6740/IS 6300 A3 data readme

There are two splits of the original data, and one split of the challenge data.

Students with the following initials: AR, EM, HL, and JA should use Lillian's
split.

Students with the following initials: JC, JL, VS should use Tianze's split.

File listing:

Lillian's split:

sentences.tsv

original-dev.ids.a.txt

original-test.ids.a.txt

Tianze's split:

Challenge split

challenge.train.id-prefixes.txt

Lillian's split of the original data was created as follows.

% cat sentences.tsv | awk '{if ($(NF-1)= "+1") print $0}' | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' | head -`echo "dummy" | awk '{print int(.2*3610)}'` > original20percent.pos.a.txt
% cat sentences.tsv | awk '{if ($(NF-1)= "-1") print $0}' | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' | head -`echo "dummy" | awk '{print int(.2*3310)}'` > original20percent.neg.a.txt
% tail -331 original20percent.neg.a.txt | awk '{print $1}' > original-dev.ids.a.txt
% tail -361 original20percent.pos.a.txt | awk '{print $1}' >> original-dev.ids.a.txt
% head -331 original20percent.neg.a.txt | awk '{print $1}' > original-test.ids.a.txt
% head -361 original20percent.pos.a.txt| awk '{print $1}' >> original-test.ids.a.txt

Thus, you have files that specify the sentence ids for the sentences belonging
to the development and test set, respectively; the training set consists of the sentence
IDs that aren't in either original-dev.ids.a.txt or original-test.ids.a.txt .

Tianze's split of the original data was created as follows.

% sed -n '1,'`expr 361 \* 8`' p' original.pos.ids > original.pos.train.ids
% sed -n `expr 361 \* 8 + 1`','`expr 361 \* 9`' p' original.pos.ids > original.pos.dev.ids
% sed -n `expr 361 \* 9 + 1`','`expr 361 \* 10`' p' original.pos.ids > original.pos.test.ids

% sed -n '1,'`expr 331 \* 8`' p' original.neg.ids > original.neg.train.ids
% sed -n `expr 331 \* 8 + 1`','`expr 331 \* 9`' p' original.neg.ids > original.neg.dev.ids
% sed -n `expr 331 \* 9 + 1`','`expr 331 \* 10`' p' original.neg.ids > original.neg.test.ids

% for split in train dev test; do (cat original.pos.${split}.ids original.neg.${split}.ids > original.${split}.ids) done

#### Sanity check after generation:
% cat original.train.ids original.dev.ids original.test.ids | wc -l
% cat original.train.ids original.dev.ids original.test.ids | sort | uniq | wc -l

#### Both gave 6920.

The challenge data split is as follows. This is not what we talked about
in class, due to some imbalance in Team4_breaker_test.tsv and the fact that
10% of the data being training could be too small to allow interesting variation
in fine-tuning-set size.

% cat Team{1,2,3}_breaker_test.tsv

# Then some manual editing (including removing:
# 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. -1
# 673_a This quirky, snarky contemporary fairy tale could have been a family blockbuster. 1
# )
#
# to yield challenge.tsv

The first entry in challenge.train.id-prefixes.txt is "850", so, the following
two sentences from challenge.tsv should be in the small challenge training set:

850_a It's basically the videogame version of Top Gun... on steroids! 1
850_b It's basically the videogame version of Top Gun... -1

Here is a link to a page where you can view "diffs" between any two versions: use the "compare selected versions" feature to highlight precisely what text was added or deleted.

Version	Published	Changed By	Comment
CURRENT (v. 2)	Oct 29, 2019 13:37	Lillian Lee	Confirmed that there can be repeated IDs, but should not be repeated sentences, in the challenge data
v. 4	Oct 28, 2019 12:28	Lillian Lee	added sanity check commands
v. 3	Oct 28, 2019 12:21	Lillian Lee	Utterly embarrassing: I had '=' instead of '==' in my command-line stuff, causing every item to be assigned label +1 (say), instead of looking for items that had +1.
v. 2	Oct 25, 2019 13:16	Lillian Lee	add explicit diff tracking
v. 1	Oct 25, 2019 13:13	Lillian Lee

Page tree

CS 6740/IS 6300 A3 data readme