CS 6740/IS 6300 A3 data readme

There are two splits of the original data, and one split of the challeng= e data.

Students with the following initials: AR, EM, HL, and JA should use Lill= ian's
split.

Students with the following initials: JC, JL, VS should use Tianze's spl= it.

File listing:

Lillian's split: th= e way you can tell you have the new (as of Oct 28) split: there should NOT = be the ID 473 twice in original-dev.ids.a.txt. (Thanks very = much Hannah for noticing this!!!)

sentences.tsv

original-dev.id= s.a.txt (redid on Oct 28)

original-test= -ids.a.txt (redid on Oct 28)

Tianze's split:

original.train.ids

original.dev.ids =

original.test.ids

Challenge split (se= e notes at the end of this page)

challenge.tsv

challenge.train.id-prefixes.txt

Lillian's split of the original data was created as follows.

% cat sentences.ts= v | awk '{if ($(NF-1)=3D "+1") print $0}' | perl -MList::Util= =3Dshuffle -e 'print shuffle(<STDIN>);' | head -`echo "dummy" | awk '= {print int(.2*3610)}'` > original20percent.pos.a.txt
% cat sentences.tsv | awk '{if ($(NF-1)=3D "-1") print $0}' | perl -MList:= :Util=3Dshuffle -e 'print shuffle(<STDIN>);' | head -`echo "dummy" | = awk '{print int(.2*3310)}'` > original20percent.neg.a.txt
% tail -331 original20percent.neg.a.txt | awk '{print $1}' > original-d= ev.ids.a.txt
% tail -361 original20percent.pos.a.txt | awk '{print $1}' >> origin= al-dev.ids.a.txt
% head -331 original20percent.neg.a.txt | awk '{print $1}' > original-t= est.ids.a.txt
% head -361 original20percent.pos.a.txt| awk '{print $1}' >> origina= l-test.ids.a.txt

Thus, you have files that specify the sentence ids for the sentences bel= onging
to the development and test set, respectively; the training set consists o= f the sentence
IDs that aren't in either origi= nal-dev.ids.a.txt or original-test.ids.a.txt .

#### Sanity checks

% cat original-dev.ids.a.txt original-test.ids.a.txt | sort | uniq -c | = sort -nr | head

1 9993 # so, nothing appears in twice in the concatenation o= f the "a" files.

% wc -l *ids.a.txt

692 original-dev.ids.a.txt
692 original-test.ids.a.txt
1384 total

Tianze's split of the original data was created as follows.

% sed -n '1,'`expr 361 \* 8`' p' original.pos.ids > original.pos.trai= n.ids
% sed -n `expr 361 \* 8 + 1`','`expr 361 \* 9`' p' original.pos.ids > o= riginal.pos.dev.ids
% sed -n `expr 361 \* 9 + 1`','`expr 361 \* 10`' p' original.pos.ids > = original.pos.test.ids

% sed -n '1,'`expr 331 \* 8`' p' original.neg.ids > original.neg.trai= n.ids
% sed -n `expr 331 \* 8 + 1`','`expr 331 \* 9`' p' original.neg.ids > o= riginal.neg.dev.ids
% sed -n `expr 331 \* 9 + 1`','`expr 331 \* 10`' p' original.neg.ids > = original.neg.test.ids

% for split in train dev test; do (cat original.pos.${split}.ids origina= l.neg.${split}.ids > original.${split}.ids) done

#### Sanity check after generation:
% cat original.train.ids <= a href=3D"/download/attachments/378089023/original.dev.ids?version=3D1&= modificationDate=3D1572022792000&api=3Dv2" data-linked-resource-id=3D"3= 78089189" data-linked-resource-version=3D"1" data-linked-resource-type=3D"a= ttachment" data-linked-resource-default-alias=3D"original.dev.ids" data-lin= ked-resource-content-type=3D"application/octet-stream" data-linked-resource= -container-id=3D"378089023" data-linked-resource-container-version=3D"9">or= iginal.dev.ids original.test.ids = | wc -l
% cat original.train.ids original.dev.ids original.test.ids | sort | uniq = | wc -l

#### Both gave 6920.

The challenge data split is as follows. This is not what we talked about=
in class, due to some imbalance in Team4_breaker_test.tsv and the fact tha= t
10% of the data being training could be too small to allow interesting var= iation
in fine-tuning-set size.

% cat Team{1,2,3}_breaker_test.tsv

# Then some manual editing (including removing:
# 673_a This quirky, snarky contemporary fairy tale could have been a fami= ly blockbuster. -1
# 673_a This quirky, snarky contemporary fairy tale could have been a fami= ly blockbuster. 1
# )
#
# to yield challenge.tsv

The first entry in challenge.train.id-prefixes.txt is "850", so, the fol= lowing
two sentences from challenge.tsv should be in the small challenge training= set:

850_a It's basically the videogame version of Top Gun... on steroids! 1<= br> 850_b It's basically the videogame version of Top Gun... -1

Note that there may be "repeated" IDs, as posted about in C= ampusWire:
Q:

duplic= ate indices in challenge.tsv
#17

I noticed that there are duplicate in= dices in challenge.tsv. For one example, there are two instances o= f 559_b's from challenge.tsv:

559_a Unfolds with the creepy eleganc= e and carefully calibrated precision of a Dario Argento horror film. 1 559_= b Unfolds with all the creepy elegance and carefully calibrated precision o= f a Jim Carrey comedy film. -1 559_b Unfolds with the creepy elegance and c= arefully calibrated precision of a Uwe Boll horror film. -1

I am not sure if this was intentional= , or the third 559 example was meant to be encoded as something like 559_c.= I first assumed there would only be pairs (a and b) of similar sentences i= n the challenge dataset, but the above examples show that there can be eith= er pairs or trios of them.

A: This was a design choice, but good to check! Note= that the actual sentences for the two 559_b's are different, although both= are "challenges" to the same 559_a. So you will want all three 559s to be = in the same split, counting as three different examples.as a design choice,= but good to check! Note that the actual sentences for the two 559_b's are = different, although both are "challenges" to the same 559_a. So you will wa= nt all three 559s to be in the same split, counting as three different exam= ples.
In general, there could be as many as 3 x_b's, one per each of the three b= reaker teams' data.

Here is a link to a page wher= e you can view "diffs" between any two versions: use the "compare = selected versions" feature to highlight precisely what text was added or de= leted.

Version	Published	Changed By	Comment
CURRENT (v. 5)	Oct 29, 2019 13:37	Lillian Lee	Confirmed that there can be repeated IDs= , but should not be repeated sentences, in the challenge d= ata
v. 4	Oct 28, 2019 12:28	Lillian Lee	added sanity check commands
v. 3	Oct 28, 2019 12:21	Lillian Lee	Utterly embarrassing: I had '=3D' instea= d of '=3D=3D' in my command-line stuff, causing every item to be as= signed label +1 (say), instead of looking for items that had +1.
v. 2	Oct 25, 2019 13:16	Lillian Lee	add explicit diff tracking
v. 1	Oct 25, 2019 13:13	Lillian Lee