==== Introduction ==== This README v1.0 (June, 2012) for the Cornell GMOhedging v1.0 comes from the URL https://confluence.cornell.edu/display/llresearch/HedgingFramingGMOs . The data is potentially relevant to research on (a) framing, especially with respect to the debate over the use of genetically modified organisms (GMOs) and/or the differences between "professional-science" and "pop-science" discourse, and (b) hedging. ==== Citation info: ==== This data was first used in: @InProceedings{Choi+al:2012, author = {Eunsol Choi and Chenhao Tan and Lillian Lee and Cristian Danescu-Niculescu-Mizil and Jennifer Spindel}, title = {Hedge detection as a lens on framing in the {GMO} debates: A position paper}, booktitle = {Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics}, year = {2012} } ==== Zip file contents ==== This README (README.txt), plus the following files: gmostyle.pdf 154K [paper describing our work] lexis 4.8M ["pop science"] wos 893K ["professional science"] pro_GMO 4.7M processed_pro_GMO 500K anti_GMO 4.2M processed_anti_GMO 514K corpus_table 35K sample_annotation 39K ==== lexis ==== Format: 928 raw documents contained within a single text file, delimited by lines of the form "Document Number: n", where n ranges from 1 to 928 inclusive. Line breaks from the original LexisNexis source are indicated by ascii code: 13 (these may appear as "\r" or "^M"). Other control characters, non-standard characters, and document headers and footers may have also been retained. Derivation: These 928 documents were collected from the LexisNexis database using the search keywords "genetically modified foods" or "transgenic crops" with search limited to US newspapers. Then we eliminated articles that do not contain at least two occurrences of the following keywords: GMO, GM, GE, genetically modified, genetic modification, modified, modification, genetic engineering, engineered, bioengineered, franken, transgenic, spliced, G.M.O., tweaked, manipulated, engineering, pharming, aquaculture. We also eliminated articles containing over 2000 words. ==== wos ==== Format: 648 raw abstracts (considered as documents) contained within a single text file, delimited by lines of the form "Document Number: n", where n ranges from 1 to 648 inclusive. Line breaks from the original source are preserved (not indicated by control characters). Derivation: From Thomson Reuter's Web of Science (WOS), a database of scientific journal and conference articles, 648 scientific paper abstracts were collected using "transgenic foods" as a search keyword. We discarded results containing either of the two off-topic filtering terms "mice" or "rats". After this, we manually removed off-topic texts from the collection. ==== {anti,pro}_GMO ==== Format: Raw (but plaintext, no html markup) documents contained within a single text file, delimited by lines of the form "Document Number: n" where n is a random number between 1 and 9999 inclusive. Each line represents a paragraph, and each such line within a document is separated from the next by a blank line. (The "Document Number: n" delimiter lines are not separated by a blank line from the first line of the document). Derivation: We used our (in particular, Jennifer Spindel's) domain expertise to compile a list of 20 anti-GMO and 20 pro-GMO organization websites. After the initial collection of data from these websites, near-duplicates and irrelevant articles were filtered through clustering, keyword searches and distance between word vectors at the document level. 'anti_GMO', a collection of articles representing the opponents of GMO, contains 762 articles and 'pro_GMO', a collection of articles representing the proponents of GMO, contains 671 articles. The mapping from document number to its position (pro or anti) and source is provided in the 'corpus_table' file. ==== processed_* files ==== Format: Same as in the anti_GMO and pro_GMO files, except that: 1. Each contains only 404 "pro" and 404 "anti" documents, to represent a balanced corpus 2. Each retained "document" consists of only the first 200 words after excluding the first 50 words of documents containing over 280 words. This was done to avoid irrelevant sections such as "Educators have permission to reprint articles for classroom use; other users, please contacteditor@actionbioscience.org for reprint permission. See reprint policy". ==== corpus_table ==== This file provides a mapping between document number and its source. Each line in corpus_table file has three columns, the first column representing document number, the second representing its position ("pro" or "anti"), and the third giving its source. Examples: 3171 pro biofortified 3174 anti center_for_food_safety The following provides the urls of the website for each source name in the third column (e.g., "biofortified" or "center_for_food_safety"). (These URLs might not work anymore, as the data was collected in summer 2011). abbreviation url --------------------------------------------------------------------------- green_peace http://www.greenpeace.org natural_news http://www.naturalnews.com/list_features_GMOs.html say_no_to_gmo http://www.saynotogmos.org/scientists_speak gmo_watch http://www.gmwatch.eu soil_assoc http://www.soilassociation.org/Whyorganic/GM/News environment_common http://environmentalcommons.org/ nano_transform http://nanotransformation.com organic_consumers http://www.organicconsumers.org responsible_technology http://www.responsibletechonology.org center_for_food_safety http://www.centerforfoodsafety sierra club http://www.sierraclub.org gmfree_scotland http://gmfreescotland.blogspot.com/ non_gmo_project http://www.nongmoproject.org/ psrast http://psrast.org/ gmo_awareness http://gmo-awareness.com/ gmo_journal http://www.gmo-journal.com/ action_bioscience http://www.actionbioscience.org/biotech/pusztai.html harvest_of_fear_n http://www.pbs.org/wgbh/harvest/ gmo_danger http://userwww.sfsu.edu/~rone/GEessays/gedanger.htm biofortified http://www.biofortified.org/page/ agbioworld http://www.agbioworld.org/newsletter_wm/ better_foods http://www.betterfoods.org biotechnow http://www.biotech-now.org/food-and-agriculture whybiotech http://www.whybiotech.com gmo_africa http://www.gmoafrica.org growers http://www.growersforwheatbiotechnology.org/html/gwb_news.cfm isaaa http://www.isaaa.org/ golden_rice http://www.goldenrice.org/ soy_connection http://www.soyconnection.com/soybean_oil/benefits_of_biotechnology.php biotechnology http://www.bio.org/category/41 ncbe http://www.ncbe.reading.ac.uk/NCBE/GMFOOD/menu.html harvest_of_fear http://www.pbs.org/wgbh/harvest/ monsanto http://www.monsanto.com/newsviews/Pages/biotech-safety-gmo-advantages.aspx bt http://www.bt.ucsd.edu/gmo.html who http://www.who.int/foodsafety/publications/biotech/20questions/en/ ==== sample_annotation ==== This file contains 200 hand-annotated randomly-sampled sentences, half from wos and half from lexis dataset. It is delimited by lines of the form "Sentence Number: n (1|-1|?) (1|-1|?) (LEXIS|WOS) ", where n is a random number between 1 and 200 inclusive. The second column indicates the first annotator's opinion and the third column indicates the second annotator's opinion, according to the following label scheme: 1 = sentence is certain -1 = sentence is uncertain (contains hedging) ? = not a proper sentence. The fourth column indicates the sentence's source, either lexis or wos. (The annotators were not privy to this source information.) Example: Sentence Number: 198 1 -1 LEXIS But the real future of biotechnology lies in addressing the special problems faced by farmers in less developed nations. This is the 198th sentence. It was classified as 'certain' by the first annotator, 'uncertain' by the second annotator, and is extracted from lexis dataset. The details of the annotation policy are described in the section 4 ("Hedging to distinguish scientific text: Initial annotation") of the accompanying paper.