---------------------------------- CTREES: Conversation Trees Dataset ---------------------------------- This dataset contains the corpus used in Conversation Trees: A Grammar Model for Topic Structure in Forums, Annie Louis and Shay Cohen, EMNLP 2015. It contains a collection of troubleshooting related threads collected from the forums on CNET.com website. In addition, the dataset contains the constituency trees created using the grammar described in the above paper. ------------------------------------------------------------------------------ Data ------ There are two corpus folders: - fullCorpus: contains a collection of 13,352 threads. - EMNLP2015Corpus: contains a subset of fullCorpus which was used in the EMNLP 2015 paper. The test and development data is this corpus is limited to threads containing 3 posts minimum and 50 maximum. The training data remains the same. Each corpus folder has three subfolders, for example: = fullCorpus/raw = Contains the thread contents as present on CNET.com Each row is a post. The columns are as follows 1. thread identifier. Rows with the same identifier are posts in the same thread 2. time order: starting from 0, gives the time order of posts in the thread, i was posted before i + 1. 3. CNET thread identifier: Another thread identifier field (the original identifier used on CNET.com 4. CNET post identifier: A post identifier assigned on CNET.com. 5. CNET parent: Specifies the parent of this post in the reply structure obtained from CNET. 'NR' stands for 'NO PARENT', ie. it is the first post in the thread. When value is not 'NR', the identifier maps to the post with that post identifier (column 4) 6. user identifier: The user name of the person who made the post 7. post time (h): Human readable time of posting 8. post time (l): A long number representation of post time, used to order the posts 9. subject line: The subject line of the post 10. post text: The post content = fullCorpus/processed = is produced by removing links,stopwords and punctuation from raw files has been tokenized. This version is used in the paper. Each row is a post same as above, except columns 9 and 10 have been normalized. = fullCorpus/trees = contains trees for training/dev/test data. It contains data for carrying out experiments on grammar-based models of topic structure or other tasks. This data is explained below. ------------------------------------------------------------------------------------- Trees ----- There are three types of trees - dep,ctrees,binctrees - dep: dependency trees indicating the reply structure - ctrees: non-binary constituency trees according to the base grammar in the paper - binctrees: binary constituency trees corresponding to the same ctrees The format of a tree file is as follows Each row is a node in the tree. 1. thread id: identifies the thread (maps to column 1 of the raw and processed files). Rows with the same thread id are nodes in the same tree. 2. node identifier: Those prefixed with 'ROOT' are root nodes of the tree, 'NID' are tree internal nodes, those starting with a number (the thread id) are leaves. 3. non-terminal/terminal symbol: root nodes are S, preterminals are T, other internal nodes are X. The posts have a symbol starting with 'P', these are terminal symbols. Refer to the paper for the base grammar definition. 4. parent node: identifier of the parent node. 'NOPAR' indicates 'NO PARENT' for root nodes. 5. head node: identifier of the head node, left most post in the span of the node. 6. children nodes: list of child node identifiers (separated by comma) 7. non-projectivity: false - projective tree, true - non projective 8. number of crossings: non-zero only for non-projective trees and indicates number of points in the tree where one edge crosses another. ------------------------------------------------------------------------------------- Code ---- Some code is provided for pretty printing the trees. The parser code will be released soon. Usage examples: java printPretty fullCorpus/trees/train.dtrees | less P#1034t0[1034t0] P#1034t1[1034t1] P#1034t2[1034t2] will print the dependency training trees. In the dependency tree, each node is printed as P#postid[headpost]. The head post is the same as current post. In this tree, posts 1 and 2 are individual replies to post 0. java printPretty fullCorpus/trees/train.ctrees | less S#ROOT1034[1034t0] T#NID1034t0[1034t0] P-1034t0#1034t0[1034t0] X#NID1034t1[1034t1] T#NID1034t2[1034t1] P-1034t1#1034t1[1034t1] X#NID1034t3[1034t2] T#NID1034t4[1034t2] P-1034t2#1034t2[1034t2] prints constituency trees. Here each node is nonterminal or terminal symbol followed by # and then the node id. The head post is chosen within square brackets.