The Penn Discourse Treebank 2.0 (PDTB) is an incredibly rich resource for studying not only the way discourse coherence is expressed but also how information about discourse commitments (content attribution) is conveyed linguistically. However, the file format and annotation methods of the standard distribution can be an obstacle to research with this resource. The goal of this code is to remove those obstacles.
This project was originally part of my LSA Linguistic Institute 2011 course Computational Pragmatics. For much more information on the PDTB, see this page.
pdtb2.csv.zip: Reformatted and repackaged corpus. This link is password protected. I will give out the password to people who have the requisite LDC license. Unzip the file to use it.pdtb2.py: Python classes for working with the corpus in thepdtb2.csvformat.pdtb2_functions.py: illustrations ofpdtb.pyin use.pdtb-template.dot: template for Graphviz output ofDatumobjects.
The code in this repository is compatible with Python 2 and Python 3. Its only other external dependency is NLTK, with the data installed so that WordNet is available.
The main interface provided by pdtb.py is the CorpusReader.
from pdtb2 import CorpusReader
corpus = CorpusReader('pdtb2.csv')The central method for CorpusReader objects is iter_data which
allows you to iterate through the data in the corpus. Intuitively,
iter_data reads each row of the source csv file pdtb2.csv and
turns it into a Datum object, which has lots of methods and
attributes for doing cool things. See pdtb_functions.relation_count
for a simple illustration (counting datum.Relation instances). There
are 40,600 Datum objects in the corpus.
Datum objects have huge numbers of attributes and methods. For lots of details, see here. Here's a simple example of working with text and trees (with row 17 chosen because it's a manageable but illustrative case):
from pdtb2 import CorpusReader, Datum
iterator = CorpusReader('pdtb2.csv').iter_data(display_progress=False)
for _ in range(17): next(iterator)
d = next(iterator)
d.arg1_words()
['that', '*T*-1', 'hung', 'over', 'parts', 'of', 'the', 'factory', ',']
d.arg1_words(lemmatize=True)
['that', '*T*-1', 'hang', 'over', 'part', 'of', 'the', 'factory', ',']
d.arg1_pos(wn_format=True)
[('that', 'wdt'), ('*T*-1', '-none-'), ('hung', 'v'), ('over', 'in'), \
('parts', 'n'), ('of', 'in'), ('the', 'dt'), ('factory', 'n'), (',', ',')]
d.arg1_pos(lemmatize=True)
[('that', 'wdt'), ('*T*-1', '-none-'), ('hang', 'v'), ('over', 'in'), \
('part', 'n'), ('of', 'in'), ('the', 'dt'), ('factory', 'n'), (',', ',')]
len(d.Arg1_Trees)
5
for t in d.Arg1_Trees:
t.pprint()
(WHNP-1 (WDT that))
(NP-SBJ (-NONE- *T*-1))
(VBD hung)
(PP-LOC
(IN over)
(NP (NP (NNS parts)) (PP (IN of) (NP (DT the) (NN factory)))))
(, ,)There are similarly named methods for Sups, connectives, and attributions.
The SpanList and GornList attributes are for connecting with the
Penn Treebank files. The relevant material is already inserted into
the CSV file and accessible via the _RawText and _Trees
attributes, so you probably won't need it, but it is there just in
case you need to connect with the external files.
There's a much fuller overview here: http://compprag.christopherpotts.net/swda.html