Corpora: Purchased/Acquired
Access: NLP Lab Only (See the CSE departmental office for copies of license agreements)
- DUC Summarization evaluation, 2005-2007, National Institute of Standards
- TAC Summarization evaluation, 2008-2011, National Institute of Standards
- Wikipedia Corpus, 1.9 billion words of English, 2014, Mark Davies
- Penn Discourse Treebank (PDTB) 3.0, 53600 tokens, 2018, Bonnie Webber
Access: Linguistic Data Consortium Membership (LDC)
- Co-subscriber units (contact name): CSE (Becky Passonneau, rjp49@psu.edu); IST (Ting-Hao Kenneth Huang,txh710@psu.edu); Center for Social Data Analytics: C-SoDA (Burt Monroe, burtmonroe@psu.edu); Center for Language Acquisition (Kevin McManus, kmcmanus@psu.edu)
- Welcome to co-subscribe with us! (Contact: rjp49@psu.edu)
LDC Corpora received under the current subscription (2017-present). Any PSU student or instructor can request access to these corpora that PSU has already paid for.
All LDC Corpora PSU has rights to are listed here.
Corpus Name | Requester | Path in the Server |
Conll-formatted-ontonotes-5.0 | IST (Sarah Rajtmajer) | /home/nlp/corpora |
Concretely_annotated_gigaword | SoDA (Burt Monroe) | /home/nlp/corpora |
RST_discourse_treebank | CSE (Becky Passonneau) | /home/nlp/corpora |
SoDA (Burt Monroe) Applied Ling (Susan Strauss) |
/home/nlp/corpora | |
IST (Prasenjit Mitra) |
/home/nlp/corpora | |
IST (Sarah Rajtmajer) |
Annotation guidelines we create or help create
Access: For NLP Lab only
- Annotationguidelines.pdf
Github: Penn State NLP Group
Access: Open Source for anyone/ Private for NLP Lab only
- EasyCCG-Tree-Categorization (Private)
- RL-Reading-Group-Fall18,Reinforcement Learning Reading Group (Fall-Spring ’18,’19)
- Docx2Txt (Private), an REU on an NSF CyberLearning project: EAGER: Collab: Automated Instruction Assistant.
- SEAView, a tool for annotating content in two-part essays, which contain a summary and an argument.
- DucView, a tool for creating and using pyramids, a method for summary content annotation.