Aaron Ecay

The PYCCLE-TCP corpus

The Penn-York Computer Annotated Corpus of a Large amount of English based on the TCP has a silly name, but a nifty acronym: PYCCLE-TCP. The corpus comprises roughly one billion (109) words of English which was printed between 1473 and 1800. These texts were scanned by the Early English Books Online (EEBO; to 1700) and Eighteenth Century Collections Online (ECCO; 1700–1800) projects, and converted to machine-readable text by the Text Creation Partnership. I subjected these texts to part of speech tagging using the PPCEME as a training set. The resultant corpus provides an extremely rich source of data on the history of English.

In my own research on do-support, I was able to expand my dataset by roughly two orders of magnitude. This is despite having to “spend” a lot of data (roughly two more orders of magnitude) compensating for the lack of structural annotation in the data. Targeted manual data collection could overcome this challenge in certain cases, as could advances in unsupervised parsing techniques.

Another use of the corpus is the location of highly infrequent constructions. For example, do-support with have begins spreading noticeably in the early 1800’s in American English, and only perceptibly enters British English a century later. Nonetheless, I was able to locate roughly 25 tokens of this construction in the PYCCLE, an occurrence rate of roughly once per 40 million words. By way of comparison, the entirety of the parsed corpora of EME (PPCEME + PCEEC) comprise only roughly 4.5 million words.

The corpus can be accessed from its Github page. (Due to licensing restrictions inherited from the TCP, only roughly half of the corpus can be made available. However, I can provide you access to the whole corpus if you can verify that your institution subscribes to the TCP.)