Corpus and websites

Useful links to general websites:

Guterberg Project : offers over 53,000 free e-books

Text Encoding Initiative: consortium which collectively develops and maintains a standard for the representation of texts in digital form.

Corpus in French literature: FranText

British National Corpus: BNC

Corpus Saint Jean, wich is a collection of texts for autorship attribution, available in different formats and in French: lemmatised-UTF8, lemmatised-Windows, texts-UTF8, texts-Windows


Links to political websites:

American Presidency Project : contains over 123,530 documents and is growing rapidly

Miller Presidency Project : explores facts, essays, and related content for all the U.S. presidents 


All the software/information/corpora given out on this Web site is covered by the BSD License (see http://www.opensource.org/licenses/bsd-license.html), with Copyright (c) 2017, Dominique Labbé, Denis Monière, Cyril Labbé.

Essentially, all this means is that you can do what you like with the code and corpora, except claim another Copyright for it, or claim that it is issued under a different license. No commercial use is allowed with these corpora.  The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give this software/information/corpus to the fact that it is covered by the BSD license.