Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of the many articles within the English section of the Wikipedia which was consumed April 2010. It had been processed, as described in more detail below, to get rid of all links and unimportant product (navigation text, etc) The corpus is untagged, natural text. Utilized by Stanford NLP (1.8 GB).
: a corpus of manually-constructed description graphs, explanatory part ranks, and associated semistructured tablestore for some publicly available primary technology exam concerns in the usa (8 MB)
Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)
Wikipedia XML information: complete content of most Wikimedia wikis, by means of wikitext supply and metadata embedded in XML. (500 GB)
Yahoo! Responses Comprehensive Questions and Responses: Yahoo! Answers corpus as of 10/25/2007. Contains 4,483,032 concerns and their responses. (3.6 GB)
Yahoo! Responses composed of concerns expected in French: Subset regarding the Yahoo! Answers corpus from 2006 to 2015 composed of 1.7 million concerns posed in French, and their matching responses. (3.8 GB)
Yahoo! Responses Manner issues: subset associated with the Yahoo! Responses corpus from the 10/25/2007 dump, chosen because of their linguistic properties. Contains 142,627 questions and their responses. (104 MB)
Yahoo! HTML Forms removed from Publicly Available Webpages: contains a small test of pages which contain complex HTML kinds, contains 2.67 million complex forms. (50+ GB)
Yahoo N-Gram Representations: This dataset contains n-gram representations. The information may act as a testbed for question task that is rewriting a universal problem in IR research along with to term and phrase similarity task, which can be typical in NLP research. (2.6 GB)
Yahoo! N-Grams, variation 2.0: n-grams (letter = 1 to 5), obtained from a corpus of 14.6 million papers (126 million sentences that are unique 3.4 billion running terms) crawled from over 12000 news-oriented web internet sites (12 GB)
Yahoo! Search Logs with Relevance Judgments: Annonymized Yahoo! Re Re Search Logs with Relevance Judgments (1.3 GB)
Yahoo! Semantically snapshot that is annotated of English Wikipedia: English Wikipedia dated from 2006-11-04 processed with an amount of publicly-available NLP tools. 1,490,688 entries. (6 GB)
Yelp: including restaurant positions and 2.2M reviews (on demand)
Youtube: 1.7 million youtube videos information (torrent)
- Awesome general public datasets/NLP (includes more listings)
- AWS Public Datasets
- CrowdFlower: information for all (plenty of small studies they conducted and information acquired by crowdsourcing for the certain task)
- Kaggle 1, 2 (make certain though that the kaggle competition information can be utilized not in the competition! )
- Open Library
- Quora (primarily annotated corpora)
- /r/datasets (endless variety of datasets, most is scraped by amateurs though and never precisely documented or certified)
- Rs.io (another big list)
- Stackexchange: Opendata
- Stanford NLP team (primarily annotated corpora and TreeBanks or real NLP tools)
- Yahoo! Webscope (comes with papers which use the information that is supplied)
- SaudiNewsNet: 31,030 Arabic magazine articles alongwith metadata, obtained from different online Saudi magazines. (2 MB)
- Number of Urdu Datasets for POS, NER and NLP tasks.
German governmental Speeches Corpus: assortment of present speeches held by top German representatives (25 MB, 11 MTokens)
NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Designed for free for several Universities and organizations that are non-profit. Have to signal and deliver type to have. (on demand)
Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for subject category. (26.1 MB)
100k German Court Decisions: Open Legal Data releases a dataset of 100,000 court that is german and 444,000 citations (772 MB)
- © 2020 GitHub, Inc.
- Contact GitHub
- We We Blog
That action can’t be performed by you at this time around.
You finalized in with another tab or screen. Reload to recharge your session. You finalized call at another tab or screen. Reload to recharge your session.