Corpus Readers and Custom Corpora

What is a corpus?

卷册以结构化格式收集了在自然通信环境中生产的机器可读文本。 “公司”一词是公司的多样性。在许多方面,公司可以:

From the text that was originally electronic

From the transcripts of spoken language

From optical character recognition and so on

Corpus representativeness, Corpus balance, Samppng, Corpus 规模是设计书时发挥重要作用的要素。 NLP的一些最受欢迎的工具是植树银行、Prop Bank、VobNet和WorNet。

How to build custom corpus?

While downloading NLTK, we also installed NLTK data package. So, we already have NLTK data package installed on our computer. If we talk about Windows, we’ll assume that this data package is installed at C: atural_language_toolkit_data and if we talk about Linux, Unix and Mac OS X, we ‘ll assume that this data package is installed at /usr/share/natural_language_toolkit_data.

在随后的沙德谢佩,我们将建立习俗库,这些企业必须位于挪威皇家骑士团确定的道路之一之内。这是因为NLTK可以找到。为了避免与官方的《北美自由贸易协定》数据包发生冲突,让我们创建一种习俗,即我们本国目录中的语言——toolkit_data目录。


import os, os.path
path = os.path.expanduser( ~/natural_language_toolkit_data )
if not os.path.exists(path):
   os.mkdir(path)
os.path.exists(path)

Output


True

现在, 让我们检查一下,我们是否拥有天然语言——toolkit——我们的家庭目录中的数据目录——


import nltk.data
path in nltk.data.path

Output


True

由于我们拿到了产出的真实性,这意味着我们有nltk_data的目录。

现在,我们将提出一个名为wordfile.txt的词句。 http://www.p.nltk_data.目录(~/nltk_data/corpus/wordfile.txt) 将使用nltk.data.load装载。

import nltk.data nltk.data.load(‘corpus/wordfile.txt’, format = ‘raw’)

Output

b’tutorialspoint ’

Corpus readers

NLTK提供各种公司注册课程。我们将在下面的平流reci中覆盖他们。

Creating wordpst corpus

NLTK有WordListCorpusReader等,可查阅载有字数的档案。对于以下的沙德谢皮,我们需要制作一个字标文件,可以是CSV或普通文本档案。例如,我们创建了一个名为“名单”的文件,其中载有以下数据:

tutorialspoint Onpne Free Tutorials

现在,让我们立即拿出一个WordListCorpusReader。从我们创建的档案中编出的言词清单的类别:`apst'。

from nltk.corpus.reader import WordListCorpusReader reader_corpus = WordListCorpusReader( . , [ pst ]) reader_corpus.words()

Output

[ tutorialspoint , Onpne , Free , Tutorials ]

Creating POS tagged word corpus

NLTK有TaredCorpus Reader等,我们可帮助建立固定字体。实际上,定位台的标记是用字标确定部分的标记。

封顶的最简单形式之一是“前言/标签”的形式,其形式来自新编的摘录。

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.

在上述节选中,每个字都有一个标记,标明其词组。例如,vb。 a 参比。

现在,让我们看一看一个TabedCorpusReader, 生产德国马克的字体,形成文件`apst.pos',上面有摘录。

from nltk.corpus.reader import TaggedCorpusReader reader_corpus = TaggedCorpusReader( . , r .*.pos ) reader_corpus.tagged_words()

Output

[( The , AT-TL ), ( expense , NN ), ( and , CC ), ...]

Creating Chunked phrase corpus

NLTK有ChnkedCorpusReader等,我们可帮助建立Chunked短语。实际上,一句中短语是一句。

举例来说,我们有以下摘录:特里克斯敦>。 a)

[Earper/JJR staff-reduction/NN moves/NNS] have/VBP trimmed/VBN about/ IN [300/CD jobs/NNS] ,/, [the/DT spokesman/NN] said/VBD ./.

在上述节选中,每chu都是一句话,但括号中不放在括号内的词语是句子的一部分,不属于任何 phrase子。

现在,让我们来临一个ChunkedCorpusReader。文件`apst.chunk'中选用的语种,上面有摘录。

from nltk.corpus.reader import ChunkedCorpusReader reader_corpus = TaggedCorpusReader( . , r .*.chunk ) reader_corpus.chunked_words()

Output

[ Tree( NP , [( Earper , JJR ), ( staff-reduction , NN ), ( moves , NNS )]), ( have , VBP ), ... ]

Creating Categorized text corpus

NLTK has CategorizedPlaintextCorpusReader 我们能够创建分类的文本集。如果我们有一大批案文,希望将案文分为几部分,那是非常有益的。

例如,新编有几种不同的类别。让我们在随后的沙尔法的帮助下找到这些东西。

from nltk.corpus import brown^M brown.categories()

Output

[ adventure , belles_lettres , editorial , fiction , government , hobbies , humor , learned , lore , mystery , news , repgion , reviews , romance , science_fiction ]

分类法典的最容易的方法之一是每个类别都有一个档案。例如,请参看movie_reviews上的两节录。 a)

movie_pos.txt

红线很弱,但会引发。

movie_neg.txt

大型预算和巨额生产无法弥补其tv光的缺乏。

因此,从以上两个档案中,我们有两类:pos和neg。

现在,我们要问一下CategorizedPlaintextCorpusReader。班级。

from nltk.corpus.reader import CategorizedPlaintextCorpusReader reader_corpus = CategorizedPlaintextCorpusReader( . , r movie_.*.txt , cat_pattern = r movie_(w+).txt ) reader_corpus.categories() reader_corpus.fileids(categories = [‘neg’]) reader_corpus.fileids(categories = [‘pos’])

Output

[ neg , pos ] [ movie_neg.txt ] [ movie_pos.txt ]
Previous Page Print Page Next Page Advertisements

Corpus Readers and Custom Corpora

What is a corpus?

How to build custom corpus?

Output

Output

Output

Corpus readers

Creating wordpst corpus

Output

Creating POS tagged word corpus

Output

Creating Chunked phrase corpus

Output

Creating Categorized text corpus

Output

movie_pos.txt

movie_neg.txt

Output

友情链接