Making projects and corpora

The first two things you need to do when using corpkit are to create a project, and to create (and optionally parse) a corpus. These steps can all be accomplished quickly using shell commands. They can also be done using the interpreter, however.

Once you’re in corpkit, the command below will create a new project called iran-news, and move you into it.

> new project named iran-news

Adding a corpus

Adding a corpus simply copies it to the project’s data directory. The syntax is simple:

> add '../../my_corpus'

Parsing a corpus

To parse a text file, folder of text files, or folder of folder of text files, you first set the corpus, and then use the parse command:

> set my_corpus as corpus
> parse corpus

Tokenising, POS tagging and lemmatising

If you don’t want/need full parses, or if you aren’t working with English, you might want to use the tokenise method.

> set abstracts as corpus
> tokenise corpus

POS tagging and lemmatisation are switched on by default, but you could also disable them:

> tokenise corpus with postag as false and lemmatise as false

Working with metadata

Parsing/tokenising can be made way cooler when your data has some metadata in it. The metadata will be transferred over to the parsed version of the corpus, and then you can search or filter by metadata features, use metadata values as symbolic subcorpora, or display metadata alongside concordances.

Metadata should take the form of an XML tag at the end of a line, which could be a sentence or a paragraph:

I hope everyone is hanging in with this blasted heat. As we all know being hot, sticky,
stressed and irritated can bring on a mood swing super fast. So please make sure your
all takeing your meds and try to stay out of the heat. <metadata username="Emz45"
totalposts="5063" currentposts="4051" date="2011-07-13" postnum="0" threadlength="1">

Then, parse with metadata:

> parse corpus with metadata

The parser output will look something like:

# sent_id 1
# parse=(ROOT (S (NP (PRP I)) (VP (VBP hope) (SBAR (S (NP (NN everyone)) (VP (VBZ is) (VP (VBG hanging) (PP (IN in) (IN with) (NP (DT this) (VBN blasted) (NN heat)))))))) (. .)))
# speaker=Emz45
# totalposts=5063
# threadlength=1
# currentposts=4051
# stage=10
# date=2011-07-13
# year=2011
# postnum=0
1   1   I         I         PRP O   2   nsubj      0       1
1   2   hope      hope      VBP O   0   ROOT       1,5,11  _
1   3   everyone  everyone  NN  O   5   nsubj      0       _
1   4   is        be        VBZ O   5   aux        0       _
1   5   hanging   hang      VBG O   2   ccomp      3,4,10  _
1   6   in        in        IN  O   10  case       0       _
1   7   with      with      IN  O   10  case       0       _
1   8   this      this      DT  O   10  det        0       2
1   9   blasted   blast     VBN O   10  amod       0       2
1   10  heat      heat      NN  O   5   nmod:with  6,7,8,9 2*
1   11  .         .         .   O   2   punct      0       _

Viewing corpus data

You can interactively work with the parser output.

> get file <n> of corpus

Or, if your corpus has subcorpora:

> get subcorpus <n> of corpus
> get file <n> of sampled

This view can be surprisingly powerful: sorting by lemma, POS or dependency function can show you some recurring lexicogrammatical patterns in a file without the need for searching.

The next page will show you how to search the corpus you’ve built, and to work with metadata if you’ve added it.