Corpus classes

Corpus

A corpus is an unparsed or parsed collection of files that can be searched, or brought into memory for higher performance operations.

class corpkit.corpus.Corpus(path_or_data, **kwargs)[source]

Bases: collections.abc.MutableSequence

Model a parsed or unparsed corpus with arbitrary depth of subfolders

insert(i, v)[source]
files = None
filepaths = None
store_as_hdf(**kwargs)[source]

Store a corpus in an HDF5 file for faster loading

subcorpora = None
load(multiprocess=False, load_trees=True, **kwargs)[source]

Load corpus into memory (i.e. create one large pd.DataFrame)

Keyword Arguments:
 
  • multiprocess (int) – how many threads to use
  • load_trees (bool) – Parse constituency trees if present
  • add_gov (bool) – pre-load each token’s governor
  • cols (list) – list of columns to be loaded (can improve performance)
  • just (dict) – restrict load to lines with feature key matching regex value (case insensitive)
  • skip (dict) – the inverse of just
Returns:

corpkit.corpus.LoadedCorpus

search(target, query, **kwargs)[source]

Search a corpus for some linguistic or metadata feature

Parameters:
  • target (str) –

    The name of the column or feature to search

    • ‘w’: words
    • ‘l’: lemmas
    • ‘x’: XPOS
    • ‘p’: POS
    • ‘f’: dependency function
    • ‘year’, speaker, etc: arbitrary metadata categories
    • ‘t’: Constitutency trees via TGrep2 syntax
    • ‘d’: Dependency graphs via depgrep
  • query (str/list) – regular expression, Tgrep2/depgrep string to match, or list of strings to match against
Keyword Arguments:
 
  • inverse (bool) – get non-matches
  • multiprocess (int) – number of parallel threads to start
  • no_store (bool) – do not store reference corpus in Results object
  • just_index (bool) – return only pointers to matches, not actual data
  • cols (list) – list of columns to be loaded (can improve performance)
  • just (dict) – restrict load to lines with feature key matching regex value (case insensitive)
  • skip (dict) – the inverse of just
Returns:

search result

Return type:

corpkit.interrogation.Results

trees(query, **kwargs)[source]

Equivalent to .search(‘t’, query)

deps(query, **kwargs)[source]

Equivalent to .search(‘d’, query)

cql(query, **kwargs)[source]

Equivalent to .search(‘c’, query)

words(query, **kwargs)[source]

Equivalent to .search(‘w’, query)

lemmas(query, **kwargs)[source]

Equivalent to .search(‘l’, query)

pos(query, **kwargs)[source]

Equivalent to .search(‘x’, query)

functions(query, **kwargs)[source]

Equivalent to .search(‘f’, query)

parse(parser='corenlp', lang='english', multiprocess=False, **kwargs)[source]

Parse a plaintext corpus

Keyword Arguments:
 
  • parser (str) – name of the parser (only ‘corenlp’ accepted so far)
  • lang (str) – language for parser (english, arabic, chinese, german, french or spanish)
  • multiprocess (int) – number of parallel threads to start
  • memory_mb (int) – megabytes of memory to use per thread (default 2024)
Returns:

parsed corpus

Return type:

corpkit.corpus.Corpus

interrogate(search, **kwargs)[source]
fsi(ix)[source]

Get a slice of a corpus as a DataFrame

Parameters:ix (iterable) – if len(ix) == 1, filename to get if len(ix) == 2, get sent from filename if len(ix) == 3, get token from sent from filename
Returns:pd.DataFrame
features(subcorpora=False)[source]

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

Example:

>>> corpus.features
    SB  Characters  Tokens  Words  Closed class words  Open class words
    01       26873    8513   7308                4809              3704
    02       25844    7933   6920                4313              3620
    03       18376    5683   4877                3067              2616
    04       20066    6354   5366                3587              2767
wordclasses = None
postags = None
lexicon = None
sample(n, level='f')[source]

Get a sample of the corpus

Parameters:
  • n (int/float) – amount of data in the the sample. If an int, get n files. if a float, get float * 100 as a percentage of the corpus
  • level (str) – sample subcorpora (s) or files (f)
Returns:

a Corpus object

delete_metadata()[source]

Delete metadata for corpus. May be needed if corpus is changed

metadata = None
tokenise(postag=True, lemmatise=True, *args, **kwargs)[source]

Tokenise a plaintext corpus, saving to disk

Returns:The newly created corpkit.corpus.Corpus
annotate(interro, annotation, dry_run=True)[source]

Annotate a corpus

Parameters:
  • interro (corpkit.Interrogation) – Search matches
  • annotation (str/dict) – a tag or field: value dict. If a dict, the key is the name of the annotation field, and the value is, well, the value. If the value string matches one of the column names seen when concordancing, the content of that string will be used. If the value is a list, the middle column will be formatted, as per the show arguments for Interrogation.table() and Interrogation.conc().
  • dry_run (bool) – Show the annotations to be made, but don’t do them
unannotate(annotation, dry_run=True)[source]

Delete annotation from a corpus

Parameters:
  • annotation (str/dict) – just as in corpus.annotate().
  • dry_run (bool) – Show the changes to be made, but don’t do them

File

Corpora are comprised of files, which can be turned into pandas DataFrames and manipulated.

class corpkit.corpus.File(path, **kwargs)[source]

Bases: corpkit.corpus.Corpus

Models a corpus file for reading, interrogating, concordancing.

Methods for interrogating, concordancing and configurations are the same as corpkit.corpus.Corpus, plus methods for accessing the file contents directly as a str, or as a Pandas DataFrame.

read(**kwargs)[source]

Get contents of file as string

document = None
trees = None
plain = None

LoadedCorpus

The load method of Corpus objects returns a MultiIndexed DataFrame, with three levels: filename, sentence number, and token number. This object can be searched very quickly, because all data is in memory.

class corpkit.corpus.LoadedCorpus(data, path=False)[source]

Bases: corpkit.interrogation.Results

Store a corpus in memory as a DataFrame.

This class has all the same methods as a Results object. The only real difference is that slicing it will do some reindexing to speed up searches.

Results

Searching a corpus returns an object that can be searched again, turned into tables or concordance lines, or exported to other formats.

class corpkit.interrogation.Results(matches, reference=False, path=False, qstring=False)[source]

Bases: pandas.core.frame.DataFrame

Search results, a record of matching tokens in a Corpus

keyness(*args, **kwargs)[source]

Calculate keyness for each subcorpus

Returns:DataFrame
visualise(**kwargs)[source]

Visualise corpus interrogations.

Keyword Arguments:
 
  • title (str) – A title for the plot
  • x_label (str) – A label for the x axis
  • y_label (str) – A label for the y axis
  • kind (str) – The kind of chart to make
  • style (str) – Visual theme of plot
  • figsize (tuple of dimensions) – Size of plot
  • save (bool/str) – If bool, save with title as name; if str, use str as name
  • legend_pos (str) – Where to place legend
  • reverse_legend (bool) – Reverse the order of the legend
  • num_to_plot (int/‘all’) – How many columns to plot
  • tex (bool) – Use TeX to draw plot text
  • colours (str) – Colourmap for lines/bars/slices
  • cumulative (bool) – Plot values cumulatively
  • pie_legend (bool) – Show a legend for pie chart
  • partial_pie (bool) – Allow plotting of pie slices only
  • (str (show_totals) – legend/plot): Print sums in plot where possible
  • transparent (bool) – Transparent .png background
  • output_format (str) – File format for saved image
  • black_and_white (bool) – Create black and white line styles
  • show_p_val (bool) – Attempt to print p values in legend if contained in df
  • stacked (bool) – When making bar chart, stack bars on top of one another
  • filled (bool) – For area and bar charts, make every column sum to 100
  • legend (bool) – Show a legend
  • rot (int) – Rotate x axis ticks by rot degrees
  • subplots (bool) – Plot each column separately
  • layout (tuple) – Grid shape to use when subplots is True
  • interactive – Experimental interactive options
Returns:

matplotlib figure

multiplot(main_params={}, sub_params={}, **kwargs)[source]

Plot a figure and subplots together

Keyword Arguments:
 
  • main_params (dict) – arguments for Results.visualise(), used to draw the large figure
  • sub_params (dict) – arguments for Results.visualise(), used to draw the sub figures. if a key is data, use its value as secondary data to plot.
  • layout (int/float) – a number between 1 and 16, corresponding to number of subplots. some numbers have an alternative layout accessible with floats (e.g. 3.5).
  • kwargs (dict) – arguments to pass to both figures
tabview(decimals=3, **kwargs)[source]
format(*args, **kwargs)[source]
calculate(**kwargs)[source]
table(subcorpora='file', *args, **kwargs)[source]

Create a spreadsheet-like table, showing one or more features by one or more others

Parameters:
  • subcorpora (str/list) – which metadata or word feature(s) to put on the y axis
  • show (str/list) – word or metadata features to put on the x axis
  • relative (bool/DataFrame) – calculate relative frequencies using self or passed data
  • keyness (bool/DataFrame) – calculate keyness frequencies using self or passed data
Returns:

pd.DataFrame

conc(*args, **kwargs)[source]

Generate a concordance

Parameters:
  • show (list of strs) – how to display concordance matches
  • n (int) – number to show
  • shuffle (bool) – randomise order
Returns:

generated concordance lines

Return type:

pd.DataFrame

sort(**kwargs)[source]
search(*args, **kwargs)[source]

Equivalent to corpus.search()

deps(*args, **kwargs)[source]

Equivalent to corpus.search(‘d’, query)

trees(*args, **kwargs)[source]

Equivalent to corpus.search(‘t’, query)

pos(*args, **kwargs)[source]

Equivalent to corpus.search(‘p’, query)

xpos(*args, **kwargs)[source]

Equivalent to corpus.search(‘x’, query)

lemmas(*args, **kwargs)[source]

Equivalent to corpus.search(‘l’, query)

words(*args, **kwargs)[source]

Equivalent to corpus.search(‘w’, query)

functions(*args, **kwargs)[source]

Equivalent to corpus.search(‘w’, query)

collapse(feature, values, name=False)[source]

Merge result on entries or metadata

Returns:Results (subset)
just(dct, mode='any')[source]

Reduce a DataFrame by string matching

skip(dct)[source]

Reduce a DataFrame by inverse string matching

top(n=50, feature='w')[source]

Get the top n most common results by column

Parameters:
  • n (int) – number of most common results to show
  • feature (str) – which feature to count
Returns:

Results (subset)

save(savename, savedir='saved_interrogations', **kwargs)[source]

Save an interrogation as pickle to savedir.

Example:

>>> o = corpus.interrogate(W, 'any')
### create ./saved_interrogations/savename.p
>>> o.save('savename')
savename (str): A name for the saved file savedir (str): Relative path to directory in which to save file print_info (bool): Show/hide stdout
store_as_hdf(**kwargs)[source]

Store a result within an HDF5 file.