Corpus classes¶
Corpus¶
A corpus is an unparsed or parsed collection of files that can be searched, or brought into memory for higher performance operations.
-
class
corpkit.corpus.
Corpus
(path_or_data, **kwargs)[source]¶ Bases:
collections.abc.MutableSequence
Model a parsed or unparsed corpus with arbitrary depth of subfolders
-
files
= None¶
-
filepaths
= None¶
-
subcorpora
= None¶
-
load
(multiprocess=False, load_trees=True, **kwargs)[source]¶ Load corpus into memory (i.e. create one large pd.DataFrame)
Keyword Arguments: - multiprocess (int) – how many threads to use
- load_trees (bool) – Parse constituency trees if present
- add_gov (bool) – pre-load each token’s governor
- cols (list) – list of columns to be loaded (can improve performance)
- just (dict) – restrict load to lines with feature key matching regex value (case insensitive)
- skip (dict) – the inverse of just
Returns:
-
search
(target, query, **kwargs)[source]¶ Search a corpus for some linguistic or metadata feature
Parameters: - target (str) –
The name of the column or feature to search
- ‘w’: words
- ‘l’: lemmas
- ‘x’: XPOS
- ‘p’: POS
- ‘f’: dependency function
- ‘year’, speaker, etc: arbitrary metadata categories
- ‘t’: Constitutency trees via TGrep2 syntax
- ‘d’: Dependency graphs via depgrep
- query (str/list) – regular expression, Tgrep2/depgrep string to match, or list of strings to match against
Keyword Arguments: - inverse (bool) – get non-matches
- multiprocess (int) – number of parallel threads to start
- no_store (bool) – do not store reference corpus in Results object
- just_index (bool) – return only pointers to matches, not actual data
- cols (list) – list of columns to be loaded (can improve performance)
- just (dict) – restrict load to lines with feature key matching regex value (case insensitive)
- skip (dict) – the inverse of just
Returns: search result
Return type: - target (str) –
-
parse
(parser='corenlp', lang='english', multiprocess=False, **kwargs)[source]¶ Parse a plaintext corpus
Keyword Arguments: - parser (str) – name of the parser (only ‘corenlp’ accepted so far)
- lang (str) – language for parser (english, arabic, chinese, german, french or spanish)
- multiprocess (int) – number of parallel threads to start
- memory_mb (int) – megabytes of memory to use per thread (default 2024)
Returns: parsed corpus
Return type:
-
fsi
(ix)[source]¶ Get a slice of a corpus as a DataFrame
Parameters: ix (iterable) – if len(ix) == 1, filename to get if len(ix) == 2, get sent from filename if len(ix) == 3, get token from sent from filename Returns: pd.DataFrame
-
features
(subcorpora=False)[source]¶ Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.
Example:
>>> corpus.features SB Characters Tokens Words Closed class words Open class words 01 26873 8513 7308 4809 3704 02 25844 7933 6920 4313 3620 03 18376 5683 4877 3067 2616 04 20066 6354 5366 3587 2767
-
wordclasses
= None¶
-
lexicon
= None¶
-
sample
(n, level='f')[source]¶ Get a sample of the corpus
Parameters: - n (
int
/float
) – amount of data in the the sample. If anint
, get n files. if afloat
, get float * 100 as a percentage of the corpus - level (
str
) – sample subcorpora (s
) or files (f
)
Returns: a Corpus object
- n (
-
metadata
= None¶
-
tokenise
(postag=True, lemmatise=True, *args, **kwargs)[source]¶ Tokenise a plaintext corpus, saving to disk
Returns: The newly created corpkit.corpus.Corpus
-
annotate
(interro, annotation, dry_run=True)[source]¶ Annotate a corpus
Parameters: - interro (
corpkit.Interrogation
) – Search matches - annotation (str/dict) – a tag or field: value dict. If a dict, the key is the name of the annotation field, and the value is, well, the value. If the value string matches one of the column names seen when concordancing, the content of that string will be used. If the value is a list, the middle column will be formatted, as per the show arguments for Interrogation.table() and Interrogation.conc().
- dry_run (bool) – Show the annotations to be made, but don’t do them
- interro (
-
File¶
Corpora are comprised of files, which can be turned into pandas DataFrames and manipulated.
-
class
corpkit.corpus.
File
(path, **kwargs)[source]¶ Bases:
corpkit.corpus.Corpus
Models a corpus file for reading, interrogating, concordancing.
Methods for interrogating, concordancing and configurations are the same as
corpkit.corpus.Corpus
, plus methods for accessing the file contents directly as a str, or as a Pandas DataFrame.-
document
= None¶
-
trees
= None¶
-
plain
= None¶
-
LoadedCorpus¶
The load method of Corpus objects returns a MultiIndexed DataFrame, with three levels: filename, sentence number, and token number. This object can be searched very quickly, because all data is in memory.
-
class
corpkit.corpus.
LoadedCorpus
(data, path=False)[source]¶ Bases:
corpkit.interrogation.Results
Store a corpus in memory as a DataFrame.
This class has all the same methods as a Results object. The only real difference is that slicing it will do some reindexing to speed up searches.
Results¶
Searching a corpus returns an object that can be searched again, turned into tables or concordance lines, or exported to other formats.
-
class
corpkit.interrogation.
Results
(matches, reference=False, path=False, qstring=False)[source]¶ Bases:
pandas.core.frame.DataFrame
Search results, a record of matching tokens in a Corpus
-
visualise
(**kwargs)[source]¶ Visualise corpus interrogations.
Keyword Arguments: - title (str) – A title for the plot
- x_label (str) – A label for the x axis
- y_label (str) – A label for the y axis
- kind (str) – The kind of chart to make
- style (str) – Visual theme of plot
- figsize (tuple of dimensions) – Size of plot
- save (bool/str) – If bool, save with title as name; if str, use str as name
- legend_pos (str) – Where to place legend
- reverse_legend (bool) – Reverse the order of the legend
- num_to_plot (int/‘all’) – How many columns to plot
- tex (bool) – Use TeX to draw plot text
- colours (str) – Colourmap for lines/bars/slices
- cumulative (bool) – Plot values cumulatively
- pie_legend (bool) – Show a legend for pie chart
- partial_pie (bool) – Allow plotting of pie slices only
- (str (show_totals) – legend/plot): Print sums in plot where possible
- transparent (bool) – Transparent .png background
- output_format (str) – File format for saved image
- black_and_white (bool) – Create black and white line styles
- show_p_val (bool) – Attempt to print p values in legend if contained in df
- stacked (bool) – When making bar chart, stack bars on top of one another
- filled (bool) – For area and bar charts, make every column sum to 100
- legend (bool) – Show a legend
- rot (int) – Rotate x axis ticks by rot degrees
- subplots (bool) – Plot each column separately
- layout (tuple) – Grid shape to use when subplots is True
- interactive – Experimental interactive options
Returns: matplotlib figure
-
multiplot
(main_params={}, sub_params={}, **kwargs)[source]¶ Plot a figure and subplots together
Keyword Arguments: - main_params (dict) – arguments for Results.visualise(), used to draw the large figure
- sub_params (dict) – arguments for Results.visualise(), used to draw the sub figures. if a key is data, use its value as secondary data to plot.
- layout (int/float) – a number between 1 and 16, corresponding to number of subplots. some numbers have an alternative layout accessible with floats (e.g. 3.5).
- kwargs (dict) – arguments to pass to both figures
-
table
(subcorpora='file', *args, **kwargs)[source]¶ Create a spreadsheet-like table, showing one or more features by one or more others
Parameters: - subcorpora (str/list) – which metadata or word feature(s) to put on the y axis
- show (str/list) – word or metadata features to put on the x axis
- relative (bool/DataFrame) – calculate relative frequencies using self or passed data
- keyness (bool/DataFrame) – calculate keyness frequencies using self or passed data
Returns: pd.DataFrame
-
conc
(*args, **kwargs)[source]¶ Generate a concordance
Parameters: - show (list of strs) – how to display concordance matches
- n (int) – number to show
- shuffle (bool) – randomise order
Returns: generated concordance lines
Return type: pd.DataFrame
-
collapse
(feature, values, name=False)[source]¶ Merge result on entries or metadata
Returns: Results (subset)
-
top
(n=50, feature='w')[source]¶ Get the top n most common results by column
Parameters: - n (int) – number of most common results to show
- feature (str) – which feature to count
Returns: Results (subset)
-
save
(savename, savedir='saved_interrogations', **kwargs)[source]¶ Save an interrogation as pickle to
savedir
.Example:
>>> o = corpus.interrogate(W, 'any') ### create ./saved_interrogations/savename.p >>> o.save('savename')
savename (str): A name for the saved file savedir (str): Relative path to directory in which to save file print_info (bool): Show/hide stdout
-