Annotating your corpus¶
Another thing you might like to do is add metadata or annotations to your corpus. This can be done by simply editing corpus files, which are stored in a human-readable format. You can also automate annotation, however.
To do annotation, you first run a
search command and generate a
concordance. After deleting any false positives from the
concordance, you can use the
annotate command to annotate each sentence for which a concordance line exists.
annotate` works a lot like the ``mark,
del commands to begin with, but has some special syntax at the end, which controls whether you annotate using tags, or fields and values.
The first way of annotating is to add a tag to one or more sentences:
> search corpus for pos matching NNP and word matching 'daisy' > annotate m matching '^daisy$' with tag 'has_daisy'
You can use all to annotate every single concordance line:
> search corpus for governor-function matching nsubjpass \ ... showing governor-lemma and lemma > annotate all with tag 'passive'
If you try to run this code, you actually get a dry run, showing you what would be modified in your corpus. Once you’re happy with it, you can do
toggle annotation to turn file writing on, and then run the previous line again (use the up arrow to get it!).
Creating fields and values¶
More complex than adding tags is adding fields and values. This creates a new metadata category with multiple possible realisations. Below, we tag an sentence sentences based on their containing certain kinds of processes
> search corpus for function matching roles.process showing lemma > mark m matching processes.verbal red # annotate by colour > annotate red with field as process \ ... and value as 'verbal' # annotate without colouring first > annotate m matching processes.mental with field as process \ ... and value as 'mental'
You can also use
m as the value, which passes in the text from the middle column of the concordance.
> search corpus for pos matching NNP showing word > annotate m matching [gatsby, daisy, tom] \ ... with field as character and value as m
The moment these values have been added to your text, you can do really powerful things with them. You can, for example, use them as subcorpora, or use them as filters for the sentences being processed.
> set subcorpora as process > set skip character as 'gatsby' > set skip passive tag
Now, the subcorpora will be the different processes (verbal, mental and none), and any sentence annotated as containing the
character, or the
tag, will be ignored.
To remove a
tag or a
field across the dataset, the commands are very simple. Note that again, you need to
toggle annotation to actually alter any files.
> unannotate character field > unannotate typo tag > unannotate all tags