This site provides documentation for three related projects:
- pollux: a web application for doing text analysis.
- corpkit, a Python backend for pollux
- pollux-cl, a command-line natural language interpreter
With pollux, you can create parsed, structured and metadata-annotated corpora, and then search them for complex lexicogrammatical patterns. Search results can be quickly edited, sorted and visualised, saved and loaded within projects, or exported to formats that can be handled by other tools. In fact, you can easily work with any dataset in CONLL U format, including the freely available, multilingual Universal Dependencies Treebanks.
Concordancing is extended to allow the user to query and display grammatical features alongside tokens. Keywording can be restricted to certain word classes or positions within the clause. If your corpus contains multiple documents or subcorpora, you can identify keywords in each, compared to the corpus as a whole.
$ pip install pollux
$ git clone https://www.github.com/interrogator/pollux $ cd pollux $ python setup.py install
Parsing and interrogation of parse trees will also require Stanford CoreNLP. pollux can download and install it for you automatically.
Running the app
After installation, pollux can be started from the command line with:
# load sample project $ pollux-quickstart
You can parse your own corpus from within the web app, or via the command line:
# parse $ pollux-parse path/to/corpus $ mkdir ~/corpora # add to database $ cp -R path/to/corpus-parsed ~/corpora $ pollux-build # open the tool $ pollux
pollux-cl is a bit like the Corpus Workbench. You can open it with:
$ pollux-cl # or, alternatively: $ python -m pollux.cl
And then start working with natural language commands:
> set junglebook as corpus > parse junglebook with outname as jb > set jb as corpus > search corpus for deps matching "f/nsubj/ <- f/ROOT/" > calculate result as percentage of self > plot result as line chart with title as 'Example figure'
From the interpreter, you can enter
jupyter notebook or
gui to switch between interfaces, preserving the local namespace and data where possible.
Information about the syntax is available at the Overview.