Add `dico lemmatize` subcommand
Description
The purpose of this merge request is to implement the dico lemmatize
subcommand.
More precisely, it allows the user to transform a dictionary so that all its definitions are lemmatized. The JSON input file can be minimal or tokenized and the result is a lemmatized. The subcommand relies on the NLP library Spacy, using its en_core_web_sm
model as default. The JSON input and output are read from stdin and written to stdout by default. The Stanza library is also supported.
Examples
After merge, the following commands are available:
# Tranforming a minimal JSON file with Spacy and en_core_web_sm
$ bin/dico lemmatize -i sample/json/small_minimal.json -o lemmatized.json \
> --library spacy --model en_core_web_sm
# Transforming a tokenized JSON file with same model
$ bin/dico lemmatize -i sample/json/small_tokenized.json -o lemmatized.json \
> --library spacy --model en_core_web_sm
# Transforming a minimal JSON file with Stanza and EWT model
$ bin/dico lemmatize -i sample/json/small_minimal.json -o small_lemmatize_stanza.json \
> -l stanza -m ewt
Modified files
-
dico.py
: to add subcommand -
lemmatize.py
: to parse the arguments and apply them -
dictionary.py
: to add lemmatize process to theDico
object -
nlp.py
: to fill the symbols and definitions using the library and model -
bats/test_dico_lemmatize.bats
: to add new tests -
README
: to add subcommand information
Added files
To test the subcommand with valid input JSON files:
sample/json/386_thing_minimal.json
sample/json/386_thing_token.json
sample/json/386_thing_lemmatized_spacy_sm.json
sample/json/386_thing_lemmatized_stanza_ewt.json
sample/json/small_lemmatized_spacy_sm.json
sample/json/small_lemmatized_stanza_ewt.json
sample/json/small_minimal.json
sample/json/small_minimal_invalid_value_key
sample/json/small_tokenized.json
Edited by Alexandre Blondin Massé