Add `dico lemmatize` subcommand (!23) · Merge requests · dictionary / dico

Medina Cardenas, Lorena Giovanna requested to merge dico-lemmatize into master May 10, 2021

Description

The purpose of this merge request is to implement the dico lemmatize subcommand.

More precisely, it allows the user to transform a dictionary so that all its definitions are lemmatized. The JSON input file can be minimal or tokenized and the result is a lemmatized. The subcommand relies on the NLP library Spacy, using its en_core_web_sm model as default. The JSON input and output are read from stdin and written to stdout by default. The Stanza library is also supported.

Examples

After merge, the following commands are available:

# Tranforming a minimal JSON file with Spacy and en_core_web_sm
$ bin/dico lemmatize -i sample/json/small_minimal.json -o lemmatized.json \
>     --library spacy --model en_core_web_sm
# Transforming a tokenized JSON file with same model
$ bin/dico lemmatize -i sample/json/small_tokenized.json -o lemmatized.json \
>     --library spacy --model en_core_web_sm
# Transforming a minimal JSON file with Stanza and EWT model
$ bin/dico lemmatize -i sample/json/small_minimal.json -o small_lemmatize_stanza.json \
>     -l stanza -m ewt

Modified files

dico.py: to add subcommand
lemmatize.py: to parse the arguments and apply them
dictionary.py: to add lemmatize process to the Dico object
nlp.py: to fill the symbols and definitions using the library and model
bats/test_dico_lemmatize.bats: to add new tests
README: to add subcommand information

Added files

To test the subcommand with valid input JSON files:

sample/json/386_thing_minimal.json
sample/json/386_thing_token.json
sample/json/386_thing_lemmatized_spacy_sm.json
sample/json/386_thing_lemmatized_stanza_ewt.json
sample/json/small_lemmatized_spacy_sm.json
sample/json/small_lemmatized_stanza_ewt.json
sample/json/small_minimal.json
sample/json/small_minimal_invalid_value_key
sample/json/small_tokenized.json

Edited Jul 17, 2021 by Alexandre Blondin Massé

Add `dico lemmatize` subcommand

Description

Examples

Modified files

Added files

Merge request reports