Skip to content

Add `dico lemmatize` subcommand

Medina Cardenas, Lorena Giovanna requested to merge dico-lemmatize into master

Description

The purpose of this merge request is to implement the dico lemmatize subcommand.

More precisely, it allows the user to transform a dictionary so that all its definitions are lemmatized. The JSON input file can be minimal or tokenized and the result is a lemmatized. The subcommand relies on the NLP library Spacy, using its en_core_web_sm model as default. The JSON input and output are read from stdin and written to stdout by default. The Stanza library is also supported.

Examples

After merge, the following commands are available:

# Tranforming a minimal JSON file with Spacy and en_core_web_sm
$ bin/dico lemmatize -i sample/json/small_minimal.json -o lemmatized.json \
>     --library spacy --model en_core_web_sm
# Transforming a tokenized JSON file with same model
$ bin/dico lemmatize -i sample/json/small_tokenized.json -o lemmatized.json \
>     --library spacy --model en_core_web_sm
# Transforming a minimal JSON file with Stanza and EWT model
$ bin/dico lemmatize -i sample/json/small_minimal.json -o small_lemmatize_stanza.json \
>     -l stanza -m ewt

Modified files

  • dico.py: to add subcommand
  • lemmatize.py: to parse the arguments and apply them
  • dictionary.py: to add lemmatize process to the Dico object
  • nlp.py: to fill the symbols and definitions using the library and model
  • bats/test_dico_lemmatize.bats: to add new tests
  • README: to add subcommand information

Added files

To test the subcommand with valid input JSON files:

  • sample/json/386_thing_minimal.json
  • sample/json/386_thing_token.json
  • sample/json/386_thing_lemmatized_spacy_sm.json
  • sample/json/386_thing_lemmatized_stanza_ewt.json
  • sample/json/small_lemmatized_spacy_sm.json
  • sample/json/small_lemmatized_stanza_ewt.json
  • sample/json/small_minimal.json
  • sample/json/small_minimal_invalid_value_key
  • sample/json/small_tokenized.json
Edited by Alexandre Blondin Massé

Merge request reports