Skip to content

[WIP] Add `dico lemmatize` subcommand

Medina Cardenas, Lorena Giovanna requested to merge dico_lemmatize into master

Description

The purpose of this merge request is to implement the dico lemmatize subcommand.

More precisely, it allows the user to add the tokens to the definition of a word. The json input file is a minimal json and the result is a lemmatized format json file. The subcommand uses SpaCy to tokenize the text definition into words, punctuation and so on, and then to get the lemma of the words.

Example

bind/dico -i sample/json/small_minimal.json -o file.json --library spacy --model en_core_web_sm

Checklist TODO

  • Read file arguments
  • Documentation for the functions
  • Update Readme
  • Change branch and subcommand name
  • Create array of tokens of the definitions
  • Add tests
  • Using Spacy , get lemma of the words
  • Import Spacy library
  • Add argument for --library (or other keyword), which would allow to choose a library. For now it would be by default spacy only, but possibly we could add others: stanza, spacy-stanza, ...
  • Add argument --model, which would allow to specify exactly the model to use. For now, it could be en_core_web_sm by default, which is small and fast for development, there are others that we will want to try, with more precise language models, eg: en_core_web_trf, etc. ...
Edited by Medina Cardenas, Lorena Giovanna

Merge request reports