[WIP] Add `dico lemmatize` subcommand
Description
The purpose of this merge request is to implement the dico lemmatize
subcommand.
More precisely, it allows the user to add the tokens to the definition of a word. The json input file is a minimal json and the result is a lemmatized format json file. The subcommand uses SpaCy to tokenize the text definition into words, punctuation and so on, and then to get the lemma of the words.
Example
bind/dico -i sample/json/small_minimal.json -o file.json --library spacy --model en_core_web_sm
Checklist TODO
-
Read file arguments -
Documentation for the functions -
Update Readme -
Change branch and subcommand name -
Create array of tokens of the definitions -
Add tests -
Using Spacy , get lemma of the words -
Import Spacy library -
Add argument for --library (or other keyword), which would allow to choose a library. For now it would be by default spacy only, but possibly we could add others: stanza, spacy-stanza, ... -
Add argument --model, which would allow to specify exactly the model to use. For now, it could be en_core_web_sm by default, which is small and fast for development, there are others that we will want to try, with more precise language models, eg: en_core_web_trf, etc. ...
Edited by Medina Cardenas, Lorena Giovanna