EduNLP.Vector¶
EduNLP.Vector.rnn¶
- class EduNLP.Vector.rnn.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]¶
Examples
>>> model = RNNModel("ELMO", None, 2, vocab_size=4, embedding_dim=3) >>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]] >>> output, hn = model(seq_idx, indexing=False, padding=False) >>> seq_idx = [[1, 2, 3], [1, 2], [3]] >>> output, hn = model(seq_idx, indexing=False, padding=True) >>> output.shape torch.Size([3, 3, 4]) >>> hn.shape torch.Size([2, 3, 2]) >>> tokens = model.infer_tokens(seq_idx, indexing=False) >>> tokens.shape torch.Size([3, 3, 4]) >>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False) >>> tokens.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, indexing=False) >>> item.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, agg="mean", indexing=False) >>> item.shape torch.Size([3, 2]) >>> item = model.infer_vector(seq_idx, agg=None, indexing=False) >>> item.shape torch.Size([2, 3, 2])
EduNLP.Vector¶
- class EduNLP.Vector.BowLoader(filepath)[source]¶
Using doc2bow model, which has a lot of effects.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
- class EduNLP.Vector.D2V(filepath, method='d2v')[source]¶
It is a collection which include d2v, bow, tfidf method.
- Parameters
filepath –
method (str) – d2v bow tfidf
item –
- Returns
d2v model
- Return type
- class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]¶
Examples
>>> model = RNNModel("ELMO", None, 2, vocab_size=4, embedding_dim=3) >>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]] >>> output, hn = model(seq_idx, indexing=False, padding=False) >>> seq_idx = [[1, 2, 3], [1, 2], [3]] >>> output, hn = model(seq_idx, indexing=False, padding=True) >>> output.shape torch.Size([3, 3, 4]) >>> hn.shape torch.Size([2, 3, 2]) >>> tokens = model.infer_tokens(seq_idx, indexing=False) >>> tokens.shape torch.Size([3, 3, 4]) >>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False) >>> tokens.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, indexing=False) >>> item.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, agg="mean", indexing=False) >>> item.shape torch.Size([3, 2]) >>> item = model.infer_vector(seq_idx, agg=None, indexing=False) >>> item.shape torch.Size([2, 3, 2])
- class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]¶
The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.
- Parameters
model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> path = "examples/test_model/test_gensim_luna_stem_tf_d2v_256.bin" >>> t2v = T2V('d2v',filepath=path) >>> print(t2v(item)) [array([...dtype=float32)]
- class EduNLP.Vector.TfidfLoader(filepath)[source]¶
This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.
- class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]¶
The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.
- Parameters
filepath – path to the pretrained model file
method (str) – fasttext other(Word2Vec)
binary –
- EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶
It is a good idea if you want to switch token list to vector earily.
- Parameters
name (str) – select the pretrained model e.g.: d2v_all_256, d2v_sci_256, d2v_eng_256, d2v_lit_256, w2v_eng_300, w2v_lit_300.
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
t2v model
- Return type
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> i2v = get_pretrained_t2v("test_d2v", "examples/test_model/data/d2v") >>> print(i2v(item)) [array([...dtype=float32)]