EduNLP¶
SIF¶
- EduNLP.SIF.sif.is_sif(item)[source]¶
the part aims to check whether the input is sif format
- Parameters
item (str) – a raw item which respects stem
- Returns
when item can not be parsed correctly, raise Error; when item doesn’t need to be modified, return Ture; when item needs to be modified, return False;
- Return type
bool
Examples
>>> text = '若$x,y$满足约束条件' \ ... '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \ ... '则$z=x+7 y$的最大值$\\SIFUnderline$' >>> is_sif(text) True >>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> is_sif(text) False
- EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, safe=True, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶
Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params
- Parameters
item (str) – a raw item which respects stem
figures (dict) – {“FigureID”: Base64 encoding of the figure}
safe (bool) – Check whether the text conforms to the sif format
symbol (str) –
- select the methods to symbolize:
”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep
tokenization (bool) – True: tokenize the item
tokenization_params –
method: which tokenizer to be used, “linear” or “ast”
The parameters only useful for “linear”: None
- The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables
errors – warn, raise, coerce, strict, ignore
- Returns
When tokenization is False, return SegmentList; When tokenization is True, return TokenList
- Return type
list
Examples
>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> tl = sif4sci(test_item) >>> tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.describe() {'t': 2, 'f': 2, 'g': 1, 'm': 1} >>> with tl.filter('fgm'): ... tl ['如图所示', '面积'] >>> with tl.filter(keep='t'): ... tl ['如图所示', '面积'] >>> with tl.filter(): ... tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.text_tokens ['如图所示', '面积'] >>> tl.formula_tokens ['\\bigtriangleup', 'ABC'] >>> tl.figure_tokens [\FigureID{1}] >>> tl.ques_mark_tokens ['\\SIFBlank'] >>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}}) ['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]'] >>> sif4sci(test_item, symbol="tfgm") ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] >>> sif4sci(test_item, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) ['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]'] >>> test_item_1 = { ... "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$", ... "options": [r"$x < y$", r"$y = x$", r"$y < x$"] ... } >>> tls = [ ... sif4sci(e, symbol="gm", ... tokenization_params={ ... "formula_params": { ... "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True, ... "link_variable": False} ... }) ... for e in ([test_item_1["stem"]] + test_item_1["options"]) ... ] >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']] >>> link_formulas(*tls) >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']] >>> from EduNLP.utils import dict2str4sif >>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False) >>> test_item_1_str '$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl1 = sif4sci(test_item_1_str, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}}) >>> tl1.get_segments()[0] ['\\SIFTag{stem}'] >>> tl1.get_segments()[1:3] [['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']] >>> tl1.get_segments(add_seg_type=False)[0:3] [['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']] >>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]} >>> test_item_2 {'options': ['$x < y$', '$y = x$', '$y < x$']} >>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False) >>> test_item_2_str '$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl2 = sif4sci(test_item_2_str, symbol="gms", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) >>> tl2 ['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x'] >>> tl2.get_segments(add_seg_type=False) [['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']] >>> tl2.get_segments(add_seg_type=False, drop="s") [['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']] >>> tl3 = sif4sci(test_item_1["stem"], symbol="gs") >>> tl3.text_segments [['说法', '正确']] >>> tl3.formula_segments [['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']] >>> tl3.figure_segments [] >>> tl3.ques_mark_segments [['\\SIFChoice']] >>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是" >>> tl4 = sif4sci(test_item_3) Warning: there is some chinese characters in formula! >>> tl4.text_segments [['已知'], ['说法', '中', '正确']]
- EduNLP.SIF.sif.to_sif(item)[source]¶
the part aims to switch item to sif formate
- Parameters
items (str) – a raw item which respects stem
- Returns
item – the item which accords with sif format
- Return type
str
Examples
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> siftext = to_sif(text) >>> siftext '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...'
EduNLP.Formula¶
- EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]¶
The origin code author is https://github.com/hxwujinze
- Parameters
formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体
index (int) – 本子树在树上的位置
forest_begin (int) – 本树在森林中的起始位置
father_tree (List[Dict]) – 父亲树
is_str (bool) –
- Returns
tree (List[Dict]) – 重新解析形成的特征树
todo (finish all types)
Notes
Some functions are not supportd in
katexe.g.,- tag
\begin{equation} \tag{tagName} F=ma \end{equation}\begin{align} \tag{1} y=x+z \end{align}\tag*{hi} x+y^{2x}
- dddot
\frac{ \dddot y }{ x }
For more information, refer to katex support table
- EduNLP.Formula.ast.get_edges(forest)[source]¶
构造边集合
- Parameters
forest (List[Dict]) – 森林
- Returns
edges – 边集合
- Return type
list of tuple(src,dst,type)
EduNLP.I2V¶
- class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer item to vector directly.
I2V
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> model_path = "examples/test_model/test_gensim_luna_stem_tf_d2v_256.bin" >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) >>> i2v(item) ([array([ ...dtype=float32)], None)
- Returns
i2v model
- Return type
- infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function D2V.<lambda>>, *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model.
- Parameters
items (str) – the text of question
tokenize (bool) – True: tokenize the item
indexing (bool) –
padding (bool) –
key (lambda function) – the parameter passed to tokenizer, select the text to be processed
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
It just a api, so you shouldn’t use it directly. If you want to get vector from item, you can use other model like D2V and W2V.
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) –
True: use pretrained t2v model
False: use your own t2v model
kwargs – the parameters passed to t2v
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> model_path = "examples/test_model/test_gensim_luna_stem_tf_d2v_256.bin" >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) >>> i2v(item) ([array([...dtype=float32)], None)
- Returns
i2v model
- Return type
- class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer tokens to vector.
I2V
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v
Examples
>>> i2v = get_pretrained_i2v("test_w2v", "examples/test_model/data/w2v") >>> item_vector, token_vector = i2v(["有学者认为:‘学习’,必须适应实际"]) >>> item_vector [array([...], dtype=float32)]
- Returns
i2v model
- Return type
- infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function W2V.<lambda>>, *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model.
- Parameters
items (str) – the text of question
tokenize (bool) – True: tokenize the item
indexing (bool) –
padding (bool) –
key (lambda function) – the parameter passed to tokenizer, select the text to be processed
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶
It is a good idea if you want to switch item to vector earily.
- Parameters
name (str) – the name of item2vector model e.g.: d2v_all_256 d2v_sci_256 d2v_eng_256 d2v_lit_256 w2v_sci_300 w2v_lit_300
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
i2v model
- Return type
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> i2v = get_pretrained_i2v("test_d2v", "examples/test_model/data/d2v") >>> print(i2v(item)) ([array([ ...dtype=float32)], None)
EduNLP.Pretrain¶
- class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;
- Returns
tokenizer
- Return type
Tokenizer
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
- class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.
False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.
- Returns
tokenizer
- Return type
Tokenizer
Examples
>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
- EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶
- Parameters
items:str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model
- Returns
tokenizer
- Return type
Tokenizer
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> train_vector(token_item[:10], "examples/test_model/data/gensim_luna_stem_t_", 100) 'examples/test_model/data/gensim_luna_stem_t_sg_100.kv'
EduNLP.Tokenizer¶
- class EduNLP.Tokenizer.PureTextTokenizer(*args, **kwargs)[source]¶
Duel with text and plain text formula. And filting special formula like $\FormFigureID{…}$ and $\FormFigureBase64{…}.
- Parameters
items (str) –
key –
args –
kwargs –
- Returns
- Return type
token
Examples
>>> tokenizer = PureTextTokenizer() >>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z'] >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '='] >>> items = [{ ... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$", ... "options": ["1", "2"] ... }] >>> tokens = tokenizer(items, key=lambda x: x["stem"]) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']
- class EduNLP.Tokenizer.TextTokenizer(*args, **kwargs)[source]¶
Duel with text and formula including special formula.
- Parameters
items (str) –
key –
args –
kwargs –
- Returns
- Return type
token
Examples
>>> tokenizer = TextTokenizer() >>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> items = ["$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$\ ... $\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$\ ... $\\SIFTag{list_3}$2$\\SIFTag{options_end}$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['[TAG]', '复数', 'z', '=', '1', '+', '2', 'i', '+', 'i']
- EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]¶
It is a total interface to use difference tokenizer.
- Parameters
name (str) – the name of tokenizer, e.g. text, pure_text.
args – the parameters passed to tokenizer
kwargs – the parameters passed to tokenizer
- Returns
tokenizer
- Return type
Tokenizer
Examples
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokenizer = get_tokenizer("text") >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']
Vector¶
- class EduNLP.Vector.BowLoader(filepath)[source]¶
Using doc2bow model, which has a lot of effects.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
- class EduNLP.Vector.D2V(filepath, method='d2v')[source]¶
It is a collection which include d2v, bow, tfidf method.
- Parameters
filepath –
method (str) – d2v bow tfidf
item –
- Returns
d2v model
- Return type
- class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]¶
Examples
>>> model = RNNModel("ELMO", None, 2, vocab_size=4, embedding_dim=3) >>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]] >>> output, hn = model(seq_idx, indexing=False, padding=False) >>> seq_idx = [[1, 2, 3], [1, 2], [3]] >>> output, hn = model(seq_idx, indexing=False, padding=True) >>> output.shape torch.Size([3, 3, 4]) >>> hn.shape torch.Size([2, 3, 2]) >>> tokens = model.infer_tokens(seq_idx, indexing=False) >>> tokens.shape torch.Size([3, 3, 4]) >>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False) >>> tokens.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, indexing=False) >>> item.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, agg="mean", indexing=False) >>> item.shape torch.Size([3, 2]) >>> item = model.infer_vector(seq_idx, agg=None, indexing=False) >>> item.shape torch.Size([2, 3, 2])
- class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]¶
The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.
- Parameters
model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> path = "examples/test_model/test_gensim_luna_stem_tf_d2v_256.bin" >>> t2v = T2V('d2v',filepath=path) >>> print(t2v(item)) [array([...dtype=float32)]
- class EduNLP.Vector.TfidfLoader(filepath)[source]¶
This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.
- class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]¶
The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.
- Parameters
filepath – path to the pretrained model file
method (str) – fasttext other(Word2Vec)
binary –
- EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶
It is a good idea if you want to switch token list to vector earily.
- Parameters
name (str) – select the pretrained model e.g.: d2v_all_256, d2v_sci_256, d2v_eng_256, d2v_lit_256, w2v_eng_300, w2v_lit_300.
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
t2v model
- Return type
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> i2v = get_pretrained_t2v("test_d2v", "examples/test_model/data/d2v") >>> print(i2v(item)) [array([...dtype=float32)]