torchtext.vocab¶
Vocab¶
-
class
torchtext.vocab.
Vocab
(counter, max_size=None, min_freq=1, specials=['<pad>'], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)¶ Defines a vocabulary object that will be used to numericalize a field.
Variables: - freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
- stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
- itos – A list of token strings indexed by their numerical identifiers.
-
__init__
(counter, max_size=None, min_freq=1, specials=['<pad>'], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)¶ Create a Vocab object from a collections.Counter.
Parameters: - counter – collections.Counter object holding the frequencies of each value found in the data.
- max_size – The maximum size of the vocabulary, or None for no maximum. Default: None.
- min_freq – The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1.
- specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token. Default: [‘<pad>’]
- vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
- unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_
- vectors_cache – directory for cached vectors. Default: ‘.vector_cache’
- specials_first – Whether to add special tokens into the vocabulary at first. If it is False, they are added into the vocabulary at last. Default: True.
-
load_vectors
(vectors, **kwargs)¶ Parameters: - vectors – one of or a list containing instantiations of the GloVe, CharNGram, or Vectors classes. Alternatively, one of or a list of available pretrained vectors: charngram.100d fasttext.en.300d fasttext.simple.300d glove.42B.300d glove.840B.300d glove.twitter.27B.25d glove.twitter.27B.50d glove.twitter.27B.100d glove.twitter.27B.200d glove.6B.50d glove.6B.100d glove.6B.200d glove.6B.300d
- keyword arguments (Remaining) – Passed to the constructor of Vectors classes.
-
set_vectors
(stoi, vectors, dim, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)¶ Set the vectors for the Vocab instance from a collection of Tensors.
Parameters: - stoi – A dictionary of string to the index of the associated vector in the vectors input argument.
- vectors – An indexed iterable (or other structure supporting __getitem__) that given an input index, returns a FloatTensor representing the vector for the token associated with the index. For example, vector[stoi[“string”]] should return the vector for “string”.
- dim – The dimensionality of the vectors.
- unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_
SubwordVocab¶
-
class
torchtext.vocab.
SubwordVocab
(counter, max_size=None, specials=['<pad>'], vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)¶ -
__init__
(counter, max_size=None, specials=['<pad>'], vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)¶ Create a revtok subword vocabulary from a collections.Counter.
Parameters: - counter – collections.Counter object holding the frequencies of each word found in the data.
- max_size – The maximum size of the subword vocabulary, or None for no maximum. Default: None.
- specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token.
- vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
- unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_
-
Vectors¶
-
class
torchtext.vocab.
Vectors
(name, cache=None, url=None, unk_init=None, max_vectors=None)¶ -
__init__
(name, cache=None, url=None, unk_init=None, max_vectors=None)¶ Parameters: - name – name of the file that contains the vectors
- cache – directory for cached vectors
- url – url for download if vectors not found in cache
- unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
- max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
Pretrained Word Embeddings¶
GloVe¶
-
class
torchtext.vocab.
GloVe
(name='840B', dim=300, **kwargs)¶ -
__init__
(name='840B', dim=300, **kwargs)¶ Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors
to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size- max_vectors (int): this can be used to limit the number of
- pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
FastText¶
-
class
torchtext.vocab.
FastText
(language='en', **kwargs)¶ -
__init__
(language='en', **kwargs)¶ Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors
to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size- max_vectors (int): this can be used to limit the number of
- pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
CharNGram¶
-
class
torchtext.vocab.
CharNGram
(**kwargs)¶ -
__init__
(**kwargs)¶ Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors
to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size- max_vectors (int): this can be used to limit the number of
- pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
Misc.¶
pretrained_aliases¶
-
torchtext.vocab.
pretrained_aliases
= {'charngram.100d': functools.partial(<class 'torchtext.vocab.CharNGram'>), 'fasttext.en.300d': functools.partial(<class 'torchtext.vocab.FastText'>, language='en'), 'fasttext.simple.300d': functools.partial(<class 'torchtext.vocab.FastText'>, language='simple'), 'glove.42B.300d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='300', name='42B'), 'glove.6B.100d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='100', name='6B'), 'glove.6B.200d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='200', name='6B'), 'glove.6B.300d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='300', name='6B'), 'glove.6B.50d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='50', name='6B'), 'glove.840B.300d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='300', name='840B'), 'glove.twitter.27B.100d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='100', name='twitter.27B'), 'glove.twitter.27B.200d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='200', name='twitter.27B'), 'glove.twitter.27B.25d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='25', name='twitter.27B'), 'glove.twitter.27B.50d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='50', name='twitter.27B')}¶ dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s
(key, value) pairs- dict(iterable) -> new dictionary initialized as if via:
d = {} for k, v in iterable:
d[k] = v- dict(**kwargs) -> new dictionary initialized with the name=value pairs
- in the keyword argument list. For example: dict(one=1, two=2)