torchtext.vocab¶

Vocab¶

class torchtext.vocab.Vocab(counter, max_size=None, min_freq=1, specials=['<pad>'], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)¶

Defines a vocabulary object that will be used to numericalize a field.

Variables:	freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab. stoi – A collections.defaultdict instance mapping token strings to numerical identifiers. itos – A list of token strings indexed by their numerical identifiers.

__init__(counter, max_size=None, min_freq=1, specials=['<pad>'], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)¶

Create a Vocab object from a collections.Counter.

Parameters:

counter – collections.Counter object holding the frequencies of each value found in the data.
max_size – The maximum size of the vocabulary, or None for no maximum. Default: None.
min_freq – The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1.
specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token. Default: [‘<pad>’]
vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_
vectors_cache – directory for cached vectors. Default: ‘.vector_cache’
specials_first – Whether to add special tokens into the vocabulary at first. If it is False, they are added into the vocabulary at last. Default: True.

load_vectors(vectors, **kwargs)¶

Parameters:

vectors – one of or a list containing instantiations of the GloVe, CharNGram, or Vectors classes. Alternatively, one of or a list of available pretrained vectors: charngram.100d fasttext.en.300d fasttext.simple.300d glove.42B.300d glove.840B.300d glove.twitter.27B.25d glove.twitter.27B.50d glove.twitter.27B.100d glove.twitter.27B.200d glove.6B.50d glove.6B.100d glove.6B.200d glove.6B.300d
keyword arguments (Remaining) – Passed to the constructor of Vectors classes.

set_vectors(stoi, vectors, dim, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)¶

Set the vectors for the Vocab instance from a collection of Tensors.

Parameters:

stoi – A dictionary of string to the index of the associated vector in the vectors input argument.
vectors – An indexed iterable (or other structure supporting __getitem__) that given an input index, returns a FloatTensor representing the vector for the token associated with the index. For example, vector[stoi[“string”]] should return the vector for “string”.
dim – The dimensionality of the vectors.
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_

SubwordVocab¶

class torchtext.vocab.SubwordVocab(counter, max_size=None, specials=['<pad>'], vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)¶

__init__(counter, max_size=None, specials=['<pad>'], vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)¶

Create a revtok subword vocabulary from a collections.Counter.

Parameters:

counter – collections.Counter object holding the frequencies of each word found in the data.
max_size – The maximum size of the subword vocabulary, or None for no maximum. Default: None.
specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token.
vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_

Vectors¶

class torchtext.vocab.Vectors(name, cache=None, url=None, unk_init=None, max_vectors=None)¶

__init__(name, cache=None, url=None, unk_init=None, max_vectors=None)¶

Parameters:

name – name of the file that contains the vectors
cache – directory for cached vectors
url – url for download if vectors not found in cache
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

Pretrained Word Embeddings¶

GloVe¶

class torchtext.vocab.GloVe(name='840B', dim=300, **kwargs)¶

__init__(name='840B', dim=300, **kwargs)¶

Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors

to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size

max_vectors (int): this can be used to limit the number of: pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

FastText¶

class torchtext.vocab.FastText(language='en', **kwargs)¶

__init__(language='en', **kwargs)¶

Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors

to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size

max_vectors (int): this can be used to limit the number of: pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

CharNGram¶

class torchtext.vocab.CharNGram(**kwargs)¶

__init__(**kwargs)¶

Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors

to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size

max_vectors (int): this can be used to limit the number of: pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

Misc.¶

_default_unk_index¶

torchtext.vocab._default_unk_index()¶

pretrained_aliases¶

torchtext.vocab.pretrained_aliases = {'charngram.100d': functools.partial(<class 'torchtext.vocab.CharNGram'>), 'fasttext.en.300d': functools.partial(<class 'torchtext.vocab.FastText'>, language='en'), 'fasttext.simple.300d': functools.partial(<class 'torchtext.vocab.FastText'>, language='simple'), 'glove.42B.300d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='300', name='42B'), 'glove.6B.100d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='100', name='6B'), 'glove.6B.200d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='200', name='6B'), 'glove.6B.300d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='300', name='6B'), 'glove.6B.50d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='50', name='6B'), 'glove.840B.300d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='300', name='840B'), 'glove.twitter.27B.100d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='100', name='twitter.27B'), 'glove.twitter.27B.200d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='200', name='twitter.27B'), 'glove.twitter.27B.25d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='25', name='twitter.27B'), 'glove.twitter.27B.50d': functools.partial(<class 'torchtext.vocab.GloVe'>, dim='50', name='twitter.27B')}¶

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:: d = {} for k, v in iterable:

d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs: in the keyword argument list. For example: dict(one=1, two=2)