View source on GitHub
|
Various tensorflow ops related to text-processing.
Modules
metrics module: Tensorflow text-processing metrics.
tflite_registrar module: tflite_registrar A module with a Python wrapper for TFLite TFText ops.
Classes
class BertTokenizer: Tokenizer used for BERT.
class ByteSplitter: Splits a string tensor into bytes.
class Detokenizer: Base class for detokenizer implementations.
class FastBertNormalizer: Normalizes a tensor of UTF-8 strings.
class FastBertTokenizer: Tokenizer used for BERT, a faster version with TFLite support.
class FastSentencepieceTokenizer: Sentencepiece tokenizer with tf.text interface.
class FastWordpieceTokenizer: Tokenizes a tensor of UTF-8 string tokens into subword pieces.
class FirstNItemSelector: An ItemSelector that selects the first n items in the batch.
class HubModuleSplitter: Splitter that uses a Hub module.
class HubModuleTokenizer: Tokenizer that uses a Hub module.
class LastNItemSelector: An ItemSelector that selects the last n items in the batch.
class MaskValuesChooser: Assigns values to the items chosen for masking.
class PhraseTokenizer: Tokenizes a tensor of UTF-8 string tokens into phrases.
class RandomItemSelector: An ItemSelector implementation that randomly selects items in a batch.
class Reduction: Type of reduction to be done by the n-gram op.
class RegexSplitter: RegexSplitter splits text on the given regular expression.
class RoundRobinTrimmer: A Trimmer that allocates a length budget to segments via round robin.
class SentencepieceTokenizer: Tokenizes a tensor of UTF-8 strings.
class ShrinkLongestTrimmer: A Trimmer that truncates the longest segment.
class SplitMergeFromLogitsTokenizer: Tokenizes a tensor of UTF-8 string into words according to logits.
class SplitMergeTokenizer: Tokenizes a tensor of UTF-8 string into words according to labels.
class Splitter: An abstract base class for splitting text.
class SplitterWithOffsets: An abstract base class for splitters that return offsets.
class StateBasedSentenceBreaker: A Splitter that uses a state machine to determine sentence breaks.
class Tokenizer: Base class for tokenizer implementations.
class TokenizerWithOffsets: Base class for tokenizer implementations that return offsets.
class Trimmer: Truncates a list of segments using a pre-determined truncation strategy.
class UnicodeCharTokenizer: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
class UnicodeScriptTokenizer: Tokenizes UTF-8 by splitting when there is a change in Unicode script.
class WaterfallTrimmer: A Trimmer that allocates a length budget to segments in order.
class WhitespaceTokenizer: Tokenizes a tensor of UTF-8 strings on whitespaces.
class WordShape: Values for the 'pattern' arg of the wordshape op.
class WordpieceTokenizer: Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Functions
boise_tags_to_offsets(...): Converts the token offsets and BOISE tags into span offsets and span type.
build_fast_bert_normalizer_model(...): build_fast_bert_normalizer_model(arg0: bool) -> bytes
build_fast_wordpiece_model(...): build_fast_wordpiece_model(arg0: list[str], arg1: int, arg2: str, arg3: str, arg4: bool, arg5: bool) -> bytes
case_fold_utf8(...): Applies case folding to every UTF-8 string in the input.
coerce_to_structurally_valid_utf8(...): Coerce UTF-8 input strings to structurally valid UTF-8.
combine_segments(...): Combine one or more input segments for a model's input sequence.
concatenate_segments(...): Concatenate input segments for a model's input sequence.
find_source_offsets(...): Maps the input post-normalized string offsets to pre-normalized offsets.
gather_with_default(...): Gather slices with indices=-1 mapped to default.
greedy_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.
mask_language_model(...): Applies dynamic language model masking.
max_spanning_tree(...): Finds the maximum directed spanning tree of a digraph.
max_spanning_tree_gradient(...): Returns a subgradient of the MaximumSpanningTree op.
ngrams(...): Create a tensor of n-grams based on the input data data.
normalize_utf8(...): Normalizes each UTF-8 string in the input tensor using the specified rule.
normalize_utf8_with_offsets_map(...): Normalizes each UTF-8 string in the input tensor using the specified rule.
offsets_to_boise_tags(...): Converts the given tokens and spans in offsets format into BOISE tags.
pad_along_dimension(...): Add padding to the beginning and end of data in a specific dimension.
pad_model_inputs(...): Pad model input and generate corresponding input masks.
regex_split(...): Split input by delimiters that match a regex pattern.
regex_split_with_offsets(...): Split input by delimiters that match a regex pattern; returns offsets.
sentence_fragments(...): Find the sentence fragments in a given text. (deprecated)
sliding_window(...): Builds a sliding window for data with a specified width.
span_alignment(...): Return an alignment from a set of source spans to a set of target spans.
span_overlaps(...): Returns a boolean tensor indicating which source and target spans overlap.
utf8_binarize(...): Decode UTF8 tokens into code points and return their bits.
viterbi_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.
wordshape(...): Determine wordshape features for each input string.
Other Members | |
|---|---|
| version |
'2.19.0'
|
View source on GitHub