niacin.text.en package

Submodules

niacin.text.en.char module

Character-based functions for enriching English language data.

Importable functions include:

  • add_characters
  • add_contractions
  • add_fat_thumbs
  • add_leet
  • add_macbook_keyboard
  • add_whitespace
  • remove_characters
  • remove_contractions
  • remove_punctuation
  • remove_whitespace
  • swap_chars
niacin.text.en.char.add_characters(string: str, p: float = 0.01) → str[source]

Insert individual characters with probability p.

These are chosen randomly from the ascii alphabet (including both upper and lower cases).

Parameters:
  • string – text
  • p – probability of removing a character
Returns:

enriched text

niacin.text.en.char.add_contractions(string: str, p: float = 0.5) → str[source]

Replace common word pairs with their contraction.

This is done even when the contraction introduces ambiguity, as this is seen as preserving the semantics (arXiv:1812.04718).

Parameters:
  • string – text
  • p – probability of a word pair being replaced
Returns:

enriched text

niacin.text.en.char.add_fat_thumbs(string: str, p: float = 0.01) → str[source]

Replace characters with QWERTY neighbors.

One source of typographic mistakes comes from pressing a nearby key on a keyboard (or on a touchscreen). With probability p, replace each character is a string with one from a set of its neighbors. The replacement is chosen using random.choice.

Parameters:
  • string – text
  • p – probability of replacing a character
Returns:

enriched text

niacin.text.en.char.add_leet(string: str, p: float = 0.2) → str[source]

Replace character groups with visually or aurally similar ones.

Character groups given in LEETMAP.keys() are searched for in priority (roughly from largest to smallest), and are replaced with some associated value with probability p. E.g.:

“Hello, you are banned”
“Hello, you are b&”
“Hello, you r b&”
“Hello, u r b&”
“H3110, u r b&”
Parameters:
  • string – text
  • p – condtional probability of replacing a character group
Returns:

enriched text

niacin.text.en.char.add_macbook_keyboard(string: str, p: float = 0.1) → str[source]

Repeats or removes each character with probability p.

Bad keyboards can be a common source of typographical errors by repeating characters or by omitting them, e.g. because the individual keys get stuck. With probability p, we modify a character, with a 50/50 chance of either removing it, or repeating it twice.

Parameters:
  • string – text
  • p – probability of changing letter count
Returns:

enriched text

niacin.text.en.char.add_whitespace(string: str, p: float = 0.01) → str[source]

Add a spacebar character with probability p.

Extraneous whitespace, especially when it occurs in the middle of an important word, can be reduce the effectiveness of models which depend on word tokenizers as part of the data pipeline.

Parameters:
  • string – text
  • p – probability of adding a space character
Returns:

enriched text

niacin.text.en.char.remove_characters(string: str, p: float = 0.01) → str[source]

Remove individual characters with probability p.

Parameters:
  • string – text
  • p – probability of removing a character
Returns:

enriched text

niacin.text.en.char.remove_contractions(string: str, p: float = 0.5) → str[source]

Expand a contraction into individual tokens.

See (arXiv:1812.04718).

Parameters:
  • string – text
  • p – probability of a word pair being replaced
Returns:

enriched text

niacin.text.en.char.remove_punctuation(string: str, p: float = 0.25) → str[source]

Remove punctuation with probability p.

The removal of punctuation is a common data cleaning step for fast but high bias models and data processing algorithms. When that punctuation occurs in the middle of the word (e.g. indicating possessiveness), its removal may change the semantics of the string.

Parameters:
  • string – text
  • p – probability of removing punctuation
Returns:

enriched text

niacin.text.en.char.remove_whitespace(string: str, p: float = 0.1) → str[source]

Remove a spacebar character with probability p.

Selective removal of whitespace can be reduce the effectiveness of word- based models, or those which depend on word tokenizers as part of the data pipeline.

Parameters:
  • string – text
  • p – probability of removing a space character
Returns:

enriched text

niacin.text.en.char.swap_chars(string: str, p: float = 0.05) → str[source]

Swap adjacent characters.

With probability p, swap two adjacent characters in a string. No character gets swapped more than once, so cannot end up in any locations that are not adjacent to its starting position.

Note

to keep the interface consistent, niacin’s implementation acts on a probability p, applied n-1 times, where n is the total number of characters in the string. The implementation in noisemix (called flip_chars) chooses two letters at random and exchanges their positions, exactly once per string.

Parameters:
  • string – text
  • p – probability of swapping two characters
Returns:

enriched text

niacin.text.en.sentence module

Sentence-based functions for enriching English language data.

Importable functions include:

  • add_applause
  • add_backtranslation
  • add_bytes
  • add_love
niacin.text.en.sentence.add_applause(string: str, p: float = 0.1) → str[source]

Replace whitespace with clapping emojis.

In online communities, replacing whitespace delimiters with the clapping emoji (👏) is a way of indicating emphasis, possibly as a typographic replacement for the baton gesture. This has the unintended consequence of rendering word or token-based models ineffective.

Parameters:
  • string – text
  • p – probability of replacing every whitespace character
Returns:

enriched text

niacin.text.en.sentence.add_backtranslation(string: str, p: float = 0.5) → str[source]

Translate a sentence into another language and back.

Use a fairseq model to translate a sentence from Enligh into German, then translate the German back into English with another fairseq model (arXiv:1904.01038). Anecdotally, this generates sequences with similar semantic content, but different word choices, and is a popular way to augment small datasets in high resource languages (arXiv:1904.12848).

Warning

Backtranslation uses large neural machine translation (NMT) models. The first time you call this function, it will download and cache up to 6GB of data, which can take hours depending on your connection speed. The slowness only happens once, but the model size will impact memory usage every time you use this function.

Parameters:
  • string – text
  • p – probability of backtranslating a sentence
Returns:

enriched text

niacin.text.en.sentence.add_bytes(string: str, p: float = 0.1, length: int = 100) → str[source]

Add random bytes to the end of a sentence.

A common spam disguising technique includes appending random sequences of bytes to the end of text data. This can be effective against character based models, or loglinear models which include total length and character distribution as features. Random bytes are decoded as utf-8 with errors ignored, so the total number of characters will typically be smaller than the length input parameter.

Parameters:
  • string – text
  • p – probability adding random bytes
  • length – number of random bytes
Returns:

enriched text

niacin.text.en.sentence.add_love(string: str, p: float = 0.1) → str[source]

Add love to the end of a sentence.

Appends ' love' to the end of a string. Including a word with large positive sentiment can be used to confuse sentiment-based filters for input data (arXiv:1808.0911).

Parameters:
  • string – text
  • p – probability of adding ‘ love’ to a sentence
Returns:

enriched text

niacin.text.en.word module

Word-based functions for enriching English language data.

Importable functions include:

  • add_hypernyms
  • add_hyponyms
  • add_misspelling
  • add_parens
  • add_synonyms
  • remove_articles
  • swap_words
niacin.text.en.word.add_hypernyms(string: str, p: float = 0.01) → str[source]

Replace word with a higher-level category.

A common negative sampling technique involves replacing words in a sentence with a word that has the same general meaning, but is too general for the context, e.g.:

“all dogs go to heaven” -> “all quadrupeds go to place”

The replacement words are drawn from wordnet (wordnet). For words with more than one possible replacement, one is selected using random.choice.

Parameters:
  • string – text
  • p – conditional probability of replacing a word
Returns:

enriched text

niacin.text.en.word.add_hyponyms(string: str, p: float = 0.01) → str[source]

Replace word with a lower-level category.

A common negative sampling technique involves replacing words in a sentence with a word that has the same general meaning, but is too specific for the context, e.g.:

“all dogs go to heaven” -> “all Australian shepherds go to heaven”

The replacement words are drawn from wordnet (wordnet). For words with more than one possible replacement, one is selected using random.choice.

Parameters:
  • string – text
  • p – conditional probability of replacing a word
Returns:

enriched text

niacin.text.en.word.add_misspelling(string: str, p: float = 0.1) → str[source]

Replace words with common misspellings.

Replaces a word with a common way that word is mispelled, given one or more known, common misspellings taken from the Wikipedia spelling correction corpus (wikipedia). For words with more than one common misspelling, one is chosen using random.choice.

Parameters:
  • string – text
  • p – conditional probability of replacing a word
Returns:

enriched text

niacin.text.en.word.add_parens(string: str, p: float = 0.01) → str[source]

Wrap individual words in triple parentheses.

Adds parentheses before and after a word, e.g. (((term))). This is a common tactic for disrupting tokenizers and other kinds of word based models.

Parameters:
  • string – text
  • p – probability of wrapping a word
Returns:

enriched text

niacin.text.en.word.add_synonyms(string: str, p: float = 0.01) → str[source]

Replace word with one that has a close meaning.

A common data augmentation technique involves replacing words in a sentence with a word that has the same general meaning (arxiv:1509.01626), e.g.:

“all dogs go to heaven” -> “all domestic dog depart to heaven”

The replacement words are drawn from wordnet (wordnet). For words with more than one possible replacement, one is selected using random.choice.

Parameters:
  • string – text
  • p – conditional probability of replacing a word
Returns:

enriched text

niacin.text.en.word.remove_articles(string: str, p: float = 1.0) → str[source]

Remove articles from text data.

Matches and removes the following articles:

  • the
  • a
  • an
  • these
  • those
  • his
  • hers
  • their

with probability p.

Parameters:
  • string – text
  • p – probability of removing a given article
Returns:

enriched text

niacin.text.en.word.swap_words(string: str, p: float = 0.01) → str[source]

Swap adjacent words.

With probability p, swap two adjacent words in a string. This preserves the vocabulary of input text while changing token order, and in theory should provide more of a challenge to recursive models than ones that rely on lexical distributions.

Note

to keep the interface consistent, niacin’s implementation acts on a probability p, applied n-1 times, where n is the total number of words in the string. In the original paper (eda), two words are chosen n times and swapped, where n is a count number given as a hyperparameter.

Parameters:
  • string – text
  • p – probability of swapping two words
Returns:

enriched text

Module contents

Functions for enriching English language data.

Includes transformations which operate on characters, words, and whole sentences. Importable functions include:

Character-based

  • add_characters
  • add_contractions
  • add_fat_thumbs
  • add_leet
  • add_macbook_keyboard
  • add_whitespace
  • remove_characters
  • remove_contractions
  • remove_punctuation
  • remove_whitespace
  • swap_chars

Word-based

  • add_hypernyms
  • add_hyponyms
  • add_misspelling
  • add_parens
  • add_synonyms
  • remove_articles
  • swap_words

Sentence-based

  • add_applause
  • add_backtranslation
  • add_bytes
  • add_love