docdeid#

docdeid.annotation#

class docdeid.annotation.Annotation(text: str, start_char: int, end_char: int, tag: str, priority: int = 0, start_token: ~typing.Optional[~docdeid.tokenizer.Token] = None, end_token: ~typing.Optional[~docdeid.tokenizer.Token] = None, _key_cache: dict = <factory>)#

Bases: object

An annotation contains information on a specific span of text that is tagged.

text: str#

The exact text.

start_char: int#

The start character.

end_char: int#

The end character.

tag: str#

The tag (e.g. name, location).

priority: int = 0#

An additional priority attribute, that can be used for resolving overlap/merges.

start_token: Optional[Token] = None#

Optionally, the first Token in the sequence of tokens corresponding to this annotation.

Should only be used when the annotation starts on a token boundary.

end_token: Optional[Token] = None#

Optionally, the last Token in the sequence of tokens corresponding to this annotation.

Should only be used when the annotation ends on a token boundary.

__eq__(other)#

Return self==value.

length: int#

The number of characters of the annotation text.

get_sort_key(by: tuple, callbacks: Optional[frozendict[str, Callable]] = None, deterministic: bool = True) tuple#

The sort key of an Annotation is used to order annotations by one or more of its attributes.

Parameters:
  • by – A list of attributes, used for sorting.

  • callbacks – A map of attributes to a callable function, to modify the value on which is sorted (for example lambda x: -x for reversing).

  • deterministic – Include all attributes in the sort key, so that ties are not broken randomly but deterministically.

Returns:

A tuple of the attributes specified, that can be passed to the key argument of the sorted function of (e.g.) list.

class docdeid.annotation.AnnotationSet#

Bases: set[Annotation]

Stores any number of annotations in a set.

It extends the builtin set.

sorted(by: tuple, callbacks: Optional[frozendict[str, Callable]] = None, deterministic: bool = True) list[docdeid.annotation.Annotation]#

Get the annotations in sorted order.

Parameters:
  • by – A list of Annotation attributes, used for sorting.

  • callbacks – A map of Annotation attributes to a callable function, to modify the value on which is sorted (for example lambda x: -x for reversing).

  • deterministic – Include all attributes in the sort key, so that ties are not broken randomly but deterministically.

Returns:

A list with the annotations, sorted as specified.

Raises:

A RunTimeError, if the callbacks are not provided as a frozen dict.

has_overlap() bool#

Check if the set of annotations has any overlapping annotations.

Returns:

True if overlapping annotations are found, False otherwise.

docdeid.deidentifier#

class docdeid.deidentifier.DocDeid#

Bases: object

The main class used for de-identifying text.

This class contains one or more document processors in a DocProcessorGroup, which can be modified directly in DocDeid.processors. Additionally, it stores and passes any number of tokenizers in the DocDeid.tokenizers dictionary, also directly accessible.

tokenizers: dict[str, docdeid.tokenizer.Tokenizer]#

A dictionary of named Tokenizer, that are passed to the Document object.

If there is only one tokenizer, you may add it with the default key, so that it will be used when the Document.get_tokens() method is called without a tokenizer name.

processors: DocProcessorGroup#

The processors of this deidentifier, captured in a DocProcessorGroup.

Processors can be added or modified by interacting with this attribute directly.

deidentify(text: str, enabled: Optional[set[str]] = None, disabled: Optional[set[str]] = None, metadata: Optional[dict] = None) Document#

Main interface to de-identifying text.

Parameters:
  • text – The input text, that needs de-identification.

  • enabled – A set of processors names that should be executed for this text. Cannot be used with disabled.

  • disabled – A set of processors names that should not be executed for this text. Cannot be used with enabled.

  • metadata – A dictionary containing additional information on this text, that is accessible to processors.

Returns:

A Document with the relevant information (e.g. Document.annotations, Document.deidentified_text).

docdeid.document#

class docdeid.document.MetaData(items: Optional[dict] = None)#

Bases: object

Contains additional information on a text that is provided by the user on input. A MetaData object is kept with the text in a Document, where it can be accessed by document processors. Note that a MetaData object does not allow overwriting keys. This is done to prevent document processors accidentally interfering with each other.

Parameters:

items – A dict of items to initialize with.

__getitem__(key: str) Optional[Any]#

Get an item.

Parameters:

key – The key.

Returns:

The item, with None as default.

__setitem__(key: str, value: Any) None#

Add an item.

Parameters:
  • key – The key.

  • value – The value.

Raises:

RuntimeError – When the key is already present.

class docdeid.document.Document(text: str, tokenizers: Optional[dict[str, docdeid.tokenizer.Tokenizer]] = None, metadata: Optional[dict] = None)#

Bases: object

Contains the text, its tokens, and other derived info after document processors have been applied to it.

Parameters:
  • text – The input text

  • tokenizers – A mapping of tokenizer names to Tokenizer. If only one tokenizer is used, default may be used as name to allow Document.get_tokens() to be called without a tokenizer name.

  • metadata – A dict with items, that can be accessed by document processors. Will be stored in a MetaData object.

metadata#

The MetaData of this Document, that can be interacted with directly.

property text: str#

The document text.

Returns:

The original and unmodified text.

get_tokens(tokenizer_name: str = 'default') TokenList#

Get the tokens corresponding to the input text, for a specific tokenizer.

Parameters:

tokenizer_name – The name of the tokenizer, that should be one of the

:param tokenizers passed when initializing the Document.:

Returns:

A TokenList containing the requested tokens.

Raises:
  • RuntimeError – If no tokenizers are initialized.

  • ValueError – If the requested tokenizer is unknown.

property annotations: AnnotationSet#

Get the annotations.

Returns:

An AnnotationSet containing the annotations belonging to this document.

property deidentified_text: Optional[str]#

Get the deidentified text.

Returns:

The deidentified text, if set by a document processor (else None).

set_deidentified_text(deidentified_text: str) None#

Set the deidentified text.

Parameters:

deidentified_text – The deidentified text.

docdeid.pattern#

class docdeid.pattern.TokenPattern(tag: str)#

Bases: ABC

A pattern that can be applied to a token, and possibly its neighbours, by matching the Token text.

Parameters:

tag – The tag that the annotations should be tagged with.

doc_precondition(doc: Document) bool#

Use this to check if the pattern is applicable to a document (e.g. check if some piece of metadata is included. By default returns True.

Parameters:

doc – The :class`.Document` the pattern will be applied to.

Returns:

True if applicable, False otherwise.

token_precondition(token: Token) bool#

Use this to check if the pattern is applicable to a token (e.g. check if it has neighbours). By default, returns True.

Parameters:

token – The Token the pattern will be applied to.

Returns:

True if applicable, False otherwise.

abstract match(token: Token, metadata: MetaData) Optional[tuple[docdeid.tokenizer.Token, docdeid.tokenizer.Token]]#

Check if the token provided matches this pattern. Instantiations of TokenPattern should implement the logic of the pattern in this method. For example, by checking if the text is lowercase, titlecase, longer than a certain number of characters, etc. The Token neighbours may be accessible by Token.previous() and Token.next(), if linked by the tokenizer.

Parameters:
  • token – The token.

  • metadata – The metadata.

Returns:

A tuple with the start and end Token if matching, or None if no match is possible.

docdeid.tokenizer#

class docdeid.tokenizer.Token(text: str, start_char: int, end_char: int, _previous_token: Optional[Token] = None, _next_token: Optional[Token] = None)#

Bases: object

A token is an atomic part of a text, as determined by a tokenizer.

text: str#

The text.

start_char: int#

The start char.

end_char: int#

The end char.

set_previous_token(token: Optional[Token]) None#

Set the previous token, in a linked list fashion.

Parameters:

token – The previous Token.

set_next_token(token: Optional[Token]) None#

Set the next token, in a linked list fashion.

Parameters:

token – The next Token.

previous(num: int = 1) Optional[Token]#

Get the previous Token, if any.

Parameters:

num – Searches the num-th token to the left.

Returns:

The token num positions to the left, if any, or None otherwise.

next(num: int = 1) Optional[Token]#

Get the next Token, if any.

Parameters:

num – Searches the num-th token to the right.

Returns:

The token num positions to the right, if any, or None otherwise.

__len__() int#

The length of the text.

Returns:

The length of the text.

__eq__(other)#

Return self==value.

class docdeid.tokenizer.TokenList(tokens: list[docdeid.tokenizer.Token], link_tokens: bool = True)#

Bases: object

Contains a sequence of tokens, along with some lookup logic.

Parameters:

tokens – The input tokens (must be final).

token_index(token: Token) int#

Find the token index in this list, i.e. its nominal position in the list.

Parameters:

token – The input token.

Returns: The index in this tokenlist.

get_words(matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None) set[str]#

Get all unique words (i.e. token texts) in this TokenList. Evaluates lazily.

Parameters:

matching_pipeline – The matching pipeline to apply.

Returns:

All the words in this TokenList as a set of strings.

token_lookup(lookup_values: set[str], matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None) set[docdeid.tokenizer.Token]#

Lookup all tokens of which the text matches a certain set of lookup values. Evaluates lazily.

Parameters:
  • lookup_values – The set of lookup values to match the token text against.

  • matching_pipeline – The matching pipeline.

Returns:

A set of Token, of which the text matches one of the lookup values.

__len__() int#
__getitem__(index: int) Token#
__eq__(other: object) bool#

Check if two TokenList are equal, by matching the tokens.

Parameters:

other – The other TokenList.

Returns:

True if the tokens exactly match, False otherwise. Does not check text equality but

Token equality.

Raises:
  • ValueError – When trying to check equality of something different from a

  • .TokenList

class docdeid.tokenizer.Tokenizer(link_tokens: bool = True)#

Bases: ABC

Abstract class for tokenizers, which split a text up in its smallest parts called tokens. Implementations should implement Tokenizer._split_text().

Parameters:

link_tokens – Whether the produced TokenList should link the tokens.

tokenize(text: str) TokenList#

Tokenize a text, based on the logic implemented in Tokenizer._split_text().

Parameters:

text – The input text.

Returns:

A TokenList, containing the created tokens.

class docdeid.tokenizer.SpaceSplitTokenizer(link_tokens: bool = True)#

Bases: Tokenizer

Tokenizes based on splitting on whitespaces.

Whitespaces themselves are not included as tokens.

class docdeid.tokenizer.WordBoundaryTokenizer(link_tokens: bool = True)#

Bases: Tokenizer

Tokenizes based on word boundary.

Whitespaces and similar characters are included as tokens.

docdeid.utils#

docdeid.utils.annotate_intext(doc: Document) str#

Annotate intext, which can be useful to compare the annotations of two different runs. This function replaces each piece of annotated text of a document with <TAG>text</TAG>.

Parameters:
  • doc – The Document as input, containing a text and zero or more

  • annotations.

Returns:

A string with each annotated span replaced with <TAG>text</TAG>.