docdeid#
- docdeid.ds
- docdeid.process
- docdeid.str
docdeid.annotation#
- class docdeid.annotation.Annotation(text: str, start_char: int, end_char: int, tag: str, priority: int = 0, start_token: ~typing.Optional[~docdeid.tokenizer.Token] = None, end_token: ~typing.Optional[~docdeid.tokenizer.Token] = None, _key_cache: dict = <factory>)#
Bases:
objectAn annotation contains information on a specific span of text that is tagged.
- text: str#
The exact text.
- start_char: int#
The start character.
- end_char: int#
The end character.
- tag: str#
The tag (e.g. name, location).
- priority: int = 0#
An additional priority attribute, that can be used for resolving overlap/merges.
- start_token: Optional[Token] = None#
Optionally, the first
Tokenin the sequence of tokens corresponding to this annotation.Should only be used when the annotation starts on a token boundary.
- end_token: Optional[Token] = None#
Optionally, the last
Tokenin the sequence of tokens corresponding to this annotation.Should only be used when the annotation ends on a token boundary.
- __eq__(other)#
Return self==value.
- length: int#
The number of characters of the annotation text.
- get_sort_key(by: tuple, callbacks: Optional[frozendict[str, Callable]] = None, deterministic: bool = True) tuple#
The sort key of an
Annotationis used to order annotations by one or more of its attributes.- Parameters:
by – A list of attributes, used for sorting.
callbacks – A map of attributes to a callable function, to modify the value on which is sorted (for example
lambda x: -xfor reversing).deterministic – Include all attributes in the sort key, so that ties are not broken randomly but deterministically.
- Returns:
A tuple of the attributes specified, that can be passed to the key argument of the sorted function of (e.g.)
list.
- class docdeid.annotation.AnnotationSet#
Bases:
set[Annotation]Stores any number of annotations in a set.
It extends the builtin
set.- sorted(by: tuple, callbacks: Optional[frozendict[str, Callable]] = None, deterministic: bool = True) list[docdeid.annotation.Annotation]#
Get the annotations in sorted order.
- Parameters:
by – A list of
Annotationattributes, used for sorting.callbacks – A map of
Annotationattributes to a callable function, to modify the value on which is sorted (for examplelambda x: -xfor reversing).deterministic – Include all attributes in the sort key, so that ties are not broken randomly but deterministically.
- Returns:
A list with the annotations, sorted as specified.
- Raises:
A RunTimeError, if the callbacks are not provided as a frozen dict. –
- has_overlap() bool#
Check if the set of annotations has any overlapping annotations.
- Returns:
Trueif overlapping annotations are found,Falseotherwise.
docdeid.deidentifier#
- class docdeid.deidentifier.DocDeid#
Bases:
objectThe main class used for de-identifying text.
This class contains one or more document processors in a
DocProcessorGroup, which can be modified directly inDocDeid.processors. Additionally, it stores and passes any number of tokenizers in theDocDeid.tokenizersdictionary, also directly accessible.- tokenizers: dict[str, docdeid.tokenizer.Tokenizer]#
A dictionary of named
Tokenizer, that are passed to theDocumentobject.If there is only one tokenizer, you may add it with the default key, so that it will be used when the
Document.get_tokens()method is called without a tokenizer name.
- processors: DocProcessorGroup#
The processors of this deidentifier, captured in a
DocProcessorGroup.Processors can be added or modified by interacting with this attribute directly.
- deidentify(text: str, enabled: Optional[set[str]] = None, disabled: Optional[set[str]] = None, metadata: Optional[dict] = None) Document#
Main interface to de-identifying text.
- Parameters:
text – The input text, that needs de-identification.
enabled – A set of processors names that should be executed for this text. Cannot be used with disabled.
disabled – A set of processors names that should not be executed for this text. Cannot be used with enabled.
metadata – A dictionary containing additional information on this text, that is accessible to processors.
- Returns:
A
Documentwith the relevant information (e.g.Document.annotations,Document.deidentified_text).
docdeid.document#
- class docdeid.document.MetaData(items: Optional[dict] = None)#
Bases:
objectContains additional information on a text that is provided by the user on input. A
MetaDataobject is kept with the text in aDocument, where it can be accessed by document processors. Note that aMetaDataobject does not allow overwriting keys. This is done to prevent document processors accidentally interfering with each other.- Parameters:
items – A
dictof items to initialize with.
- __getitem__(key: str) Optional[Any]#
Get an item.
- Parameters:
key – The key.
- Returns:
The item, with
Noneas default.
- __setitem__(key: str, value: Any) None#
Add an item.
- Parameters:
key – The key.
value – The value.
- Raises:
RuntimeError – When the key is already present.
- class docdeid.document.Document(text: str, tokenizers: Optional[dict[str, docdeid.tokenizer.Tokenizer]] = None, metadata: Optional[dict] = None)#
Bases:
objectContains the text, its tokens, and other derived info after document processors have been applied to it.
- Parameters:
text – The input text
tokenizers – A mapping of tokenizer names to
Tokenizer. If only one tokenizer is used,defaultmay be used as name to allowDocument.get_tokens()to be called without a tokenizer name.metadata – A dict with items, that can be accessed by document processors. Will be stored in a
MetaDataobject.
- property text: str#
The document text.
- Returns:
The original and unmodified text.
- get_tokens(tokenizer_name: str = 'default') TokenList#
Get the tokens corresponding to the input text, for a specific tokenizer.
- Parameters:
tokenizer_name – The name of the tokenizer, that should be one of the
:param tokenizers passed when initializing the
Document.:- Returns:
A
TokenListcontaining the requested tokens.- Raises:
RuntimeError – If no tokenizers are initialized.
ValueError – If the requested tokenizer is unknown.
- property annotations: AnnotationSet#
Get the annotations.
- Returns:
An
AnnotationSetcontaining the annotations belonging to this document.
- property deidentified_text: Optional[str]#
Get the deidentified text.
- Returns:
The deidentified text, if set by a document processor (else
None).
- set_deidentified_text(deidentified_text: str) None#
Set the deidentified text.
- Parameters:
deidentified_text – The deidentified text.
docdeid.pattern#
- class docdeid.pattern.TokenPattern(tag: str)#
Bases:
ABCA pattern that can be applied to a token, and possibly its neighbours, by matching the
Tokentext.- Parameters:
tag – The tag that the annotations should be tagged with.
- doc_precondition(doc: Document) bool#
Use this to check if the pattern is applicable to a document (e.g. check if some piece of metadata is included. By default returns
True.- Parameters:
doc – The :class`.Document` the pattern will be applied to.
- Returns:
Trueif applicable,Falseotherwise.
- token_precondition(token: Token) bool#
Use this to check if the pattern is applicable to a token (e.g. check if it has neighbours). By default, returns
True.- Parameters:
token – The
Tokenthe pattern will be applied to.- Returns:
Trueif applicable,Falseotherwise.
- abstract match(token: Token, metadata: MetaData) Optional[tuple[docdeid.tokenizer.Token, docdeid.tokenizer.Token]]#
Check if the token provided matches this pattern. Instantiations of
TokenPatternshould implement the logic of the pattern in this method. For example, by checking if the text is lowercase, titlecase, longer than a certain number of characters, etc. TheTokenneighbours may be accessible byToken.previous()andToken.next(), if linked by the tokenizer.- Parameters:
token – The token.
metadata – The metadata.
- Returns:
A tuple with the start and end
Tokenif matching, orNoneif no match is possible.
docdeid.tokenizer#
- class docdeid.tokenizer.Token(text: str, start_char: int, end_char: int, _previous_token: Optional[Token] = None, _next_token: Optional[Token] = None)#
Bases:
objectA token is an atomic part of a text, as determined by a tokenizer.
- text: str#
The text.
- start_char: int#
The start char.
- end_char: int#
The end char.
- set_previous_token(token: Optional[Token]) None#
Set the previous token, in a linked list fashion.
- Parameters:
token – The previous
Token.
- set_next_token(token: Optional[Token]) None#
Set the next token, in a linked list fashion.
- Parameters:
token – The next
Token.
- previous(num: int = 1) Optional[Token]#
Get the previous
Token, if any.- Parameters:
num – Searches the
num-th token to the left.- Returns:
The token
numpositions to the left, if any, orNoneotherwise.
- next(num: int = 1) Optional[Token]#
Get the next
Token, if any.- Parameters:
num – Searches the
num-th token to the right.- Returns:
The token
numpositions to the right, if any, orNoneotherwise.
- __len__() int#
The length of the text.
- Returns:
The length of the text.
- __eq__(other)#
Return self==value.
- class docdeid.tokenizer.TokenList(tokens: list[docdeid.tokenizer.Token], link_tokens: bool = True)#
Bases:
objectContains a sequence of tokens, along with some lookup logic.
- Parameters:
tokens – The input tokens (must be final).
- token_index(token: Token) int#
Find the token index in this list, i.e. its nominal position in the list.
- Parameters:
token – The input token.
Returns: The index in this tokenlist.
- get_words(matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None) set[str]#
Get all unique words (i.e. token texts) in this
TokenList. Evaluates lazily.- Parameters:
matching_pipeline – The matching pipeline to apply.
- Returns:
All the words in this
TokenListas a set of strings.
- token_lookup(lookup_values: set[str], matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None) set[docdeid.tokenizer.Token]#
Lookup all tokens of which the text matches a certain set of lookup values. Evaluates lazily.
- Parameters:
lookup_values – The set of lookup values to match the token text against.
matching_pipeline – The matching pipeline.
- Returns:
A set of
Token, of which the text matches one of the lookup values.
- __len__() int#
- class docdeid.tokenizer.Tokenizer(link_tokens: bool = True)#
Bases:
ABCAbstract class for tokenizers, which split a text up in its smallest parts called tokens. Implementations should implement
Tokenizer._split_text().- Parameters:
link_tokens – Whether the produced
TokenListshould link the tokens.
docdeid.utils#
- docdeid.utils.annotate_intext(doc: Document) str#
Annotate intext, which can be useful to compare the annotations of two different runs. This function replaces each piece of annotated text of a document with
<TAG>text</TAG>.- Parameters:
doc – The
Documentas input, containing a text and zero or moreannotations. –
- Returns:
A string with each annotated span replaced with
<TAG>text</TAG>.