docdeid.process#

docdeid.process.annotation_processor#

class docdeid.process.annotation_processor.AnnotationProcessor#

Bases: DocProcessor, ABC

Processes an AnnotationSet.

process(doc: Document, **kwargs) None#

Process the document.

Parameters:
  • doc – The document as input.

  • **kwargs – Any other settings.

abstract process_annotations(annotations: AnnotationSet, text: str) AnnotationSet#

Process an AnnotationSet.

Parameters:
  • annotations – The input AnnotationSet.

  • text – The corresponding text.

Returns:

An AnnotationSet that is processed according to the class logic.

class docdeid.process.annotation_processor.OverlapResolver(sort_by: tuple, sort_by_callbacks: Optional[frozendict[str, Callable]] = None)#

Bases: AnnotationProcessor

Resolves overlap in an AnnotationSet, if any. Use the sort_by and sort_by_callbacks arguments to specify how overlap should be resolved. For instance: sort_by=('start_char',) will solve overlap from left to right, while sort_by=('length',) will sort from short to long.

Parameters:
  • sort_by – A list of Annotation attributes to use for sorting.

  • sort_by_callbacks – A mapping from class attribute (by string) to a callable, to influence sort order (e.g. reverse with lambda x: -x).

process_annotations(annotations: AnnotationSet, text: str) AnnotationSet#

Process an AnnotationSet.

Parameters:
  • annotations – The input AnnotationSet.

  • text – The corresponding text.

Returns:

An AnnotationSet that is processed according to the class logic.

class docdeid.process.annotation_processor.MergeAdjacentAnnotations(slack_regexp: Optional[str] = None, check_overlap: bool = True)#

Bases: AnnotationProcessor

Merge adjacent annotations, with possibility for some slack (e.g. whitespaces) in between. Assumes the annotations do not overlap. You can disable checking by setting check_overlap=False to gain some performance, if you are very sure that no overlap can be present.

Parameters:
  • slack_regexp – A regexp that is used to match the characters between two annotations.

  • check_overlap – If set to False, there is no check if annotations are non-overlapping. This will give some minor performance benefit if you are sure there can be no overlap.

process_annotations(annotations: AnnotationSet, text: str) AnnotationSet#

Process an AnnotationSet.

Parameters:
  • annotations – The input AnnotationSet.

  • text – The corresponding text.

Returns:

An AnnotationSet that is processed according to the class logic.

docdeid.process.annotator#

class docdeid.process.annotator.Annotator(tag: str, priority: int = 0)#

Bases: DocProcessor, ABC

Abstract class for annotators, which are responsible for generating annotations from a given document. Instatiations should implement the annotate method.

Parameters:

tag – The tag to use in the annotations.

process(doc: Document, **kwargs) None#

Process a document, by adding annotations to its AnnotationSet.

Parameters:

doc – The document to be processed.

abstract annotate(doc: Document) list[docdeid.annotation.Annotation]#

Generate annotations for a document.

Parameters:

doc – The document that should be annotated.

Returns:

A list of annotations.

class docdeid.process.annotator.SingleTokenLookupAnnotator(lookup_values: Iterable[str], *args, matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None, tokenizer_name: str = 'default', **kwargs)#

Bases: Annotator

Matches single tokens based on lookup values.

Parameters:
  • lookup_values – An iterable of strings that should be used for lookup.

  • matching_pipeline – An optional pipeline that can be used for matching (e.g. lowercasing). Note that this degrades performance.

  • tokenizer_name – If not taking tokens from the default tokenizer, specify which tokenizer to use. The tokenizer should be present in DocDeid.tokenizers.

annotate(doc: Document) list[docdeid.annotation.Annotation]#

Generate annotations for a document.

Parameters:

doc – The document that should be annotated.

Returns:

A list of annotations.

class docdeid.process.annotator.MultiTokenLookupAnnotator(*args, lookup_values: Optional[Iterable[str]] = None, matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None, tokenizer: Optional[Tokenizer] = None, trie: Optional[LookupTrie] = None, overlapping: bool = False, **kwargs)#

Bases: Annotator

Matches lookup values against tokens, where the lookup_values may themselves be sequences.

Parameters:
  • lookup_values – An iterable of strings, that should be matched. These are tokenized internally.

  • matching_pipeline – An optional pipeline that can be used for matching (e.g. lowercasing). This has no specific impact on matching performance, other than overhead for applying the pipeline to each string.

  • tokenizer – A tokenizer that is used to create the sequence patterns from lookup_values.

  • trie – A trie that is used for matching, rather than a combination of lookup_values and a matching_pipeline (cannot be used simultaneously).

  • overlapping – Whether the annotator should match overlapping sequences, or should process from left to right.

Raises:
  • RunTimeError, when an incorrect combination of lookup_values,

  • matching_pipeline

annotate(doc: Document) list[docdeid.annotation.Annotation]#

Generate annotations for a document.

Parameters:

doc – The document that should be annotated.

Returns:

A list of annotations.

class docdeid.process.annotator.RegexpAnnotator(regexp_pattern: Union[Pattern, str], *args, capturing_group: int = 0, pre_match_words: Optional[list[str]] = None, **kwargs)#

Bases: Annotator

Create annotations based on regular expression patterns. Note that these patterns do not necessarily start/stop on token boundaries.

Parameters:
  • regexp_pattern – A pattern, either as a str or a re.Pattern, that will be used for matching.

  • capturing_group – The capturing group of the pattern that should be used to produce the annotation. By default, the entire match is used.

  • pre_match_words – A list of words (lookup values), of which at least one must be present in the tokens for the annotator to start matching the regexp at all.

annotate(doc: Document) list[docdeid.annotation.Annotation]#

Generate annotations for a document.

Parameters:

doc – The document that should be annotated.

Returns:

A list of annotations.

class docdeid.process.annotator.TokenPatternAnnotator(pattern: TokenPattern, *args, **kwargs)#

Bases: Annotator

Annotate based on TokenPattern.

Parameters:

pattern – The token pattern that should be used.

annotate(doc: Document) list[docdeid.annotation.Annotation]#

Generate annotations for a document.

Parameters:

doc – The document that should be annotated.

Returns:

A list of annotations.

docdeid.process.doc_processor#

class docdeid.process.doc_processor.DocProcessor#

Bases: ABC

Something that processes a document.

abstract process(doc: Document, **kwargs) None#

Process the document.

Parameters:
  • doc – The document as input.

  • **kwargs – Any other settings.

class docdeid.process.doc_processor.DocProcessorGroup#

Bases: object

A group of DocProcessor, that executes the containing processors in order.

A DocProcessorGroup can itself be part of a DocProcessorGroup.

get_names(recursive: bool = True) list[str]#

Get the names of all document processors.

Parameters:

recursive – Whether to recurse on any contained DocProcessorGroup.

Returns:

The names of all document processors.

add_processor(name: str, processor: Union[DocProcessor, DocProcessorGroup], position: Optional[int] = None) None#

Add a document processor to the group.

Parameters:
  • name – The name of the processor.

  • processor – The processor or processor group.

  • position – The position at which to insert it. Will append if left

  • unspecified.

remove_processor(name: str) None#

Remove a processor from the group.

Parameters:

name – The name of the processor.

__getitem__(name: str) Union[DocProcessor, DocProcessorGroup]#

Get a document processor by name.

Parameters:

name – The name of the document processor.

Returns:

The document processor.

process(doc: Document, enabled: Optional[set[str]] = None, disabled: Optional[set[str]] = None) None#

Process a document, by passing it to this group’s processors.

Parameters:
  • doc – The document to be processed.

  • enabled – A set of strings, indicating which document processors to run for this document. By default, all document processors are used. In case of nested, it’s necessary to supply both the name of the processor group, and all of its containing processors (or a subset thereof).

  • disabled – A set of strings, indicating which document processors not to run for this document. Cannot be used together with enabled.

docdeid.process.redactor#

class docdeid.process.redactor.Redactor#

Bases: DocProcessor, ABC

Takes care of redacting the text by modifying it, based on the input text and annotations.

Instantiations should implement the logic in Redactor.redact().

process(doc: Document, **kwargs) None#

Process a document by redacting it, according to the logic in Redactor.redact().

Parameters:
  • doc – The document to process.

  • **kwargs – Any other arguments.

abstract redact(text: str, annotations: AnnotationSet) str#

Redact the text.

Parameters:
  • text – The input text.

  • annotations – The annotations that are produced by previous document

  • processors.

Returns:

The redacted text.

class docdeid.process.redactor.RedactAllText(open_char: str = '[', close_char: str = ']')#

Bases: Redactor

Literally redacts all text. Might for example be used when an error is raised.

Parameters:
  • open_char – The open char to use for the REPLACED tag.

  • close_char – The close char to use for the REPLACED tag.

Returns:

The text REDACTED with the open and close char, literally (e.g. [REDACTED]).

redact(text: str, annotations: AnnotationSet) str#

Redact the text.

Parameters:
  • text – The input text.

  • annotations – The annotations that are produced by previous document

  • processors.

Returns:

The redacted text.

class docdeid.process.redactor.SimpleRedactor(open_char: str = '[', close_char: str = ']', check_overlap: bool = True)#

Bases: Redactor

Basic redactor, that replaces each entity in text with its tag. If the same entity occurs multiple times (with the same tag), it is replaced with tag-n. Requires the set of annotations to be non-overlapping.

Parameters:
  • open_char – The open char to use for the replacment tag.

  • close_char – The close char to use for the replacment tag.

  • check_overlap – Whether to check whether annotations overlap. If set to

  • overlapping (False but annotations are) –

  • results. (will not give correct) –

Returns:

The redacted text, with each entity recognized in the set of annotations replaced with the proper tag.

redact(text: str, annotations: AnnotationSet) str#

Redact the text.

Parameters:
  • text – The input text.

  • annotations – The annotations that are produced by previous document

  • processors.

Returns:

The redacted text.