docdeid.process#
docdeid.process.annotation_processor#
- class docdeid.process.annotation_processor.AnnotationProcessor#
Bases:
DocProcessor,ABCProcesses an
AnnotationSet.- process(doc: Document, **kwargs) None#
Process the document.
- Parameters:
doc – The document as input.
**kwargs – Any other settings.
- abstract process_annotations(annotations: AnnotationSet, text: str) AnnotationSet#
Process an
AnnotationSet.- Parameters:
annotations – The input
AnnotationSet.text – The corresponding text.
- Returns:
An
AnnotationSetthat is processed according to the class logic.
- class docdeid.process.annotation_processor.OverlapResolver(sort_by: tuple, sort_by_callbacks: Optional[frozendict[str, Callable]] = None)#
Bases:
AnnotationProcessorResolves overlap in an
AnnotationSet, if any. Use thesort_byandsort_by_callbacksarguments to specify how overlap should be resolved. For instance:sort_by=('start_char',)will solve overlap from left to right, whilesort_by=('length',)will sort from short to long.- Parameters:
sort_by – A list of
Annotationattributes to use for sorting.sort_by_callbacks – A mapping from class attribute (by string) to a callable, to influence sort order (e.g. reverse with
lambda x: -x).
- process_annotations(annotations: AnnotationSet, text: str) AnnotationSet#
Process an
AnnotationSet.- Parameters:
annotations – The input
AnnotationSet.text – The corresponding text.
- Returns:
An
AnnotationSetthat is processed according to the class logic.
- class docdeid.process.annotation_processor.MergeAdjacentAnnotations(slack_regexp: Optional[str] = None, check_overlap: bool = True)#
Bases:
AnnotationProcessorMerge adjacent annotations, with possibility for some slack (e.g. whitespaces) in between. Assumes the annotations do not overlap. You can disable checking by setting
check_overlap=Falseto gain some performance, if you are very sure that no overlap can be present.- Parameters:
slack_regexp – A regexp that is used to match the characters between two annotations.
check_overlap – If set to
False, there is no check if annotations are non-overlapping. This will give some minor performance benefit if you are sure there can be no overlap.
- process_annotations(annotations: AnnotationSet, text: str) AnnotationSet#
Process an
AnnotationSet.- Parameters:
annotations – The input
AnnotationSet.text – The corresponding text.
- Returns:
An
AnnotationSetthat is processed according to the class logic.
docdeid.process.annotator#
- class docdeid.process.annotator.Annotator(tag: str, priority: int = 0)#
Bases:
DocProcessor,ABCAbstract class for annotators, which are responsible for generating annotations from a given document. Instatiations should implement the annotate method.
- Parameters:
tag – The tag to use in the annotations.
- process(doc: Document, **kwargs) None#
Process a document, by adding annotations to its
AnnotationSet.- Parameters:
doc – The document to be processed.
- abstract annotate(doc: Document) list[docdeid.annotation.Annotation]#
Generate annotations for a document.
- Parameters:
doc – The document that should be annotated.
- Returns:
A list of annotations.
- class docdeid.process.annotator.SingleTokenLookupAnnotator(lookup_values: Iterable[str], *args, matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None, tokenizer_name: str = 'default', **kwargs)#
Bases:
AnnotatorMatches single tokens based on lookup values.
- Parameters:
lookup_values – An iterable of strings that should be used for lookup.
matching_pipeline – An optional pipeline that can be used for matching (e.g. lowercasing). Note that this degrades performance.
tokenizer_name – If not taking tokens from the
defaulttokenizer, specify which tokenizer to use. The tokenizer should be present inDocDeid.tokenizers.
- annotate(doc: Document) list[docdeid.annotation.Annotation]#
Generate annotations for a document.
- Parameters:
doc – The document that should be annotated.
- Returns:
A list of annotations.
- class docdeid.process.annotator.MultiTokenLookupAnnotator(*args, lookup_values: Optional[Iterable[str]] = None, matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None, tokenizer: Optional[Tokenizer] = None, trie: Optional[LookupTrie] = None, overlapping: bool = False, **kwargs)#
Bases:
AnnotatorMatches lookup values against tokens, where the
lookup_valuesmay themselves be sequences.- Parameters:
lookup_values – An iterable of strings, that should be matched. These are tokenized internally.
matching_pipeline – An optional pipeline that can be used for matching (e.g. lowercasing). This has no specific impact on matching performance, other than overhead for applying the pipeline to each string.
tokenizer – A tokenizer that is used to create the sequence patterns from
lookup_values.trie – A trie that is used for matching, rather than a combination of lookup_values and a matching_pipeline (cannot be used simultaneously).
overlapping – Whether the annotator should match overlapping sequences, or should process from left to right.
- Raises:
RunTimeError, when an incorrect combination of lookup_values, –
matching_pipeline –
- annotate(doc: Document) list[docdeid.annotation.Annotation]#
Generate annotations for a document.
- Parameters:
doc – The document that should be annotated.
- Returns:
A list of annotations.
- class docdeid.process.annotator.RegexpAnnotator(regexp_pattern: Union[Pattern, str], *args, capturing_group: int = 0, pre_match_words: Optional[list[str]] = None, **kwargs)#
Bases:
AnnotatorCreate annotations based on regular expression patterns. Note that these patterns do not necessarily start/stop on token boundaries.
- Parameters:
regexp_pattern – A pattern, either as a str or a
re.Pattern, that will be used for matching.capturing_group – The capturing group of the pattern that should be used to produce the annotation. By default, the entire match is used.
pre_match_words – A list of words (lookup values), of which at least one must be present in the tokens for the annotator to start matching the regexp at all.
- annotate(doc: Document) list[docdeid.annotation.Annotation]#
Generate annotations for a document.
- Parameters:
doc – The document that should be annotated.
- Returns:
A list of annotations.
- class docdeid.process.annotator.TokenPatternAnnotator(pattern: TokenPattern, *args, **kwargs)#
Bases:
AnnotatorAnnotate based on
TokenPattern.- Parameters:
pattern – The token pattern that should be used.
- annotate(doc: Document) list[docdeid.annotation.Annotation]#
Generate annotations for a document.
- Parameters:
doc – The document that should be annotated.
- Returns:
A list of annotations.
docdeid.process.doc_processor#
- class docdeid.process.doc_processor.DocProcessor#
Bases:
ABCSomething that processes a document.
- class docdeid.process.doc_processor.DocProcessorGroup#
Bases:
objectA group of
DocProcessor, that executes the containing processors in order.A
DocProcessorGroupcan itself be part of aDocProcessorGroup.- get_names(recursive: bool = True) list[str]#
Get the names of all document processors.
- Parameters:
recursive – Whether to recurse on any contained
DocProcessorGroup.- Returns:
The names of all document processors.
- add_processor(name: str, processor: Union[DocProcessor, DocProcessorGroup], position: Optional[int] = None) None#
Add a document processor to the group.
- Parameters:
name – The name of the processor.
processor – The processor or processor group.
position – The position at which to insert it. Will append if left
unspecified. –
- remove_processor(name: str) None#
Remove a processor from the group.
- Parameters:
name – The name of the processor.
- __getitem__(name: str) Union[DocProcessor, DocProcessorGroup]#
Get a document processor by name.
- Parameters:
name – The name of the document processor.
- Returns:
The document processor.
- process(doc: Document, enabled: Optional[set[str]] = None, disabled: Optional[set[str]] = None) None#
Process a document, by passing it to this group’s processors.
- Parameters:
doc – The document to be processed.
enabled – A set of strings, indicating which document processors to run for this document. By default, all document processors are used. In case of nested, it’s necessary to supply both the name of the processor group, and all of its containing processors (or a subset thereof).
disabled – A set of strings, indicating which document processors not to run for this document. Cannot be used together with enabled.
docdeid.process.redactor#
- class docdeid.process.redactor.Redactor#
Bases:
DocProcessor,ABCTakes care of redacting the text by modifying it, based on the input text and annotations.
Instantiations should implement the logic in
Redactor.redact().- process(doc: Document, **kwargs) None#
Process a document by redacting it, according to the logic in
Redactor.redact().- Parameters:
doc – The document to process.
**kwargs – Any other arguments.
- abstract redact(text: str, annotations: AnnotationSet) str#
Redact the text.
- Parameters:
text – The input text.
annotations – The annotations that are produced by previous document
processors. –
- Returns:
The redacted text.
- class docdeid.process.redactor.RedactAllText(open_char: str = '[', close_char: str = ']')#
Bases:
RedactorLiterally redacts all text. Might for example be used when an error is raised.
- Parameters:
open_char – The open char to use for the REPLACED tag.
close_char – The close char to use for the REPLACED tag.
- Returns:
The text
REDACTEDwith the open and close char, literally (e.g.[REDACTED]).
- redact(text: str, annotations: AnnotationSet) str#
Redact the text.
- Parameters:
text – The input text.
annotations – The annotations that are produced by previous document
processors. –
- Returns:
The redacted text.
- class docdeid.process.redactor.SimpleRedactor(open_char: str = '[', close_char: str = ']', check_overlap: bool = True)#
Bases:
RedactorBasic redactor, that replaces each entity in text with its tag. If the same entity occurs multiple times (with the same tag), it is replaced with
tag-n. Requires the set of annotations to be non-overlapping.- Parameters:
open_char – The open char to use for the replacment tag.
close_char – The close char to use for the replacment tag.
check_overlap – Whether to check whether annotations overlap. If set to
overlapping (False but annotations are) –
results. (will not give correct) –
- Returns:
The redacted text, with each entity recognized in the set of annotations replaced with the proper tag.
- redact(text: str, annotations: AnnotationSet) str#
Redact the text.
- Parameters:
text – The input text.
annotations – The annotations that are produced by previous document
processors. –
- Returns:
The redacted text.