Introduction#

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Note that docdeid is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving docdeid, feel free to get in touch to coordinate.

Installation#

Grab the latest version from PyPi:

pip install docdeid

Getting started#

from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor

deidentifier = DocDeid()

deidentifier.tokenizers["default"] = WordBoundaryTokenizer()

deidentifier.processors.add_processor(
    "name_lookup",
    SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)

deidentifier.processors.add_processor(
    "name_regexp",
    RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)

deidentifier.processors.add_processor(
    "redactor", 
    SimpleRedactor()
)

text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)

Find the relevant info in the Document object:

print(doc.annotations)

AnnotationSet({
    Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
    Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
    Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4), 
    Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})

print(doc.deidentified_text)

'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'

Features#

Additionally, docdeid features:

Ability to create your own Annotator, AnnotationProcessor, Redactor and Tokenizer components
Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
Callable from one interface (DocDeid.deidenitfy())
String processing and filtering
Fast lookup based on sets or tries
Anything you add! PRs welcome.

For a more in-depth tutorial, see: docs/tutorial