docdeid.ds#

docdeid.ds.ds#

class docdeid.ds.ds.Datastructure#

Bases: ABC

Something that holds data in an efficient way.

class docdeid.ds.ds.DsCollection#

Bases: dict[str, Datastructure]

A collection of datastructures.

Directly inherits from dict.

docdeid.ds.lookup#

class docdeid.ds.lookup.LookupStructure(matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None)#

Bases: Datastructure

Structure that contains strings, and allow efficiently checking whether a string is contained in it.

Parameters:

matching_pipeline – An optional list of StringModifier, that will be
of (used to match an item against the structure. Implementations) –

:param LookupStructure must implement the logic itself.:

has_matching_pipeline() → bool#

Whether there’s a matching pipeline or not.

Returns:: True if there is a matching pipeline, False else.

class docdeid.ds.lookup.LookupSet(*args, **kwargs)#

Bases: LookupStructure

Contains strings, that can efficiently be looked up. Additionally, contains some logic for matching.

Parameters:

matching_pipeline – An optional list of StringModifier, that will be
item (used to match an) – against the set.

clear_items() → None#: Clear the items.

add_items_from_iterable(items: Iterable[str], cleaning_pipeline: Optional[list[docdeid.str.processor.StringProcessor]] = None) → None#

Add items from an iterable.

Parameters:

items – The iterable of strings.
cleaning_pipeline – An optional cleaning pipeline applied to the strings
iterator. (in the) –

remove_items_from_iterable(items: Iterable[str]) → None#

Remove items from an iterable. Respects the matching pipeline.

Parameters:: items – An iterable of the strings to be removed.

add_items_from_file(file_path: str, strip_lines: bool = True, cleaning_pipeline: Optional[list[docdeid.str.processor.StringProcessor]] = None, encoding: str = 'utf-8') → None#

Add items from a file, line by line.

Parameters:

file_path – Full path to the file being opened.
strip_lines – Whether to strip the lines. Applies StripString to each line.
cleaning_pipeline – An optional cleaning pipeline applied to the lines in the file.
encoding – The encoding with which to open the file.

add_items_from_self(cleaning_pipeline: list[docdeid.str.processor.StringProcessor], replace: bool = False) → None#

Add items from self (this items of this LookupSet). This can be used to do a transformation or replacment of the items.

Parameters:

cleaning_pipeline – A cleaning pipeline applied to the items of this set. This can also be used to transform the items.
replace – Whether to replace the items with the new/transformed items.

__len__() → int#

The number of items.

Returns:: The number of items.

items() → set[str]#

Get the items in this set.

Returns:: The items in this set.

class docdeid.ds.lookup.LookupTrie(*args, **kwargs)#

Bases: LookupStructure

Efficiently contains lists of strings (e.g. tokens), for lookup. This is done by using a trie datastructure, which maps each element in the sequence of strings to a next trie.

Parameters:

matching_pipeline – An optional list of StringModifier, that will be
Trie. (used to match an item against the) –

add_item(item: list[str]) → None#

Add an item, i.e. a list of strings, to this Trie.

Parameters:: item – The item to be added.

longest_matching_prefix(item: list[str], start_i: int = 0) → Optional[list[str]]#

Finds the longest matching prefix of a list of strings. This is used to find the longest matching pattern at a current position of a text. Respects the matching pipeline.

Parameters:

item – The input sequence of strings, of which to find the longest prefix that matches an item in this Trie.
start_i – The index of item at which to start the matching. This is useful to avoid making copies of the items.

Returns:

The longest matching prefix, if any, or None if no matching prefix is found.