docdeid.ds#
docdeid.ds.ds#
- class docdeid.ds.ds.Datastructure#
Bases:
ABCSomething that holds data in an efficient way.
- class docdeid.ds.ds.DsCollection#
Bases:
dict[str,Datastructure]A collection of datastructures.
Directly inherits from
dict.
docdeid.ds.lookup#
- class docdeid.ds.lookup.LookupStructure(matching_pipeline: Optional[list[docdeid.str.processor.StringModifier]] = None)#
Bases:
DatastructureStructure that contains strings, and allow efficiently checking whether a string is contained in it.
- Parameters:
matching_pipeline – An optional list of
StringModifier, that will beof (used to match an item against the structure. Implementations) –
:param
LookupStructuremust implement the logic itself.:- has_matching_pipeline() bool#
Whether there’s a matching pipeline or not.
- Returns:
Trueif there is a matching pipeline,Falseelse.
- class docdeid.ds.lookup.LookupSet(*args, **kwargs)#
Bases:
LookupStructureContains strings, that can efficiently be looked up. Additionally, contains some logic for matching.
- Parameters:
matching_pipeline – An optional list of
StringModifier, that will beitem (used to match an) – against the set.
- clear_items() None#
Clear the items.
- add_items_from_iterable(items: Iterable[str], cleaning_pipeline: Optional[list[docdeid.str.processor.StringProcessor]] = None) None#
Add items from an iterable.
- Parameters:
items – The iterable of strings.
cleaning_pipeline – An optional cleaning pipeline applied to the strings
iterator. (in the) –
- remove_items_from_iterable(items: Iterable[str]) None#
Remove items from an iterable. Respects the matching pipeline.
- Parameters:
items – An iterable of the strings to be removed.
- add_items_from_file(file_path: str, strip_lines: bool = True, cleaning_pipeline: Optional[list[docdeid.str.processor.StringProcessor]] = None, encoding: str = 'utf-8') None#
Add items from a file, line by line.
- Parameters:
file_path – Full path to the file being opened.
strip_lines – Whether to strip the lines. Applies
StripStringto each line.cleaning_pipeline – An optional cleaning pipeline applied to the lines in the file.
encoding – The encoding with which to open the file.
- add_items_from_self(cleaning_pipeline: list[docdeid.str.processor.StringProcessor], replace: bool = False) None#
Add items from self (this items of this
LookupSet). This can be used to do a transformation or replacment of the items.- Parameters:
cleaning_pipeline – A cleaning pipeline applied to the items of this set. This can also be used to transform the items.
replace – Whether to replace the items with the new/transformed items.
- __len__() int#
The number of items.
- Returns:
The number of items.
- items() set[str]#
Get the items in this set.
- Returns:
The items in this set.
- class docdeid.ds.lookup.LookupTrie(*args, **kwargs)#
Bases:
LookupStructureEfficiently contains lists of strings (e.g. tokens), for lookup. This is done by using a trie datastructure, which maps each element in the sequence of strings to a next trie.
- Parameters:
matching_pipeline – An optional list of
StringModifier, that will beTrie. (used to match an item against the) –
- add_item(item: list[str]) None#
Add an item, i.e. a list of strings, to this Trie.
- Parameters:
item – The item to be added.
- longest_matching_prefix(item: list[str], start_i: int = 0) Optional[list[str]]#
Finds the longest matching prefix of a list of strings. This is used to find the longest matching pattern at a current position of a text. Respects the matching pipeline.
- Parameters:
item – The input sequence of strings, of which to find the longest prefix that matches an item in this Trie.
start_i – The index of item at which to start the matching. This is useful to avoid making copies of the items.
- Returns:
The longest matching prefix, if any, or
Noneif no matching prefix is found.