Combining Tokenizer's custom prefix and URL matcher #10987
-
Hello everyone. I'm having difficulties when identifying URLs in a key-value text dump. Example code: When I run:
but for
I've tried adding |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
The If you want the existing The other alternative if you frequently have URLs embedded in longer tokens is to add a URL pattern as an infix. |
Beta Was this translation helpful? Give feedback.
-
Going over the algorithm really made the logic clear. thanks both for the reference and the working examples |
Beta Was this translation helpful? Give feedback.
The
TOKEN_MATCH
andURL_MATCH
patterns are only checked while handling prefixes and suffixes, not after splitting on infixes. There's a summary of the algorithm at the bottom of this expandable box in the docs: https://spacy.io/usage/linguistic-features#how-tokenizer-worksIf you want the existing
URL_MATCH
pattern to apply here, you'd need to handle the other parts as prefixes. Let's see, if you only need to have cases like\w+=
, you could add a prefix pattern that splits on anything before=
and then also add=
as its own prefix. This is similar to how French splitsl'
as a prefix, but only when there's a following alphabet character.The other alternative if you frequently have URLs em…