Skip to content

Combining Tokenizer's custom prefix and URL matcher #10987

Discussion options

You must be logged in to vote

The TOKEN_MATCH and URL_MATCH patterns are only checked while handling prefixes and suffixes, not after splitting on infixes. There's a summary of the algorithm at the bottom of this expandable box in the docs: https://spacy.io/usage/linguistic-features#how-tokenizer-works

If you want the existing URL_MATCH pattern to apply here, you'd need to handle the other parts as prefixes. Let's see, if you only need to have cases like \w+=, you could add a prefix pattern that splits on anything before = and then also add = as its own prefix. This is similar to how French splits l' as a prefix, but only when there's a following alphabet character.

The other alternative if you frequently have URLs em…

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
2 replies
@antonpibm
Comment options

@adrianeboyd
Comment options

Answer selected by adrianeboyd
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants