Combining Tokenizer's custom prefix and URL matcher #10987

antonpibm · 2022-06-20T08:29:16Z

antonpibm
Jun 20, 2022

Hello everyone.

I'm having difficulties when identifying URLs in a key-value text dump.

Example code:
text1 = "aaaa=www.google.com:5000/get/abc_asd/ aaa"
text2 = "aaaa www.google.com:5000/get/abc_asd/ aaa"

When I run:
nlp.tokenizer.explain(text1)
I'm getting:

[('TOKEN', 'aaaa'),
('INFIX', '='),
('TOKEN', 'www.google.com:5000'),
('INFIX', '/'),
('TOKEN', 'get'),
('INFIX', '/'),
('TOKEN', 'abc_asd/'),
('TOKEN', 'aaa')]

but for nlp.tokenizer.explain(text2)
I'm getting:

[('TOKEN', 'aaaa'),
('URL_MATCH', 'www.google.com:5000/get/abc_asd/'),
('TOKEN', 'aaa')]

I've tried adding = to custom affix,prefix and suffix in the Tokenizer but it didn't help, any ideas on how to correctly identify the full URL with a = prefix?

Answered by adrianeboyd

Jun 20, 2022

The TOKEN_MATCH and URL_MATCH patterns are only checked while handling prefixes and suffixes, not after splitting on infixes. There's a summary of the algorithm at the bottom of this expandable box in the docs: https://spacy.io/usage/linguistic-features#how-tokenizer-works

If you want the existing URL_MATCH pattern to apply here, you'd need to handle the other parts as prefixes. Let's see, if you only need to have cases like \w+=, you could add a prefix pattern that splits on anything before = and then also add = as its own prefix. This is similar to how French splits l' as a prefix, but only when there's a following alphabet character.

The other alternative if you frequently have URLs em…

View full answer

adrianeboyd · 2022-06-20T11:27:43Z

adrianeboyd
Jun 20, 2022

The TOKEN_MATCH and URL_MATCH patterns are only checked while handling prefixes and suffixes, not after splitting on infixes. There's a summary of the algorithm at the bottom of this expandable box in the docs: https://spacy.io/usage/linguistic-features#how-tokenizer-works

If you want the existing URL_MATCH pattern to apply here, you'd need to handle the other parts as prefixes. Let's see, if you only need to have cases like \w+=, you could add a prefix pattern that splits on anything before = and then also add = as its own prefix. This is similar to how French splits l' as a prefix, but only when there's a following alphabet character.

The other alternative if you frequently have URLs embedded in longer tokens is to add a URL pattern as an infix.

2 replies

antonpibm Jun 20, 2022
Author

Thank you for your answer, I've tried adding = as a prefix but it didn't work, what am I doing wrong?


def _custom_tokenizer(nlp):
    custom_prefixes = ['[=]']
    prefix_re = spacy.util.compile_prefix_regex(tuple(custom_prefixes + list(nlp.Defaults.prefixes)))
    return Tokenizer(nlp.vocab,
                     rules=nlp.tokenizer.rules,
                     prefix_search=prefix_re.search,
                     suffix_search=nlp.tokenizer.suffix_search, 
                     infix_finditer=nlp.tokenizer.infix_finditer, 
                     url_match=nlp.tokenizer.url_match,
                     token_match=nlp.tokenizer.token_match)

nlp_local = spacy.load('en_core_web_lg')
nlp_local.tokenizer = _custom_tokenizer(nlp_local)

adrianeboyd Jun 20, 2022

The prefix is only matched at the very beginning of the token, so only adding = as a prefix won't match aaaa=.

If you want the tokens aaaa = URL you need to add two new prefixes which are split off one after the other:

whatever is allowed to come in front of = with = following it as a positive lookahead (as one possibility for lowercase letters: [a-z]+(?==))
= itself

Unless you want aaaa= as one token, in which case you can just add [a-z]+= as a single prefix pattern.

antonpibm · 2022-06-21T17:49:11Z

antonpibm
Jun 21, 2022
Author

Going over the algorithm really made the logic clear. thanks both for the reference and the working examples

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining Tokenizer's custom prefix and URL matcher #10987

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Combining Tokenizer's custom prefix and URL matcher #10987

antonpibm Jun 20, 2022

Replies: 2 comments · 2 replies

adrianeboyd Jun 20, 2022

antonpibm Jun 20, 2022 Author

adrianeboyd Jun 20, 2022

antonpibm Jun 21, 2022 Author

antonpibm
Jun 20, 2022

Replies: 2 comments 2 replies

adrianeboyd
Jun 20, 2022

antonpibm Jun 20, 2022
Author

antonpibm
Jun 21, 2022
Author