Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

russian text not matched #251

Closed
acidnik opened this issue Nov 25, 2016 · 3 comments
Closed

russian text not matched #251

acidnik opened this issue Nov 25, 2016 · 3 comments
Labels
bug A bug.

Comments

@acidnik
Copy link

acidnik commented Nov 25, 2016

$ cat test.txt 
привет
Привет
ПрИвЕт
$ rg -i привет test.txt 
$ grep -i привет test.txt 
привет
Привет
ПрИвЕт
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8
$ file test.txt
test.txt: UTF-8 Unicode text
$ rg --version
ripgrep 0.3.0

looks like a bug to me

@BurntSushi BurntSushi added the bug A bug. label Nov 25, 2016
@BurntSushi
Copy link
Owner

Hmm, yes, it does appear to be a bug. I don't know the source of it yet, but it does at least appear that that the literals detected are correct:

$ rg привет test.txt --debug -i
DEBUG:grep::search: regex ast:
Literal {
    chars: [
        'п',                                                                                                                                                                                         
        'р',                                                                                                                                                                                         
        'и',                                                                                                                                                                                         
        'в',                                                                                                                                                                                         
        'е',                                                                                                                                                                                         
        'т'                                                                                                                                                                                          
    ],                                                                                                                                                                                               
    casei: true                                                                                                                                                                                      
}                                                                                                                                                                                                    
DEBUG:grep::literals: required literals found: [Cut(ПРИВ), Cut(пРИВ), Cut(ПрИВ), Cut(прИВ), Cut(ПРиВ), Cut(пРиВ), Cut(ПриВ), Cut(приВ), Cut(ПРИв), Cut(пРИв), Cut(ПрИв), Cut(прИв), Cut(ПРив), Cut(пРив), Cut(Прив), Cut(прив)]                                                                                                                                                                           
DEBUG:rg::args: will try to use memory maps

Notably Прив is in the set (second from last), and searching that explicitly works:

$ rg 'Прив' test.txt                                                                                                                                                              
2:Привет

In fact, adding a -i flag works too:

$ rg 'Прив' test.txt -i
1:привет                                                                                                                                                                                             
2:Привет                                                                                                                                                                                             
3:ПрИвЕт

Regardless of what's going on, the following should never happen:

$ rg -i привет test.txt                                                                                                                                                           
$ rg привет test.txt 
1:привет

@BurntSushi
Copy link
Owner

I'll be on vacation for the next few days, so unfortunately this bug will have to linger.

@BurntSushi
Copy link
Owner

This was a subtly gross bug, but it should be fixed now. See the commit message for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

No branches or pull requests

2 participants