Generating checksums with duplicates plugin fails #2873

lhupitr · 2018-04-09T04:12:34Z

Problem

I can't get the duplicates plugin to generate checksums (neither CRC nor md5sum) when following the examples suggested in the documentation: https://beets.readthedocs.io/en/stable/plugins/duplicates.html

Running this command in verbose (-vv) mode:

$ beet -vv duplicates -C 'ffmpeg -i {file} -f crc -' title:"Blank Generation"

Led to this problem:

Sending event: pluginload
artresizer: method is (2, (7, 0, 7))
thumbnails: using IM to write metadata
thumbnails: using GIO to compute URIs
inline: adding item field multidisc
library database: /net/nfs4/server/mediacentre/Music/Catalogs/beets_seeding.blb
library directory: /net/nfs4/server/mediacentre/Music/Library
Sending event: library_opened
duplicates: key ffmpeg on item /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac not cached:computing checksum
duplicates: failed to checksum /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac: Command 'ffmpeg -i b'/net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac' -f crc -' returned non-zero exit status 1.
duplicates: all keys ['ffmpeg'] on item /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac are null or empty: skipping
Sending event: cli_exit

However running directly from ffmpeg produces a checksum and exits with 0:

$ ffmpeg -i '/media/mediacentre/Mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac' -f crc -

...
    Metadata:
      encoder         : Lavc57.107.100 pcm_s16le
    Side data:
      replaygain: track gain - -4.800000, track peak - 0.000023, album gain - -3.150000, album peak - 0.000023, 
Invalid return value 0 for stream protocol
CRC=0xfd05c9a4
size=       0kB time=00:02:45.20 bitrate=   0.0kbits/s speed= 354x    
video:0kB audio:28458kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown

Setup

OS: Arch linux
Python version: 3.6.4
beets version: 1.4.6
Turning off plugins made problem go away (yes/no): N/A

My configuration (output of beet config) is:

importfeeds:
    relative_to: !!binary |
        L25ldC9uZnM0L3NlcnZlci9tZWRpYWNlbnRyZS9NdXNpYy9MaWJyYXJ5
    formats: m3u
    dir: /media/mediacentre/Mediacentre/Music/Playlists/
    m3u_name: imported.m3u
    absolute_path: no
lyrics:
    bing_lang_from: []
    auto: no
    google_API_key: REDACTED
    sources: google
    bing_client_secret: REDACTED
    bing_lang_to:
    google_engine_ID: REDACTED
    genius_api_key: REDACTED
    fallback:
    force: no
    local: no
directory: /net/nfs4/server/mediacentre/Music/Library
library: /net/nfs4/server/mediacentre/Music/Catalogs/beets_seeding.blb

plugins: inline edit fromfilename zero rewrite fuzzy info missing random types replaygain smartplaylist duplicates badfiles play chroma absubmit acousticbrainz mbsync mbsubmit lastimport embedart fetchart thumbnails importfeeds lyrics wlg popularity
pluginpath: ~/projects/whatlastgenre/plugin/beets/beetsplug
per_disc_numbering: yes
threaded: no

import:
    write: yes
    copy: yes
    hardlink: no
    autotag: yes
    incremental: no
    resume: yes
    quiet_fallback: skip
    none_rec_action: ask
    timid: yes
    log: ~/.config/beets/beetslog.txt
    detail: yes
    bell: yes
    from_scratch: yes

replace:
    '[\\/]': _
    ^\.: _
    '[\x00-\x1f]': _
    \.$: _
    \s+$: ''
    ^\s+: ''
    ^-: _

ui:
    color: yes

paths:
    default: $albumartist_credit/($original_year) $album%aunique{albumartist album year,albumtype label catalognum albumdisambig}/%if{$multidisc,Disc${disc}%if{$disctitle, - $disctitle}/}$track $title
    singleton: $artist/-Misc-/$title
    comp: Various Artists/$album%aunique{albumartist album year,albumtype label catalognum albumdisambig} ($year)/%if{$multidisc,Disc$disc%if{$disctitle, - $disctitle}/}$track $title
    ext:cue: $albumpath/$artist - $album
    ext:log: $albumpath/$artist - $album
    ext:jpg: $albumpath/cover
    ext:jpeg: $albumpath/cover
    ext:png: $albumpath/cover
item_fields:
    multidisc: 1 if disctotal > 1 else 0

match:
    strong_rec_thresh: 0.04
    medium_rec_thresh: 0.25
    rec_gap_thresh: 0.25
    max_rec:
        missing_tracks: medium
        unmatched_tracks: medium

musicbrainz:
    searchlimit: 5
art_filename: cover
fetchart:
    auto: no
    cautious: yes
    minwidth: 450
    maxwidth: 1010
    enforce_ratio: 1.5%
    sources: filesystem albumart amazon wikipedia coverart fanarttv google
    google_key: REDACTED
    store_source: yes
    cover_names:
    - cover
    - front
    - art
    - album
    - folder
    google_engine: 001442825323518660753:hrh5ch1gjzm
    fanarttv_key: REDACTED
embedart:
    auto: no
    maxwidth: 300
    compare_threshold: 0
    ifempty: no
    remove_art_file: no
thumbnails:
    auto: yes
    force: no
    dolphin: no
replaygain:
    auto: yes
    backend: bs1770gain
    overwrite: yes
    targetlevel: 89
    r128: [Opus]
    chunk_at: 5000
    method: replaygain
wlg:
    auto: no
    force: yes
    count: 4
    whitelist: /home/user/.whatlastgenre/genres.txt
    separator: ', '
mpdstats:
    rating_mix: 0.75
zero:
    auto: yes
    fields: images
    update_database: yes
    keep_fields: []
smartplaylist:
    relative_to: /net/nfs4/server/mediacentre/Music/Library/
    playlist_dir: /net/nfs4/server/mediacentre/Music/Playlists/
    playlists: [{name: Loved.m3u, query: 'loved:1'}, {name: Genre_acoustic-country-folk-americana.m3u, album_query: ['genre:Folk', 'genre:Country', 'genre:Acoustic', 'genre:Americana']}]
    auto: yes
types:
    loved: int
    play_count: int
lastfm:
    user: user
    api_key: REDACTED
play:
    command: mpv
    use_folders: no
    relative_to:
    raw: no
    warning_threshold: 100
copyartifacts:
    extensions: .cue .log .jpg .jpeg .png
    print_ignored: yes
acousticbrainz:
    auto: no
    force: no
    tags: []
absubmit:
    auto: no
    extractor: /usr/bin/streaming_extractor_music
chroma:
    auto: yes
acoustid:
    apikey: REDACTED
mbsubmit:
    format: $track. $title - $artist ($length)
    threshold: medium
lastimport:
    per_page: 500
    retry_limit: 3
duplicates:
    album: no
    checksum: ''
    copy: ''
    count: no
    delete: no
    format: ''
    full: no
    keys: []
    merge: no
    move: ''
    path: no
    tiebreak: {}
    strict: no
    tag: ''
fuzzy:
    prefix: '~'
    threshold: 0.7
missing:
    count: no
    total: no
    album: no
pathfields: {}
album_fields: {}
rewrite: {}
edit:
    albumfields: album albumartist
    itemfields: track title artist album
    ignore_fields: id path

The text was updated successfully, but these errors were encountered:

sampsyo · 2018-04-10T02:53:53Z

Interesting! I don't have an obvious explanation for what could be going wrong—maybe someone (perhaps you) can do some digging to find out what ffmpeg's output is and why it's exiting with an error when we invoke it.

lhupitr · 2018-04-10T03:18:55Z

As mentioned running the ffmpeg command directly produces a crc checksum without a non zero exit status, also the problem seems to be beyond ffmpeg as the same error occurs with the suggested md5sum command (which also functions fine when executed directly on the command line).

sampsyo · 2018-04-10T03:28:08Z

Right—so the question is what it’s doing (and what its output is) when it’s invoked by beets instead of manually.

ghost · 2018-12-17T16:19:55Z

So, maybe this is a Python thing, but I find this suspicious:

duplicates: failed to checksum /home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3: Command 'md5sum b'/home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3'' returned non-zero exit status 1.

Same as in OP -- what's with the spurious "b" in front of the file name? No matter which command I pass in (sha512sum, md5sum, ffmpeg), they all exit status 1 -- and if duplicates really is putting a "b" in there, no wonder, because that's not the {file}name.

Edit: copy/paste included some newlines; removed those for accuracy.

schmod · 2019-01-06T19:20:33Z

Indeed, that seems to be the problem. In Python 3, because the file path is a bytes object instead of a string, it's being passed to the shell as `"b'/path/to/file'".

Changing the relevant line from:

args = [p.format(file=item.path) for p in shlex.split(prog)]

to

args = [p.format(file=item.path.decode('utf-8')) for p in shlex.split(prog)]

seems to fix things.

A few problems remain, however. Both md5sum and sha512sum include the file's path in their output:

$ md5sum ~/my/file.mp3
ea187811890ede95aa618ecba4f27f57  ./my/file.mp3

Because beets uses this output to determine duplicates, it's never going to mark anything as a duplicate.

Additionally, because beets caches the checksums (using the first argument of the command), if you somehow mistype your checksum command, once you've cached bad fingerprints from md5sum, you're stuck with them forever.

sampsyo · 2019-01-06T22:38:03Z

Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded .decode('utf-8') will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.

That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into md5sum's standard input rather than passing the filename on its command line?

stale · 2020-07-12T00:10:32Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lhupitr · 2020-07-15T16:38:36Z

This is still a problem for me.

stale · 2020-09-13T16:39:17Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lhupitr · 2020-09-18T15:44:51Z

I'm not sure what needs to be done here, just keeping it open.

wisp3rwind · 2020-09-18T17:15:01Z

Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded .decode('utf-8') will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.

I think this issue is solved in the convert plugin:

beets/beetsplug/convert.py

Lines 207 to 233 in b659ad6

    
           # On Python 3, we need to construct the command to invoke as a 
        
           # Unicode string. On Unix, this is a little unfortunate---the OS is 
        
           # expecting bytes---so we use surrogate escaping and decode with the 
        
           # argument encoding, which is the same encoding that will then be 
        
           # *reversed* to recover the same bytes before invoking the OS. On 
        
           # Windows, we want to preserve the Unicode filename "as is." 
        
           if not six.PY2: 
        
               command = command.decode(util.arg_encoding(), 'surrogateescape') 
        
               if platform.system() == 'Windows': 
        
                   source = source.decode(util._fsencoding()) 
        
                   dest = dest.decode(util._fsencoding()) 
        
               else: 
        
                   source = source.decode(util.arg_encoding(), 'surrogateescape') 
        
                   dest = dest.decode(util.arg_encoding(), 'surrogateescape') 
        
           # Substitute $source and $dest in the argument list. 
        
           args = shlex.split(command) 
        
           encode_cmd = [] 
        
           for i, arg in enumerate(args): 
        
               args[i] = Template(arg).safe_substitute({ 
        
                   'source': source, 
        
                   'dest': dest, 
        
               }) 
        
               if six.PY2: 
        
                   encode_cmd.append(args[i]) 
        
               else: 
        
                   encode_cmd.append(args[i].encode(util.arg_encoding()))

So the solution is probably to move this to a separate function in beets.util and apply that in the duplicates plugin.

That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into md5sum's standard input rather than passing the filename on its command line?

I think that's not easily doable without writing a separate script and passing it as the argument to -C. Piping effectively requires running the command through a shell with all the implications about proper escaping.
For example, the following (untested!) incantations might work, but I guess they're not nice advice for the docs due to the amount of escaping required (which means it's non-trivial to modify these commands):

beet dup -C 'sh -c "md5sum < \"$1\"" {file}' ...
beet dup -C 'sh -c "md5sum \"$1\" | awk '"'"'{print $1}'"'"'" {file}' ...

Note that if the plugin would run the -Cargument through command_output(..., shell=True), the {file} itself would need to be quoted properly, which doesn't exactly simplify things.

So maybe the advice should be to create a script

#! /usr/bin/env sh

md5sum < "$1"

or

#! /usr/bin/env sh

md5sum "$1" | awk '{printf $1}'

and use it with

beet dup -C 'myscript {file}' ...

I'll remove the needinfo label since I think that the above implies that there's a clear path forward. I won't implement any of this myself, though, I'm not familiar with the duplicates plugin at all and don't use it myself.

sampsyo · 2020-09-18T19:41:59Z

Thanks, @wisp3rwind! I think you have the right fix there.

zwiebelspaetzle · 2020-12-04T05:14:36Z

This is an issue for me as well.

arogl · 2021-02-28T04:58:49Z

I have replaced line 200 with the following block and it is now computing checksums, currently testing on Ubuntu 20.04, Python 3.8

        if not six.PY2:
            if platform.system() == 'Windows':
                args = [p.format(file=item.path.decode(util._fsencoding()))
                        for p in shlex.split(prog)]
            else:
                args = [p.format(file=item.path.decode(util.arg_encoding(),
                        'surrogateescape')) for p in shlex.split(prog)]

I tried to add a

prog = prog.decode(util.arg_encoding(), 'surrogateescape'))

but I got an error:

AttributeError: 'str' object has no attribute 'decode'

I am not sure if the prog needs the decoding?

Thoughts?

sampsyo added the needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." label Apr 10, 2018

stale bot added the stale label Jul 12, 2020

stale bot removed the stale label Jul 15, 2020

stale bot added the stale label Sep 13, 2020

stale bot removed the stale label Sep 18, 2020

wisp3rwind added bug bugs that are confirmed and actionable and removed needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." labels Sep 18, 2020

arogl mentioned this issue May 8, 2021

Attempt to fix duplicates #3930

Merged

2 tasks

wisp3rwind mentioned this issue May 14, 2021

Duplicate Plugin: beet dup -C "md5sum {file}" doesn't calculate checksum on Win10 cmd (incl ad-hoc fix) #3943

Closed

wisp3rwind closed this as completed in #3930 Jun 15, 2021

wisp3rwind mentioned this issue Jun 15, 2021

Duplicates: Improve docs on using external commands #3979

Open

snejus mentioned this issue Dec 5, 2024

Release: Fix changelog formatting #5529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating checksums with duplicates plugin fails #2873

Generating checksums with duplicates plugin fails #2873

lhupitr commented Apr 9, 2018

sampsyo commented Apr 10, 2018

lhupitr commented Apr 10, 2018

sampsyo commented Apr 10, 2018

ghost commented Dec 17, 2018 •

edited by ghost

Loading

schmod commented Jan 6, 2019

sampsyo commented Jan 6, 2019

stale bot commented Jul 12, 2020

lhupitr commented Jul 15, 2020

stale bot commented Sep 13, 2020

lhupitr commented Sep 18, 2020

wisp3rwind commented Sep 18, 2020

sampsyo commented Sep 18, 2020

zwiebelspaetzle commented Dec 4, 2020

arogl commented Feb 28, 2021

Generating checksums with duplicates plugin fails #2873

Generating checksums with duplicates plugin fails #2873

Comments

lhupitr commented Apr 9, 2018

Problem

Setup

sampsyo commented Apr 10, 2018

lhupitr commented Apr 10, 2018

sampsyo commented Apr 10, 2018

ghost commented Dec 17, 2018 • edited by ghost Loading

schmod commented Jan 6, 2019

sampsyo commented Jan 6, 2019

stale bot commented Jul 12, 2020

lhupitr commented Jul 15, 2020

stale bot commented Sep 13, 2020

lhupitr commented Sep 18, 2020

wisp3rwind commented Sep 18, 2020

sampsyo commented Sep 18, 2020

zwiebelspaetzle commented Dec 4, 2020

arogl commented Feb 28, 2021

ghost commented Dec 17, 2018 •

edited by ghost

Loading