Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating checksums with duplicates plugin fails #2873

Closed
lhupitr opened this issue Apr 9, 2018 · 14 comments · Fixed by #3930
Closed

Generating checksums with duplicates plugin fails #2873

lhupitr opened this issue Apr 9, 2018 · 14 comments · Fixed by #3930
Labels
bug bugs that are confirmed and actionable

Comments

@lhupitr
Copy link

lhupitr commented Apr 9, 2018

Problem

I can't get the duplicates plugin to generate checksums (neither CRC nor md5sum) when following the examples suggested in the documentation: https://beets.readthedocs.io/en/stable/plugins/duplicates.html

Running this command in verbose (-vv) mode:

$ beet -vv duplicates -C 'ffmpeg -i {file} -f crc -' title:"Blank Generation"

Led to this problem:

Sending event: pluginload
artresizer: method is (2, (7, 0, 7))
thumbnails: using IM to write metadata
thumbnails: using GIO to compute URIs
inline: adding item field multidisc
library database: /net/nfs4/server/mediacentre/Music/Catalogs/beets_seeding.blb
library directory: /net/nfs4/server/mediacentre/Music/Library
Sending event: library_opened
duplicates: key ffmpeg on item /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac not cached:computing checksum
duplicates: failed to checksum /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac: Command 'ffmpeg -i b'/net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac' -f crc -' returned non-zero exit status 1.
duplicates: all keys ['ffmpeg'] on item /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac are null or empty: skipping
Sending event: cli_exit

However running directly from ffmpeg produces a checksum and exits with 0:

$ ffmpeg -i '/media/mediacentre/Mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac' -f crc -
...
    Metadata:
      encoder         : Lavc57.107.100 pcm_s16le
    Side data:
      replaygain: track gain - -4.800000, track peak - 0.000023, album gain - -3.150000, album peak - 0.000023, 
Invalid return value 0 for stream protocol
CRC=0xfd05c9a4
size=       0kB time=00:02:45.20 bitrate=   0.0kbits/s speed= 354x    
video:0kB audio:28458kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown

Setup

  • OS: Arch linux
  • Python version: 3.6.4
  • beets version: 1.4.6
  • Turning off plugins made problem go away (yes/no): N/A

My configuration (output of beet config) is:

importfeeds:
    relative_to: !!binary |
        L25ldC9uZnM0L3NlcnZlci9tZWRpYWNlbnRyZS9NdXNpYy9MaWJyYXJ5
    formats: m3u
    dir: /media/mediacentre/Mediacentre/Music/Playlists/
    m3u_name: imported.m3u
    absolute_path: no
lyrics:
    bing_lang_from: []
    auto: no
    google_API_key: REDACTED
    sources: google
    bing_client_secret: REDACTED
    bing_lang_to:
    google_engine_ID: REDACTED
    genius_api_key: REDACTED
    fallback:
    force: no
    local: no
directory: /net/nfs4/server/mediacentre/Music/Library
library: /net/nfs4/server/mediacentre/Music/Catalogs/beets_seeding.blb

plugins: inline edit fromfilename zero rewrite fuzzy info missing random types replaygain smartplaylist duplicates badfiles play chroma absubmit acousticbrainz mbsync mbsubmit lastimport embedart fetchart thumbnails importfeeds lyrics wlg popularity
pluginpath: ~/projects/whatlastgenre/plugin/beets/beetsplug
per_disc_numbering: yes
threaded: no

import:
    write: yes
    copy: yes
    hardlink: no
    autotag: yes
    incremental: no
    resume: yes
    quiet_fallback: skip
    none_rec_action: ask
    timid: yes
    log: ~/.config/beets/beetslog.txt
    detail: yes
    bell: yes
    from_scratch: yes

replace:
    '[\\/]': _
    ^\.: _
    '[\x00-\x1f]': _
    \.$: _
    \s+$: ''
    ^\s+: ''
    ^-: _

ui:
    color: yes

paths:
    default: $albumartist_credit/($original_year) $album%aunique{albumartist album year,albumtype label catalognum albumdisambig}/%if{$multidisc,Disc${disc}%if{$disctitle, - $disctitle}/}$track $title
    singleton: $artist/-Misc-/$title
    comp: Various Artists/$album%aunique{albumartist album year,albumtype label catalognum albumdisambig} ($year)/%if{$multidisc,Disc$disc%if{$disctitle, - $disctitle}/}$track $title
    ext:cue: $albumpath/$artist - $album
    ext:log: $albumpath/$artist - $album
    ext:jpg: $albumpath/cover
    ext:jpeg: $albumpath/cover
    ext:png: $albumpath/cover
item_fields:
    multidisc: 1 if disctotal > 1 else 0

match:
    strong_rec_thresh: 0.04
    medium_rec_thresh: 0.25
    rec_gap_thresh: 0.25
    max_rec:
        missing_tracks: medium
        unmatched_tracks: medium

musicbrainz:
    searchlimit: 5
art_filename: cover
fetchart:
    auto: no
    cautious: yes
    minwidth: 450
    maxwidth: 1010
    enforce_ratio: 1.5%
    sources: filesystem albumart amazon wikipedia coverart fanarttv google
    google_key: REDACTED
    store_source: yes
    cover_names:
    - cover
    - front
    - art
    - album
    - folder
    google_engine: 001442825323518660753:hrh5ch1gjzm
    fanarttv_key: REDACTED
embedart:
    auto: no
    maxwidth: 300
    compare_threshold: 0
    ifempty: no
    remove_art_file: no
thumbnails:
    auto: yes
    force: no
    dolphin: no
replaygain:
    auto: yes
    backend: bs1770gain
    overwrite: yes
    targetlevel: 89
    r128: [Opus]
    chunk_at: 5000
    method: replaygain
wlg:
    auto: no
    force: yes
    count: 4
    whitelist: /home/user/.whatlastgenre/genres.txt
    separator: ', '
mpdstats:
    rating_mix: 0.75
zero:
    auto: yes
    fields: images
    update_database: yes
    keep_fields: []
smartplaylist:
    relative_to: /net/nfs4/server/mediacentre/Music/Library/
    playlist_dir: /net/nfs4/server/mediacentre/Music/Playlists/
    playlists: [{name: Loved.m3u, query: 'loved:1'}, {name: Genre_acoustic-country-folk-americana.m3u, album_query: ['genre:Folk', 'genre:Country', 'genre:Acoustic', 'genre:Americana']}]
    auto: yes
types:
    loved: int
    play_count: int
lastfm:
    user: user
    api_key: REDACTED
play:
    command: mpv
    use_folders: no
    relative_to:
    raw: no
    warning_threshold: 100
copyartifacts:
    extensions: .cue .log .jpg .jpeg .png
    print_ignored: yes
acousticbrainz:
    auto: no
    force: no
    tags: []
absubmit:
    auto: no
    extractor: /usr/bin/streaming_extractor_music
chroma:
    auto: yes
acoustid:
    apikey: REDACTED
mbsubmit:
    format: $track. $title - $artist ($length)
    threshold: medium
lastimport:
    per_page: 500
    retry_limit: 3
duplicates:
    album: no
    checksum: ''
    copy: ''
    count: no
    delete: no
    format: ''
    full: no
    keys: []
    merge: no
    move: ''
    path: no
    tiebreak: {}
    strict: no
    tag: ''
fuzzy:
    prefix: '~'
    threshold: 0.7
missing:
    count: no
    total: no
    album: no
pathfields: {}
album_fields: {}
rewrite: {}
edit:
    albumfields: album albumartist
    itemfields: track title artist album
    ignore_fields: id path
@sampsyo sampsyo added the needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." label Apr 10, 2018
@sampsyo
Copy link
Member

sampsyo commented Apr 10, 2018

Interesting! I don't have an obvious explanation for what could be going wrong—maybe someone (perhaps you) can do some digging to find out what ffmpeg's output is and why it's exiting with an error when we invoke it.

@lhupitr
Copy link
Author

lhupitr commented Apr 10, 2018

As mentioned running the ffmpeg command directly produces a crc checksum without a non zero exit status, also the problem seems to be beyond ffmpeg as the same error occurs with the suggested md5sum command (which also functions fine when executed directly on the command line).

@sampsyo
Copy link
Member

sampsyo commented Apr 10, 2018

Right—so the question is what it’s doing (and what its output is) when it’s invoked by beets instead of manually.

@ghost
Copy link

ghost commented Dec 17, 2018

So, maybe this is a Python thing, but I find this suspicious:

duplicates: failed to checksum /home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3: Command 'md5sum b'/home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3'' returned non-zero exit status 1.

Same as in OP -- what's with the spurious "b" in front of the file name? No matter which command I pass in (sha512sum, md5sum, ffmpeg), they all exit status 1 -- and if duplicates really is putting a "b" in there, no wonder, because that's not the {file}name.

Edit: copy/paste included some newlines; removed those for accuracy.

@schmod
Copy link

schmod commented Jan 6, 2019

Indeed, that seems to be the problem. In Python 3, because the file path is a bytes object instead of a string, it's being passed to the shell as `"b'/path/to/file'".

Changing the relevant line from:

args = [p.format(file=item.path) for p in shlex.split(prog)]

to

args = [p.format(file=item.path.decode('utf-8')) for p in shlex.split(prog)]

seems to fix things.


A few problems remain, however. Both md5sum and sha512sum include the file's path in their output:

$ md5sum ~/my/file.mp3
ea187811890ede95aa618ecba4f27f57  ./my/file.mp3

Because beets uses this output to determine duplicates, it's never going to mark anything as a duplicate.

Additionally, because beets caches the checksums (using the first argument of the command), if you somehow mistype your checksum command, once you've cached bad fingerprints from md5sum, you're stuck with them forever.

@sampsyo
Copy link
Member

sampsyo commented Jan 6, 2019

Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded .decode('utf-8') will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.

That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into md5sum's standard input rather than passing the filename on its command line?

@stale
Copy link

stale bot commented Jul 12, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 12, 2020
@lhupitr
Copy link
Author

lhupitr commented Jul 15, 2020

This is still a problem for me.

@stale stale bot removed the stale label Jul 15, 2020
@stale
Copy link

stale bot commented Sep 13, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 13, 2020
@lhupitr
Copy link
Author

lhupitr commented Sep 18, 2020

I'm not sure what needs to be done here, just keeping it open.

@stale stale bot removed the stale label Sep 18, 2020
@wisp3rwind
Copy link
Member

Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded .decode('utf-8') will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.

I think this issue is solved in the convert plugin:

beets/beetsplug/convert.py

Lines 207 to 233 in b659ad6

# On Python 3, we need to construct the command to invoke as a
# Unicode string. On Unix, this is a little unfortunate---the OS is
# expecting bytes---so we use surrogate escaping and decode with the
# argument encoding, which is the same encoding that will then be
# *reversed* to recover the same bytes before invoking the OS. On
# Windows, we want to preserve the Unicode filename "as is."
if not six.PY2:
command = command.decode(util.arg_encoding(), 'surrogateescape')
if platform.system() == 'Windows':
source = source.decode(util._fsencoding())
dest = dest.decode(util._fsencoding())
else:
source = source.decode(util.arg_encoding(), 'surrogateescape')
dest = dest.decode(util.arg_encoding(), 'surrogateescape')
# Substitute $source and $dest in the argument list.
args = shlex.split(command)
encode_cmd = []
for i, arg in enumerate(args):
args[i] = Template(arg).safe_substitute({
'source': source,
'dest': dest,
})
if six.PY2:
encode_cmd.append(args[i])
else:
encode_cmd.append(args[i].encode(util.arg_encoding()))

So the solution is probably to move this to a separate function in beets.util and apply that in the duplicates plugin.

That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into md5sum's standard input rather than passing the filename on its command line?

I think that's not easily doable without writing a separate script and passing it as the argument to -C. Piping effectively requires running the command through a shell with all the implications about proper escaping.
For example, the following (untested!) incantations might work, but I guess they're not nice advice for the docs due to the amount of escaping required (which means it's non-trivial to modify these commands):

beet dup -C 'sh -c "md5sum < \"$1\"" {file}' ...
beet dup -C 'sh -c "md5sum \"$1\" | awk '"'"'{print $1}'"'"'" {file}' ...

Note that if the plugin would run the -Cargument through command_output(..., shell=True), the {file} itself would need to be quoted properly, which doesn't exactly simplify things.

So maybe the advice should be to create a script

#! /usr/bin/env sh

md5sum < "$1"

or

#! /usr/bin/env sh

md5sum "$1" | awk '{printf $1}'

and use it with

beet dup -C 'myscript {file}' ...

I'll remove the needinfo label since I think that the above implies that there's a clear path forward. I won't implement any of this myself, though, I'm not familiar with the duplicates plugin at all and don't use it myself.

@wisp3rwind wisp3rwind added bug bugs that are confirmed and actionable and removed needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." labels Sep 18, 2020
@sampsyo
Copy link
Member

sampsyo commented Sep 18, 2020

Thanks, @wisp3rwind! I think you have the right fix there.

@zwiebelspaetzle
Copy link

This is an issue for me as well.

@arogl
Copy link
Contributor

arogl commented Feb 28, 2021

I have replaced line 200 with the following block and it is now computing checksums, currently testing on Ubuntu 20.04, Python 3.8

        if not six.PY2:
            if platform.system() == 'Windows':
                args = [p.format(file=item.path.decode(util._fsencoding()))
                        for p in shlex.split(prog)]
            else:
                args = [p.format(file=item.path.decode(util.arg_encoding(),
                        'surrogateescape')) for p in shlex.split(prog)]

I tried to add a

prog = prog.decode(util.arg_encoding(), 'surrogateescape'))

but I got an error:

AttributeError: 'str' object has no attribute 'decode'

I am not sure if the prog needs the decoding?

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs that are confirmed and actionable
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants