Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common extensions to Motorola 68k Assembly #4637

Merged
merged 6 commits into from
Jan 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions lib/linguist/heuristics.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ disambiguations:
pattern: '^[=-]+(\s|\n)|{{[A-Za-z]'
- language: AGS Script
pattern: '^(\/\/.+|((import|export)\s+)?(function|int|float|char)\s+((room|repeatedly|on|game)_)?([A-Za-z]+[A-Za-z_0-9]+)\s*[;\(])'
- extensions: ['.asm']
rules:
- language: Motorola 68K Assembly
named_pattern: m68k
- extensions: ['.asy']
rules:
- language: LTspice Symbol
Expand Down Expand Up @@ -191,13 +195,21 @@ disambiguations:
rules:
- language: Hack
pattern: '<\?hh'
- extensions: ['.i']
rules:
- language: Motorola 68K Assembly
named_pattern: m68k
- language: SWIG
pattern: '^[ \t]*%[a-z_]+\b|^%[{}]$'
- extensions: ['.ice']
rules:
- language: JSON
pattern: '\A\s*[{\[]'
- language: Slice
- extensions: ['.inc']
rules:
- language: Motorola 68K Assembly
named_pattern: m68k
- language: PHP
pattern: '^<\?(?:php)?'
- language: SourcePawn
Expand Down Expand Up @@ -403,6 +415,10 @@ disambiguations:
pattern: '^(use |fn |mod |pub |macro_rules|impl|#!?\[)'
- language: RenderScript
pattern: '#include|#pragma\s+(rs|version)|__attribute__'
- extensions: ['.s']
rules:
- language: Motorola 68K Assembly
named_pattern: m68k
- extensions: ['.sc']
rules:
- language: SuperCollider
Expand Down Expand Up @@ -508,6 +524,14 @@ named_patterns:
- 'std::\w+'
fortran: '^(?i:[c*][^abd-z]| (subroutine|program|end|data)\s|\s*!)'
key_equals_value: '^[^#!;][^=]*='
m68k:
- '(?im)\bmoveq(?:\.l)?\s+#(?:\$-?[0-9a-f]{1,3}|%[0-1]{1,8}|-?[0-9]{1,3}),\s*d[0-7]\b'
- '(?im)^\s*move(?:\.[bwl])?\s+(?:sr|usp),\s*[^\s]+'
- '(?im)^\s*move\.[bwl]\s+.*\b[ad]\d'
- '(?im)^\s*movem\.[bwl]\b'
- '(?im)^\s*move[mp](?:\.[wl])?\b'
- '(?im)^\s*btst\b'
- '(?im)^\s*dbra\b'
objectivec: '^\s*(@(interface|class|protocol|property|end|synchronised|selector|implementation)\b|#import\s+.+\.h[">])'
perl5: '\buse\s+(?:strict\b|v?5\.)'
perl6: '^\s*(?:use\s+v6\b|\bmodule\b|\b(?:my\s+)?class\b)'
16 changes: 15 additions & 1 deletion lib/linguist/languages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,7 @@ Assembly:
extensions:
- ".asm"
- ".a51"
- ".i"
- ".inc"
- ".nasm"
tm_scope: source.assembly
Expand Down Expand Up @@ -3256,7 +3257,11 @@ Motorola 68K Assembly:
aliases:
- m68k
extensions:
- ".X68"
- ".asm"
- ".i"
- ".inc"
- ".s"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any registers or opcodes unique to Motorola we can use to disambiguate assembly files with?

We're definitely going to need some heuristics for .asm and .inc. The latter of which is particularly important because it sees very general use across a range of unrelated) languages…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M68k assembly is easily distinguished from other assembly languages, both by registers and opcodes.

How would such a heuristic look?

Copy link
Collaborator

@Alhadis Alhadis Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regular expression; I'm happy to write it for you, provided you give me the names of substrings guaranteed (or highly unlikely) to appear in the source code of any other assembler language.

Here are our existing heuristics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(?im:moveq\b.*?d\d|move\.[bwl]\s+.*\b[ad]\d|movem\.[bwl]\b|btst\b|dbra\b)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to limit the moveq heuristic to match two registers (one address, one data)? From what I hear, 68k is unique for differentiating between the two.

If so, we could try this:

(?xi)
	# Mnemonic
	\b moveq (\.l)? \s+
	
	# Address
	\#( \$ -? [0-9a-f]{1,3}
	  |    %  [0-1]{1,8}
	  |    -? [0-9]{1,3}
	  )
	, \s*
	
	# Register
	d[0-7] \b

When writing heuristics, it's best to be as specific as possible; anything which doesn't match is passed down to the (less accurate) classification techniques.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credit for this expression belongs to @zerkman, since it's taken from the language-m68k grammar we're using to highlight 68k on GitHub (I did clean it up and remove some redundant syntax for clarity).

I've amended the other parts of that expression to use other bits of that grammar, bringing us down to:

(?xim:
	# Mnemonic
	\b moveq (\.l)? \s+
	
	# Address
	\#( \$ -? [0-9a-f]{1,3}
	  |    %  [0-1]{1,8}
	  |    -? [0-9]{1,3}
	  )
	, \s*
	
	# Register
	d[0-7] \b
	
	| ^ \s* move     (\.[bwl])? \s+ (sr|usp), \s* [^\s]+
	| ^ \s* movem     \.[bwl]  \b
	| ^ \s* move[mp] (\.[wl])? \b
	| ^ \s* btst  \b
	| ^ \s* dbra  \b
)

Notice that I've anchored the remaining parts to match at the beginning of a line (with or without indentation). This reduces the risk of incorrectly matching part of a comment in an unrelated file. For the same reason, you'll notice I avoid using wildcards when possible (.*).

Copy link
Collaborator

@Alhadis Alhadis Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idrougge If the above revisions look good to you, then the changes to make to heuristics.yml are below.

I'll still need to test them thoroughly on my end, as well as investigate any possible formats using the .i extension that we've not registered yet.

Click to show diff
--- heuristics.yml	2019-10-03 21:45:48.000000000 +1000
+++ heuristics.yml	2019-10-03 22:30:25.000000000 +1000
@@ -49,8 +49,12 @@
   rules:
   - language: ActionScript
     pattern: '^\s*(package\s+[a-z0-9_\.]+|import\s+[a-zA-Z0-9_\.]+;|class\s+[A-Za-z0-9_]+\s+extends\s+[A-Za-z0-9_]+)'
   - language: AngelScript
+- extensions: ['.asm']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
 - extensions: ['.asc']
   rules:
   - language: Public Key
     pattern: '^(----[- ]BEGIN|ssh-(rsa|dss)) '
@@ -191,8 +195,10 @@
     pattern: '\A\s*[{\[]'
   - language: Slice
 - extensions: ['.inc']
   rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
   - language: PHP
     pattern: '^<\?(?:php)?'
   - language: SourcePawn
     pattern: '^public\s+(?:SharedPlugin(?:\s+|:)__pl_\w+\s*=(?:\s*{)?|(?:void\s+)?__pl_\w+_SetNTVOptional\(\)(?:\s*{)?)'
@@ -383,8 +389,12 @@
   - language: Rust
     pattern: '^(use |fn |mod |pub |macro_rules|impl|#!?\[)'
   - language: RenderScript
     pattern: '#include|#pragma\s+(rs|version)|__attribute__'
+- extensions: ['.s']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
 - extensions: ['.sc']
   rules:
   - language: SuperCollider
     pattern: '(?i:\^(this|super)\.|^\s*~\w+\s*=\.)'
@@ -486,7 +496,15 @@
   - '^[ \t]*(private|public|protected):$'
   - 'std::\w+'
   fortran: '^(?i:[c*][^abd-z]|      (subroutine|program|end|data)\s|\s*!)'
   key_equals_value: '^[^#!;][^=]*='
+  m68k:
+  - '(?im)\bmoveq(?:\.l)?\s+#(?:\$-?[0-9a-f]{1,3}|%[0-1]{1,8}|-?[0-9]{1,3}),\s*d[0-7]\b'
+  - '(?im)^\s*move(?:\.[bwl])?\s+(?:sr|usp),\s*[^\s]+'
+  - '(?im)^\s*move\.[bwl]\s+.*\b[ad]\d'
+  - '(?im)^\s*movem\.[bwl]\b'
+  - '(?im)^\s*move[mp](?:\.[wl])?\b'
+  - '(?im)^\s*btst\b'
+  - '(?im)^\s*dbra\b'
   objectivec: '^\s*(@(interface|class|protocol|property|end|synchronised|selector|implementation)\b|#import\s+.+\.h[">])'
   perl5: '\buse\s+(?:strict\b|v?5\.)'
   perl6: '^\s*(?:use\s+v6\b|\bmodule\b|\b(?:my\s+)?class\b)'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the ^ \s* move (\.[bwl])? \s+ (sr|usp), \s* [^\s]+ line, and it only catches moves to sr or usp, not moves to any given register. I feel that the Motorola syntax of `move.size with source or destination as a register named Dn or An is sufficiently dissimilar to other assembly syntaxes to avoid confusion with other assembly languages while also catching even the shortest snippet.
Testing may prove the totality of heuristics to still be sufficient to catch all m68k assembly sources.

.i , like .inc, .asm or .s is used by assemblers on most platforms AFAIK.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the third line of the m68k heuristic becomes:

- '(?im)^\s*move\.[bwl]\s+.*\b[ad]\d'

I'll update the diff I just posted.

.i , like .inc, .asm or .s is used by assemblers on most platforms AFAIK.

There are currently 11,273,157 .i files publicly indexed on GitHub. Surely there must be other formats hidden out there...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That regex looks fine for heuristics.

A quick glance at those results indicate that a lot of .i files are SWIG files, which may need a language definition of their own.

- ".x68"
tm_scope: source.m68k
ace_mode: assembly_x86
language_id: 477582706
Expand Down Expand Up @@ -4796,6 +4801,15 @@ SVG:
codemirror_mode: xml
codemirror_mime_type: text/xml
language_id: 337
SWIG:
type: programming
extensions:
- ".i"
tm_scope: source.c++
ace_mode: c_cpp
codemirror_mode: clike
codemirror_mime_type: text/x-c++src
language_id: 1066250075
Sage:
type: programming
group: Python
Expand Down
82 changes: 82 additions & 0 deletions samples/Assembly/3D_PRG.I
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
; this file is part of Release, written by Malban in 2017
;
***********************************************************
; input list in X
; destroys u
; 0 move
; negative use as shift
; positive end
asm_draw_3ds:
ldu 2,x
lda 1,x;
starts:
sta $d004;
ldd ,u;
sta $d001;
clr $d000;
lda ,x;
inc $d000;
stb $d001;
sta $d00A;
clr $d005;
leax 4,x;
ldu 2,x;
lda ,x;
bgt end1s;
lda 1,x;
ldb #$40;
waits: bitb $d00D;
beq waits;
ldb #0
stb $d00A;
bra starts;
end1s: ldd #$0040;
ends: bitb $d00D;
beq ends;
sta $d00A
rts


asm_draw_3d:
ldu 1,x
start: ldd ,u;
sta $d001;
clr $d000;
lda ,x;
inc $d000;
stb $d001;
sta $d00A;
clr $d005;
leax 3,x;
ldu 1,x;
lda ,x;
bgt end1;
ldd #$0040;
wait: bitb $d00D;
beq wait;
sta $d00A;
bra start;
end1: ldd #$0040;
end: bitb $d00D;
beq end;
sta $d00A
rts



; Cosinus data
cosinus3d:
DB 63, 62, 61, 60, 58, 55, 52, 48, 43, 39, 34 ; 11
DB 28, 23, 17, 10, 4, -1, -7, -14, -20, -25, -31 ; 22
DB -36, -41, -46, -50, -53, -56, -59, -61, -62, -62, -62 ; 33
DB -62, -61, -59, -56, -53, -50, -46, -41, -36, -31, -25 ; 44
DB -20, -14, -7, -1, 4, 10, 17, 23, 28, 34, 39 ; 55
DB 43, 48, 52, 55, 58, 60, 61, 62, 63
; Sinus data
sinus3d:
DB 0, 6, 12, 18, 24, 30, 35, 40, 45, 49, 52 ; 11
DB 56, 58, 60, 62, 62, 62, 62, 61, 59, 57, 54 ; 22
DB 51, 47, 42, 38, 32, 27, 21, 15, 9, 3, -3 ; 33
DB -9, -15, -21, -27, -32, -38, -42, -47, -51, -54, -57 ; 44
DB -59, -61, -62, -62, -62, -62, -60, -58, -56, -52, -49 ; 55
DB -45, -40, -35, -30, -24, -18, -12, -6, -3
Loading