Scanner Requests

Scanners are the heart of CodeRay. They split input code into tokens and classify them.

Each language has its own scanner: You can see what languages are currently supported in the repository.

Why is the CodeRay language support list so short?

CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high quality software.

Special attention is paid to the scanners: Every CodeRay scanner is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well.

I need a new Scanner – What can I do?

Here’s what you can do to speed up the development of a new scanner:

Request it! File a new ticket unless it already exists or add a +1 or something to existing tickets to show your interest.
Upload or link to example code in the ticket discussion.
- Typical code in large quantities is very helpful, also for benchmarking.
- But we also need the most weird and strange code you can find to make the scanner.
Provide links to useful information about the language lexic, such as:
- a list of reserved words (Did you know that “void” is a JavaScript keyword?)
- rules for string and number literals (Can a double quoted string contain a newline?)
- rules for comments and other token types (Does Language have a special syntax for multiline comments?)
- a description of any unusual syntactic features (There’s this weird %w() thing in Ruby…)
- If there are different versions / implementations / dialects of this language: How do they differ?
Give examples for good and bad highlighters / syntax definitions for the language (usually from editors or other libraries),
Find more example code!

Also, read the next section.

I want to write a Scanner myself

Wow, you’re brave! Writing CodeRay scanners is not an easy task because:

You need excellent knowledge about the language you want to scan. Every language has a dark side!
You need good knowledge of (Ruby) regular expressions.
There’s no documentation to speak of.
- But this is a wiki ^{hint hint} ;o)

But it has been done before, so go and try it!

You should still request the scanner (as described above) and announce that you are working on a patch yourself.
Check out the Repository and try the Test Suite.
Copy a scanner of your choice as a base. You would know what language comes closest.
Make sure you have run rake test:scanners to get the scanner test suite.
Create a test case directory in test/scanners/<lang> and add example files for your language.
Run your tests cases with rake test:scanner:<lang> and write your scanner!
Also, look into lib/coderay/scanners/_map.rb and lib/coderay/helpers/file_type.rb.
Make a patch (scanner, test cases and other changes) and upload it to the ticket.
Follow the following discussion.
Prepare to be added to the THX list.

Contact me (murphy rubychan de) if you have any questions.

How does a Scanner look?

For example, the JSON scanner:

~~~ruby

Namespace; use this form instead of CodeRay::Scanners to avoid messages like
“uninitialized constant CodeRay” when testing it.
module CodeRay
module Scanners

Always inherit from CodeRay::Scanners::Scanner.
#
Scanner inherits directly from StringScanner, the Ruby class for fast
string scanning. Read the documentation to understand what’s going on here:
#
http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
class JSON < Scanner

Deprecation notice: The Streamable module is gone.

Scanners are plugins and must be registered like this:
register_for :json

You can provide a file extension associated with this language.
file_extension ‘json’

List all token kinds that are not considered to be running code
in this language. For a typical language, this would just be
:comment, but for a data or markup language like JSON, no tokens
should count as Line of Code.
KINDS_NOT_LOC = [
:float, :char, :content, :delimiter,
:error, :integer, :operator, :value,
] # :nodoc:

See the WordList documentation.
CONSTANTS = %w( true false null )
IDENT_KIND = WordList.new(:key).add(CONSTANTS, :value)

ESCAPE = / [bfnrt\\"\/] /x UNICODE_ESCAPE = / u[a-fA-F0-9]{4} /x

This is the only method you need to define. It scans code.
#
encoder is an object which encodes tokens. It provides the following API:
* encoder.text_token(text, kind) for tokens
* encoder.begin_group(kind) and encoder.end_group(kind) for token groups
* encoder.begin_line(kind) and encoder.end_line(kind) for line tokens
#
options is a hash. Standard options are:
* keep_state: Try to save the current scanner state and restore it in the
next call of scan_tokens.
#
scan_tokens must return the encoder variable it was given.
#
You are completely free to use any style you want, just make sure encoder
gets what it needs. But typically, a Scanner follows the following scheme:
def scan_tokens encoder, options

The scanner is always in a certain state, which is :initial by default.
We use local variables and symbols to maximize speed.
state = :initial

Sometimes, you need a stack. Ruby arrays are perfect for this.
stack = []

Define more flags and variables as you need them.
key_expected = false

The main loop; eos? is true when the end of the code is reached.
until eos?

Deprecation notice: The use of local variables kind and match no longer
recommended.

Depending on the state, we want to do different things.
case state

Normally, we use this case.
when :initial
I like the / … /x style regexps because white space makes them more
readable. x means white space is ignored.
if match = scan(/ \s+ /x)
White space and masked line ends are :space.
Make sure you never send an empty token! /\s*/ for example would be
very bad (actually creating an infinite loop).
encoder.text_token match, :space
elsif match = scan(/ [:,\[{\]}] /x)
Operators of JSON. stack is used to determine where we are. stack and
key_expected are set depending on which operator was found.
key_expected is used to decide whether a “quoted” thing should be
classified as key or string.
encoder.text_token match, :operator
case match
when ‘{’ then stack << :object; key_expected = true
when ‘[’ then stack << :array
when ‘:’ then key_expected = false
when ‘,’ then key_expected = true if stack.last == :object
when ‘}’, ‘]’ then stack.pop # no error recovery, but works for valid JSON
end
elsif match = scan(/ true | false | null /x)
These are the only idents that are allowed in JSON. Normally, IDENT_KIND
would be used to tell keywords and idents apart.
encoder.text_token match, IDENT_KIND[match]
elsif match = scan(/ -? (?: 0 | [1-9]\d* ) /x)
Pay attention to the details: JSON doesn’t allow numbers like 00.
if scan(/ \.\d+ (?:[eE][-]?\d)? | [eE][-]? \d /x)
match << matched
encoder.text_token match, :float
else
encoder.text_token match, :integer
end
elsif match = scan(/"/)
A “quoted” token was found, and we know whether it is a key or a string.
state = key_expected ? :key : :string
This opens a token group and encodes the delimiter token.
encoder.begin_group state
encoder.text_token match, :delimiter
else
Don’t forget to add this case: If we reach invalid code, we try to discard
chars one by one and mark them as :error.
encoder.text_token getch, :error
end

String scanning is a bit more complicated, so we use another state for it.
The scanner stays in :string state until the string ends or an error occurs.
#
JSON uses the same notation for strings and keys. We want keys to be in a
different color, but the lexical rules are the same. This is why we use this
case also for the :key state.
when :string, :key
Another if-elsif-else-switch, for strings this time.
if match = scan(/[^\\"]+/)
Everything that is not \ or " is just string content.
encoder.text_token match, :content
elsif match = scan(/"/)
A " is found, which means this string or key is ending here.
A special token class, :delimiter, is used for tokens like this one.
encoder.text_token match, :delimiter
Always close your token groups using the right token kind!
encoder.end_group state
We’re going back to normal scanning here.
state = :initial
Deprecation notice: Don’t use “next” any more.
elsif match = scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox)
A valid special character should be classified as :char.
encoder.text_token match, :char
elsif match = scan(/\\./m)
Anything else that is escaped (including \n, we use the m modifier) is
just content.
encoder.text_token match, :content
elsif match = scan(/ \\ | $ /x)
A string that suddenly ends in the middle, or reaches the end of the
line. This is an error; we go back to :initial now.
encoder.end_group state
encoder.text_token match, :error
state = :initial
else
Nice for debugging. Should never happen.
raise_inspect “else case \” reached; %p not handled." % [peek(1)], encoder
end

else

Nice for debugging. Should never happen.
raise_inspect ‘Unknown state: %p’ % [state], encoder

end

Deprecation notice: The block using the match local variable block is gone.
end

If we still have a string or key token group open, close it.
if [:string, :key].include? state
encoder.end_group state
end

Return the encoder.
encoder
end

end

end
end
_~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scanner Requests

Why is the CodeRay language support list so short?

I need a new Scanner – What can I do?

I want to write a Scanner myself

How does a Scanner look?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally