Skip to content

customize TOKENS #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gemerden opened this issue Jul 10, 2017 · 5 comments
Open

customize TOKENS #74

gemerden opened this issue Jul 10, 2017 · 5 comments

Comments

@gemerden
Copy link
Contributor

gemerden commented Jul 10, 2017

In the tokenize() method of BooleanAlgebra, would it be possible to change the tokens without inheriting the whole method and changing just the tokens, e.g.:

def tokenize(self, expr, TOKENS=None):
    ...
    TOKENS = TOKENS or {
         # current TOKENS
    }
    ...

Or perhaps define the current tokens outside the method and make them the default TOKENS instead of None above.

This makes it less likely that in future versions the inheriting class becomes outdated.

Cheers, Lars

@pombredanne
Copy link
Collaborator

@gemerden Thanks. This is an easy change that makes a lot of sense Just out of curiosity, what would be custom TOKENS you would need?
BTW, slightly related, here is an example of a tokenizer that uses customs tokens (and uses a trie/aho-corasick automaton for tokens recognition) https://github.com/nexB/license-expression/blob/f3421c1a1f409249ba86a16b7b46c2e987f6ab35/src/license_expression/__init__.py#L409

@gemerden
Copy link
Contributor Author

gemerden commented Jul 14, 2017

@pombredanne: i only use '|', '&' and '!', '(' and ')' and i use e.g. '*' for something else (as a wildcard). I needed to change more in tokenize(); roughly: everything that is not a token i accept as a symbol, but i need to do some more testing. Currently it looks like this:

class KeyParser(BooleanAlgebra):

    DEFAULT_TOKENS = {
        '&': TOKEN_AND,
        '|': TOKEN_OR,
        '!': TOKEN_NOT,
        '(': TOKEN_LPAR,
        ')': TOKEN_RPAR,
    }

    def __init__(self, TOKENS=None, *args, **kwargs):
        super(KeyParser, self).__init__(Symbol_class=WildSymbol,
                                        OR_class=SET_OR,
                                        AND_class=SET_AND,
                                        NOT_class=SET_NOT,
                                        *args, **kwargs)
        self.TOKENS = TOKENS or self.DEFAULT_TOKENS

    def tokenize(self, expr):

        if not isinstance(expr, basestring):
            raise TypeError('expr must be string but it is %s.' % type(expr))
        TOKENS = self.TOKENS
        length = len(expr)
        position = 0
        while position < length:
            tok = expr[position]

            sym = tok not in TOKENS
            if sym:
                position += 1
                while position < length:
                    char = expr[position]
                    if char not in TOKENS:
                        position += 1
                        tok += char
                    else:
                        break
                position -= 1

            try:
                yield TOKENS[tok], tok, position
            except KeyError:
                if sym:
                    yield TOKEN_SYMBOL, tok, position
                else:
                    raise ParseError(token_string=tok, position=position, error_code=PARSE_UNKNOWN_TOKEN)
            position += 1

by sym = tok not in TOKENS i leave the possibility to put more (a different) syntax in the symbols. When I am happy with my project I'll make the repo public and share the link here.

@pombredanne
Copy link
Collaborator

@gemerden
Copy link
Contributor Author

Thanks, the code above is passing all my tests, so for now i am ok.

@pombredanne
Copy link
Collaborator

ok, your call. You can send a PR or close this as you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants