Skip to content

A POSIX (IEEE Std 1003.1-2017, 2017) Parser shell lexer written in Python 3!

License

Notifications You must be signed in to change notification settings

micepram/parsify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parsify

CI

Overview

Parsify is a minimal, POSIX-compliant shell lexer written in Python 3. It is designed to implement the Token Recognition rules defined in IEEE Std 1003.1-2017 Section 2.3. It scans shell script input and breaks it down into a stream of tokens, correctly identifying words, operators, and reserved keywords while respecting quoting and escaping rules.

Features

  • Token Types: Identifies Words, Operators, Keywords, and Newlines.
  • Quoting Support: Handles single quotes ('...') for literals and double quotes ("...") with escape processing.
  • Operator Recognition: Recognizes standard POSIX operators (e.g., |, &, ;) and multi-character operators (e.g., &&, ||, <<, >>).
  • Escaping: Supports backslash escapes in unquoted and double-quoted contexts.
  • Comments: Strips # comments until the end of the line.
  • Keywords: Detects reserved words like if, then, while, done.

License

MIT

Usage

The lexer can be run directly from the command line, accepting either a raw string or a file path. It outputs the list of tokens in JSON format.

Scan a string:

python posix_lexer.py "echo 'hello world'"

Scan a file:

python posix_lexer.py examples/basic.sh

Examples

Input (examples/basic.sh):

echo 'hi' "there" # comment
ls -l | grep ".py"
val=123; echo $val
if true; then echo yes; fi
cat <<EOF

Output (abbreviated JSON):

[
  { "type": "word", "value": "echo" },
  { "type": "word", "value": "hi" },
  { "type": "word", "value": "there" },
  { "type": "newline", "value": "\n" },
  { "type": "word", "value": "ls" },
  { "type": "word", "value": "-l" },
  { "type": "operator", "value": "|" },
  ...
]

Testing

The project includes a comprehensive unit test suite covering edge cases, mixed quotes, and operator precedence.

Run the tests with:

python -m unittest test_posix_lexer.py

Limitations

  • No Expansion: Variable expansion ($var) and command substitution ($(...)) are tokenized as words or parts of words. The lexer does not perform the actual expansion or execution.
  • No AST: This is strictly a lexical scanner; it does not parse the tokens into an Abstract Syntax Tree.

Built with ❤️ by Pramika Garg
LinkedIn | Email

About

A POSIX (IEEE Std 1003.1-2017, 2017) Parser shell lexer written in Python 3!

Topics

Resources

License

Stars

Watchers

Forks

Languages