shlex —- Simple lexical analysis
Source code:Lib/shlex.py
The shlex
class makes it easy to write lexical analyzers forsimple syntaxes resembling that of the Unix shell. This will often be usefulfor writing minilanguages, (for example, in run control files for Pythonapplications) or for parsing quoted strings.
The shlex
module defines the following functions:
shlex.
split
(s, comments=False, posix=True)- Split the string s using shell-like syntax. If comments is
False
(the default), the parsing of comments in the given string will be disabled(setting thecommenters
attribute of theshlex
instance to the empty string). This function operatesin POSIX mode by default, but uses non-POSIX mode if the posix argument isfalse.
注解
Since the split()
function instantiates a shlex
instance, passing None
for s will read the string to split fromstandard input.
shlex.
quote
(s)- Return a shell-escaped version of the string s. The returned value is astring that can safely be used as one token in a shell command line, forcases where you cannot use a list.
This idiom would be unsafe:
- >>> filename = 'somefile; rm -rf ~'
- >>> command = 'ls -l {}'.format(filename)
- >>> print(command) # executed by a shell: boom!
- ls -l somefile; rm -rf ~
quote()
lets you plug the security hole:
- >>> from shlex import quote
- >>> command = 'ls -l {}'.format(quote(filename))
- >>> print(command)
- ls -l 'somefile; rm -rf ~'
- >>> remote_command = 'ssh home {}'.format(quote(command))
- >>> print(remote_command)
- ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"''
The quoting is compatible with UNIX shells and with split()
:
- >>> from shlex import split
- >>> remote_command = split(remote_command)
- >>> remote_command
- ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"]
- >>> command = split(remote_command[-1])
- >>> command
- ['ls', '-l', 'somefile; rm -rf ~']
3.3 新版功能.
The shlex
module defines the following class:
- class
shlex.
shlex
(instream=None, infile=None, posix=False, punctuation_chars=False) - A
shlex
instance or subclass instance is a lexical analyzerobject. The initialization argument, if present, specifies where to readcharacters from. It must be a file-/stream-like object withread()
andreadline()
methods, ora string. If no argument is given, input will be taken fromsys.stdin
.The second optional argument is a filename string, which sets the initialvalue of theinfile
attribute. If the instream_argument is omitted or equal tosys.stdin
, this second argumentdefaults to "stdin". The _posix argument defines the operational mode:when posix is not true (default), theshlex
instance willoperate in compatibility mode. When operating in POSIX mode,shlex
will try to be as close as possible to the POSIX shellparsing rules. The punctuation_chars argument provides a way to make thebehaviour even closer to how real shells parse. This can take a number ofvalues: the default value,False
, preserves the behaviour seen underPython 3.5 and earlier. If set toTrue
, then parsing of the characters();<>|&
is changed: any run of these characters (considered punctuationcharacters) is returned as a single token. If set to a non-empty string ofcharacters, those characters will be used as the punctuation characters. Anycharacters in thewordchars
attribute that appear inpunctuation_chars will be removed fromwordchars
. SeeImproved Compatibility with Shells for more information. _punctuation_chars_can be set only uponshlex
instance creation and can't bemodified later.
在 3.6 版更改: The punctuation_chars parameter was added.
参见
- Module
configparser
- Parser for configuration files similar to the Windows
.ini
files.
shlex Objects
A shlex
instance has the following methods:
shlex.
get_token
()- Return a token. If tokens have been stacked using
push_token()
, pop atoken off the stack. Otherwise, read one from the input stream. If readingencounters an immediate end-of-file,eof
is returned (the emptystring (''
) in non-POSIX mode, andNone
in POSIX mode).
shlex.
read_token
()- Read a raw token. Ignore the pushback stack, and do not interpret sourcerequests. (This is not ordinarily a useful entry point, and is documented hereonly for the sake of completeness.)
shlex.
sourcehook
(filename)- When
shlex
detects a source request (seesource
below) this method is given the following token as argument, and expectedto return a tuple consisting of a filename and an open file-like object.
Normally, this method first strips any quotes off the argument. If the resultis an absolute pathname, or there was no previous source request in effect, orthe previous source was a stream (such as sys.stdin
), the result is leftalone. Otherwise, if the result is a relative pathname, the directory part ofthe name of the file immediately before it on the source inclusion stack isprepended (this behavior is like the way the C preprocessor handles #include"file.h"
).
The result of the manipulations is treated as a filename, and returned as thefirst component of the tuple, with open()
called on it to yield the secondcomponent. (Note: this is the reverse of the order of arguments in instanceinitialization!)
This hook is exposed so that you can use it to implement directory search paths,addition of file extensions, and other namespace hacks. There is nocorresponding 'close' hook, but a shlex instance will call theclose()
method of the sourced input stream when it returnsEOF.
For more explicit control of source stacking, use the push_source()
andpop_source()
methods.
shlex.
pushsource
(_newstream, newfile=None)- Push an input source stream onto the input stack. If the filename argument isspecified it will later be available for use in error messages. This is thesame method used internally by the
sourcehook()
method.
shlex.
pop_source
()- Pop the last-pushed input source from the input stack. This is the same methodused internally when the lexer reaches EOF on a stacked input stream.
shlex.
errorleader
(_infile=None, lineno=None)- This method generates an error message leader in the format of a Unix C compilererror label; the format is
'"%s", line %d: '
, where the%s
is replacedwith the name of the current source file and the%d
with the current inputline number (the optional arguments can be used to override these).
This convenience is provided to encourage shlex
users to generate errormessages in the standard, parseable format understood by Emacs and other Unixtools.
Instances of shlex
subclasses have some public instancevariables which either control lexical analysis or can be used for debugging:
shlex.
commenters
- The string of characters that are recognized as comment beginners. Allcharacters from the comment beginner to end of line are ignored. Includes just
'#'
by default.
shlex.
wordchars
- The string of characters that will accumulate into multi-character tokens. Bydefault, includes all ASCII alphanumerics and underscore. In POSIX mode, theaccented characters in the Latin-1 set are also included. If
punctuation_chars
is not empty, the characters~-./*?=
, which canappear in filename specifications and command line parameters, will also beincluded in this attribute, and any characters which appear inpunctuation_chars
will be removed fromwordchars
if they are presentthere.
shlex.
whitespace
- Characters that will be considered whitespace and skipped. Whitespace boundstokens. By default, includes space, tab, linefeed and carriage-return.
shlex.
escape
- Characters that will be considered as escape. This will be only used in POSIXmode, and includes just
'\'
by default.
shlex.
quotes
- Characters that will be considered string quotes. The token accumulates untilthe same quote is encountered again (thus, different quote types protect eachother as in the shell.) By default, includes ASCII single and double quotes.
shlex.
escapedquotes
- Characters in
quotes
that will interpret escape characters defined inescape
. This is only used in POSIX mode, and includes just'"'
bydefault.
shlex.
whitespace_split
- If
True
, tokens will only be split in whitespaces. This is useful, forexample, for parsing command lines withshlex
, gettingtokens in a similar way to shell arguments. If this attribute isTrue
,punctuation_chars
will have no effect, and splitting will happenonly on whitespaces. When usingpunctuation_chars
, which isintended to provide parsing closer to that implemented by shells, it isadvisable to leavewhitespace_split
asFalse
(the default value).
shlex.
infile
- The name of the current input file, as initially set at class instantiation timeor stacked by later source requests. It may be useful to examine this whenconstructing error messages.
shlex.
instream
- The input stream from which this
shlex
instance is readingcharacters.
shlex.
source
- This attribute is
None
by default. If you assign a string to it, thatstring will be recognized as a lexical-level inclusion request similar to thesource
keyword in various shells. That is, the immediately following tokenwill be opened as a filename and input will be taken from that stream untilEOF, at which point theclose()
method of that stream will becalled and the input source will again become the original input stream. Sourcerequests may be stacked any number of levels deep.
shlex.
debug
- If this attribute is numeric and
1
or more, ashlex
instance will print verbose progress output on its behavior. If you needto use this, you can read the module source code to learn the details.
shlex.
eof
- Token used to determine end of file. This will be set to the empty string(
''
), in non-POSIX mode, and toNone
in POSIX mode.
shlex.
punctuation_chars
- A read-only property. Characters that will be considered punctuation. Runs ofpunctuation characters will be returned as a single token. However, note that nosemantic validity checking will be performed: for example, '>>>' could bereturned as a token, even though it may not be recognised as such by shells.
3.6 新版功能.
Parsing Rules
When operating in non-POSIX mode, shlex
will try to obey to thefollowing rules.
Quote characters are not recognized within words (
Do"Not"Separate
isparsed as the single wordDo"Not"Separate
);Escape characters are not recognized;
Enclosing characters in quotes preserve the literal value of all characterswithin the quotes;
Closing quotes separate words (
"Do"Separate
is parsed as"Do"
andSeparate
);If
whitespace_split
isFalse
, any character notdeclared to be a word character, whitespace, or a quote will be returned asa single-character token. If it isTrue
,shlex
will onlysplit words in whitespaces;EOF is signaled with an empty string (
''
);It's not possible to parse empty strings, even if quoted.
When operating in POSIX mode, shlex
will try to obey to thefollowing parsing rules.
Quotes are stripped out, and do not separate words (
"Do"Not"Separate"
isparsed as the single wordDoNotSeparate
);Non-quoted escape characters (e.g.
'\'
) preserve the literal value of thenext character that follows;Enclosing characters in quotes which are not part of
escapedquotes
(e.g."'"
) preserve the literal valueof all characters within the quotes;Enclosing characters in quotes which are part of
escapedquotes
(e.g.'"'
) preserves the literal valueof all characters within the quotes, with the exception of the charactersmentioned inescape
. The escape characters retain itsspecial meaning only when followed by the quote in use, or the escapecharacter itself. Otherwise the escape character will be considered anormal character.EOF is signaled with a
None
value;Quoted empty strings (
''
) are allowed.
Improved Compatibility with Shells
3.6 新版功能.
The shlex
class provides compatibility with the parsing performed bycommon Unix shells like bash
, dash
, and sh
. To take advantage ofthis compatibility, specify the punctuation_chars
argument in theconstructor. This defaults to False
, which preserves pre-3.6 behaviour.However, if it is set to True
, then parsing of the characters ();<>|&
is changed: any run of these characters is returned as a single token. Whilethis is short of a full parser for shells (which would be out of scope for thestandard library, given the multiplicity of shells out there), it does allowyou to perform processing of command lines more easily than you couldotherwise. To illustrate, you can see the difference in the following snippet:
- >>> import shlex
- >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")"
- >>> list(shlex.shlex(text))
- ['a', '&', '&', 'b', ';', 'c', '&', '&', 'd', '|', '|', 'e', ';', 'f', '>',
- "'abc'", ';', '(', 'def', '"ghi"', ')']
- >>> list(shlex.shlex(text, punctuation_chars=True))
- ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', "'abc'",
- ';', '(', 'def', '"ghi"', ')']
Of course, tokens will be returned which are not valid for shells, and you'llneed to implement your own error checks on the returned tokens.
Instead of passing True
as the value for the punctuation_chars parameter,you can pass a string with specific characters, which will be used to determinewhich characters constitute punctuation. For example:
- >>> import shlex
- >>> s = shlex.shlex("a && b || c", punctuation_chars="|")
- >>> list(s)
- ['a', '&', '&', 'b', '||', 'c']
注解
When punctuation_chars
is specified, the wordchars
attribute is augmented with the characters ~-./*?=
. That is because thesecharacters can appear in file names (including wildcards) and command-linearguments (e.g. —color=auto
). Hence:
- >>> import shlex
- >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?',
- ... punctuation_chars=True)
- >>> list(s)
- ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?']
For best effect, punctuation_chars
should be set in conjunction withposix=True
. (Note that posix=False
is the default forshlex
.)