在数据抓取的时候会经常使用正则表达式,如果对于python的re模块不太熟悉,很容易被里面的各种方法搞混,今天就一起来复习下Python的re模块。
在学习Python模块之前,先来看下官方说明文档是怎么说的?执行:
importre help(re)返回内容如下:
Helponmodulere: NAME re - Supportfor regularexpressions (RE). FILE c:\python27\lib\re.py DESCRIPTION This moduleprovidesregularexpressionmatchingoperationssimilarto thosefoundin Perl.Itsupportsboth 8-bitand Unicodestrings; both thepatternand thestringsbeingprocessedcancontainnull bytesand charactersoutsidetheUSASCIIrange. Regularexpressionscancontainbothspecialand ordinarycharacters. Mostordinarycharacters, like "A", "a", or "0", arethesimplest regularexpressions; theysimplymatchthemselves.Youcan concatenateordinarycharacters, solastmatchesthestring 'last'. Thespecialcharactersare: "."Matchesanycharacterexcept a newline. "^"Matchesthestartofthestring. "$"Matchestheend ofthestring or justbeforethenewlineat theend ofthestring. "*"Matches 0 or more (greedy) repetitionsoftheprecedingRE. Greedymeansthatitwillmatchas manyrepetitionsas possible. "+"Matches 1 or more (greedy) repetitionsoftheprecedingRE. "?"Matches 0 or 1 (greedy) oftheprecedingRE. *?,+?,?? Non-greedyversionsofthepreviousthreespecialcharacters. {m,n}Matchesfrom m to n repetitionsoftheprecedingRE. {m,n}?Non-greedyversionoftheabove. "\\"Eitherescapesspecialcharactersor signals a specialsequence. []Indicates a setofcharacters. A "^" as thefirstcharacterindicates a complementingset. "|"A|B, createsanREthatwillmatcheither A or B. (...)MatchestheREinsidetheparentheses. Thecontentscanberetrievedor matchedlaterin thestring. (?iLmsux) Setthe I, L, M, S, U, or X flagfor theRE (seebelow). (?:...)Non-groupingversionofregularparentheses. (?P<name>...) Thesubstringmatchedbythegroupis accessiblebyname. (?P=name)Matchesthetextmatchedearlierbythegroupnamedname. (?#...)A comment; ignored. (?=...)Matchesif ... matchesnext, butdoesn't consume the string. (?!...)Matches if ... doesn't matchnext. (?<=...) Matchesif precededby ... (mustbefixedlength). (?<!...) Matchesif not precededby ... (mustbefixedlength). (?(id/name)yes|no) Matchesyespatternif thegroupwithid/namematched, the (optional) nopatternotherwise. Thespecialsequencesconsistof "\\" and a characterfromthelist below.If theordinarycharacteris not onthelist, then the resultingREwillmatchthesecondcharacter. \numberMatchesthecontentsofthegroupofthesamenumber. \AMatchesonlyatthestartofthestring. \ZMatchesonlyattheend ofthestring. \bMatchestheemptystring, butonlyatthestartor end of a word. \BMatchestheemptystring, butnot atthestartor end of a word. \dMatchesanydecimaldigit; equivalentto theset [0-9]. \DMatchesanynon-digitcharacter; equivalentto theset [^0-9]. \sMatchesanywhitespacecharacter; equivalentto [ \t\n\r\f\v]. \SMatchesanynon-whitespacecharacter; equiv. to [^ \t\n\r\f\v]. \wMatchesanyalphanumericcharacter; equivalentto [a-zA-Z0-9_]. WithLOCALE, itwillmatchtheset [0-9_] pluscharactersdefined as lettersfor thecurrentlocale. \WMatchesthecomplementof \w. \\Matches a literalbackslash. This moduleexportsthefollowingfunctions: matchMatch a regularexpressionpatternto thebeginningof a string. searchSearch a string for thepresenceof a pattern. subSubstituteoccurrencesof a patternfoundin a string. subnSameas sub, butalsoreturn thenumberofsubstitutionsmade. splitSplit a string bytheoccurrencesof a pattern. findallFindalloccurrencesof a patternin a string. finditerReturn aniteratoryielding a matchobject for each match. compileCompile a patterninto a RegexObject. purgeCleartheregularexpressioncache. escapeBackslashallnon-alphanumericsin a string. Someofthefunctionsin this moduletakesflagsas optionalparameters: IIGNORECASEPerformcase-insensitivematching. LLOCALEMake \w, \W, \b, \B, dependentonthecurrentlocale. MMULTILINE"^" matchesthebeginningoflines (after a newline) as wellas thestring. "$" matchestheend oflines (before a newline) as well as theend ofthestring. SDOTALL"." matchesanycharacteratall, includingthenewline. XVERBOSEIgnorewhitespaceand commentsfor nicerlookingRE's. UUNICODE Make \w, \W, \b, \B, dependent on the Unicode locale. This module also defines an exception 'error'. CLASSES exceptions.Exception(exceptions.BaseException) sre_constants.error class error(exceptions.Exception) |Method resolution order: |error |exceptions.Exception |exceptions.BaseException |__builtin__.object | |Data descriptors defined here: | |__weakref__ |list of weak references to the object (if defined) | |---------------------------------------------------------------------- |Methods inherited from exceptions.Exception: | |__init__(...) |x.__init__(...) initializes x; see help(type(x)) for signature | |---------------------------------------------------------------------- |Data and other attributes inherited from exceptions.Exception: | |__new__ = <built-in method __new__ of type object> |T.__new__(S, ...) -> a new object with type S, a subtype of T | |---------------------------------------------------------------------- |Methods inherited from exceptions.BaseException: | |__delattr__(...) |x.__delattr__('name') <==> del x.name | |__getattribute__(...) |x.__getattribute__('name') <==> x.name | |__getitem__(...) |x.__getitem__(y) <==> x[y] | |__getslice__(...) |x.__getslice__(i, j) <==> x[i:j] | |Use of negative indices is not supported. | |__reduce__(...) | |__repr__(...) |x.__repr__() <==> repr(x) | |__setattr__(...) |x.__setattr__('name', value) <==> x.name = value | |__setstate__(...) | |__str__(...) |x.__str__() <==> str(x) | |__unicode__(...) | |---------------------------------------------------------------------- |Data descriptors inherited from exceptions.BaseException: | |__dict__ | |args | |message FUNCTIONS compile(pattern, flags=0) Compile a regular expression pattern, returning a pattern object. escape(pattern) Escape all non-alphanumeric characters in pattern. findall(pattern, string, flags=0) Return a list of all non-overlapping matches in the string. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result. finditer(pattern, string, flags=0) Return an iterator over all non-overlapping matches in the string.For each match, the iterator returns a match object. Empty matches are included in the result. match(pattern, string, flags=0) Try to apply the pattern at the start of the string, returning a match object, or None if no match was fo