Notes on Cygwin

Notes on Cygwin

Download setup.exe from http://cygwin.com/, and install it.

Open “System Variables “, add the path to Cygwin's bin directory to the end of “Path” variable, then we can use unix commands such as ls, ps, in windows command-prompt.

Set up Environment Variables

The CYGWIN variable is used to configure many global settings for the Cygwin runtime system.

The PATH environment variable is used by Cygwin applications as a list of directories to search for executable files to run.

The HOME environment variable is used by many programs to determine the location of your home directory

The TERM environment variable specifies your terminal type. It is automatically set to cygwin if you have not set it to something else.

The LD_LIBRARY_PATH environment variable is used by the Cygwin function dlopen () as a list of directories to search for .dll files to load.

Customizing bash

In home directory, you can see the following files: .bash_profile(or .profile),.bashrc, .inputrc

.bash_profile is executed when bash is started as login shell, e.g. from the command bash --login. This is a useful place to define and export environment variables and bash functions that will be used by bash and the programs invoked by bash. It is a good place to redefine PATH if needed.

such as adding a ":." to the end of PATH to also search the current working directory,

PATH=$PATH:.

unsetting MAILCHECK or define MAILPATH to point to your existing mail inbox in order to avoid delays you should.

.bashrc is similar to .profile but is executed each time an interactive bash shell is launched. It serves to define elements such as aliases. bashrc is not called automatically for login shells. You need source it from .bash_profile.

.inputrc controls how programs using the readline library (including bash) behave. It is loaded automatically.

Consider the following settings:

# Ignore case while completing

set completion-ignore-case on

# Make Bash 8bit clean

set meta-flag on

set convert-meta off

set output-meta on

Tools that do not use readline for display, such as less and ls, require additional settings, which could be put in your .bashrc:

alias less=’/bin/less -r’

alias ls=’/bin/ls -F --color=tty --show-control-chars’

Using Cygwin

Mapping path names

Cygwin supports both Win32- and POSIX-style paths, where directory delimiters may be either forward or back slashes. UNC pathnames (starting with two slashes and a network name) are also supported.

Cygwin maintains a special internal POSIX view of the Win32 file system that allows these programs to successfully run under Windows. Cygwin uses this mapping to translate between Win32 and POSIX paths as necessary.

The Cygwin Mount Table

The mount utility program is used to to map Win32 drives and network shares into Cygwin’s internal POSIX directory tree.Whenever Cygwin generates a POSIX path from a Win32 one, it uses the longest matching prefix in the mount table.

Invoking mount without any arguments displays Cygwin’s current set of mount points.

Whenever Cygwin cannot use any of the existing mounts to convert from a particular Win32 path to a POSIX one, Cygwin will automatically default to an imaginary mount point under the default POSIX path /cygdrive.

Z:\ would be automatically converted to /cygdrive/Z.

mount; mount c: /c

Additional Path-related Information

Symbolic links can also be used to map Win32 pathnames to POSIX, but symbolic links cannot set the default file access mode.

ln -s d:/yuanyun/ws /ws

The .exe extension Executable program filenames end with .exe but the .exe need not be included in

the command, so that traditional UNIX names can be used. but, for programs that end in .bat and .com, you cannot omit the extension.

If a shell script myprog and a programmyprog.exe coexist in a directory, the program has precedence and is selected for execution of myprog.

The gcc compiler produces an executable named filename.exe when asked to produce filename. This allows many makefiles written for UNIX systems to work well under Cygwin.

Unfortunately, the install and strip commands do distinguish between filename and filename.exe. They fail when working on a non-existing filename even if filename.exe exists, thus breaking some makefiles. This problem can be solved by writing install and strip shell scripts to provide the extension ".exe" when needed.

The /proc filesystem

Cygwin supports the /proc virtual filesystem. The files in this directory are representations of various aspects of your system.

cat /proc/cpuinfo

One unique aspect of the Cygwin /proc filesystem is /proc/registry, which displays the Windows registry with each KEY as a directory and each VALUE as a file. As anytime you deal with the Windows registry, use caution since changes may result in an unstable or broken system.

cd /proc/registry;ls

The @pathnames

To circumvent the limitations on shell line length in the native Windows command shells, Cygwin programs expand their arguments starting with "@" in a special way. If a file pathname exists, the argument @pathname expands recursively to the content of pathname. Double quotes can be used inside the file to delimit strings containing blank space. Embedded double quotes must be repeated.

Cygwin Utilities

cygcheck

cygpath

The cygpath program is a utility that converts Windows native filenames to Cygwin POSIX-style pathnames and vice versa. It can be usedwhen a Cygwin programneeds to pass a file name to a native Windows program, or expects to get a file name from a nativeWindows program.

dumper

The dumper utility can be used to create a core dump of running Windows process. This core dump can be later loaded to gdb and analyzed. One common way to use dumper is to plug it into cygwin’s Just-In-Time debugging facility by adding error_start=x:\path\to\dumper.exe to the CYGWIN environment variable. If error_start is set this way, then dumper will be started whenever some program encounters a fatal error.

dumper can be also be started from the command line to create a core dump of any running process. Unfortunately, because of a Windows API limitation, when a core dump is created and dumper exits, the target process is terminated too.

kill

-f, --force force, using win32 interface if necessary

The kill program allows you to send arbitrary signals to other Cygwin programs. The usual purpose is to end a running program from some other window when ^C won’t work, but you can also send program-specified signals such as SIGUSR1 to trigger actions within the program, like enabling debugging or re-opening log files. Each program defines the signals they understand.

You may need to specify the full path to use kill from within some shells, including bash, the default Cygwin shell. This is because bash defines a kill builtin function; To make sure you are using the Cygwin version, use: /bin/kill --version

mount

Cygdrive mount points

Whenever Cygwin cannot use any of the existingmounts to convert froma particular Win32 path to a POSIX one, Cygwin will, instead, convert to a POSIX path using a default mount point: /cygdrive. The mount utility can be used to change this default automount prefix through the use of the "--change-cygdrive-prefix" option. We can set the automount prefix to /:

mount --change-cygdrive-prefix /

ps

-a, --all show processes of all users

-e, --everyone show processes of all users

-W, --windows show windows as well as cygwin processes

regtool: View or edit the Win32 registry

umount

-A, --remove-all-mounts remove all mounts

-c, --remove-cygdrive-prefix remove cygdrive prefix

-s, --system remove system mount (default)

-S, --remove-system-mounts remove all system mounts

-u, --user remove user mount

-U, --remove-user-mounts remove all user mounts

Using Cygwin effectively with Windows

Many Windows utilities provide a good way to interact with Cygwin’s predominately command-line environment In cygwin, you can also call windows commands, such as ipconfig, net.exe, notepad.exe. Most of these tools support the /? switch to display usage information.

Pathnames

Windows programs do not understand POSIX pathnames, so any arguments that reference the filesystem must be in Windows (or DOS) format or translated. Cygwin provides the cygpath utility for converting between Windows and POSIX paths.

notepad.exe "$(cygpath -aw "Desktop/Phone Numbers.txt")"

A few programs require a Windows-style, semicolon-delimited path list, which cygpath can translate from a POSIX path with the -p option.

javac -cp "$(cygpath -pw "$CLASSPATH")" hello.java

The cygutils package

Unix tools such as tr can convert between CRLF and LF endings, but cygutils provides several dedicated programs: conv, d2u, dos2unix, u2d, and unix2dos.

Creating shortcuts with cygutils

The cygutils package includes a mkshortcut utility for creating standard Microsoft .lnk files.

Printing with cygutils

There are several options for printing from Cygwin, including the lpr found in cygutils.

The easiest way to use cygutils’ lpr is to specify a default device name in the PRINTER environment variable. You may also specify a device on the command line with the -d or -P

options, which will override the environment variable setting.

A device name may be a UNC path (\\server_name\printer_name), a reserved DOS device name (prn, lpt1), or a local port name that is mapped to a printer share.

lpr sends raw data to the printer; no formatting is done. Many, but not all, printers accept plain text as input. If your printer supports PostScript, packages such as a2ps and enscript can prepare text files for printing. The ghostscript package also provides some translation from PostScript to various native printer languages. Additionally, a native Windows application for printing PostScript, gsprint, is available from the Ghostscript website.

Programming with Cygwin

Using GCC with Cygwin

Console Mode Applications

gcc –help,gcc hello.c -o hello.exe,g++ Welcome.cpp -o Welcome.exe

The g++ command signifies that the C++ complier should be used instead of the C compiler.

Compiling Programs with Multiple Source Files

Compiling a program, which has two or more source files, can be accomplished two ways. The first method requires listing all the files on the command line. The second method takes advantage of Cygwin’s wild-card character(*).

g++ *.cpp -o Fig06_05

The STLPort Library

  1. This library can be downloaded at www.STLPort.org.

  2. The library must be installed from the Cygwin prompt.

  3. Enter the command make -f gcc-cygwin.mak to start the creation of the install files

  4. When completed, enter make -f gcc-cygwin.mak install to begin the installation of the new library The new library should now be installed and working. To test it, compile a program using g++ -I /usr/Local/Include/STLPort File.cpp -L / usr/Local/lib -lSTLPort_Cygwin -o ExecutableFile.

Using the GDB Debugger

Before you can debug your program, you need to prepare your program for debugging. What you need to do is add -g to all the other flags you use when compiling your sources to objects.

gcc -g myapp.c -o myapp, g++ Debug.cpp -o Debug -g

What this does is add extra information to the objects (they get much bigger too) that tell the debugger about line numbers, variable names, and other useful things. These extra symbols and debugging information give your program enough information about the original sources so that the debugger can make debugging much easier for you.

To invoke GDB, simply type gdb Debug.exe, then (gdb) will appear to prompt you to enter commands, like run or help.

If your program crashes and you’re trying to figure out why it crashed, the best thing to do is type run and let your program run. After it crashes, you can type where to find out where it crashed, or info locals to see the values of all the local variables. There’s also a print that lets you look at individual variables or what pointers point to.

If your program is doing something unexpected, you can use the break command to tell gdb to stop your program when it gets to a specific function or line number:

break 47, break my_function

when you type run your programwill stop at that "breakpoint" and you can use the other gdb commands to look at the state of your program at that point, modify variables, and step through your program’s statements one at a time.

You may specify additional arguments to the run command to provide command-line arguments to your program.

Debugging with command line arguments

myprog -t foo --queue 47

gdb myprog

(gdb) run -t foo --queue 47

Use Cygwin in Eclipse.org CDT

Use Linux's telnet on cygwin

In order to use Linux's telnet on cygwin instead of windows version, we just need to install inetutils package (in category Net) .



Some Little Trick:

Problem:

When I connect to some Lunix/Unix machines, and use vi, it reports 'Unknown terminal: cygwin'.

This is because the "cygwin" terminal type is cygwin's emulation of a UNIX terminal in a windows "dos box". Since it's not a "standard" terminal type, it's not included in a lot of terminfo databases.

Solution:

Add 'export TERM=xterm' to .bashrc or .kshrc etc.

Trick2: Use backslash(\) directly as path delimiter on cygwin

On cygwin, we can not directly call 'cd D:\dira\dirb' to change directory, we can type 'cd D:/dira/dirb' or 'cd D:\\dira\\dirb'.But this is somewhat inconvenient.

We can use one little trick, add sinle quote to the path.

cd 'D:\dira\dirb'

So now, we can just copy the dir path from windows explore, and paste in cygwin's command line,

Trick3: Display Chinese in Cygwin

Set home directory if have not done it, and change to home directory.

1. edit ~/.inputrc, and add the following lines:

set meta-flag on

set input-meta on

set convert-meta off

set output-meta on

2. edit ~/.bash_profile, and add the following line

alias ls='ls –show-control-chars'



Resources

Cygwin User's Guide

DiveIntoCygwinGCC.pdf

http://ras52-tech.blogspot.com/2007/01/telnet-on-cygwin.html

http://www.nabble.com/Could-I-use-backslash-directly-as-path-delimiter-on-cygwin--to21273102.html#a21273528

http://pinglunliao.blogspot.com/2006/05/chinese-in-cygwin.html


Take Notes from Mastering Regular Expressions

Take Notes from Mastering Regular Expressions

Introduction to Regular Expression
Searching Text Files: Egrep
egrep is freely available for many systems, including DOS, MacOS, Windows, Unix, and so on.
Egrep Metacharacters
^ (caret) and $(dollar) represents the start and end of the line of text.
Character Classes: Matching any one of several characters
[], usually called a character class, list the characters you want to allow at that point in the match.
Within a character class, the character-class metacharacter '-' (dash) indicates a range of characters: <H[1-6]>,[0-9] and [a-z], [0-9A-Z_!.?].
A dash is a metacharacter only when it is within a character class and is not the first character listed in the class; otherwise it matches the normal dash character.
The question mark and period at the end of the class are usually regular-expression metacharacters, but only when not within a class. The only special characters within the class in [0-9A-ZR!.?] are the two dashes.
^cat$,^$:an empty line (with nothing in it, not even spaces).
^: Since every line has a beginning, every line will match even lines that are empty!
Negated character classes [^]
The leading ^ in the class "negates" the list.
[^1-6] matches a character that's not 1 through 6.
^ is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class's opening bracket; otherwise, it's not special inside a class).
A negated character class means "match a character that's not listed" and not "don't match what is listed." A convenient way to view a negated class is that it is simply a shorthand for a normal class that includes all possible characters except those that are listed.
Matching Any Character with Dot
The metacharacter . is a shorthand for a character class that matches any character.
In 03[-./]19[-./]76, the dots are not metacharacters because they are within a character class.
The dashes are also not class metacharacters in this case because each is the first thing after [ or [^. The list of metacharacters and their meanings are different inside and outside of character classes.
Knowing the target text well is an important part of wielding regular expressions effectively.
Alternation: Matching any one of several subexpressions
| means "or." Bob|Robert
In gr[a|e]y, the '|' character is just a normal character, like a and e.
Alternation reaches far, but not beyond parentheses.
A character class can match just a single character in the target text. With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text. Character classes are almost like their own special mini-language (with their own ideas about metacharacters, for example), while alternation is part of the "main" regular expression language.
Also, take care when using caret or dollar in an expression that has alternation. Compare ^From|Subject|Date:• with ^(From|Subject|Date):•. The first is composed of three alternatives, so it matches "^From or Subject or Date: •," which is not particularly useful. We want the leading caret and trailing : • to apply to each alternative. We can accomplish this by using parentheses to "constrain" the alternation:
^(From|Subject|Date):•
Ignoring Differences in Capitalizatio: -i
egrep's command-line option "-i" tells it to do a case-insensitive match.
Word Boundaries: \< and \>
\<cat\>, \<cat or cat\>
The "start of a word" is simply the position where a sequence of alphanumeric characters begins; "end of word" is where such a sequence ends. "start and end of word" is better phrased as "start and end of an alphanumeric sequence,"
Summary of Metacharacters Seen So Far.
Metacharacter Name Matches
. dot any one character
[] character class any character listed
[^] negated character class any character not listed
^ caret the position at the start of the line
$ dollar the position at the end of the line
\< backslash less-than the position at the start of a word
\> backslash greater-than the position at the end of a word
not supported by all versions of egrep
| or; bar matches either expression it separates
() parentheses used to limit scope of |
The rules about which characters are and aren't metacharacters (and exactly what they mean) are different inside a character class. Dot is a metacharacter outside of a class, but not within one. Conversely, a dash is a metacharacter within a class (usually), but not outside. Moreover, a caret has one meaning outside, another if specified inside a class immediately after the opening [, and a third if given elsewhere in the class.
Don't confuse alternation with a character class. The class [abc] and the alternation (a|b|c) effectively mean the same thing, but the similarity in this example does not extend to the general case. A character class can match exactly one character, and that's true no matter how long or short the specified list of acceptable characters might be.
Alternation, on the other hand, can have arbitrarily long alternatives,
A negated character class is simply a notational convenience for a normal character class that matches everything not listed. Thus, [^x] doesn't mean "match unless there is an x," but rather "match if there is something that is not x." The difference is subtle, but important. The first concept matches a blank line, for example, while [^x] does not.
The useful -i option discounts capitalization during a match
Optional Items: ?
The question mark attaches only to the immediately-preceding item.
Other Quantifiers: Repetition: + (plus) and * (star)
<HR•+SIZE •* = •* [0-9]+ •*>
Defined range of matches: intervals
Some versions of egrep support a metasequence for providing your own minimum and maximum: {min,max}. This is called the interval quantifier.
Parentheses and Backreferences
Uses for parentheses: to limit the scope of alternation |, and to group multiple characters into larger units to which you can apply quantifiers like question mark and star.
In many regular-expression flavors, parentheses can "remember" text matched by the subexpression they enclose.
Backreferencing allows matching new text that is the same as some text matched earlier in the expression.
egrep -i '\<([a-z]+) +\1\>' files
With tools that support backreferencing, parentheses "remember" the text that the subexpression inside them matches, and the special metasequence \1 represents that text later in the regular expression, whatever it happens to be at the time.
Use \1, \2, \3, etc., to refer to the first, second, third, etc. sets. Pairs of parentheses are numbered by counting opening parentheses from the left, ([a-z])([0-9])\1\2.
The Great Escape: \
ega\.att\.com, \([a-zA-Z]+\) matchs a word within parentheses.
A backslash used in this way is called an "escape" when a metacharacter is escaped, it loses its special meaning and becomes a literal character.
When used before a non-metacharacter, a backslash can have different meanings depending upon the version of the program. For example \<, \>, \1, etc. as metasequences.
Variable names: [a-zA-Z_][a-zA-Z_0-9]*
A string within double quotes: "[^"]*"
Dollar amount (with optional cents): \$[0-9]+(\.[0-9][0-9])?
An HTTP/HTML URL
<http://[-a-z0-9_.:]+/[-a-z0-9_:@&?=+,.!/~*%$]*\.html?\>
egrep -i '\<http://[^ ]*\.html?\>' files...
Time of day, such as "9:17 am" or "12:30 pm"
[0-9]?[0-9]:[0-9][0-9]•(am|pm)
(1[012]?[1-9]):[0-5][0-9]•(am|pm)
[01]?[0-9]|2[0-3]:[0-5][0-9]
Egrep Metacharacter Summary Items to Match a Single Character
Metacharacter Matches
.dot Matches any one character
[]character class Matches any one character listed
[^]negated character class Matches any one character not listed
\chares(caped character)
Items Appended to Provide "Counting": The Quantifiers
?question One allowed, but it is optional
*star Any number allowed, but all are optional
+plus At least one required; additional are optional
{min,max}specified range Min required, max allowed
Items That Match a Position
^caret Matches the position at the start of the line
$dollar Matches the position at the end of the line
\<word boundary Matches the position at the start of a word
\>word boundary Matches the position at the end of a word
Other
|alternation Matches either expression it separates
()parentheses Limits scope of alternation, provides grouping for the quantifiers,
and "captures" for backreferences
\1, \2, ...backreference Matches text previously matched within first, second, etc., set of parentheses.
Three reasons for using parentheses are constraining alternation, grouping, and capturing.
Character classes are special, and have their own set of metacharacters totally distinct from the "main" regex language
Alternation and character classes are fundamentally different, providing unrelated services that appear, in only one limited situation, to overlap
A negated character class is still a "positive assertion"even negated, a character class must match a character to be successful. Because the listing of characters to match is negated, the matched character must be one of those not listed in the class.
The useful -i option discounts capitalization during a match.
There are three types of escaped items:
1. The pairing of \ and a metacharacter is a metasequence to match the literal character
2. The pairing of \ and selected non-metacharacters becomes a metasequence with an implementation-defined meaning (for example, \< often means "start of word").
3. The pairing of \ and any other character defaults to simply matching the character (that is, the backslash is ignored).
Items governed by a question mark or star don't need to actually match any characters to "match successfully." They are always successful, even if they don't match anything.
Extended Introductory Examples
Ensure that each file contained 'ResetSize' exactly as many times as 'SetSize'.
perl -0ne 'print "$ARGV\n" if s/ResetSize//ig != s/SetSize//ig' *
perl -w programFile: -w tells Perl to check program more carefully and issue warnings about items it thinks to be dubious.
Matching Text with Regular Expressions
if ( $celsius =~ m/^[0-9]+$/)
The m means to attempt a regular expression match, while the slashes delimit the regex itself. =~ links a regex search with the target string to be searched.
The operator == tests whether two numbers are the same. (The operator eq is used to test whether two strings are the same.)
Side Effects of a Successful Match
Use the metacharacter \1 within the regular expression to refer to some text matched earlier during the same match attempt, and use the variable $1 in subsequent code to refer to that same text after the match has been successfully completed.
Non-Capturing Parentheses: (?:)
The benefits of this are twofold. One is that by avoiding the unnecessary capturing, the match process is more efficient.
Another is that, overall, using exactly the type of parentheses needed for each situation may be less confusing later to someone reading the code who might otherwise be left wondering about the exact nature of each set of parentheses.
\b normally matches a word boundary, but within a character class, it matches a backspace.
Temperature-conversion program final listing
if ($input =~ m/^([-+]?[0-9]+(\.[0-9]+)?)\s*([CF])$/i)
{ $InputNum = $1; # Save to named variables to make the ...
$type = $3; # ... rest of the program easier to read.
if ($type =~ m/c/i) { } else { }
} else { # The initial regex did not match, so issue a warning. }
2. Perl can check a string in a variable against a regex using the construct $variable =~ m/regex/. The m indicates that a match is requested, while the slashes delimit (and are not part of) the regular expression. The whole test, as a unit, is either true or false.
4. Among the more useful shorthands that Perl and many other flavors of regex provide are:
\t, a tab character \b, backspace
\n, a newline character \r, a carriage-return character
\s matches any "whitespace" character (space, tab, newline, formfeed, and such)
\S anything not \s
\w [a-zA-Z0-9_] (useful as in \w+, ostensibly to match a word)
\W anything not \w, i.e., [^a-zA-Z0-9_]
\d [0-9], i.e., a digit
\D anything not \d, i.e., [^0-9]
5. The /i modifier makes the test case-insensitive. Also /g ("global match") and /x ("free-form expressions").
6. (?:) non-capturing parentheses can be used for grouping without capturing.
7. After a successful match, Perl provides the variables $1, $2, $3, etc., which hold the text matched by their respective () parenthesized subexpressions in the regex.
Subexpressions are numbered by counting open parentheses from the left, starting with one. Subexpressions can be nested, as in (Washington(•DC)?) Raw () parentheses can be intended for grouping only, but as a byproduct, they still capture into one of the special variables.
Modifying Text with Regular Expressions: $var =~ s/regex/replacement/
The regex is the same as with m//, but the replacement is actually a Perl string in its own right,. You can include references to variables, including $1, $2, and so on to refer to parts of what was just matched.
Perl provides the catch-all \b, which matches start-of-word or end-of-word metacharacters:
$var =~ s/Jeff/Jeffrey/; $var =~ s/\bJeff\b/Jeffrey/; $var =~ s/\bJeff\b/Jeff/i
Just what does $var =~ s/\bJeff\b/Jeff/i do?
Example: Form Letter
$letter =~ s/=FAMILY=/$family/g;
The /g "global replacement" modifier instructs the s/// to replace all occurrences.
Example: Prettifying a Stock Price
Always take the first two digits after the decimal point, and take the third digit only if it is not zero. Then, remove any other digits: $price =~ s/(\.\d\d[1-9]?)\d*/$1/
Automated Editing
perl -p -i -e 's/sysread/read/g' file
This runs the Perl program s/sysread/read/g. (the -e flag indicates that the entire program follows right there on the command line.) The -p flag results in the substitution being done for every line of the named file, and the -i flag causes any changes to be written back to the file when done.
There is no explicit target string for the substitute command to work on (that is, no $var =~ ) because conveniently, the -p flag implicitly applies the program, in turn, to each line of the file. Also, because I used the /g modifier, I'm sure to replace multiple occurrences that might be in a line.
A Small Mail Utility
Perl’s magic "<>" operator gives the next line of input when you assign from it to a normal $variable, as with "$variable = <>". It is just Perl's funny way to express a kind of a getline ().
while ($line = < >){
if ($line =~ m/^\s*$/ ) { # If we have an empty line...
last; # this immediately ends the 'while' loop.}
if ($line =~ m/^Subject: (.*)/i) { $subject = $1;}
if ($line =~ m/^From: (\s+) \(([^()]*)\)/i) {$reply_address = $1; $from_name = $2; }
}
$line =~ s/^/|> /;
The substitute searches for ^, which of course immediately matches at the beginning of the string. It doesn't actually match any characters, though, so the substitute "replaces" the "nothingness" at the beginning of the string with '|>•'. In effect, it inserts '|>•' at the beginning of the string.
Perl's defined function indicates whether the variable has a value, while the die function issues an error message and exits the program.
Adding Commas to a Number with Lookaround
Regular expressions generally work left-to-right.
Lookaround matchs positions within the text and doesn't "consume" text.
Positive lookahead (?=), such as (?=\d).
Lookbehind (?<=) looks back (toward the left), such as (?<=\d)
Lookaround doesn't "consume" text
(?=Jeffrey), matches only the marked location.
Lookahead uses its subexpression to check the text, but only to find a location in the text at which it can be matched, not the actual text it matches.
The combined expression, (?=Jeffrey)Jeff effectively matches "Jeff" only if it is part of "Jeffrey." It is effectively the same as Jeff(?=rey).
Approaches to the "Jeffs" Problem
s/\bJeffs\b/Jeff's/g
s/\b(Jeff)(s)\b/$1'$2/g
s/\bJeff(?=s\b)/Jeff'/g
s/(?<=\bJeff)(?=s\b)/'/g
This regex doesn't actually "consume" any text. It uses both lookahead and lookbehind to match positions of interest, at which an apostrophe is inserted. Very useful to illustrate lookaround.
s/(?=s\b)(?<=\bJeff)/'/g
This is exactly the same as the one above, but the two lookaround tests are reversed. Because the tests don't consume text, the order makes no difference to whether there's a match.
Which "Jeffs" solutions would preserve case when applied with /i?
s/\b(Jeff)(s)\b/$1'$2/g and s/(?<=\bJeff)(?=s\b)/'/g
To preserve case, you've got to either replace the exact characters consumed (rather than just always inserting 'Jeff's'), or not consume any letters.
Back to the comma example ...
"Locations having digits on the right in exact sets of three, and at least some digits on the left."
The second requirement is simple enough with lookbehind.
$pop =~ s/(?<=\d)(?=(\d\d\d)+$)/,/g; print "The US population is $pop\n";
Also (?<=\d)(?=(?:\d\d\d)+$)
$text =~ s/(?<=\d)(?=(\d\d\d)+(?!\d))/,/g;
Four Types of Lookaround
Type Regex Successful if the enclosed subexpression ...
Positive Lookbehind (?<=......) successful if can match to the left
Negative Lookbehind (?<!......) successful if can not match to the left
Positive Lookahead (?=......) successful if can match to the right
Negative Lookahead (?!......) successful if can not match to the right
Commafication without lookbehind
Lookbehind is not as widely supported (nor as widely used) as lookahead.
$text =~ s/(\d)(?=(\d\d\d)+(?!\d))/$1,/g;
When one iteration ends, the next picks up the inspection of the text at the point where the previous match ended.
Does $text =~ s/(\d)((\d\d\d)+\b)/$1,$2/g "commafy" a number?
This won't work the way we want. It leaves results such as "281,421906." This is because the digits matched by (\d\d\d)+ are now actually part of the final match, and so are not left "unmatched" and available to the next iteration of the regex via the /g.

When one iteration ends, the next picks up the inspection of the text at the point where the previous match ended. The whole point of using lookahead was to get the positional check without actually having the inspected text check count toward the final "string that matched."

Actually, this expression can still be used to solve this problem. With each such application, one more comma is added (to each number in the target string, due to the /g modifier).
while ( $text =~ s/(\d)((\d\d\d)+\b)/$1,$2/g ) {
# Nothing to do inside the body of the while -- we merely want to reapply the regex until it fails
}
Text-to-HTML Conversion
Separating paragraphs
^ and $ normally refer not to logical line positions, but to the absolute start- and end-of-string positions. Most regex-endowed languages give us an easy solution, an enhanced line anchor match mode in which the meaning of ^ and $ to change from string related to the logical-line related meaning. With Perl, this mode is specified with the /m modifier:
$text =~ s/^\s*$/<p>/mg;
\s can match a newline. This means that if we have several blank lines in a row, ^\s*$ is able to match them all in one shot. The fortunate result is that the replacement leaves just one <p> instead of the several in a row we would otherwise end up with.
"Linkizing" an email address
$text =~ s/\b(username regex\@hostname regex)\b/<a href="mailto:$1">$1<\/a>/g;
Perl allows picking our own delimiters. s!regex!string!modifiers or s{regex}{string}modifiers.
Matching the username and hostname
The /x modifier does two simple but powerful things for the regular expression. First, it causes most whitespace to be ignored, so you can "free-format" the expression for readability. Secondly, it allows comments with a leading #.
Specifically, /x turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter . They aren't taken as metacharacters within a character class (which means that classes are not free-format, even with /x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace, as in m/<a \s+ href=>/x.
/x applies only to the regular expression, and not to the replacement string.
Putting it together
undef $/; # Enter "file-slurp" mode
$text = <>; # Slurp up the first file given on the command line.
$text =~ s/&/&amp;/g; # Make the basic HTML ...
$text =~ s/</&lt;/g; # ... characters &, <, and > ...
$text =~ s/>/&gt;/g; # ... HTML safe.
$text =~ s/^\s*$/<p>/mg; # Separate paragraphs.
# Turn email addresses into links ...
$text =~ s{
\b
# Capture the address to $1 ...
(
\w[-.\w]* # username
\@
[-a-z0-9]+(\.[-a-z0-9]+)*\.(com;edu;info) # hostname
)
\b
}{<a href="mailto:$1">$1</a>}gix;
# Turn HTTP URLs into links ...
$text =~ s{
\b
# Capture the URL to $1 ...
(
http:// [-a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info) \b # hostname
(
/ [-a-z0-9_:\@&?=+,.!/~*'%\$]* # Optional path
(?<![.,?!]) # Not allowed to end with [.,?!]
)?
)
}{<a href="$1">$1</a>}/gix;
Building a regex library
The same expression is used for each of the two hostnames, which means that if we ever update one, we have to be sure to update the other. Rather than keeping that potential source of confusion, consider the three instances of $HostnameRegex in this modified snippet from our program:
$HostnameRegex = qr/[-a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info)/i;
# Turn email addresses into links ...
$text =~ s{
\b
# Capture the address to $1 ...
(
\w[-.\w]* # username
\@
$HostnameRegex # hostname
)
\b
}{<a href="mailto:$1">$1</a>}gix;
# Turn HTTP URLs into links ...
$text =~ s{
\b
# Capture the URL to $1 ...
(
http:// $HostnameRegex \b # hostname
(
/ [-a-z0-9_:\@&?=+,.!/~*'%\$]* # Optional path
(?<![.,?!]) #not allowed to end with [.,?!]
)?
)
}{<a href="$1">$1</a>}gix;
qr operator converts the regex provided into a regex object. Later you can use that object in place of a regular expression, or even as a subexpression of some other regex.
Why '$' and '@' sometimes need to be escaped
You'll notice that the same '$' is used as both the end-of-string metacharacter, and to request interpolation (inclusion) of a variable. Normally, there's no ambiguity to what '$' means, but within a character class it gets a bit tricky. Since it can't possibly mean end-of-string within a class, in that situation Perl considers it a request to interpolate (include from) a variable, unless it's escaped. If escaped, the '$' is just included as a member of the class. That's what we want this time, so that's why we have to escape the dollar sign in the path part of the URL-matching regex.

Perl uses @ at the beginning of array names, and Perl string or regex literals allow arrays to be interpolated. If we wish a literal @ to be part of a regex, we must escape it so that it's not taken as an array interpolation.
That Doubled-Word Thing
$/ = ".\n";
while (<>) {
next if !s/\b([a-z]+)((?:\s<<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[7m$3\e[m/ig;
s/^(?:[^\e]*\n)+//mg; # Remove any unmarked lines.
s/^/$ARGV: /mg; # Ensure lines begin with filename.
print;
}
Double-word example in Perl
$/
= ".\n"; # Sets a special "chunk-mode"; chunks end with a
period-newline combination
while (< >)
{
next unless s{# (regex starts here)
### Need to match one word:
\b # Start of word ... .
( [a-z]+ ) # Grab word, filling $1 (and \1).

### Now need to allow any number of spaces and/or <TAGS>
( #Save what intervenes to $2.
(?: # (Non-capturing parens for grouping the alternation)
\s # Whitespace (includes newline, which is good).
| # -or-
<[^>]+> # Item like <TAG>.
)+ # Need at least one of the above, but allow more.
)
### Now match the first word again:
(\1\b) # \b ensures not embedded. This copy saved to $3.
#(regex ends here)
}
# Above is the regex. The replacement string is below, followed
by the modifiers, /i, /g, and /x
{\e[7m$1\e[m$2\e[7m$3\e[m}igx;
s/^(?:[^\e]+\n)+//mg; # Remove any unmarked lines.
s/^/$ARGV: /mg; # Ensure lines begin with filename.
print;
}
❶ Because the doubled-word problem must work even when the doubled words are split across lines, I can't use the normal line-by-line processing I used with the mail utility example. Setting the special variable $/ (yes, that's a variable) as shown puts the subsequent <> into a magic mode such that it returns not single lines, but more-or-less paragraph-sized chunks. The value returned is just one string, but a string that could potentially contain many of what we would consider to be logical lines.

❷ Did you notice that I don't assign the value from <> to anything? When used as the conditional of a while like this, <> magically assigns the string to a special default variable.[] That same variable holds the default string that s/// works on, and that print displays. Using these defaults makes the program less cluttered, but also less understandable to someone new to the language, so I recommend using explicit operands until you're comfortable.

[] The default variable is $_ (yes, that's a variable too). It's used as the default operand for many functions and operators.

❻ The variable $ARGV magically provides the name of the input file. Combined with /m and /g, this substitution tacks the input filename to the beginning of each logical line remaining in the string. Cool!

Moving bits around: operators, functions, and objects
You'll also notice that the regular expressions are located not in the main text-processing part of the program, but at the start, in the initialization section. The Pattern.compile function merely analyzes the string as a regular expression, and builds an internal "compiled version" that is assigned to a Pattern variable (regex1, etc.). Then, in the main text-processing part of the program, that compiled version is applied to text with regex1.matcher(text), the result of which is used to do the replacement.
public class TwoWord
{
public static void main(String [] args)
{
Pattern regex1 =
Pattern.compile("\\b([a-z]+)((?:\\s<\\<[^>]+\\>)+)(\\1\\b)", Pattern.CASE_INSENSITIVE);
String replace1 = "\033[7m$1\033[m$2\033[7m$3\033[m";
Pattern regex2 = Pattern.compile("^(?:[^\\e]*\\n)+", Pattern.MULTILINE);
Pattern regex3 = Pattern.compile("^([^\\n]+)", Pattern.MULTILINE);
// For each command-line argument....
for (int i = 0; i < args.length; i++)
{
try {
BufferedReader in = new BufferedReader(new FileReader(args[i]));
String text;
// For each paragraph of each file.....
while ((text = getPara(in)) != null)
{
// Apply the three substitutions
text = regex1.matcher(text).replaceAll(replace1);
text = regex2.matcher(text).replaceAll("");
text = regex3.matcher(text).replaceAll(args[i] + ": $1");
System.out.print(text);
}
} catch (IOException e) {
System.err.println("can't read ["+args[i]+"]: " + e.getMessage());}
}
}
// Routine to read next "paragraph" and return as a string
static String getPara(BufferedReader in) throws java.io.IOException
{
StringBuffer buf = new StringBuffer();
String line;
while ((line = in.readLine()) != null && (buf.length() == 0 ;; line.length() != 0))
{buf.append(line + "\n");}
return buf.length() == 0 ? null : buf.toString();
}
}

Overview of Regular Expression Features and Flavors
Care and Handling of Regular Expressions
Regex handling in Java
  1. Inspect the regular expression and compile it into an internal form that matches in a case-insensitive manner, yielding a "Pattern" object.
  2. Associate it with some text to be inspected, yielding a "Matcher" object.
  3. Actually apply the regex to see if there is a match in the previously-associated text, and let us know the result.
  4. If there is a match, make available the text matched within the first set of capturing parentheses.
Regex handling in Python
import re;
R = re.compile("^Subject:(.*)", re.IGNORECASE);
M = R.search(line)
if M:
subject = M.group(1)
A Search-and-Replace Example
$text =~ s{
\b
# Capture the address to $1 ...
(
\w[-.\w]* # username
@
[-\w]+(\.[-\w]+)+\.(com;edu;info) # hostname
)
\b
}{<a href="mailto:$1">$1</a>}gix;
Search and replace in Java
import java.util.regex.*; // Make regex classes easily available
Pattern r = Pattern.compile(
"\\b \n"+
"# Capture the address to $1 ... \n"+
"( \n"+
" \\w[-.\\w]* # username \n"+
" @ \n"+
" [-\\w]+(\\.[-\\w]+)*\\.(com|edu|info) # hostname \n"+
") \n"+
"\\b \n",
Pattern.CASE_INSENSITIVE|Pattern.COMMENTS);
Matcher m = r.matcher(text);
text = m.replaceAll("<a href=\"mailto:$1\">$1</a>");
Note that each '\' wanted in a string's value requires '\\' in the string literal, so if you're providing regular expressions via string literals as we are here, \w requires '\\w'. For debugging, System.out.println(r.pattern()) can be useful to display the regular expression as the regex function actually received it. One reason that I include newlines in the regex is so that it displays nicely when printed this way. Another reason is that each '#' introduces a comment that goes until the next newline; so, at least some of the newlines are required to restrain the comments.

Perl uses notations like /g, /i, and /x to signify special conditions (these are the modifiers for replace all, case-insensitivity, and free formatting modes 135), but java.util.regex uses either different functions (replaceAll versus replace) or flag arguments passed to the function (e.g., Pattern.CASE_INSENSITIVE and Pattern.COMMENTS).

Search and Replace in Other Languages
Awk uses an integrated approach, /regex/, to perform a match on the current input line, and uses "var ~ " to perform a match on other data.
Sub function is used for substitution. sub(/mizpel/, "misspell")
To replace all matches within the line, awk does not use any kind of /g modifier, but a different operator altogether: gsub(/mizpel/, "misspell").
Strings, Character Encodings, and Modes
Strings as Regular Expressions
Strings in Java
Java string literals are like those presented in the introduction, in that they are delimited by double quotes, and backslash is a metacharacter. Common combinations such as '\t' (tab), '\n' (newline), '\\' (literal backslash), etc. are supported. Using a backslash in a sequence not explicitly supported by literal strings results in an error.
Strings in Python
Python uses either single quotes or double quotes to create strings, Python also offers "triple-quoted" strings of the form '''''' and """""", which are different in that they may contain unescaped newlines. All four types offer the common backslash sequences such as \n, but have the same twist that PHP has in that unrecognized sequences are left in the string verbatim. Contrast this with Java and C# strings, for which unrecognized sequences cause an error.

Like PHP and C#, Python offers a more literal type of string, its "raw string." Similar to C#'s @"" notation, Python uses an 'r' before the opening quote of any of the four quote types. For example, r"\t\x2A" yields \t\x2A. Unlike the other languages, though, with Python's raw strings, all backslashes are kept in the string, including those that escape a double quote (so that the double quote can be included within the string): r"he said \"hi\"\." results in he said \"hi\"\.. This isn't really a problem when using strings for regular expressions, since Python's regex flavor treats \" as ", but if you like, you can bypass the issue by using one of the other types of raw quoting: r'he said "hi"\.'

Regex literals in Perl
$str =~ m/(\w+)/; can also be written as: $regex = '(\w+)'; $str =~ $regex;
or perhaps: $regex = "(\\w+)"; $str =~ $regex;
When a regex is provided as a literal, Perl provides extra features that the regular-expression engine itself does not, including: interpolation of variables, support for a literal-text mode via \Q\E (113).
Optional support for a \N{name} construct, which allows you to specify characters via their official Unicode names. For example, you can match '¡Hola!' with \N{INVERTED EXCLAMATION MARK}Hola!.

In Perl, a regex literal is parsed like a very special kind of string. In fact, these features are also available with Perl double-quoted strings. The point to be aware of is that these features are not provided by the regular-expression engine. Since the vast majority of regular expressions used within Perl are as regex literals, most think that \Q\E is part of Perl's regex language, but if you ever use regular expressions read from a configuration file (or from the command line, etc.), it's important to know exactly what features are provided by which aspect of the language.

Richness of encoding-related support
Sometimes things are not as simple as they might seem. For example, the \b of Sun's java.util.regex package properly understands all the word-related characters of Unicode, but its \w does not (it understands only basic ASCII).

Regular expressions for programs that work with Unicode often support a \unum metasequence that can be used to match a specific Unicode character (117).
It's important to realize that \uC0B5 is saying "match the Unicode character U+C0B5," and says nothing about what actual bytes are to be compared, which is dependent on the particular encoding used internally to represent Unicode code points. If the program happens to use UTF-8 internally, that character happens to be represented with three bytes. But you, as someone using the Unicode-enabled program, don't normally need to care.

Characters versus combining-character sequences
Perl and PCRE (and by extension, PHP's preg suite) support the \X metasequence, which fulfills what many might expect from dot ("match one character") in that it matches a base character followed by any number of combining characters.
Regex Modes and Match Modes
Case-insensitive match mode
Common Metacharacters and Features
Character Representations
  1. Character Shorthands: \n, \t, \a, \b, \e, \f, \r, \v, ...
  2. Octal Escapes: \num
  3. Hex/Unicode Escapes: \xnum, \x{num}, \unum, \Unum, ...
  4. Control Characters: \cchar
Character Classes and Class-Like Constructs
  1. Normal classes: [a-z] and [^a-z]
  2. Almost any character: dot
  3. Exactly one byte: \C
  4. Unicode Combining Character Sequence: \X
  5. Class shorthands: \w, \d, \s, \W, \D, \S
  6. Unicode properties, blocks, and categories: \p{Prop}, \P{Prop}
  7. Class set operations: [[a-z]&&[^aeiou]]
  8. POSIX bracket-expression "character class": [[:alpha:]]
  9. POSIX bracket-expression "collating sequences": [[.span-ll.]]
  10. POSIX bracket-expression "character equivalents": [[=n=]]
  11. Emacs syntax classes
Anchors and Other "Zero-Width Assertions"
  1. Start of line/string: ^, \A
  2. End of line/string: $, \Z, \z
  3. Start of match (or end of previous match): \G
  4. Word boundaries: \b, \B, \<, \>, ...
  5. Lookahead (?=), (?!); Lookbehind, (?<=), (?<!)
Comments and Mode Modifiers
  1. Mode modifier: (?modifier), such as (?i) or (?-i)
  2. Mode-modified span: (?modifier:), such as (?i:)
  3. Comments: (?#) and #
  4. Literal-text span: \Q\E
Grouping, Capturing, Conditionals, and Control
  1. Capturing/grouping parentheses: (), \1, \2, ...
  2. Grouping-only parentheses: (?:)
  3. Named capture: (?<Name>)
  4. Atomic grouping: (?>)
  5. Alternation: ||
  6. Conditional: (?if then|else)
  7. Greedy quantifiers: *, +, ?, {num,num}
  8. Lazy quantifiers: *?, +?, ??, {num,num}?
  9. Possessive quantifiers: *+, ++, ?+, {num,num}+
Character Representations
This group of metacharacters provides visually pleasing ways to match specific characters that are otherwise difficult to represent.
Character shorthands
Many utilities provide metacharacters to represent certain control characters that are sometimes machine-dependent, and which would otherwise be difficult to input or to visualize:
\a Alert (e.g., to sound the bell when "printed") Usually maps to the ASCII <BEL> character, 007 octal.
\b Backspace Usually maps to the ASCII <BS> character, 010 octal. (With many flavors, \b is a shorthand only within a character class, a word-boundary metacharacter outside)
\e Escape character Usually maps to the ASCII <ESC> character, 033 octal.
\f Form feed Usually maps to the ASCII <FF> character, 014 octal.
\n Newline On most platforms (including Unix and DOS/Windows), usually maps to the ASCII <LF> character, 012 octal. On MacOS systems, usually maps to the ASCII <CR> character, 015 octal. With Java or any .NET language, always the ASCII <LF> character regardless of platform.
\r Carriage return Usually maps to the ASCII <CR> character. On MacOS systems, usually maps to the ASCII <LF> character. With Java or any .NET language, always the ASCII <CR> character regardless of platform.
\t,\v
Octal escape \num
Implementations supporting octal (base 8) escapes generally allow two- and three digit octal escapes to be used to indicate a byte or character with a particular value. For example, \015\012 matches an ASCII CR/LF sequence. Octal escapes can be convenient for inserting hard-to-type characters into an expression. In Perl, for instance, you can use \e for the ASCII escape character, but you can't in awk.

Since awk does support octal escapes, you can use the ASCII code for the escape character directly: \033.
Some implementations, as a special case, allow \0 to match a NUL byte. Some allow all one-digit octal escapes, but usually don't if backreferences such as \1 are supported. When there's a conflict, backreferences generally take precedence over octal escapes. Some allow four-digit octal escapes, usually to support a requirement that any octal escape begin with a zero (such as with java.util.regex).

You might wonder what happens with out-of-range values like \565 (8-bit octal values range from \000 until only \377). It seems that half the implementations leave it as a larger-than-byte value (which may match a Unicode character if Unicode is supported), while the other half strip it to a byte. In general, it's best tolimit octal escapes to \377 and below.

Hex and Unicode escapes: \xnum, \x{num}, \unum, \Unum, ...
Similar to octal escapes, many utilities allow a hexadecimal (base 16) value to be entered using \x, \u, or sometimes \U. If allowed with \x, for example, \x0D\x0A matches the CR/LF sequence.

Control characters: \cchar
Many flavors offer the \cchar sequence to match control characters with encoding values less than 32 (some allow a wider range). For example, \cH matches a Control-H, which represents a backspace in ASCII, while \cJ matches an ASCII linefeed (which is often also matched by \n, but sometimes by \r, depending on the platform.

Details aren't uniform among systems that offer this construct. You'll always be safe using uppercase English letters as in the examples. With most implementations, you can use lowercase letters as well, but Sun's Java regex package, for example, does not support them. And what exactly happens with non-alphabetics is very flavor-dependent, so I recommend using only uppercase letters with \c.

Related Note: GNU Emacs supports this functionality, but with the rather ungainly metasequence ?\^char (e.g., ?\^H to match an ASCII backspace).
.............Table 3-7. A Few Utilities and the Octal and Hex Regex Escapes Their Regexes Support

Character Classes and Class-Like Constructs
Normal classes: [a-z] and [^a-z]
[*] is never a metacharacter within a class, while [-] usually is. Some metasequences, such as \b, sometimes have a different meaning within a class than outside of one .

Almost any character: dot
In some tools, dot is a shorthand for a character class that can match any character, while in most others, it is a shorthand to match any character except a newline. It's a subtle difference that is important when working with tools that allow target text to contain multiple logical lines (or to span logical lines, such as in a text editor). Concerns about dot include:
In some Unicode-enabled systems, such as Sun's Java regex package, dot normally does not match a Unicode line terminator (109).

A match mode (111) can change the meaning of what dot matches.

The POSIX standard dictates that dot not match a NUL (a character with the value zero), although all the major scripting languages allow NULLs in their text (and dot matches them).

Dot versus a negated character class
When working with tools that allow multiline text to be searched, take care to note that dot usually does not match a newline, while a negated class like [^"] usually does. This could yield surprises when changing from something such as ".*" to "[^"]*". The matching qualities of dot can often be changed by a match modesee.

Exactly one byte
Perl and PCRE (and hence PHP) support \C, which matches one byte, even if that byte is one of several that might encode a single character (on the other hand, everything else works on a per-character basis). This is dangerousits misuse can cause internal errors, so it shouldn't be used unless you really know what you're doing.

Perl and PHP support \X as a shorthand for \P{M}\p{M}*, which is like an extended . (dot). It matches a base character (anything not \p{M}), possibly followed by any number of combining characters (anything that is \p{M}).

Unicode uses a system of base and combining characters which, in combination, create what look like single, accented characters like à ('a' U+0061 combined with the grave accent '`' U+0300). You can use more than one combining character if that's what you need to create the final result. For example, if for some reason you need 'Ç̆', that would be 'c' followed by a combining cedilla '¸' and a combining breve '' (U+0063 followed by U+0327 and U+0306).

Besides the fact that \X matches trailing combining characters, there are two differences between it and dot. One is that \X always matches a newline and other Unicode line terminators (109), while dot is subject to dot-matches-all match-mode (111), and perhaps other match modes depending on the tool. Another difference is that a dot-matches-all dot is guaranteed to match all characters at all times, while \X doesn't match a leading combining character.

Class shorthands: \w, \d, \s, \W, \D, \S
\d Digit Generally the same as [0-9] or, in some Unicode-enabled tools, all Unicode digits.
\D Non-digit Generally the same as [^\d]
\w Part-of-word character Often the same as [a-zA-Z0-9_]. Some tools omit the underscore, while others include all alphanumerics in the current locale. If Unicode is supported, \w usually refers to all alphanumerics; notable exceptions include java.util.regex and PCRE (and by extension, PHP), whose \w are exactly [a-zA-Z0-9_].
\W Non-word character Generally the same as [^\w].
\s Whitespace character On ASCII-only systems, this is often the same as [• \f\n\r\t\v]. Unicode-enabled systems sometimes also include the Unicode "next line" control character U+0085, and sometimes the "white space" property \p{Z} (described in the next section).
\S Non-whitespace character Generally the same as [^\s].
Simple class subtraction: [[a-z]-[aeiou]]
.NET offers a simple class "subtraction" nomenclature, which allows you to remove from what a class can match those characters matchable by another class. For example, the characters matched by [[a-z]-[aeiou]] are those matched by [a-z] minus those matched by [aeiou], i.e. that are non-vowel lower-case ASCII.

As another example, [\p{P}-[\p{Ps}\p{Pe}]] is a class that matches characters in \p{P} except those matchable by [\p{Ps}\p{Pe}], which is to say that it matches all punctuation except opening and closing punctuation such as and (.

Full class set operations: [[a-z] && [^aeiou]]
Sun's Java regex package supports a full range of set operations (union, subtraction, intersection) within character classes.
OR allows you to add characters to the class by including what looks like an embedded class within the class.
AND does a conceptual "bitwise AND" of two sets, keeping only those characters found in both sets. It is achieved by inserting the special class metasequence && between two sets of characters.

AND is less confusing in that [\p{InThai}&&\P{Cn}] is normally read as "match only characters matchable by \p{InThai} and \P{Cn}," although it is sometimes read as "the list of allowed characters is the intersection of \p{InThai} and \P{Cn}."

POSIX bracket-expression "character class": [[:alpha:]]
A POSIX character class is one of several special metasequences for use within a POSIX bracket expression. An example is [:lower:], which represents any lowercase letter within the current locale. For English text, [:lower:] is comparable to a-z. Since this entire sequence is valid only within a bracket expression, the full class comparable to [a-z] is [[:lower:]]. Yes, it's that ugly. But, it has the advantage over [a-z] of including other characters, such as ö, ñ, and the like if the locale actually indicates that they are "lowercase letters."
The exact list of POSIX character classes is locale dependent, but the following are usually supported:
[:alnum:] alphabetic characters and numeric character
[:alpha:] alphabetic characters
[:blank:] space and tab
[:cntrl:] control characters
[:digit:] digits
[:graph:] non-blank characters (not spaces, control characters, or the like)
[:lower:] lowercase alphabetics
[:print:] like [:graph:], but includes the space character
[:punct:] punctuation characters
[:space:] all whitespace characters ([:blank:], newline, carriage return, and the like)
[:upper:] uppercase alphabetics
[:xdigit:] digits allowed in a hexadecimal number (i.e., 0-9a-fA-F).
POSIX bracket-expression "collating sequences" : [[.span-ll.]]
A locale can have collating sequences to describe how certain characters or sets of characters should be ordered. For example, in Spanish, the two characters ll (as in tortilla) traditionally sort as if they were one logical character between l and m, and the German ß is a character that falls between s and t, but sorts as if it were the two characters ss. These rules might be manifested in collating sequences named, for example, span-ll and eszet.

A collating sequence that maps multiple physical characters to a single logical character, such as the span-ll example, is considered "one character" to a fully compliant POSIX regex engine. This means that [^abc] matches a 'll' sequence.

A collating sequence element is included within a bracket expression using a[..] notation: torti[[.span-ll.]]a matches tortilla. A collating sequence allows you to match against those characters that are made up of combinations of other characters. It also creates a situation where a bracket expression can match more than one physical character.
POSIX bracket-expression "character equivalents" : [[=n=]]
Some locales define character equivalents to indicate that certain characters should be considered identical for sorting and such. For example, a locale might define an equivalence class 'n' as containing n and ñ, or perhaps one named 'a' as containing a, à, and ´. Using a notation similar to [::], but with '=' instead of a colon, you can reference these equivalence classes within a bracket expression: [[=n=][=a=]] matches any of the characters just mentioned.

If a character equivalence with a single-letter name is used but not defined in the locale, it defaults to the collating sequence of the same name. Locales normally include normal characters as collating sequences [.a.], [.b.], [.c.], and so onso in the absence of special equivalents, [[=n=][=a=]] defaults to [na].
Emacs syntax classes
GNU Emacs doesn't support the traditional \w, \s, etc.; rather, it uses special sequences to reference "syntax classes" :
\schar: matches characters in the Emacs syntax class as described by char
\Schar: matches characters not in the Emacs syntax class
\sw matches a "word constituent" character, and \s- matches a "whitespace character." These would be written as \w and \s in many other systems.
Emacs is special because the choice of which characters fall into these classes can be modified on the fly, so, for example, the concept of which characters are word constituents can be changed depending upon the kind of text being edited.
Anchors and Other "Zero-Width Assertions"
Anchors and other "zero-width assertions" don't match actual text, but rather positions in the text.
Start of line/string: ^, \A
Caret ^ matches at the beginning of the text being searched, and, if in an enhanced line-anchor match mode, after any newline. In some systems, an enhanced-mode ^ can match after Unicode line terminators as well.

When supported, \A always matches only at the start of the text being searched, regardless of any match mode.
End of line/string: $, \Z, \z
$ has a variety of meanings among different tools, but the most common meaning is that it matches at the end of the target string, and before a string-ending newline, as well. The latter is common, to allow an expression like s$ (ostensibly, to match "a line ending with s") to match '...s', a line ending with s that's capped with an ending newline.

Two other common meanings for $ are to match only at the end of the target text, and to match before any newline. In some Unicode systems, the special meaning of newline in these rules are replaced by Unicode line terminators(Java, for example, offers particularly complex semantics for $ with respect to Unicode line terminators).
A match mode can change the meaning of $ to match before any embedded newline (or Unicode line terminator as well).
When supported, \Z usually matches what the "unmoded" $ matches, which often means to match at the end of the string, or before a string-ending newline. To complement these, \Z matches only at the end of the string, period, without regard to any newline.

Start of match (or end of previous match): \G
Word boundaries: \b, \B(Not word-boundary), \<, \>, ...
Lookahead (?=⋯), (?!⋯); Lookbehind, (?<=⋯), (?<!⋯)
The most restrictive rule exists in Perl and Python, where the lookbehind can match only fixed-length strings. For example, (?<!;\w) and (?<!this|that) are allowed, but (?<!books?) and (?<^\w+:) are not, as they can match a variable amount of text.
The next level of support allows alternatives of different lengths within the look behind, so (?<!books?) can be written as (?<!book|books). PCRE (and as such the preg suite in PHP) allows this.

The next level allows for regular expressions that match a variable amount of text, but only if it's of a finite length. This allows (?<!books?) directly, but still disallows(?<!^\w+:) since the \w+ is open-ended. Sun's Java regex package supports this level.

The fourth level, however, allows the subexpression within lookbehind to match any amount of text, including the (?<!^\w+:) example. This level, supported by Microsoft's .NET languages, is truly superior to the others, but does carry a potentially huge efficiency penalty if used unwisely. (When faced with lookbehind that can match any amount of text, the engine is forced to check the lookbehind subexpression from the start of the string, which may mean a lot of wasted effort when requested from near the end of a long string.)

Comments and Mode Modifiers
Mode modifier: (?modifier), such as (?i) or (?-i)
Many flavors now allow some of the regex and match modes to be set within the regular expression itself. A common example is the special notation (?i), which turns on case-insensitive matching, and (?-i), which turns it off. For example, <B>(?i)very(?-i)</B> has the very part match with case insensitivity, while still keeping the tag names case-sensitive.
This example works with most systems that support (?i), including Perl, PHP, java.util.regex, Ruby,[] and the .NET languages. It doesn't work with Python or Tcl, neither of which support (?-i).

With most implementations except Python, the effects of (?i) within any type of parentheses are limited by the parentheses (that is, turn off at the closing parentheses). So, the (?-i) can be eliminated by wrapping the case-insensitive part in parentheses and putting (?i) as the first thing inside: <B>(?:(?i)very)</B>.
Common Mode Modifiers
Letter Mode
I case-insensitivity match mode
x free-spacing and comments regex mode
s dot-matches-all match mode
m enhanced line-anchor match mode
Mode-modified span: (?modifier :⋯), such as (?i:⋯)
Using a syntax like (?i:⋯), a mode-modified span turns on the mode only for what's matched within the parentheses. The <B>(?:(?i)very)</B> example is simplified to <B>(?i:very)</B>. When supported, this form generally works for all mode-modifier letters the system supports. Tcl and Python are two examples that support the (?i) form, but not the mode-modified span (?i:⋯) form.
Comments: (?#⋯)and #⋯
Literal-text span: \Q⋯\E
The special sequence \Q⋯\E turns off all regex metacharacters between them, except for \E itself. (If the \E is omitted, they are turned off until the end of the regex.) It allows what would otherwise be taken as normal metacharacters to be treated as literal text. This is especially useful when including the contents of a variable while building a regular expression.
With the Perl code m/\Q$query\E/i, a $query of 'C:\WINDOWS\' becomes C\:\\WINDOWS\\, resulting in a search that finds the original 'C:\WINDOWS\' as the user expects.

This feature is less useful in systems with procedural and object-oriented handling, as they accept normal strings. While building the string to be used as a regular expression, it's fairly easy to call a function to make the value from the variable "safe" for use in a regular expression. PHP has the preg_quote function; Java has a quote method.

The only regex engines that I know of that support \Q⋯\E are java.util.regex and PCRE. Perl supports \Q⋯\E within regex literals, but not within the contents of variables that might be interpolated into them.
Grouping, Capturing, Conditionals, and Control
Capturing/Grouping Parentheses: (⋯) and \1, \2, ...
Common, unadorned parentheses generally perform two functions, grouping and capturing.
One of the most common uses of parentheses is to pluck data from a string. The text matched by a parenthesized subexpression (also called "the text matched by the parentheses") is made available after the match in different ways by different programs, such as Perl's $1, $2, etc. (A common mistake is to try to use the \1 syntax outside the regular expression; something allowed only with sed and vi.)
Grouping-only parentheses: (?:⋯)
A Few Utilities and Their Access to Captured Text Program
Entire match First set of parentheses
Perl $& $1
PHP $matches[0] $matches[1]
Python MatchObj.group(0) MatchObj.group(1)
Ruby $& $1
Java MatcherObj.group() MatcherObj.group(1)
vi & \1
Named capture: (?<Name>⋯)
Python, PHP's preg engine, and .NET languages support captures to named locations. Python and PHP use the syntax (?P<name>⋯), while the .NET languages use (?<name>⋯).
for .Net: \b(?<Area>\d\d\d\)-(?<Exch>\d\d\d)-(?<Num>\d\d\d\d)\b
for Python/PHP: \b(?P<Area>\d\d\d\)-(?P<Exch>\d\d\d)-(?P<Num>\d\d\d\d)\b
Program can then refer to each matched substring through its name. RegexObj.Groups["Area"] in C#, RegexObj.group("Area") in Python, and $matches["Area"] in PHP.
Within the regular expression itself, the captured text is available via \k<Area> with .NET, and (?P=Area) in Python and PHP.
With Python and .NET (but not with PHP), you can use the same name more than once within the same expression. For example, to match the area code part of a US phone number, which look like '(###)' or '###-', you might use (shown in .NET syntax): ⋯(?:\((?<Area>\d\d\d)\)|(?<Area>\d\d\d)-)⋯. When either set matches, the three-digit code is saved to the name Area.
Atomic grouping: (?>⋯)
Atomic grouping, (?>⋯), will be very easy to explain once the important details of how the regex engine carries out its work is understood (☞169). Here, I'll just say that once the parenthesized subexpression matches, what it matches is fixed (becomes atomic, unchangeable) for the rest of the match, unless it turns out that the whole set of atomic parentheses needs to be abandoned and subsequently revisited. A simple example helps to illustrate this indivisible, "atomic" nature of text matched by these parentheses.

The string '¡Hola!' is matched by ¡.*!, but is not matched if .* is wrapped with atomic grouping, ¡(?>.*)!. In either case, .* first internally matches as much as it can ('¡'), but the inability of the subsequent ! to match wants to force the .* to give up some of what it had matched (the final '!'). That can't happen in the second case because .* is inside atomic grouping, which never "gives up" anything once the matching leaves them.

Although this example doesn't hint at it, atomic grouping has important uses. In particular, it can help make matching more efficient (☞171), and can be used to finely control what can and can't be matched (☞269).

Alternation: ⋯| ⋯ | ⋯
Alternation has very low precedence, so this and|or that matches the same as (this and)|(or that), and not this (and|or) that, even though visually, the and|or looks like a unit.

Most flavors allow an empty alternative, as in (this|that|). The empty subexpression can always match, so this example is comparable to (this|that)?.
Conditional: (?if then |else)
This construct allows you to express an if/then/else within a regex. The if part is a special kind of conditional expression discussed in a moment. Both the then and else parts are normal regex subexpressions. If the if part tests true, the then expression is attempted. Otherwise, the else part is attempted. (The else part may be omitted, and if so, the '|' before it may be omitted as well.)

The kinds of if tests available are flavor-dependent, but most implementations allow at least special references to capturing subexpressions and lookaround.

Greedy quantifiers: *, +, ?, {num,num}
Intervals {min,max}or \{min,max \}
Lazy quantifiers: *?, +?, ??, {num,num}?
Quantifiers are normally "greedy," and try to match as much as possible. Conversely, these non-greedy versions match as little as possible, just the bare minimum needed to satisfy the match.
Possessive quantifiers: *+, ++, ?+, {num,num}+
Currently supported only by java.util.regex and PCRE (and hence PHP), possessive quantifiers are like normally greedy quantifiers, but once they match something, they never "give it up."

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts