This is the web page for Data Engineering at the University of Florida.
The goal of this assignment is to ensure that you are familiar with the concept and practice of regular expressions.
h/t to libraries carpentries
Regular expressions are a concept and an implementation used in many different programming environments for sophisticated pattern matching. They are an incredibly powerful tool that can amplify your capacity to find, manage, and transform data and files.
A regular expression, often abbreviated to regex, is a method of using a sequence of characters to define a search to match strings, i.e. “find and replace”-like operations. In computation, a ‘string’ is a contiguous sequence of symbols or values. For example, a word, a date, a set of numbers (e.g., a phone number), or an alphanumeric value (e.g., an identifier). A string could be any length, ranging from empty (zero characters) to one that spans many lines of text (including line break characters). The terms ‘string’ and ‘line’ are sometimes used interchangeably, even when they are not strictly the same thing.
In library searches, we are most familiar with a small part of regular expressions known as the “wild card character,” but there are many more features to the complete regular expressions syntax. Regular expressions will let you:
Regex can also be useful for daily work. For example, say your organization wants to change the way they display telephone numbers on their website by removing the parentheses around the area code. Rather than search for each specific phone number (that could take forever and be prone to error) or searching for every open parenthesis character (could also take forever and return many false-positives), you could search for the pattern of a phone number. Regular expressions rely on the use of literal characters and metacharacters. A metacharacter is any American Standard Code for Information Interchange (ASCII) character that has a special meaning. By using metacharacters and possibly literal characters, you can construct a regex for finding strings or files that match a pattern rather than a specific string.
Since regular expressions defines some ASCII characters as “metacharacters” that have more than their literal meaning, it is also important to be able to “escape” these metacharacters to use them for their normal, literal meaning. For example, the period . means “match any character”, but if you want to match a period then you will need to use a \ in front of it to signal to the regular expression processor that you want to use the period as a plain old period and not a metacharacter. That notation is called “escaping” the special character. The concept of “escaping” special characters is shared across a variety of computational settings, including markdown and Hypertext Markup Language (HTML).
[!NOTE] Note that the syntax may vary slightly between different programming languages and tools. Useful information that users should know, even when skimming content. When using regular expressions it is best to consult the documentation for the specific programming language or tool you are using. The following is a general guide to the syntax of regular expressions. —
A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression organi[sz]e
matches both organise
and organize
. But because it locates all matches for the pattern in the file, not just for that word, it would also match reorganise
, reorganize
, organises
, organizes
, organised
, organized
, etc.
Learning common regex metacharacters Square brackets can be used to define a list or range of characters to be found. So:
[ABC]
matches A or B or C.[A-Z]
matches any upper case letter.[A-Za-z]
matches any upper or lower case letter.[A-Za-z0-9]
matches any upper or lower case letter or any digit.Then there are:
.
matches any character.\d
matches any single digit.\w
matches any part of word character (equivalent to [A-Za-z0-9]
).\s
matches any space, tab, or newline.\
used to escape the following character when that character is a special character. So, for example, a regular expression that found .com
would be \.com
because .
is a special character that matches any character.^
is an “anchor” which asserts the position at the start of the line. So what you put after the caret will only match if they are the first characters of a line. The caret is also known as a circumflex.$
is an “anchor” which asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line.\b
asserts that the pattern must match at a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words. So:
mark
will match not only mark
but also find marking
, market
, unremarkable
, and so on.\bword
will match word
, wordless
, and wordlessly
.comb\b
will match comb
and honeycomb
but not combine
.\brespect\b
will match respect
but not respectable
or disrespectful
.So, what is ^[Oo]rgani.e\b
going to match?
Answer:
organise
Organized
organifer
Organi2ek
Or, any other string that starts a line, begins with a letter o
in lower or capital case, proceeds with rgani
, has any character in the 7th position, and ends with the letter e
.
See solution visualised on Regexper.com
Other useful special characters are:
*
matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.+
matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”.?
matches when the preceding character appears zero or one time.{VALUE}
matches the preceding character the number of times defined by VALUE; ranges, say, 1-6, can be specified with the syntax {VALUE,VALUE}
, e.g. \d{1,9}
will match any number between one and nine digits in length.|
means or./i
renders an expression case-insensitive (equivalent to [A-Za-z]
).So, what are these going to match?
^[Oo]rgani.e\w*
^[Oo]rgani.e\w*
match?
organise
Organize
organifer
Organi2ed111
Or, any other string that starts a line, begins with a letter o
in lower or capital case, proceeds with rgani
, has any character in the 7th position, follows with letter e
and zero or more characters from the range [A-Za-z0-9]
.
[Oo]rgani.e\w+$
[Oo]rgani.e\w+$
match?
organiser
Organized
organifer
Organi2ed111
Or, any other string that ends a line, begins with a letter o
in lower or capital case, proceeds with rgani
, has any character in the 7th position, follows with letter e
and at least one or more characters from the range [A-Za-z0-9]
.
^[Oo]rgani.e\w?\b
^[Oo]rgani.e\w?\b
match?
organise
Organized
organifer
Organi2ek
Or, any other string that starts a line, begins with a letter o
in lower or capital case, proceeds with rgani
, has any character in the 7th position, follows with letter e
, and ends with zero or one characters from the range [A-Za-z0-9]
.
dd-MM-yyyy
?
\b\d{2}-\d{2}-\d{4}\b
Depending on your data, you may choose to remove the word bounding.
dd-MM-yyyy
or dd-MM-yy
at the end of a line only?
\d{2}-\d{2}-\d{2,4}$
Note this will also find strings such as 31-01-198 at the end of a line, so you may wish to check your data and revise the expression to exclude false positives. Depending on your data, you may choose to add word bounding at the start of the expression.
\b[Oo]rgani.e\w{2}\b
\b[Oo]rgani.e\w{2}\b
match?
organisers
Organizers
organifers
Organi2ek1
Or, any other string that begins with a letter o
in lower or capital case after a word boundary, proceeds with rgani
, has any character in the 7th position, follows with letter e
, and ends with two characters from the range [A-Za-z0-9]
.
\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b
\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b
match?
organise
Organi1e
Organizer
organifed
Or, any other string that begins with a letter o
in lower or capital case after a word boundary, proceeds with rgani
, has any character in the 7th position, and end with letter e
, or any other string that begins with a letter o
in lower or capital case after a word boundary, proceeds with rgani
, has any character in the 7th position, follows with letter e
, and ends with a single character from the range [A-Za-z0-9]
.
This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. Some real-world use cases for regex were included on a ACRL Tech Connect blog post now archived at Library Hat .
[]
defines a range of characters..
matches any character.\
is used to escape the following character when that character is a special character. So, for example, a regular expression that found ‘.com’ would be \\.com
because .
is a special character that matches any character.\d
matches any single digit.\w
matches any part of word character (equivalent to [A-Za-z0-9]
).\s
matches any space, tab, or newline.^
asserts the position at the start of the line. So what you put after it will only match if they are the first characters of a line.$
asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line.\b
adds a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words.*
matches the preceding element zero or more times. For example, ab*c
matches ‘ac’, ‘abc’, ‘abbbc’, etc.+
matches the preceding element one or more times. For example, ab+c
matches ‘abc’, ‘abbbc’ but not ‘ac’.?
matches when the preceding character appears zero or one time.{VALUE}
matches the preceding character the number of times define by VALUE; ranges can be specified with the syntax {VALUE,VALUE}
.|
means or.Please see this pdf for your assignment instructions Instructions PDF.
Back to CIS6930