= "banana"
fruit = fruit[1]
letter letter
'a'
Understanding Core Python Data Structures and Patterns
In this module, you will learn about Python’s most commonly used data structures: strings, lists, dictionaries, and tuples. You will also explore how to leverage regular expressions to search for patterns in text. Finally, you will see examples that combine these structures to solve more advanced tasks, followed by tips for debugging and practice exercises.
Why are these data structures important?
Strings store text and are immutable sequences of characters. In Python, they form the foundation of almost all user-facing output and file processing.
A string is a sequence of characters in a specific order. A character can be a letter, digit, punctuation mark, or whitespace. You can select any character in a string using the bracket operator:
= "banana"
fruit = fruit[1]
letter letter
'a'
The index in brackets starts at 0
, so
fruit[0]
is the first character ('b'
),
fruit[1]
is the second character ('a'
), and so
on.
0] fruit[
'b'
You can use variables or expressions as indices:
= 1
i +1] # fruit[2] fruit[i
'n'
If you use a non-integer index, you get a TypeError
. You
can use len()
to determine a string’s length:
= len(fruit) # 6 for "banana" n
Because indices start at 0
, the last character is at
position len(fruit) - 1
, which is fruit[n-1]
.
Alternatively, negative indices let you count backward:
print(fruit[-1]) # last character
print(fruit[-2]) # second to last
a
n
You can quickly access any position in the string without manual loops.
A slice selects a substring by indicating a range of
indices with [start:end]
. It includes the
start
index but excludes the end
.
= 'banana'
fruit print(fruit[0:3]) # 'ban'
print(fruit[3:6]) # 'ana'
ban
ana
Omitting start
means “from the beginning”, and omitting
end
means “to the end”:
print(fruit[:3]) # 'ban'
print(fruit[3:]) # 'ana'
ban
ana
If the first index is greater than or equal to the second, you get an
empty string. For example, fruit[3:3]
returns
''
.
Use slices to easily extract segments of text, such as prefixes, suffixes, or partial filenames.
Strings are immutable, so you cannot modify them in
place. An assignment like greeting[0] = 'J'
causes a
TypeError
. Instead, create a new string:
= 'Hello, world!'
greeting = 'J' + greeting[1:] new_greeting
This prevents accidental data corruption, making string handling more predictable.
You can compare strings using relational operators:
= 'banana'
word
if word == 'banana':
print('All right, banana.')
All right, banana.
Other operators let you determine alphabetical ordering:
def compare_word(word):
if word < 'banana':
print(word, 'comes before banana.')
elif word > 'banana':
print(word, 'comes after banana.')
else:
print('All right, banana.')
"apple")
compare_word("Orange") compare_word(
apple comes before banana.
Orange comes before banana.
Uppercase letters come before lowercase letters in Python’s default sort order, so be mindful of case differences. You can convert strings to lowercase or uppercase for case-insensitive comparisons.
A method is like a function but follows the object-dot-method syntax. For example:
= "Hello World"
text print(text.lower())
print(text.upper())
print(text.replace("Hello", "Hi"))
print(text.split())
hello world
HELLO WORLD
Hi World
['Hello', 'World']
These help easily perform text transformations for data cleaning or user-facing output.
Regular expressions (regex) help you search for
complex patterns in text. Python’s built-in re
module
provides powerful tools for matching and manipulating text.
For example, you can verify formats (phone numbers, emails), capture specific bits of text, or do advanced replacements.
import re
= "Hello, my name is Jane. It's nice to meet you."
text = 'Jane'
pattern
= re.search(pattern, text)
result if result:
print("Found:", result.group())
else:
print("Not found.")
Found: Jane
re.search
returns a
Match object with .group()
,
.span()
, etc.None
.This allows very fast pattern matching in large strings, flexible for partial matches (e.g., ’Jan[eE]*’ to allow slight variations).
When writing regex, prefix patterns with r
to create raw
strings, which interpret backslashes literally:
= "Hello\nWorld" # \n is a newline
normal_str = r"Hello\nWorld" # keeps the literal \n
raw_str
print(normal_str)
print(raw_str)
Hello
World
Hello\nWorld
Prefix strings with r
to avoid having to escape
backslashes, e.g. r"\d+"
instead of
"\\d+"
.
For the following examples, we will use this file:
for line in open('data/sample_text.txt'):
print(line)
Hello, world!
Alice smiled as she greeted Bob with a cheerful hello.
In the quiet morning, Bob whispered hello to the sleeping world.
Alice and Bob wandered through a world that seemed to echo with hello.
A simple hello from Alice brightened Bob’s day in an ordinary world.
Bob called out, "Hello, Alice!" as they explored the world together.
In a magical world, hello was the key that united Alice and Bob.
Alice thought, "Hello to a new day in this ever-changing world," as Bob nodded.
With a friendly hello, Bob opened the door to Alice’s mysterious world.
The world felt lighter when Alice and Bob exchanged a heartfelt hello.
Bob wrote in his journal: "Today, Alice said hello to the whole world."
Amid the busy city, a quiet hello from Alice and Bob brought calm to the world.
In the realm of dreams, Alice and Bob discovered that every hello sparked wonder in the world.
A warm hello from Bob melted the chill of the early world, as Alice looked on.
Alice and Bob laughed together, their hello echoing through the vibrant world.
While strolling through the park, Bob’s spontaneous hello made the world seem friendlier to Alice.
In a story of friendship, every hello by Alice and every nod from Bob transformed their little world.
The world listened as Bob said hello, while Alice beamed in response.
Under the starlight, Alice and Bob shared a soft hello that warmed their world.
A final hello from Alice to Bob closed a day where the world felt wonderfully alive.
You might loop over each line in a file and call
re.search
:
def find_first(pattern, filename='data/sample_text.txt'):
import re
for line in open(filename):
= re.search(pattern, line)
result if result is not None:
return result
"Hello") find_first(
<re.Match object; span=(0, 5), match='Hello'>
|
)Use the | symbol for logical OR within a regex. For example, to find either “Alice” or “Bob”:
= 'Alice|Bob'
pattern = find_first(pattern)
result print(result)
<re.Match object; span=(0, 5), match='Alice'>
You can also loop through lines, counting matches. For instance:
def count_matches(pattern, filename='data/sample_text.txt'):
import re
= 0
count for line in open(filename):
if re.search(pattern, line) is not None:
+= 1
count return count
= count_matches('Alice|Bob')
mentions print(mentions)
19
^
: start of a line$
: end of a line'^Hello') find_first(
<re.Match object; span=(0, 5), match='Hello'>
'world!$') find_first(
<re.Match object; span=(7, 13), match='world!'>
Regex includes special metacharacters and quantifiers:
.
matches any character (except newline).*
matches 0 or more of the preceding element.+
matches 1 or more of the preceding element.?
makes the preceding element optional (0 or 1).[...]
matches any one character in the brackets.(...)
captures the matched text as a group.\
escapes special characters or denotes special
sequences like , etc.Use re.sub(pattern, replacement, text)
to substitute
matches:
= "This is the centre of the city."
text_line = r'cent(er|re)'
pattern = re.sub(pattern, 'center', text_line)
updated_line print(updated_line)
This is the center of the city.
This allows you to clean up strings in powerful ways, such as normalizing different spellings or removing special characters.
Use re.findall
to get all matches, re.split
to split a string by a regex, and various flags (e.g.,
re.IGNORECASE
) to alter matching behavior.
Regex is extremely powerful for tasks like extracting email addresses, validating formats, or searching logs.
Lists are mutable sequences that can store elements of any type (including other lists). They form the workhorse for many data-processing tasks due to their flexibility.
A list is a sequence of values (of any type). Create one with square brackets:
= [42, 123]
numbers = ['Cheddar', 'Edam', 'Gouda']
cheeses = ['spam', 2.0, 5, [10, 20]] # nested list
mixed = [] empty
len(cheeses)
returns the length of a list. The length of
an empty list is 0
.
Use the bracket operator to read or write an element:
1] = 17 # modifies the list
numbers[print(numbers)
[42, 17]
Unlike strings, lists allow you to assign directly to their indices. You can still use negative indices to count backward.
Use the in
operator to check membership:
'Edam' in cheeses
True
Lists support slicing with the same [start:end]
syntax
as strings:
= ['a', 'b', 'c', 'd'] letters
1:3] letters[
['b', 'c']
2] letters[:
['a', 'b']
2:] letters[
['c', 'd']
# copy of the list letters[:]
['a', 'b', 'c', 'd']
+
concatenates, *
repeats:
1, 2] + [3, 4] [
[1, 2, 3, 4]
'spam'] * 4 [
['spam', 'spam', 'spam', 'spam']
sum([1, 2, 3])
6
min([3, 1, 4])
1
max([3, 1, 4])
4
append(x)
adds an item at the end.extend([x, y])
adds multiple items.pop(index)
removes and returns the item at
index
.remove(x)
removes the first occurrence of
x
.= ['a', 'b', 'c'] letters
'd') # modifies letters
letters.append(print(letters)
['a', 'b', 'c', 'd']
'e', 'f'])
letters.extend([print(letters)
['a', 'b', 'c', 'd', 'e', 'f']
1) # removes 'b'
letters.pop(print(letters)
['a', 'c', 'd', 'e', 'f']
'e') # removes 'e'
letters.remove(print(letters)
['a', 'c', 'd', 'f']
These list methods help manage growing or shrinking lists without extra variables.
List methods often modify a list in place and return
None
. This can confuse people who expect them to behave
like string methods. For instance:
= [1, 2, 3]
t = t.remove(3) # WRONG!
t
print(t)
# Expect: [1, 2]
# Return: None
None
remove(3)
modifies t
and returns
None
, so assigning it back to t
loses the
original list. If you see an error like
NoneType object has no attribute 'remove'
, check whether
you accidentally assigned a list method’s return value to the list.
For the example above, you would do this:
= [1, 2, 3]
t 3) # CORRECT!
t.remove(
print(t)
[1, 2]
a list of characters is not the same as a
string. To convert a string to a list of characters,
use list()
:
= 'coal'
s = list(s)
t print(t)
['c', 'o', 'a', 'l']
To split a string by whitespace into a list of words:
= "The children yearn for the mines"
s = s.split()
words print(words)
['The', 'children', 'yearn', 'for', 'the', 'mines']
You can specify a delimiter for split
, and you can use
''.join(list_of_strings)
to rebuild a single string from a
list. These are useful for text tokenization, splitting logs, or
reconstructing messages.
a for
loop iterates over each element:
for cheese in cheeses:
print(cheese)
Cheddar
Edam
Gouda
Use sorted()
to return a new sorted list without
modifying the original:
= ["c", "a", "b"]
scrambled_list = sorted(scrambled_list)
sorted_list
print(sorted_list)
print(scrambled_list)
['a', 'b', 'c']
['c', 'a', 'b']
sorted('letters')
returns a list of characters. Combine
with "".join()
to build a sorted string:
"".join(sorted('letters'))
'eelrstt'
Variables can refer to the same object or different objects that have the same value. For example:
= 'banana'
a = 'banana'
b is b # often True (same object) a
True
In this example, Python only created one string object, and both
a
and b
refer to it. But when you create two
lists, you get two objects.
= [1, 2, 3]
x = [1, 2, 3]
y is y # False (different objects) x
False
In this case we would say that the two lists are equivalent, because they have the same elements, but not identical, because they are not the same object. If two objects are identical, they are also equivalent, but if they are equivalent, they are not necessarily identical.
When you assign one variable to another, both variables reference the same object:
= [1, 2, 3]
a = a
b is a b
True
If an object is mutable, changes made via one variable affect the other:
print(a)
0] = 5
b[print(a)
[1, 2, 3]
[5, 2, 3]
Avoid aliasing unless it’s intentional.
When you pass a list to a function, you pass a reference to that list. The function can modify the original list:
def pop_first(lst):
return lst.pop(0)
= ['a', 'b', 'c']
letters
pop_first(letters)print(letters)
['b', 'c']
If you do not want a function to modify the original list, pass a copy:
list(letters)) # or pop_first(letters[:]) pop_first(
'b'
A dictionary maps keys to values and offers very fast lookups. Keys must be immutable, while values can be anything (including lists).
Instead of using integer indices, a dictionary can use almost any hashable type as a key. You create a dictionary with curly braces:
= {}
numbers 'zero'] = 0
numbers['one'] = 1
numbers[ numbers
{'zero': 0, 'one': 1}
Access a value using its key:
'one'] numbers[
1
Dictionary keys must be unique and immutable. Lists cannot serve as keys because they are mutable. These are useful for fast lookup by label (e.g., “user_id” -> user info) instead of by integer position.
You can create a dictionary all at once:
= {'zero': 0, 'one': 1, 'two': 2} numbers
or use dict()
:
= dict(numbers)
numbers_copy print(numbers_copy)
= dict()
empty print(empty)
{'zero': 0, 'one': 1, 'two': 2}
{}
in
operatorin
checks for keys in the dictionary for membership
without searching through all entries:
'one' in numbers
True
'three' in numbers
False
To check if something appears as a value, use
numbers.values()
:
1 in numbers.values()
True
Use a dictionary to count how often each character appears in a string:
def value_counts(string):
= {}
counter for letter in string:
if letter not in counter:
= 1
counter[letter] else:
+= 1
counter[letter] return counter
'brontosaurus') value_counts(
{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}
When you loop over a dictionary, you traverse its keys:
= value_counts('banana') counter
for key in counter:
print(key)
b
a
n
Use counter.values()
to loop over values:
for value in counter.values():
print(value)
1
3
2
Or you can use the bracket operator to get the key and value:
for key in counter:
print(key, counter[key])
b 1
a 3
n 2
This method searches the counter
dictionary in every
loop, and we will see more efficient version of this loop in the tuples
section.
A dictionary’s values can be lists (or other dictionaries), but keys must be hashable:
= {
d "fruits": ["apple", "banana", "cherry"],
"numbers": [10, 20, 30],
"colors": {
"red": [True, False, True],
"yellow": [True, True, False],
"green": [True, False, False]
}
}
print(d)
{'fruits': ['apple', 'banana', 'cherry'], 'numbers': [10, 20, 30], 'colors': {'red': [True, False, True], 'yellow': [True, True, False], 'green': [True, False, False]}}
This allows you to combine structures for more complex data representations, such as JSON-like objects. You cannot use a list as a key. Python uses a hash table for quick lookups, and hash values must not change.
Tuples are immutable sequences that can hold multiple items. They’re often used where immutability is helpful (e.g., as dictionary keys).
Tuples work like lists but cannot be modified once created. You create a tuple with comma-separated values, usually enclosed in parentheses:
= ('l', 'u', 'p', 'i', 'n')
t = 'l', 'u', 'p', 'i', 'n'
t_2
print(type(t))
print(type(t_2))
<class 'tuple'>
<class 'tuple'>
You can create a single element tuple:
= "a",
t_single print(t_single)
('a',)
Wrapping a single element with parenthesis does not make a single-element tuple:
= ("a")
t_single_bad print(t_single_bad)
print(type(t_single_bad))
a
<class 'str'>
Like strings, tuples are immutable. Attempting to
modify a tuple directly causes an error. Tuples do not have list-like
methods such as append
or remove
.
0] = "L" t[
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[68], line 1 ----> 1 t[0] = "L" TypeError: 'tuple' object does not support item assignment
Because they are immutable, tuples are hashable and can serve as keys in a dictionary:
= {}
coords 1, 2)] = "Location A"
coords[(3, 4)] = "Location B"
coords[(print(coords)
{(1, 2): 'Location A', (3, 4): 'Location B'}
You cannot alter tuple contents after creation.
You can assign multiple variables at once with tuple unpacking:
= 1, 2 # could also use: (a, b) = (1, 2) or any combo of parenthesis
a, b print(a, b)
1 2
If the right side has the wrong number of values, Python raises a
ValueError
.
= 1, 2, 3 a, b
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[71], line 1 ----> 1 a, b = 1, 2, 3 ValueError: too many values to unpack (expected 2)
You can also swap variables in one line. This allows you to swap variables without an extra temporary variable and return multiple values elegantly:
print(a, b)
= b, a # swap
a, b print(a, b)
1 2
2 1
You often use tuple assignment to iterate over
(key, value)
pairs from a dictionary:
= {'one': 1, 'two': 2, 'three': 3}
d
for item in d.items():
= item
key, value print(key, '->', value)
one -> 1
two -> 2
three -> 3
Each time through the loop, item
is assigned a tuple
that contains a key and the corresponding value.
We can write this loop more concisely, like this:
for key, value in d.items():
print(key, '->', value)
one -> 1
two -> 2
three -> 3
A function can return a single tuple, effectively returning multiple values:
def min_max(t):
return min(t), max(t) # could also write: (min(t), max(t))
= min_max([2, 4, 1, 3])
low, high print(low, high)
1 4
This offers a clean way to return more than one piece of information from a function.
If a function parameter starts with *
, Python
packs extra arguments into a tuple:
def mean(*args):
return sum(args) / len(args)
1, 2, 3) mean(
2.0
Here is an example you are already familiar with,
print
:
def print(*args, sep=' ', end='\n', file=None, flush=False):
"""print code"""
print(1, 2, 3, sep=", ")
1, 2, 3
You can unpack a sequence by using *
when calling a function:
divmod(*[7, 3]) # same as divmod(7, 3)
(2, 1)
Consider a function that calculates a “trimmed” mean by removing the lowest and highest values:
def trimmed_mean(*args):
= min_max(args)
low, high = list(args)
trimmed
trimmed.remove(low)
trimmed.remove(high)return mean(*trimmed)
1, 2, 3, 4, 5) trimmed_mean(
3.0
While this is a bit more advanced than we will need for this course, it allows flexible argument passing and returning, which helps build utility functions that accept varying numbers of inputs.
The built-in zip
function pairs up corresponding
elements from multiple sequences:
= [1, 2, 4, 5, 1, 5, 2]
scores1 = [5, 5, 2, 5, 5, 2, 3]
scores2
for s1, s2 in zip(scores1, scores2):
if s1 > s2:
print("Team1 wins this game!")
elif s1 < s2:
print("Team2 wins this game!")
else:
print("It's a tie!")
Team2 wins this game!
Team2 wins this game!
Team1 wins this game!
It's a tie!
Team2 wins this game!
Team1 wins this game!
Team2 wins this game!
list(zip(a, b))
returns a list of tuples. You can also
combine zip
with dict
to create dictionaries
from two parallel lists:
= 'abc'
letters = [0, 1, 2]
numbers dict(zip(letters, numbers)) # try list(zip(letters, numbers)) on your own
{'a': 0, 'b': 1, 'c': 2}
Use enumerate
to loop over the indices and elements of a
sequence at the same time:
for index, element in enumerate('abcefghijk'):
print(index, element)
0 a
1 b
2 c
3 e
4 f
5 g
6 h
7 i
8 j
9 k
To see the values enumerate
creates, you need to turn
the enumerate
object into either a list, tuple, or
dictionary:
enumerate('abcefghijk')
<enumerate at 0x2488a161df0>
list(enumerate('abcefghijk'))
[(0, 'a'),
(1, 'b'),
(2, 'c'),
(3, 'e'),
(4, 'f'),
(5, 'g'),
(6, 'h'),
(7, 'i'),
(8, 'j'),
(9, 'k')]
tuple(enumerate('abcefghijk'))
((0, 'a'),
(1, 'b'),
(2, 'c'),
(3, 'e'),
(4, 'f'),
(5, 'g'),
(6, 'h'),
(7, 'i'),
(8, 'j'),
(9, 'k'))
dict(enumerate('abcefghijk'))
{0: 'a',
1: 'b',
2: 'c',
3: 'e',
4: 'f',
5: 'g',
6: 'h',
7: 'i',
8: 'j',
9: 'k'}
This is true for many Python functions that create objects, so remember to experiment with new code.
To invert a dictionary that maps a key to a value, you might need to map each value to a list of keys (because multiple keys can share the same value). For example:
def invert_dict(d):
= {}
new_d for key, val in d.items():
if val not in new_d:
= [key]
new_d[val] else:
new_d[val].append(key)return new_d
This is useful for reverse lookups when multiple keys share the same value:
= {
counts "a": 1,
"b": 23,
"c": 1,
"d": 4,
"e": 4
}
invert_dict(counts)
{1: ['a', 'c'], 23: ['b'], 4: ['d', 'e']}
Tuples are hashable, so we can use them as dictionary keys:
= {}
locations 1, 2)] = "Start"
locations[(3, 4)] = "Goal"
locations[(print(locations[(3, 4)])
Goal
This could be useful for coordinate-based lookups (e.g., board games or grid-based apps).
Write a program that checks if the word "apple"
appears
in the sentence
“I bought some apples and oranges at the market."
Print
"Found"
or "Not Found"
accordingly. Consider
using re.search()
with a pattern allowing an optional
s
.
Given:
= """
text Call me at 123-456-7890 or at (123) 456-7890.
Alternatively, reach me at 123.456.7890.
"""
Write a single regex that matches all three phone formats. Use
re.findall()
to capture them.
For a product catalog:
= """Product ID: ABC-123 Price: $29.99
catalog Product ID: XY-999 Price: $199.95
Product ID: TT-100 Price: $10.50
Product ID: ZZ-777 Price: $777.00
Product ID: FF-333 Price: $2.99
"""
Write a regex that captures (ProductID, Price)
as
groups. Use re.findall()
to produce a list of tuples.
Two words are anagrams if one can be rearranged to form the other.
Write is_anagram
that returns True
if two
strings are anagrams. Then find all anagrams of "takes"
in
a given word list.
A palindrome reads the same forward and backward. Write
is_palindrome
that checks if a string is a palindrome. Use
reversed
or slice notation to reverse strings.
get
in a dictionaryRewrite the value_counts
function to eliminate the
if
statement by using
dict.get(key, default)
.
Write has_duplicates(sequence)
that returns
True
if any element appears more than once. Test it to see
if you can find a word longer than "unpredictably"
with all
unique letters.
Write find_repeats(counter)
that takes a dictionary
mapping from keys to counts and returns a list of keys appearing more
than once.
Write most_frequent_letters(string)
that prints letters
in decreasing order of frequency. You can use
reversed(sorted(...))
or
sorted(..., reverse=True)
.