# Instructions to use the template
1. **Strictly no modifications**:
– Do not create/split to have new cells.
– Do not amend the sequence of the given cells.
– Do not modify or delete the given code in the Jupyter Notebook template.
2. **Place your code**:
– Write your code under the designated tags:
“`
##########<Your code starts here>##########
##########<Your code ends here>############
“`
3. **If there is any discrepancy in questions, please follow the ones in ECA paper PDF version.**
#### [15 marks] Q1a. Use BeautifulSoup to scrape all chemical formulas from all tables on this Wiki page (https://en.wikipedia.org/w/index.php?title=Glossary_of_chemical_formulae&oldid=1219388235) into a Pandas dataframe named <u>**formulas_df**</u> with the following columns: ‘formula’, ‘synonym’, ‘synonym_url’, ‘group’, ‘cas_number’. The ‘group’ column should contain the table title, such as A, B, C, Ca–Cu, etc. Note that a chemical formula can have multiple synonyms, with each synonym represented as a separate row in the dataframe.
– Compute and print the number of records per ‘group’, sorted by ‘group’.
– Show the rows of these formulas: Be(ClO₃)₂, Fe₃P, FeTiO₃, Fe₃O₄, C₅H₅ClN₂, and HPO₄²⁻.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from IPython.display import display, HTML
display(HTML(“<style>.container { width:100% !important; }</style>”))
display(HTML(“<style>.output_result { max-width:100% !important; }</style>”))
host = “https://en.wikipedia.org”
url = f”{host}/w/index.php?title=Glossary_of_chemical_formulae&oldid=1219388235″
##########<Your code starts here>##########
response = requests.get(url)
soup = BeautifulSoup(response.content, ‘html.parser’)
records = []
# Compute and print the number of records per ‘group’, sorted by ‘group’.
tables = soup.find_all(‘table’, class_=’wikitable’)
print(“Number of tables found:”, len(tables))
for index, table in enumerate(tables):
group = table.find_previous(‘h2’).text.strip() if table.find_previous(‘h2’) else f”Group {index}”
print(“Processing Group:”, group)
rows = table.find_all(‘tr’)[1:]
for row in rows:
cells = row.find_all(‘td’)
if len(cells) >= 2:
formula = cells[0].get_text(strip=True)
synonym = cells[1].get_text(strip=True)
synonym_url_tag = cells[1].find(‘a’)
synonym_url = f”{host}{synonym_url_tag[‘href’]}” if synonym_url_tag else None
cas_number = cells[2].get_text(strip=True) if len(cells) > 2 else None
records.append([formula, synonym, synonym_url, group, cas_number])
formulas_df = pd.DataFrame(records, columns=[‘formula’, ‘synonym’, ‘synonym_url’, ‘group’, ‘cas_number’])
print(“Number of records per group:”)
print(formulas_df.groupby(‘group’).size().sort_index())
# Show the rows of these formulas: Be(ClO₃)₂, Fe₃P, FeTiO₃, Fe₃O₄, C₅H₅ClN₂, and HPO₄²⁻.
target_formulas = [
“Be(ClO3)2”,
“Fe3P”,
“FeTiO3”,
“Fe3O4”,
“C5H5ClN2”,
“HPO42−”
]
filtered_df = formulas_df[formulas_df[‘formula’].str.match(‘|’.join(target_formulas))]
filtered_df
##########<Your code ends here>############
#### [5 marks] Q1b. For each unique chemical formula, count the number of atoms. Store the result in a dataframe with two columns: ‘formula’ and ‘atom_count’, and pass all assertions. Join the resulting dataframe with the dataframe in Question 1a. For example:
– Fe₃P: 4 atoms
– Be(ClO₃)₂: 9 atoms
– Ga₂(SO₄)₃·18H₂O: 71 atoms
– IO₃: 4 atoms
– HPO₄²⁻: 6 atoms
– Ba(BrO₃)₂·H₂O: 12 atoms
– Mg(ClO₃)₂·xH₂O: None or NaN atoms (uncountable)
def count_atoms(formula):
num_atoms = 0
##########<Your code starts here>##########
if re.search(r'<i>[nx]</i>|[nx](?!</i>)|·\s*$’, formula):
return None
formula = re.sub(r'<sup>[−+\d]*</sup>’, ”, formula)
formula = re.sub(r'</?[^>]+>’, ”, formula)
parts = formula.split(‘·’)
element_pattern = re.compile(r'([A-Z][a-z]*)(\d*\.?\d*)|(\()|(\))|(\d*\.?\d+)’)
for part in parts:
stack = [{}]
index = 0
while index < len(part):
if part[index] == ‘(‘:
stack.append({})
index += 1
elif part[index] == ‘)’:
index += 1
multiplier = ”
while index < len(part) and (part[index].isdigit() or part[index] == ‘.’):
multiplier += part[index]
index += 1
multiplier = float(multiplier) if multiplier else 1
temp_counts = stack.pop()
for element, count in temp_counts.items():
stack[-1][element] = stack[-1].get(element, 0) + count * multiplier
else:
match = element_pattern.match(part, index)
if match:
element, count, _, _, multiplier = match.groups()
if element:
count = float(count) if count else 1
stack[-1][element] = stack[-1].get(element, 0) + count
elif multiplier:
stack[-1] = {k: v * float(multiplier) for k, v in stack[-1].items()}
index = match.end()
else:
index += 1
num_atoms += sum(stack[-1].values())
##########<Your code ends here>############
return num_atoms
assert count_atoms(“Fe3P”) == 4
assert count_atoms(“Be(ClO3)2”) == 9
assert count_atoms(“FeTiO3”) == 5
assert count_atoms(“Fe3O4”) == 7
assert count_atoms(“C5H5ClN2”) == 13
assert count_atoms(“HPO42−”) == 44
assert count_atoms(“Au(OH)3”) == 7
formulas = formulas_df[‘formula’].tolist()
atom_counts = [count_atoms(formula) for formula in formulas]
atom_counts_df = pd.DataFrame({‘formula’: formulas, ‘atom_count’: atom_counts})
formulas_list = atom_counts_df[‘formula’].head(2598).to_list()
print(formulas_list)
atom_counts_df = atom_counts_df.drop_duplicates(subset=’formula’)
formulas_df = formulas_df.drop_duplicates(subset=’formula’)
merged_df = pd.merge(formulas_df, atom_counts_df, on=’formula’)
print(“Number of rows in merged DataFrame:”, len(merged_df))
merged_df.tail(10)
#### [5 marks] Q1c .For each unique chemical formula, count the number of unique elements. Store the result in a dataframe with two columns: ‘formula’ and ‘element_count’, and pass all assertions. Join the resulting dataframe with the dataframe in Question 1b. For example:
– Fe₃P: 2 elements
– Be(ClO₃)₂: 3 elements
– Ga₂(SO₄)₃·18H₂O: 4 elements
– IO₃: 2 elements
– HPO₄²⁻: 3 elements
– Ba(BrO₃)₂·H₂O: 4 elements
– Mg(ClO₃)₂·xH₂O: 4 elements
def count_elements(formula):
num_elements = 0
##########<Your code starts here>##########
# Remove HTML tags and charge indicators
formula = re.sub(r'<sup>[−+\d]*</sup>’, ”, formula)
formula = re.sub(r'</?[^>]+>’, ”, formula)
# Set to store unique elements
elements = set()
# Regular expression to match chemical elements
element_pattern = re.compile(r'([A-Z][a-z]*)’)
# Find all unique elements and add to the set
elements.update(element_pattern.findall(formula))
# Count the number of unique elements
num_elements = len(elements)
##########<Your code ends here>############
return num_elements
assert count_elements(“(BiO)2CO3”) == 3
assert count_elements(“(C2H5)2NH”) == 3
assert count_elements(“(C4H7NO)n”) == 4
formulas = formulas_df[‘formula’].tolist()
element_counts = [count_elements(formula) for formula in formulas]
element_counts_df = pd.DataFrame({‘formula’: formulas, ‘element_count’: element_counts})
final_df = pd.merge(merged_df, element_counts_df, on=’formula’)
print(final_df.tail(10))
#### [8 marks] Q1d. Which chemical formula<u>s</u> has the <u>smallest</u> or <u>largest</u> number of <u>elements</u> or <u>atoms</u>? Answer the same question by using Pandas dataframe methods and by using SQLite3 queries respectively.
##########<Your code starts here>##########
# Pandas
# Find the formulas with the smallest and largest number of atoms
min_atoms_formula = atom_counts_df.loc[atom_counts_df[‘atom_count’].idxmin()]
max_atoms_formula = atom_counts_df.loc[atom_counts_df[‘atom_count’].idxmax()]
# Find the formulas with the smallest and largest number of elements
min_elements_formula = element_counts_df.loc[element_counts_df[‘element_count’].idxmin()]
max_elements_formula = element_counts_df.loc[element_counts_df[‘element_count’].idxmax()]
# Print the results in a formatted way
print(“Formula with the smallest number of atoms:”)
print(f”Formula: {min_atoms_formula[‘formula’]}”)
print(f”Atom Count: {min_atoms_formula[‘atom_count’]}\n”)
print(“Formula with the largest number of atoms:”)
print(f”Formula: {max_atoms_formula[‘formula’]}”)
print(f”Atom Count: {max_atoms_formula[‘atom_count’]}\n”)
print(“Formula with the smallest number of elements:”)
print(f”Formula: {min_elements_formula[‘formula’]}”)
print(f”Element Count: {min_elements_formula[‘element_count’]}\n”)
print(“Formula with the largest number of elements:”)
print(f”Formula: {max_elements_formula[‘formula’]}”)
print(f”Element Count: {max_elements_formula[‘element_count’]}\n”)
##########<Your code ends here>############
##########<Your code starts here>##########
# SQLite3
import sqlite3
# Create a SQLite3 connection
conn = sqlite3.connect(“:memory:”)
atom_counts_df.to_sql(‘atom_counts’, conn, index=False, if_exists=’replace’)
element_counts_df.to_sql(‘element_counts’, conn, index=False, if_exists=’replace’)
# Query for the formula with the smallest and largest number of atoms
cursor = conn.cursor()
cursor.execute(“SELECT * FROM atom_counts WHERE atom_count = (SELECT MIN(atom_count) FROM atom_counts)”)
min_atoms_formula_sql = cursor.fetchall()
cursor.execute(“SELECT * FROM atom_counts WHERE atom_count = (SELECT MAX(atom_count) FROM atom_counts)”)
max_atoms_formula_sql = cursor.fetchall()
# Query for the formula with the smallest and largest number of elements
cursor.execute(“SELECT * FROM element_counts WHERE element_count = (SELECT MIN(element_count) FROM element_counts)”)
min_elements_formula_sql = cursor.fetchall()
cursor.execute(“SELECT * FROM element_counts WHERE element_count = (SELECT MAX(element_count) FROM element_counts)”)
max_elements_formula_sql = cursor.fetchall()
print(“Formula with the smallest number of atoms (SQLite):”, min_atoms_formula_sql)
print(“Formula with the largest number of atoms (SQLite):”, max_atoms_formula_sql)
print(“Formula with the smallest number of elements (SQLite):”, min_elements_formula_sql)
print(“Formula with the largest number of elements (SQLite):”, max_elements_formula_sql)
# Close the connection
conn.close()
##########<Your code ends here>############
#### [6 marks]Q2a. Visualize any correlations between the number of atoms and the number of elements, and draw one conclusion.
##########<Your code starts here>##########
import matplotlib.pyplot as plt
import seaborn as sns
# Merge the atom_counts_df and element_counts_df on the ‘formula’ column
merged_df = pd.merge(atom_counts_df, element_counts_df, on=’formula’)
# Create a scatter plot to visualize the correlation between atom_count and element_count
plt.figure(figsize=(10, 6))
sns.scatterplot(x=merged_df[‘element_count’], y=merged_df[‘atom_count’])
plt.xlabel(‘Number of Elements’)
plt.ylabel(‘Number of Atoms’)
plt.title(‘Correlation between Number of Elements and Number of Atoms’)
plt.show()
# Conclusion based on the scatter plot
print(“Conclusion: The scatter plot shows a general trend where formulas with more elements tend to have a higher number of atoms. This suggests a positive correlation between the number of elements in a chemical formula and the total atom count.”)
##########<Your code ends here>############
#### [13 marks]Q2b. Follow each URL from Question 1c, scrape the SMILES field, and store it in a new column called ‘smiles’. If there are multiple SMILES values, take only the first value. Reference the assertions below for sample expected scraped SMILES values.
!pip install rdkit-pypi
import requests
from bs4 import BeautifulSoup
from rdkit import Chem
def get_canonical_smiles(smiles):
if smiles is None:
return None
try:
mol = Chem.MolFromSmiles(smiles)
return Chem.MolToSmiles(mol)
except:
return smiles
def scrape_smiles(url):
smiles = None
##########<Your code starts here>##########
# Fetch the content of the webpage
response = requests.get(url)
soup = BeautifulSoup(response.content, ‘html.parser’)
# Try to find the SMILES field in the infobox or elsewhere in the page
# This will vary depending on the structure of the Wikipedia page
for tag in soup.find_all(‘td’, string=’SMILES’):
smiles_tag = tag.find_next(‘td’)
if smiles_tag:
smiles = smiles_tag.get_text(strip=True).split()[0] # Take the first SMILES value
break
##########<Your code ends here>############
return get_canonical_smiles(smiles)
assert scrape_smiles(“https://en.wikipedia.org/wiki/Actinium(III)_oxide”) == “[Ac+3].[Ac+3].[O-2].[O-2].[O-2]”
assert scrape_smiles(“https://en.wikipedia.org/wiki/Zirconium_phosphide”) == “P#[Zr]”
assert scrape_smiles(“https://en.wikipedia.org/wiki/Aluminium_chloride”) == “Cl[Al](Cl)Cl”
assert scrape_smiles(“https://en.wikipedia.org/wiki/Gold_tribromide”) == “[Au+3].[Br-].[Br-].[Br-]”
import concurrent.futures
from tqdm import tqdm
import pandas as pd
tqdm.pandas()
with concurrent.futures.ThreadPoolExecutor() as executor:
results = list(tqdm(executor.map(scrape_smiles, formulas_df[‘synonym_url’]), total=len(formulas_df)))
formulas_df[‘smiles’] = results
formula_df.loc[formula_df[“smiles”] == “S(Ag)Ag”, “smiles”] = get_canonical_smiles(“[Ag]S[Ag]”)
formula_df.loc[formula_df[“smiles”] == “[BH2]1[H][BH2][H]1”, “smiles”] = get_canonical_smiles(“B.B”)
formula_df.loc[formula_df[“synonym”] == “barium hexafluorosilicate”, “smiles”] = get_canonical_smiles(“F[Si](F)(F)(F)(F)F.[Ba]”)
formula_df.loc[formula_df[“synonym”] == “beryllium acetate”, “smiles”] = get_canonical_smiles(“[Be+2].[Be+2].[Be+2].[Be+2].CC(=O)[O-].CC(=O)[O-].CC(=O)[O-].CC(=O)[O-].CC(=O)[O-].CC(=O)[O-].[O-2]”)
formula_df.loc[formula_df[“synonym”] == “ammonium acetate”, “smiles”] = get_canonical_smiles(“CC(=O)[O-].[NH4+]”)
formula_df.loc[formula_df[“synonym”] == “guanine”, “smiles”] = get_canonical_smiles(“O=C1C2NCNC2NC(N)N1”)
formula_df.loc[formula_df[“synonym”] == “calcium boride”, “smiles”] = get_canonical_smiles(“B12B3[B-]14B5[B-]23B45.[Ca+2]”)
formula_df.loc[formula_df[“synonym”] == “cerium(iv) oxide”, “smiles”] = get_canonical_smiles(“O=[Ce]=O”)
formula_df.loc[formula_df[“synonym”] == “chlorine dioxide”, “smiles”] = get_canonical_smiles(“O=Cl[O]”)
formula_df.loc[formula_df[“synonym”] == “chromium(iii) hydroxide”, “smiles”] = get_canonical_smiles(“O[Cr](O)O”)
formula_df.loc[formula_df[“synonym”] == “copper(i) phosphide”, “smiles”] = get_canonical_smiles(“[Cu+].[Cu+].[Cu+].[P-3]”)
formula_df.loc[formula_df[“synonym”] == “superoxide ion”, “smiles”] = get_canonical_smiles(“O=O”)
formula_df.loc[formula_df[“synonym”] == “uranium nitride”, “smiles”] = get_canonical_smiles(“N#[U]#N”)
formula_df.loc[formula_df[“synonym”] == “xenon trioxide”, “smiles”] = get_canonical_smiles(“O=[Xe](=O)=O”)
formula_df.to_csv(“chemicals.csv”, index=False)
formula_df = pd.read_csv(“chemicals.csv”)
#### [9 marks] Q2c. Use the method GetAdjacencyMatrix provided in the RDKit package to define a function that converts a SMILES string into the corresponding adjacency matrix. Call the function for each SMILES string in the dataframe from Question 2b, and store the adjacency matrices into a Python dictionary with the keys being the SMILES strings. Fix all invalid SMILES. Reference the assertions below for sample expected results.
import rdkit
from rdkit import Chem
from rdkit.Chem import rdmolops
adj_matrices = {}
##########<Your code starts here>##########
##########<Your code ends here>############
import numpy as np
assert np.array_equal(adj_matrices[“[Ag]S[Ag]”],
np.array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0]]))
assert np.array_equal(adj_matrices[“O=S(=O)([O-])[O-].O=S(=O)([O-])[O-].[Zr+4]”],
np.array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]))
assert np.array_equal(adj_matrices[“Nc1ccc(S(O)(=O)=O)cc1”],
np.array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]]))
#### [9 marks] Q2d. Use the method GetBonds provided in the RDKit package to define a function that converts a SMILES string into the corresponding list of bonds between its pairs of bonded atoms. Call the function for each SMILES string in the dataframe from Question 2b, and store the list of bonds into a Python dictionary with the keys being the SMILES strings. Reference the assertions below for sample expected results.
all_bonds = {}
##########<Your code starts here>##########
##########<Your code ends here>############
assert all_bonds[“[Ag]S[Ag]”] == [
(0, 1, rdkit.Chem.rdchem.BondType.SINGLE),
(1, 2, rdkit.Chem.rdchem.BondType.SINGLE)
]
assert all_bonds[“O=S(=O)([O-])[O-].O=S(=O)([O-])[O-].[Zr+4]”] == [
(0, 1, rdkit.Chem.rdchem.BondType.DOUBLE),
(1, 2, rdkit.Chem.rdchem.BondType.DOUBLE),
(1, 3, rdkit.Chem.rdchem.BondType.SINGLE),
(1, 4, rdkit.Chem.rdchem.BondType.SINGLE),
(5, 6, rdkit.Chem.rdchem.BondType.DOUBLE),
(6, 7, rdkit.Chem.rdchem.BondType.DOUBLE),
(6, 8, rdkit.Chem.rdchem.BondType.SINGLE),
(6, 9, rdkit.Chem.rdchem.BondType.SINGLE)
]
assert all_bonds[“Nc1ccc(S(O)(=O)=O)cc1”] == [
(0, 1, rdkit.Chem.rdchem.BondType.SINGLE),
(1, 2, rdkit.Chem.rdchem.BondType.AROMATIC),
(2, 3, rdkit.Chem.rdchem.BondType.AROMATIC),
(3, 4, rdkit.Chem.rdchem.BondType.AROMATIC),
(4, 5, rdkit.Chem.rdchem.BondType.SINGLE),
(5, 6, rdkit.Chem.rdchem.BondType.SINGLE),
(5, 7, rdkit.Chem.rdchem.BondType.DOUBLE),
(5, 8, rdkit.Chem.rdchem.BondType.DOUBLE),
(4, 9, rdkit.Chem.rdchem.BondType.AROMATIC),
(9, 10, rdkit.Chem.rdchem.BondType.AROMATIC),
(10, 1, rdkit.Chem.rdchem.BondType.AROMATIC)
]
#### [5 marks] Q2e. What are the unique bond types in the dictionary from Question 2d?
##########<Your code starts here>##########
##########<Your code ends here>############
#### [5 marks] Q2f. Find all SMILES containing AROMATIC bond type, and use this information to compute a new boolean column namely ‘aromatic’ in the dataframe from Question F. Fill records that have an empty SMILES value with ‘Unknown’ in the ‘aromatic’ field. Print out the count per ‘aromatic’ value. Also, print out records with ‘True’ in the ‘aromatic’ field.
##########<Your code starts here>##########
##########<Your code ends here>############
#### [10 marks] Q2g. Use NetworkX together with RDKit to visualize a SMILES value given its adjacency matrix and bond types. See the cells below for examples. Treat each bond as one link between two atoms, regardless of the bond type.
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from rdkit import Chem
def smiles_and_adj_matrix_to_mol(smiles):
mol = Chem.MolFromSmarts(smiles)
adj_matrix = adj_matrices[smiles]
bond_types = all_bonds[smiles]
##########<Your code starts here>##########
##########<Your code ends here>############
smiles = “[Ag]S[Ag]”
smiles_and_adj_matrix_to_mol(smiles)
smiles = “O=S(=O)([O-])[O-].O=S(=O)([O-])[O-].[Zr+4]”
smiles_and_adj_matrix_to_mol(smiles)
smiles = “Nc1ccc(S(O)(=O)=O)cc1”
smiles_and_adj_matrix_to_mol(smiles)
#### [10 marks] Q2h. Implement the breadth-first search (https://en.wikipedia.org/wiki/Breadth-first_search) starting from a provided anchor atom of a SMILES value. The ‘anchor_atom’ parameter is the index in the adjacency matrix of the provided SMILES value. The ‘hops’ parameter represents the depth of the resulting breadth-first search tree. The anchor atom is at depth 0, its immediate neighbors are at depth 1, and so on. For example, the output of `breadth_first_search(“[Ag]S[Ag]”, 0, 1)` is `{‘Ag_0’: 0, ‘S_1’: 1}`, where `Ag_0` is the atom at index 0 of the adjacency matrix. Similarly, `S_1` and `Ag_2` are the atoms at indices 1 and 2, respectively. See the cells below for expected outputs.
def breath_first_search(smiles, anchor_atom, hops=1):
mol = Chem.MolFromSmarts(smiles)
adj_matrix = adj_matrices[smiles]
atoms = [atom.GetSymbol() for atom in mol.GetAtoms()]
bfs_tree = {}
##########<Your code starts here>##########
##########<Your code ends here>############
return bfs_tree