Byte Order mark (BOM)

The \ufeff character at the beginning of a file is a Unicode character called the Byte Order Mark (BOM). It is used to indicate the endianness (byte order) of a text file or stream. It is especially common in files encoded as UTF-16 or UTF-8.

When you see \ufeff at the beginning of a file, it usually means that the file is encoded in UTF-8 with BOM. The BOM is not required for UTF-8 and can cause issues, as some software does not handle it correctly. It is generally recommended to use UTF-8 without BOM for maximum compatibility.

Handling BOM in Python

When reading a CSV file with the csv module in Python, you can handle the BOM by opening the file with the utf-8-sig encoding, which will automatically remove the BOM if it is present.

Here’s an example:

import csv
from io import StringIO
 
data = "\ufefffoo,bar\nfirst,second"
 
# Convert the string data to a file object using StringIO
csv_file = StringIO(data)
 
# Open the file with utf-8-sig encoding
csv_file = open(csv_file.name, encoding='utf-8-sig')
 
# Create a CSV reader object
reader = csv.reader(csv_file)
 
# Iterate over the rows in the reader and print them
for row in reader:
    print(row)
 
# Close the file
csv_file.close()

In this example, if a BOM is present, it will be removed by the utf-8-sig encoding, and the CSV reader will read the file correctly. If you are reading the CSV data from a string, you can use the StringIO object with the utf-8-sig encoding as well.

public-available

Alisson's Notes

Explorador

Byte Order mark (BOM)

Handling BOM in Python

Visão de gráfico

Backlinks