๐ Problem Statement
Given an integer array data representing the UTF-8 encoded data, return true if it is a valid UTF-8 encoding, else return false.
A UTF-8 encoding is a variable-width character encoding standard that can encode every character in the Unicode character set. It uses one to four bytes to represent a character, depending on the character's value.
- 1-byte character:
0xxxxxxx
- 2-byte character:
110xxxxx 10xxxxxx
- 3-byte character:
1110xxxx 10xxxxxx 10xxxxxx
- 4-byte character:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
To be a valid UTF-8 encoding, the following rules must be satisfied:
- 1-byte character: The first bit is
0
, followed by 7 bits of data.
- 2-byte character: The first 3 bits are
110
, followed by 5 bits of data; the second byte starts with 10
.
- 3-byte character: The first 4 bits are
1110
, followed by 4 bits of data; the second and third bytes start with 10
.
- 4-byte character: The first 5 bits are
11110
, followed by 3 bits of data; the second, third, and fourth bytes start with 10
.
๐ Example:
python
Input:
data = [197, 130, 1]
Output:
true
python
Input:
data = [235, 140, 4]
Output:
false
๐ง Key Insight
The main challenge in this problem is to identify whether the given byte sequence adheres to valid UTF-8 encoding rules. We need to process the input as a series of bits and ensure that each byte correctly conforms to one of the 1-4 byte UTF-8 patterns.
๐ก Approach
- Bitwise Analysis:
- Each number in the array represents a byte. We examine the binary representation of each byte to check if it conforms to the expected bit pattern for the UTF-8 encoding.
- Identify Byte Sequences:
- The first byte of each sequence tells us how many subsequent bytes are part of that character's encoding. We can classify each byte as either:
- A single byte (valid if the first bit is
0
).
- A continuation byte (valid if the first two bits are
10
).
- A leading byte (valid if it starts with
110
, 1110
, or 11110
depending on the byte sequence length).
- Traverse the Array:
- For each byte in the array, check whether it is a valid start byte or continuation byte. If itโs a start byte, determine how many subsequent bytes are expected, then check if they conform to the continuation byte pattern.
- Return Result:
- If we encounter any byte that doesnโt conform to the expected pattern, return
false
. If we successfully traverse the entire array, return true
.
๐ Python Code
python
class Solution:
def validUtf8(self, data: list[int]) -> bool:
n = len(data)
i = 0
while i < n:
# Get the first byte in binary
byte = data[i]
if byte >> 7 == 0:
# 1-byte character
i += 1
elif byte >> 5 == 0b110:
# 2-byte character
if i + 1 < n and data[i + 1] >> 6 == 0b10:
i += 2
else:
return False
elif byte >> 4 == 0b1110:
# 3-byte character
if i + 2 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10:
i += 3
else:
return False
elif byte >> 3 == 0b11110:
# 4-byte character
if i + 3 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10 and data[i + 3] >> 6 == 0b10:
i += 4
else:
return False
else:
# Invalid byte
return False
return True
๐ Step-by-Step Explanation
1. Initialize Variables:
python
n = len(data)
i = 0
- We start with the length of the input
data
and a pointer i
initialized to 0, which we will use to iterate through the array.
2. Examine Each Byte:
python
byte = data[i]
- For each byte in the data, we examine its first few bits to determine if it is a valid UTF-8 byte.
3. Check for 1-byte Character:
python
if byte >> 7 == 0:
i += 1
- If the byte starts with
0
, it is a valid 1-byte character, and we move to the next byte.
4. Check for 2-byte Character:
python
elif byte >> 5 == 0b110:
if i + 1 < n and data[i + 1] >> 6 == 0b10:
i += 2
else:
return False
- If the byte starts with
110
, itโs a 2-byte character. We then check if the next byte starts with 10
(a continuation byte).
5. Check for 3-byte Character:
python
elif byte >> 4 == 0b1110:
if i + 2 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10:
i += 3
else:
return False
- If the byte starts with
1110
, itโs a 3-byte character. We then check if the next two bytes start with 10
(continuation bytes).
6. Check for 4-byte Character:
python
elif byte >> 3 == 0b11110:
if i + 3 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10 and data[i + 3] >> 6 == 0b10:
i += 4
else:
return False
- If the byte starts with
11110
, itโs a 4-byte character. We check if the next three bytes start with 10
(continuation bytes).
7. Return False for Invalid Byte:
python
else:
return False
- If the byte doesnโt match any of the valid UTF-8 patterns, return
False
.
8. Return True After Traversing All Bytes:
python
return True
- If we successfully validate all bytes, return
True
.
๐ก Example Walkthrough
Example 1:
python
Input:
data = [197, 130, 1]
Output:
true
- First Byte (
197 = 11000101
):
- The first 3 bits
110
indicate a 2-byte character. The second byte (130 = 10 000010
) matches the expected continuation byte pattern.
- We successfully parse the first character.
- Second Byte (
130 = 10 000010
):
- Itโs a continuation byte, and we move to the next byte.
- Third Byte (
1 = 00000001
):
- This byte is a valid 1-byte character.
- Result: All bytes are valid UTF-8 characters. Return
True
.
Example 2:
python
Input:
data = [235, 140, 4]
Output:
false
- First Byte (
235 = 11101011
):
- The first 3 bits
111
indicate the start of a 4-byte character. The next three bytes should start with 10
.
- But the second byte (
140 = 10 000011
) does not follow the continuation byte pattern.
- Result: The sequence is invalid, return
False
.
โฑ๏ธ Time & Space Complexity
MetricComplexityTime ComplexityO(n)Space ComplexityO(1)
- Time Complexity:
- We iterate through the
data
array once, so the time complexity is O(n), where n is the length of data
.
- Space Complexity:
- We use a constant amount of extra space for the variables, so the space complexity is O(1).
๐งต Final Thoughts
LeetCode 393 is a great problem for practicing bit manipulation and validating byte sequences against a set of encoding rules. The two-pointer or sequence validation approach is efficient and ensures we check each byte and its corresponding continuation bytes in constant time. This problem is a good exercise in understanding how multi-byte encodings work at the bit level.