LeetCode 393: UTF-8 Validation – Validate UTF-8 Encoded Data

Programming & Development / April 10, 2025

LeetCode 393: UTF-8 Validation – Validate UTF-8 Encoded Data

LeetCode 393 UTF-8 Validation Validate UTF-8 UTF-8 Encoding Algorithm Python Bit Manipulation Byte Sequence Time Complexity Space Complexity

Copy Link Bookmark Print

📘 Problem Statement

Given an integer array data representing the UTF-8 encoded data, return true if it is a valid UTF-8 encoding, else return false.

A UTF-8 encoding is a variable-width character encoding standard that can encode every character in the Unicode character set. It uses one to four bytes to represent a character, depending on the character's value.

1-byte character: 0xxxxxxx
2-byte character: 110xxxxx 10xxxxxx
3-byte character: 1110xxxx 10xxxxxx 10xxxxxx
4-byte character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

To be a valid UTF-8 encoding, the following rules must be satisfied:

1-byte character: The first bit is 0, followed by 7 bits of data.
2-byte character: The first 3 bits are 110, followed by 5 bits of data; the second byte starts with 10.
3-byte character: The first 4 bits are 1110, followed by 4 bits of data; the second and third bytes start with 10.
4-byte character: The first 5 bits are 11110, followed by 3 bits of data; the second, third, and fourth bytes start with 10.

📚 Example:

python

Input:
data = [197, 130, 1]
Output:
true
python

Input:
data = [235, 140, 4]
Output:
false

🧠 Key Insight

The main challenge in this problem is to identify whether the given byte sequence adheres to valid UTF-8 encoding rules. We need to process the input as a series of bits and ensure that each byte correctly conforms to one of the 1-4 byte UTF-8 patterns.

💡 Approach

Bitwise Analysis:

Each number in the array represents a byte. We examine the binary representation of each byte to check if it conforms to the expected bit pattern for the UTF-8 encoding.

Identify Byte Sequences:

The first byte of each sequence tells us how many subsequent bytes are part of that character's encoding. We can classify each byte as either:
A single byte (valid if the first bit is 0).
A continuation byte (valid if the first two bits are 10).
A leading byte (valid if it starts with 110, 1110, or 11110 depending on the byte sequence length).

Traverse the Array:

For each byte in the array, check whether it is a valid start byte or continuation byte. If it’s a start byte, determine how many subsequent bytes are expected, then check if they conform to the continuation byte pattern.

Return Result:

If we encounter any byte that doesn’t conform to the expected pattern, return false. If we successfully traverse the entire array, return true.

🐍 Python Code

python

class Solution:
    def validUtf8(self, data: list[int]) -> bool:
        n = len(data)
        i = 0
        
        while i < n:
            # Get the first byte in binary
            byte = data[i]
            if byte >> 7 == 0:
                # 1-byte character
                i += 1
            elif byte >> 5 == 0b110:
                # 2-byte character
                if i + 1 < n and data[i + 1] >> 6 == 0b10:
                    i += 2
                else:
                    return False
            elif byte >> 4 == 0b1110:
                # 3-byte character
                if i + 2 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10:
                    i += 3
                else:
                    return False
            elif byte >> 3 == 0b11110:
                # 4-byte character
                if i + 3 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10 and data[i + 3] >> 6 == 0b10:
                    i += 4
                else:
                    return False
            else:
                # Invalid byte
                return False
        
        return True

🔍 Step-by-Step Explanation

1. Initialize Variables:

python

n = len(data)
i = 0

We start with the length of the input data and a pointer i initialized to 0, which we will use to iterate through the array.

2. Examine Each Byte:

python

byte = data[i]

For each byte in the data, we examine its first few bits to determine if it is a valid UTF-8 byte.

3. Check for 1-byte Character:

python

if byte >> 7 == 0:
    i += 1

If the byte starts with 0, it is a valid 1-byte character, and we move to the next byte.

4. Check for 2-byte Character:

python

elif byte >> 5 == 0b110:
    if i + 1 < n and data[i + 1] >> 6 == 0b10:
        i += 2
    else:
        return False

If the byte starts with 110, it’s a 2-byte character. We then check if the next byte starts with 10 (a continuation byte).

5. Check for 3-byte Character:

python

elif byte >> 4 == 0b1110:
    if i + 2 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10:
        i += 3
    else:
        return False

If the byte starts with 1110, it’s a 3-byte character. We then check if the next two bytes start with 10 (continuation bytes).

6. Check for 4-byte Character:

python

elif byte >> 3 == 0b11110:
    if i + 3 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10 and data[i + 3] >> 6 == 0b10:
        i += 4
    else:
        return False

If the byte starts with 11110, it’s a 4-byte character. We check if the next three bytes start with 10 (continuation bytes).

7. Return False for Invalid Byte:

python

else:
    return False

If the byte doesn’t match any of the valid UTF-8 patterns, return False.

8. Return True After Traversing All Bytes:

python

return True

If we successfully validate all bytes, return True.

💡 Example Walkthrough

Example 1:

python

Input:
data = [197, 130, 1]
Output:
true

First Byte (197 = 11000101):

The first 3 bits 110 indicate a 2-byte character. The second byte (130 = 10 000010) matches the expected continuation byte pattern.
We successfully parse the first character.

Second Byte (130 = 10 000010):

It’s a continuation byte, and we move to the next byte.

Third Byte (1 = 00000001):

This byte is a valid 1-byte character.

Result: All bytes are valid UTF-8 characters. Return True.

Example 2:

python

Input:
data = [235, 140, 4]
Output:
false

First Byte (235 = 11101011):

The first 3 bits 111 indicate the start of a 4-byte character. The next three bytes should start with 10.
But the second byte (140 = 10 000011) does not follow the continuation byte pattern.

Result: The sequence is invalid, return False.

⏱️ Time & Space Complexity

MetricComplexityTime ComplexityO(n)Space ComplexityO(1)

Time Complexity:
We iterate through the data array once, so the time complexity is O(n), where n is the length of data.
Space Complexity:
We use a constant amount of extra space for the variables, so the space complexity is O(1).

🧵 Final Thoughts

LeetCode 393 is a great problem for practicing bit manipulation and validating byte sequences against a set of encoding rules. The two-pointer or sequence validation approach is efficient and ensures we check each byte and its corresponding continuation bytes in constant time. This problem is a good exercise in understanding how multi-byte encodings work at the bit level.

Share this article:

Comments

No comments yet

NUHMAN.com