Programming & Development / April 10, 2025

LeetCode 393: UTF-8 Validation โ€“ Validate UTF-8 Encoded Data

LeetCode 393 UTF-8 Validation Validate UTF-8 UTF-8 Encoding Algorithm Python Bit Manipulation Byte Sequence Time Complexity Space Complexity

๐Ÿ“˜ Problem Statement

Given an integer array data representing the UTF-8 encoded data, return true if it is a valid UTF-8 encoding, else return false.

A UTF-8 encoding is a variable-width character encoding standard that can encode every character in the Unicode character set. It uses one to four bytes to represent a character, depending on the character's value.

  • 1-byte character: 0xxxxxxx
  • 2-byte character: 110xxxxx 10xxxxxx
  • 3-byte character: 1110xxxx 10xxxxxx 10xxxxxx
  • 4-byte character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

To be a valid UTF-8 encoding, the following rules must be satisfied:

  1. 1-byte character: The first bit is 0, followed by 7 bits of data.
  2. 2-byte character: The first 3 bits are 110, followed by 5 bits of data; the second byte starts with 10.
  3. 3-byte character: The first 4 bits are 1110, followed by 4 bits of data; the second and third bytes start with 10.
  4. 4-byte character: The first 5 bits are 11110, followed by 3 bits of data; the second, third, and fourth bytes start with 10.

๐Ÿ“š Example:

python

Input:
data = [197, 130, 1]
Output:
true
python

Input:
data = [235, 140, 4]
Output:
false

๐Ÿง  Key Insight

The main challenge in this problem is to identify whether the given byte sequence adheres to valid UTF-8 encoding rules. We need to process the input as a series of bits and ensure that each byte correctly conforms to one of the 1-4 byte UTF-8 patterns.

๐Ÿ’ก Approach

  1. Bitwise Analysis:
  • Each number in the array represents a byte. We examine the binary representation of each byte to check if it conforms to the expected bit pattern for the UTF-8 encoding.
  1. Identify Byte Sequences:
  • The first byte of each sequence tells us how many subsequent bytes are part of that character's encoding. We can classify each byte as either:
  • A single byte (valid if the first bit is 0).
  • A continuation byte (valid if the first two bits are 10).
  • A leading byte (valid if it starts with 110, 1110, or 11110 depending on the byte sequence length).
  1. Traverse the Array:
  • For each byte in the array, check whether it is a valid start byte or continuation byte. If itโ€™s a start byte, determine how many subsequent bytes are expected, then check if they conform to the continuation byte pattern.
  1. Return Result:
  • If we encounter any byte that doesnโ€™t conform to the expected pattern, return false. If we successfully traverse the entire array, return true.

๐Ÿ Python Code

python

class Solution:
    def validUtf8(self, data: list[int]) -> bool:
        n = len(data)
        i = 0
        
        while i < n:
            # Get the first byte in binary
            byte = data[i]
            if byte >> 7 == 0:
                # 1-byte character
                i += 1
            elif byte >> 5 == 0b110:
                # 2-byte character
                if i + 1 < n and data[i + 1] >> 6 == 0b10:
                    i += 2
                else:
                    return False
            elif byte >> 4 == 0b1110:
                # 3-byte character
                if i + 2 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10:
                    i += 3
                else:
                    return False
            elif byte >> 3 == 0b11110:
                # 4-byte character
                if i + 3 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10 and data[i + 3] >> 6 == 0b10:
                    i += 4
                else:
                    return False
            else:
                # Invalid byte
                return False
        
        return True

๐Ÿ” Step-by-Step Explanation

1. Initialize Variables:

python

n = len(data)
i = 0
  • We start with the length of the input data and a pointer i initialized to 0, which we will use to iterate through the array.

2. Examine Each Byte:

python

byte = data[i]
  • For each byte in the data, we examine its first few bits to determine if it is a valid UTF-8 byte.

3. Check for 1-byte Character:

python

if byte >> 7 == 0:
    i += 1
  • If the byte starts with 0, it is a valid 1-byte character, and we move to the next byte.

4. Check for 2-byte Character:

python

elif byte >> 5 == 0b110:
    if i + 1 < n and data[i + 1] >> 6 == 0b10:
        i += 2
    else:
        return False
  • If the byte starts with 110, itโ€™s a 2-byte character. We then check if the next byte starts with 10 (a continuation byte).

5. Check for 3-byte Character:

python

elif byte >> 4 == 0b1110:
    if i + 2 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10:
        i += 3
    else:
        return False
  • If the byte starts with 1110, itโ€™s a 3-byte character. We then check if the next two bytes start with 10 (continuation bytes).

6. Check for 4-byte Character:

python

elif byte >> 3 == 0b11110:
    if i + 3 < n and data[i + 1] >> 6 == 0b10 and data[i + 2] >> 6 == 0b10 and data[i + 3] >> 6 == 0b10:
        i += 4
    else:
        return False
  • If the byte starts with 11110, itโ€™s a 4-byte character. We check if the next three bytes start with 10 (continuation bytes).

7. Return False for Invalid Byte:

python

else:
    return False
  • If the byte doesnโ€™t match any of the valid UTF-8 patterns, return False.

8. Return True After Traversing All Bytes:

python

return True
  • If we successfully validate all bytes, return True.

๐Ÿ’ก Example Walkthrough

Example 1:

python

Input:
data = [197, 130, 1]
Output:
true
  1. First Byte (197 = 11000101):
  • The first 3 bits 110 indicate a 2-byte character. The second byte (130 = 10 000010) matches the expected continuation byte pattern.
  • We successfully parse the first character.
  1. Second Byte (130 = 10 000010):
  • Itโ€™s a continuation byte, and we move to the next byte.
  1. Third Byte (1 = 00000001):
  • This byte is a valid 1-byte character.
  1. Result: All bytes are valid UTF-8 characters. Return True.

Example 2:

python

Input:
data = [235, 140, 4]
Output:
false
  1. First Byte (235 = 11101011):
  • The first 3 bits 111 indicate the start of a 4-byte character. The next three bytes should start with 10.
  • But the second byte (140 = 10 000011) does not follow the continuation byte pattern.
  1. Result: The sequence is invalid, return False.

โฑ๏ธ Time & Space Complexity

MetricComplexityTime ComplexityO(n)Space ComplexityO(1)

  • Time Complexity:
  • We iterate through the data array once, so the time complexity is O(n), where n is the length of data.
  • Space Complexity:
  • We use a constant amount of extra space for the variables, so the space complexity is O(1).

๐Ÿงต Final Thoughts

LeetCode 393 is a great problem for practicing bit manipulation and validating byte sequences against a set of encoding rules. The two-pointer or sequence validation approach is efficient and ensures we check each byte and its corresponding continuation bytes in constant time. This problem is a good exercise in understanding how multi-byte encodings work at the bit level.


Comments

No comments yet

Add a new Comment

NUHMAN.COM

Information Technology website for Programming & Development, Web Design & UX/UI, Startups & Innovation, Gadgets & Consumer Tech, Cloud Computing & Enterprise Tech, Cybersecurity, Artificial Intelligence (AI) & Machine Learning (ML), Gaming Technology, Mobile Development, Tech News & Trends, Open Source & Linux, Data Science & Analytics

Categories

Tags

©{" "} Nuhmans.com . All Rights Reserved. Designed by{" "} HTML Codex