Validating UTF-8 strings using as little as 0.7 cycles per byte – Daniel Lemire's blog

Most strings found on the Internet are encoded using a particular unicode format called UTF-8 . However, not all strings of bytes are valid UTF-8. The rules as to what constitute a valid UTF-8 string are somewhat arcane. Yet it seems important to quickly validate these strings before you consume them.