Java Coding Problems - Second Edition

By : Anghel Leonard

Java Coding Problems - Second Edition

By: Anghel Leonard

Overview of this book

The super-fast evolution of the JDK between versions 12 and 21 has made the learning curve of modern Java steeper, and increased the time needed to learn it. This book will make your learning journey quicker and increase your willingness to try Java’s new features by explaining the correct practices and decisions related to complexity, performance, readability, and more. Java Coding Problems takes you through Java’s latest features but doesn’t always advocate the use of new solutions — instead, it focuses on revealing the trade-offs involved in deciding what the best solution is for a certain problem. There are more than two hundred brand new and carefully selected problems in this second edition, chosen to highlight and cover the core everyday challenges of a Java programmer. Apart from providing a comprehensive compendium of problem solutions based on real-world examples, this book will also give you the confidence to answer questions relating to matching particular streams and methods to various problems. By the end of this book you will have gained a strong understanding of Java’s new features and have the confidence to develop and choose the right solutions to your problems.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Text Blocks, Locales, Numbers, and Math

Problems

1. Creating a multiline SQL, JSON, and HTML string

2. Exemplifying the usage of text block delimiters

3. Working with indentation in text blocks

4. Removing incidental white spaces in text blocks

5. Using text blocks just for readability

6. Escaping quotes and line terminators in text blocks

7. Translating escape sequences programmatically

8. Formatting text blocks with variables/expressions

9. Adding comments in text blocks

10. Mixing ordinary string literals with text blocks

11. Mixing regular expression with text blocks

12. Checking if two text blocks are isomorphic

13. Concatenating strings versus StringBuilder

14. Converting int to String

15. Introducing string templates

16. Writing a custom template processor

17. Creating a Locale

18. Customizing localized date-time formats

19. Restoring Always-Strict Floating-Point semantics

20. Computing mathematical absolute value for int/long and result overflow

21. Computing the quotient of the arguments and result overflow

22. Computing the largest/smallest value that is less/greater than or equal to the algebraic quotient

23. Getting integral and fractional parts from a double

24. Testing if a double number is an integer

25. Hooking Java (un)signed integers in a nutshell

26. Returning the flooring/ceiling modulus

27. Collecting all prime factors of a given number

28. Computing the square root of a number using the Babylonian method

29. Rounding a float number to specified decimals

30. Clamping a value between min and max

31. Multiply two integers without using loops, multiplication, bitwise, division, and operators

32. Using TAU

33. Selecting a pseudo-random number generator

34. Filling a long array with pseudo-random numbers

35. Creating a stream of pseudo-random generators

36. Getting a legacy pseudo-random generator from new ones of JDK 17

37. Using pseudo-random generators in a thread-safe fashion (multithreaded environments)

Summary

Free Chapter

Objects, Immutability, Switch Expressions, and Pattern Matching

Problems

38. Explain and exemplifying UTF-8, UTF-16, and UTF-32

39. Checking a sub-range in the range from 0 to length

40. Returning an identity string

41. Hooking unnamed classes and instance main methods

42. Adding code snippets in Java API documentation

43. Invoking default methods from Proxy instances

44. Converting between bytes and hex-encoded strings

45. Exemplify the initialization-on-demand holder design pattern

46. Adding nested classes in anonymous classes

47. Exemplify erasure vs. overloading

48. Xlinting default constructors

49. Working with the receiver parameter

50. Implementing an immutable stack

51. Revealing a common mistake with Strings

52. Using the enhanced NullPointerException

53. Using yield in switch expressions

54. Tackling the case null clause in switch

55. Taking on the hard way to discover equals()

56. Hooking instanceof in a nutshell

57. Introducing pattern matching

58. Introducing type pattern matching for instanceof

59. Handling the scope of a binding variable in type patterns for instanceof

60. Rewriting equals() via type patterns for instanceof

61. Tackling type patterns for instanceof and generics

62. Tackling type patterns for instanceof and streams

63. Introducing type pattern matching for switch

64. Adding guarded pattern labels in switch

65. Dealing with pattern label dominance in switch

66. Dealing with completeness (type coverage) in pattern labels for switch

67. Understanding the unconditional patterns and nulls in switch expressions

Summary

Working with Date and Time

Problems

68. Defining a day period

69. Converting between Date and YearMonth

70. Converting between int and YearMonth

71. Converting week/year to Date

72. Checking for a leap year

73. Calculating the quarter of a given date

74. Getting the first and last day of a quarter

75. Extracting the months from a given quarter

76. Computing pregnancy due date

77. Implementing a stopwatch

78. Extracting the count of milliseconds since midnight

79. Splitting a date-time range into equal intervals

80. Explaining the difference between Clock.systemUTC() and Clock.systemDefaultZone()

81. Displaying the names of the days of the week

82. Getting the first and last day of the year

83. Getting the first and last day of the week

84. Calculating the middle of the month

85. Getting the number of quarters between two dates

86. Converting Calendar to LocalDateTime

87. Getting the number of weeks between two dates

Summary

Records and Record Patterns

Problems

88. Declaring a Java record

89. Introducing the canonical and compact constructors for records

90. Adding more artifacts in a record

91. Iterating what we cannot have in a record

92. Defining multiple constructors in a record

93. Implementing interfaces in records

94. Understanding record serialization

95. Invoking the canonical constructor via reflection

96. Using records in streams

97. Introducing record patterns for instanceof

98. Introducing record patterns for switch

99. Tackling guarded record patterns

100. Using generic records in record patterns

101. Handling nulls in nested record patterns

102. Simplifying expressions via record patterns

103. Hooking unnamed patterns and variables

104. Tackling records in Spring Boot

105. Tackling records in JPA

106. Tackling records in jOOQ

Summary

Arrays, Collections, and Data Structures

Problems

107. Introducing parallel computations with arrays

108. Covering the Vector API’s structure and terminology

109. Summing two arrays via the Vector API

110. Summing two arrays unrolled via the Vector API

111. Benchmarking the Vector API

112. Applying the Vector API to compute FMA

113. Multiplying matrices via the Vector API

114. Hooking the image negative filter with the Vector API

115. Dissecting factory methods for collections

116. Getting a list from a stream

117. Handling map capacity

118. Tackling Sequenced Collections

119. Introducing the Rope data structure

120. Introducing the Skip List data structure

121. Introducing the K-D Tree data structure

122. Introducing the Zipper data structure

123. Introducing the Binomial Heap data structure

124. Introducing the Fibonacci Heap data structure

125. Introducing the Pairing Heap data structure

126. Introducing the Huffman Coding data structure

127. Introducing the Splay Tree data structure

128. Introducing the Interval Tree data structure

129. Introducing the Unrolled Linked List data structure

130. Implementing join algorithms

Summary

Java I/O: Context-Specific Deserialization Filters

Problems

131. Serializing objects to byte arrays

132. Serializing objects to strings

133. Serializing objects to XML

134. Introducing JDK 9 deserialization filters

135. Implementing a custom pattern-based ObjectInputFilter

136. Implementing a custom class ObjectInputFilter

137. Implementing a custom method ObjectInputFilter

138. Implementing a custom lambda ObjectInputFilter

139. Avoiding StackOverflowError at deserialization

140. Avoiding DoS attacks at deserialization

141. Introducing JDK 17 easy filter creation

142. Tackling context-specific deserialization filters

143. Monitoring deserialization via JFR

Summary

Foreign (Function) Memory API

Problems

144. Introducing Java Native Interface (JNI)

145. Introducing Java Native Access (JNA)

146. Introducing Java Native Runtime (JNR)

147. Motivating and introducing Project Panama

148. Introducing Panama’s architecture and terminology

149. Introducing Arena and MemorySegment

150. Allocating arrays into memory segments

151. Understanding addresses (pointers)

152. Introducing the sequence layout

153. Shaping C-like structs into memory segments

154. Shaping C-like unions into memory segments

155. Introducing PaddingLayout

156. Copying and slicing memory segments

157. Tackling the slicing allocator

158. Introducing the slice handle

159. Introducing layout flattening

160. Introducing layout reshaping

161. Introducing the layout spreader

162. Introducing the memory segment view VarHandle

163. Streaming memory segments

164. Tackling mapped memory segments

165. Introducing the Foreign Linker API

166. Calling the sumTwoInt() foreign function

167. Calling the modf() foreign function

168. Calling the strcat() foreign function

169. Calling the bsearch() foreign function

170. Introducing Jextract

171. Generating native binding for modf()

Summary

Sealed and Hidden Classes

Problems

172. Creating an electrical panel (hierarchy of classes)

173. Closing the electrical panel before JDK 17

174. Introducing JDK 17 sealed classes

175. Introducing the permits clause

176. Closing the electrical panel after JDK 17

177. Combining sealed classes and records

178. Hooking sealed classes and instanceof

179. Hooking sealed classes in switch

180. Reinterpreting the Visitor pattern via sealed classes and type pattern matching for switch

181. Getting info about sealed classes (using reflection)

182. Listing the top three benefits of sealed classes

183. Briefly introducing hidden classes

184. Creating a hidden class

Summary

Functional Style Programming – Extending APIs

Problems

185. Working with mapMulti()

186. Streaming custom code to map

187. Exemplifying a method reference vs. a lamda

188. Hooking lambda laziness via Supplier/Consumer

189. Refactoring code to add lambda laziness

190. Writing a Function<String, T> for parsing data

191. Composing predicates in a Stream’s filters

192. Filtering nested collections with Streams

193. Using BiPredicate

194. Building a dynamic predicate for a custom model

195. Building a dynamic predicate from a custom map of conditions

196. Logging in predicates

197. Extending Stream with containsAll() and containsAny()

198. Extending Stream with removeAll() and retainAll()

199. Introducing stream comparators

200. Sorting a map

201. Filtering a map

202. Creating a custom collector via Collector.of()

203. Throwing checked exceptions from lambdas

204. Implementing distinctBy() for the Stream API

205. Writing a custom collector that takes/skips a given number of elements

206. Implementing a Function that takes five (or any other arbitrary number of) arguments

207. Implementing a Consumer that takes five (or any other arbitrary number of) arguments

208. Partially applying a Function

Summary

Concurrency – Virtual Threads and Structured Concurrency

Problems

209. Explaining concurrency vs. parallelism

210. Introducing structured concurrency

211. Introducing virtual threads

212. Using the ExecutorService for virtual threads

213. Explaining how virtual threads work

214. Hooking virtual threads and sync code

215. Exemplifying thread context switching

216. Introducing the ExecutorService invoke all/any for virtual threads – part 1

217. Introducing the ExecutorService invoke all/any for virtual threads – part 2

218. Hooking task state

219. Combining newVirtualThreadPerTaskExecutor() and streams

220. Introducing a scope object (StructuredTaskScope)

221. Introducing ShutdownOnSuccess

222. Introducing ShutdownOnFailure

223. Combining StructuredTaskScope and streams

224. Observing and monitoring virtual threads

Summary

Concurrency ‒ Virtual Threads and Structured Concurrency: Diving Deeper

Problems

225. Tackling continuations

226. Tracing virtual thread states and transitions

227. Extending StructuredTaskScope

228. Assembling StructuredTaskScope

229. Assembling StructuredTaskScope instances with timeout

230. Hooking ThreadLocal and virtual threads

231. Hooking ScopedValue and virtual threads

232. Using ScopedValue and executor services

233. Chaining and rebinding scoped values

234. Using ScopedValue and StructuredTaskScope

235. Using Semaphore instead of Executor

236. Avoiding pinning via locking

237. Solving the producer-consumer problem via virtual threads

238. Solving the producer-consumer problem via virtual threads (fixed via Semaphore)

239. Solving the producer-consumer problem via virtual threads (increase/decrease consumers)

240. Implementing an HTTP web server on top of virtual threads

241. Hooking CompletableFuture and virtual threads

242. Signaling virtual threads via wait() and notify()

Summary

Garbage Collectors and Dynamic CDS Archives

Problems

243. Hooking the garbage collector goal

244. Handling the garbage collector stages

245. Covering some garbage collector terminology

246. Tracing the generational GC process

247. Choosing the correct garbage collector

248. Categorizing garbage collectors

249. Introducing G1

250. Tackling G1 throughput improvements

251. Tackling G1 latency improvements

252. Tackling G1 footprint improvements

253. Introducing ZGC

254. Monitoring garbage collectors

255. Logging garbage collectors

256. Tuning garbage collectors

257. Introducing Application Class Data Sharing (AppCDS, or Java’s Startup Booster)

Summary

Socket API and Simple Web Server

Problems

258. Introducing socket basics

259. Introducing TCP server/client applications

260. Introducing the Java Socket API

261. Writing a blocking TCP server/client application

262. Writing a non-blocking TCP server/client application

263. Writing UDP server/client applications

264. Introducing multicasting

265. Exploring network interfaces

266. Writing a UDP multicast server/client application

267. Adding KEM to a TCP server/client application

268. Reimplementing the legacy Socket API

269. Quick overview of SWS

270. Exploring the SWS command-line tool

271. Introducing the com.sun.net.httpserver API

272. Adapting request/exchange

273. Complementing a conditional HttpHandler with another handler

274. Implementing SWS for an in-memory file system

275. Implementing SWS for a zip file system

276. Implementing SWS for a Java runtime directory

Summary

Other Books You May Enjoy

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

38. Explain and exemplifying UTF-8, UTF-16, and UTF-32

Character encoding/decoding is important for browsers, databases, text editors, filesystems, networking, and so on, so it’s a major topic for any programmer. Check out the following figure:

Figure 2.1: Representing text with different char sets

In Figure 2.1, we see several Chinese characters represented in UTF-8, UTF-16, and ANSI on a computer screen. But, what are these? What is ANSI? What is UTF-8 and how did we get to it? Why don’t these characters look normal in ANSI?

Well, the story may begin with computers trying to represent characters (such as letters from the alphabet or digits or punctuation marks). The computers understand/process everything from the real world as a binary representation, so as a sequence of 0 and 1. This means that every character (for instance, A, 5, +, and so on) has to be mapped to a sequence of 0 and 1.

The process of mapping a character to a sequence of 0 and 1 is known as character encoding or simply encoding. The reverse process of un-mapping a sequence of 0 and 1 to a character is known as character decoding or simply decoding. Ideally, an encoding-decoding cycle should return the same character; otherwise, we obtain something that we don’t understand or we cannot use.

For instance, the Chinese character, , should be encoded in the computer’s memory as a sequence of 0 and 1. Next, when this sequence is decoded, we expect back the same Chinese letter, . In Figure 2.1, this happens in the left and middle screenshots, while in the right screenshot, the returned character is …. A Chinese speaker will not understand this (actually, nobody will), so something went wrong!

Of course, we don’t have only Chinese characters to represent. We have many other sets of characters grouped in alphabets, emoticons, and so on. A set of characters has well-defined content (for instance, an alphabet has a certain number of well-defined characters) and is known as a character set or, in short, a charset.

Having a charset, the problem is to define a set of rules (a standard) that clearly explains how the characters of this charset should be encoded/decoded in the computer memory. Without having a clear set of rules, the encoding and decoding may lead to errors or indecipherable characters. Such a standard is known as an encoding scheme.

One of the first encoding schemes was ASCII.

Introducing ASCII encoding scheme (or single-byte encoding)

ASCII stands for American Standard Code for Information Interchange. This encoding scheme relies on a 7-bit binary system. In other words, each character that is part of the ASCII charset (http://ee.hawaii.edu/~tep/EE160/Book/chap4/subsection2.1.1.1.html) should be representable (encoded) on 7 bits. A 7-bit number can be a decimal between 0 and 127, as in the next figure:

Figure 2.2: ASCII charset encoding

So, ASCII is an encoding scheme based on a 7-bit system that supports 128 different characters. But, we know that computers operate on bytes (octets) and a byte has 8 bits. This means that ASCII is a single-byte encoding scheme that leaves a bit free for each byte. See the following figure:

Figure 2.3: The highlighted bit is left free in ASCII encoding

In ASCII encoding, the letter A is 65, the letter B is 66, and so on. In Java, we can easily check this via the existing API, as in the following simple code:

int decimalA = "A".charAt(0); // 65
String binaryA = Integer.toBinaryString(decimalA); // 1000001

Or, let’s see the encoding of the text Hello World. This time, we added the free bit as well, so the result will be 01001000 01100101 01101100 01101100 01101111 0100000 01010111 01101111 01110010 01101100 01100100:

char[] chars = "Hello World".toCharArray();
for(char ch : chars) {
  System.out.print("0" + Integer.toBinaryString(ch) + " ");
}

If we perform a match, then we see that 01001000 is H, 01100101 is e, 01101100 is l, 01101111 is o, 0100000 is space, 01010111 is W, 01110010 is r, and 01100100 is d. So, besides letters, the ASCII encoding can represent the English alphabet (upper and lower case), digits, space, punctuation marks, and some special characters.

Besides the core ASCII for English, we also have ASCII extensions, which are basically variations of the original ASCII to support other alphabets. Most probably, you’ve heard about the ISO-8859-1 (known as ISO Latin 1), which is a famous ASCII extension. But, even with ASCII extensions, there are still a lot of characters in the world that cannot be encoded yet. There are countries that have a lot more characters than ASCII can encode, and even countries that don’t use alphabets. So, ASCII has its limitations.

I know what you are thinking … let’s use that free bit (2⁷+127). Yes, but even so, we can go up to 256 characters. Still not enough! It is time to encode characters using more than 1 byte.

Introducing multi-byte encoding

In different parts of the world, people started to create multi-byte encoding schemes (commonly, 2 bytes). For instance, speaker of the Chinese language, which has a lot of characters, created Shift-JIS and Big5, which use 1 or 2 bytes to represent characters.

But, what happens when most of the countries come up with their own multi-byte encoding schemes trying to cover their special characters, symbols, and so on? Obviously, this leads to a huge incompatibility between the encoding schemes used in different countries. Even worse, some countries have multiple encoding schemes that are totally incompatible with each other. For instance, Japan has three different incompatible encoding schemes, which means that encoding a document with one of these encoding schemes and decoding with another will lead to a garbled document.

However, this incompatibility was not such a big issue before the Internet, since which documents have been massively shared all around the globe using computers. At that moment, the incompatibility between the encoding schemes conceived in isolation (for instance, countries and geographical regions) started to be painful.

It was the perfect moment for the Unicode Consortium to be created.

Unicode

In a nutshell, Unicode (https://unicode-table.com/en/) is a universal encoding standard capable of encoding/decoding every possible character in the world (we are talking about hundreds of thousands of characters).

Unicode needs more bytes to represent all these characters. But, Unicode didn’t get involved in this representation. It just assigned a number to each character. This number is named a code point. For instance, the letter A in Unicode is associated with the code point 65 in decimal, and we refer to it as U+0041. This is the constant U+ followed by 65 in hexadecimal. As you can see, in Unicode, A is 65, exactly as in the ASCII encoding. In other words, Unicode is backward compatible with ASCII. As you’ll see soon, this is big, so keep it in mind!

Early versions of Unicode contain characters having code points less than 65,535 (0xFFFF). Java represents these characters via the 16-bit char data type. For instance, the French (e with circumflex) is associated with the Unicode 234 decimal or U+00EA hexadecimal. In Java, we can use charAt() to reveal this for any Unicode character less than 65,535:

int e = "ê".charAt(0);                // 234
String hexe = Integer.toHexString(e); // ea

We also may see the binary representation of this character:

String binarye = Integer.toBinaryString(e); // 11101010 = 234

Later, Unicode added more and more characters up to 1,114,112 (0x10FFFF). Obviously, the 16-bit Java char was not enough to represent these characters, and calling charAt() was not useful anymore.

Important note

Java 19+ supports Unicode 14.0. The java.lang.Character API supports Level 14 of the Unicode Character Database (UCD). In numbers, we have 47 new emojis, 838 new characters, and 5 new scripts. Java 20+ supports Unicode 15.0, which means 4,489 new characters for java.lang.Character.

In addition, JDK 21 has added a set of methods especially for working with emojis based on their code point. Among these methods, we have boolean isEmoji(int codePoint), boolean isEmojiPresentation(int codePoint), boolean isEmojiModifier(int codePoint), boolean isEmojiModifierBase(int codePoint), boolean isEmojiComponent(int codePoint), and boolean isExtendedPictographic(int codePoint). In the bundled code, you can find a small application showing you how to fetch all available emojis and check if a given string contains emoji. So, we can easily obtain the code point of a character via Character.codePointAt() and pass it as an argument to these methods to determine whether the character is an emoji or not.

However, Unicode doesn’t get involved in how these code points are encoded into bits. This is the job of special encoding schemes within Unicode, such as the Unicode Transformation Format (UTF) schemes. Most commonly, we use UTF-32, UTF-16, and UTF-8.

UTF-32

UTF-32 is an encoding scheme for Unicode that represents every code point on 4 bytes (32 bits). For instance, the letter A (having code point 65), which can be encoded on a 7-bit system, is encoded in UTF-32 as in the following figure next to the other two characters:

Figure 2.4: Three characters sample encoded in UTF-32

As you can see in Figure 2.4, UTF-32 uses 4 bytes (fixed length) to represent every character. In the case of the letter A, we see that UTF-32 wasted 3 bytes of memory. This means that converting an ASCII file to UTF-32 will increase its size by 4 times (for instance, a 1KB ASCII file is a 4KB UTF-32 file). Because of this shortcoming, UTF-32 is not very popular.

Java doesn’t support UTF-32 as a standard charset but it relies on surrogate pairs (introduced in the next section).

UTF-16

UTF-16 is an encoding scheme for Unicode that represents every code point on 2 or 4 bytes (not on 3 bytes). UTF-16 has a variable length and uses an optional Byte-Order Mark (BOM), but it is recommended to use UTF-16BE (BE stands for Big-Endian byte order), or UTF-16LE (LE stands for Little-Endian byte order). While more details about Big-Endian vs. Little-Endian are available at https://en.wikipedia.org/wiki/Endianness, the following figure reveals how the orders of bytes differ in UTF-16BE (left side) vs. UTF-16LE (right side) for three characters:

Figure 2.5: UTF-16BE (left side) vs. UTF-16LE (right side)

Since the figure is self-explanatory, let’s move forward. Now, we have to tackle a trickier aspect of UTF-16. We know that in UTF-32, we take the code point and transform it into a 32-bit number and that’s it. But, in UTF-16, we can’t do that every time because we have code points that don’t accommodate 16 bits. This being said, UTF-16 uses the so-called 16-bit code units. It can use 1 or 2 code units per code point. There are three types of code units, as follows:

A code point needs a single code unit: these are 16-bit code units (covering U+0000 to U+D7FF, and U+E000 to U+FFFF)
A code point needs 2 code units:
- The first code unit is named high surrogate and it covers 1,024 values (U+D800 to U+DBFF)
- The second code unit is named low surrogate and it covers 1,024 values (U+DC00 to U+DFFF)

A high surrogate followed by a low surrogate is named a surrogate pair. Surrogate pairs are needed to represent the so-called supplementary Unicode characters or characters having a code point larger than 65,535 (0xFFFF).

Characters such as the letter A (65) or the Chinese (26263) have a code point that can be represented via a single code unit. The following figure shows these characters in UTF-16BE:

Figure 2.6: UTF-16 encoding of A and

This was easy! Now, let’s consider the following figure (encoding of Unicode, Smiling Face with Heart-Shaped Eyes):

Figure 2.7: UTF-16 encoding using a surrogate pair

The character from this figure has a code point of 128525 (or, 1 F60D) and is represented on 4 bytes.

Check the first byte: the sequence of 6 bits, 110110, identifies a high surrogate.

Check the third byte: the sequence of 6 bits, 110111, identifies a low surrogate.

These 12 bits (identifying the high and low surrogates) can be dropped and we keep the rest of the 20 bits: 00001111011000001101. We can compute this number as 2⁰+ 2²+ 2³+ 2⁹+ 2¹⁰+ 2¹²+ 2¹³+ 2¹⁴+ 2¹⁵= 1 + 4 + 8 + 512 + 1024 + 4096 + 8192 + 16384 + 32768 = 62989 (or, the hexadecimal, F60D).

Finally, we have to compute F60D + 0x10000 = 1 F60D, or in decimal 62989 + 65536 = 128525 (the code point of this Unicode character). We have to add 0x10000 because the characters that use 2 code units(a surrogate pair) are always of form 1 F…

Java supports UTF-16, UTF-16BE, and UTF-16LE. Actually, UTF-16 is the native character encoding for Java.

UTF-8

UTF-8 is an encoding scheme for Unicode that represents every code point on 1, 2, 3, or 4 bytes. Having this 1- to 4-byte flexibility, UTF-8 uses space in a very efficient way.

Important note

UTF-8 is the most popular encoding scheme that dominates the Internet and applications.

For instance, we know that the code point of the letter A is 65 and it can be encoded using a 7-bit binary representation. The following figure represents this letter encoded in UTF-8:

Figure 2.8: Letter A encoded in UTF-8

This is very cool! UTF-8 has used a single byte to encode A. The first (leftmost) 0 signals that this is a single-byte encoding. Next, let’s see the Chinese character, :

Figure 2.9: Chinese character, , encoded in UTF-8

The code point of is 26263, so UTF-8 uses 3 bytes to represent it. The first byte contains 4 bits (1110) that signal that this is a 3-byte encoding. The next two bytes start with 2 bits of 10. All these 8 bits can be dropped and we keep only the remaining 16 bits, which gives us the expected code point.

Finally, let’s tackle the following figure:

Figure 2.10: UTF-8 encoding with 4 bytes

This time, the first byte signals that this is a 4-byte encoding via 11110. The remaining 3 bytes start with 10. All these 11 bits can be dropped and we keep only the remaining 21 bits, 000011111011000001101, which gives us the expected code point, 128525.

In the following figure you can see the UTF-8 template used for encoding Unicode characters:

Figure 2.11: UTF-8 template used for encoding Unicode characters

Did you know that 8 zeros in a row (00000000 – U+0000) are interpreted as NULL? A NULL represents the end of the string, so sending it “accidentally” will be a problem because the remaining string will not be processed. Fortunately, UTF-8 prevents this issue, and sending a NULL can be done only if we effectively send the U+0000 code point.

Java and Unicode

As long as we use characters with code points less than 65,535 (0xFFFF), we can rely on the charAt() method to obtain the code point. Here are some examples:

int cp1 = "A".charAt(0);                   // 65
String hcp1 = Integer.toHexString(cp1);    // 41
String bcp1 = Integer.toBinaryString(cp1); // 1000001
int cp2 = "".charAt(0);                  // 26263
String hcp2 = Integer.toHexString(cp2);    // 6697
String bcp2 = Integer.toBinaryString(cp2); // 1101100000111101

Based on these examples, we may write a helper method that returns the binary representation of strings having code points less than 65,535 (0xFFFF) as follows (you already saw the imperative version of the following functional code earlier):

public static String strToBinary(String str) {
   String binary = str.chars()
     .mapToObj(Integer::toBinaryString)
     .map(t -> "0" +  t)
     .collect(Collectors.joining(" "));
   return binary;
}

If you run this code against a Unicode character having a code point greater than 65,535 (0xFFFF), then you’ll get the wrong result. You’ll not get an exception or any kind of warning.

So, charAt() covers only a subset of Unicode characters. For covering all Unicode characters, Java provides an API that consists of several methods. For instance, if we replace charAt() with codePointAt(), then we obtain the correct code point in all cases, as you can see in the following figure:

Figure 2.12: charAt() vs. codePointAt()

Check out the last example, c2. Since codePointAt() returns the correct code point (128525), we can obtain the binary representation as follows:

String uc = Integer.toBinaryString(c2); // 11111011000001101

So, if we need a method that returns the binary encoding of any Unicode character, then we can replace the chars() call with the codePoints() call. The codePoints() method returns the code points of the given sequence:

public static String codePointToBinary(String str) {
   String binary = str.codePoints()
      .mapToObj(Integer::toBinaryString)
      .collect(Collectors.joining(" "));
   return binary;
}

The codePoints() method is just one of the methods provided by Java to work around code points. The Java API also includes codePointAt(), offsetByCodePoints(), codePointCount(), codePointBefore(), codePointOf(), and so on. You can find several examples of them in the bundled code next to this one for obtaining a String from a given code point:

String str1 = String.valueOf(Character.toChars(65)); // A
String str2 = String.valueOf(Character.toChars(128525));

The toChars() method gets a code point and returns the UTF-16 representation via a char[]. The string returned by the first example (str1) has a length of 1 and is the letter A. The second example returns a string of length 2 since the character having the code point 128525 needs a surrogate pair. The returned char[] contains both the high and low surrogates.

Finally, let’s have a helper method that allows us to obtain the binary representation of a string for a given encoding scheme:

public static String stringToBinaryEncoding(
      String str, String encoding) {
   final Charset charset = Charset.forName(encoding);
   final byte[] strBytes = str.getBytes(charset);
   final StringBuilder strBinary = new StringBuilder();
   for (byte strByte : strBytes) {
      for (int i = 0; i < 8; i++) {
        strBinary.append((strByte & 128) == 0 ? 0 : 1);
        strByte <<= 1;
      }
      strBinary.append(" ");
   }
   return strBinary.toString().trim();
}

Using this method is quite simple, as you can see in the following examples:

// 00000000 00000000 00000000 01000001
String r = Charsets.stringToBinaryEncoding("A", "UTF-32");
// 10010111 01100110
String r = Charsets.stringToBinaryEncoding("", 
              StandardCharsets.UTF_16LE.name());

You can practice more examples in the bundled code.

JDK 18 defaults the charset to UTF-8

Before JDK 18, the default charset was determined based on the operating system charset and locale (for instance, on a Windows machine, it could be windows-1252). Starting with JDK 18, the default charset is UTF-8 (Charset.defaultCharset() returns the string, UTF-8). Or, having a PrintStream instance, we can find out the used charset via the charset() method (starting with JDK 18).

But, the default charset can be explicitly set via the file.encoding and native.encoding system properties at the command line. For instance, you may need to perform such modification to compile legacy code developed before JDK 18:

// the default charset is computed from native.encoding
java -Dfile-encoding = COMPAT 
// the default charset is windows-1252
java -Dfile-encoding = windows-1252

So, since JDK 18, classes that use encoding (for instance, FileReader/FileWriter, InputStreamReader/OutputStreamWriter, PrintStream, Formatter, Scanner, and URLEncoder/URLDecoder) can take advantage of UTF-8 out of the box. For instance, using UTF-8 before JDK 18 for reading a file can be accomplished by explicitly specifying this charset encoding scheme as follows:

try ( BufferedReader br = new BufferedReader(new FileReader(
   chineseUtf8File.toFile(), StandardCharsets.UTF_8))) {
   ...
}

Accomplishing the same thing in JDK 18+ doesn’t require explicitly specifying the charset encoding scheme:

try ( BufferedReader br = new BufferedReader(
   new FileReader(chineseUtf8File.toFile()))) {
   ...
}

However, for System.out and System.err, JDK 18+ still uses the default system charset. So, if you are using System.out/err and you see question marks (?) instead of the expected characters, then most probably you should set UTF-8 via the new properties -Dstdout.encoding and -Dstderr.encoding:

-Dstderr.encoding=utf8 -Dstdout.encoding=utf8

Or, you can set them as environment variables to set them globally:

_JAVA_OPTIONS="-Dstdout.encoding=utf8 -Dstderr.encoding=utf8"

In the bundled code you can see more examples.

Java Coding Problems - Second Edition

By : Anghel Leonard

Java Coding Problems - Second Edition

By: Anghel Leonard

Overview of this book

Related Content you might be interested in

Current Title:

Java Coding Problems - Second Edition

38. Explain and exemplifying UTF-8, UTF-16, and UTF-32

Introducing ASCII encoding scheme (or single-byte encoding)

Introducing multi-byte encoding

Unicode

UTF-32

UTF-16

UTF-8

Java and Unicode

JDK 18 defaults the charset to UTF-8