2024/05/13

Guava Hashing

Guava Hash 裡面有幾個重要的類別:Hashing, HashFunction, Hasher, HashCode, Funnel, PrimitiveSink

HashFunction

HashFunction 是 stateless function,可將任意的 data 轉換為 fixed number of bits,相同的 input 會得到相同的 output 結果,不同的 input 得到不同的唯一的結果

可產生 Hasher 物件,也可以根據條件回傳 HashCode 結果

    @Test
    public void hashFunction() {
        HashFunction hf = Hashing.md5();
        HashCode hc = hf.newHasher()
                .putLong(10)
                .putString("abc", Charsets.UTF_8)
                .hash();
        System.out.println("hashFunction="+hc.toString());
    }

Hasher

HashFunction 可以使用 stateful Hasher,Hasher 提供 fluent syntax,可將資料填充到 Hash 裡面,並得到 hash value。Hasher 可接受 primitive input, byte array, slices of byte arrays, character sequences, 任意物件產生的 Funnel。

資料透過 putXXX() method 放到 Hasher 的資料源中,有 putByte, putBytes, putShort, putInt, putLong, putFloat, putdouble, putBoolean, putChar, putEnencodedChars, putString, putObject

Funnel

如果有一個類別

class Person {
  final int id;
  final String firstName;
  final String lastName;
  final int birthYear;
}

其 Funnel 會是

Funnel<Person> personFunnel = new Funnel<Person>() {
  @Override
  public void funnel(Person person, PrimitiveSink into) {
    into
        .putInt(person.id)
        .putString(person.firstName, Charsets.UTF_8)
        .putString(person.lastName, Charsets.UTF_8)
        .putInt(birthYear);
  }
};

組合起來的code

    @Test
    public void objectToHashCode() {

        // 需要hash的对象
        Person person = new Person(27, "John", "Wink", 1990);

        Funnel<Person> personFunnel = new Funnel<Person>() {
            @Override
            public void funnel(Person person, PrimitiveSink into) {
                into
                        .putInt(person.id)
                        .putString(person.firstName, Charsets.UTF_8)
                        .putString(person.lastName, Charsets.UTF_8)
                        .putInt(person.birthYear);
            }
        };

        // md5算法
        HashFunction hf = Hashing.md5();
        HashCode hc = hf.newHasher()
                .putString("abc", Charsets.UTF_8)
                .putObject(person, personFunnel)
                .hash();
        System.out.println("objectToHashCode="+hc.toString());
    }

    class Person {
        int id;
        String firstName;
        String lastName;
        int birthYear;

        public Person(int id, String firstName, String lastName, int birthYear) {
            this.id = id;
            this.firstName = firstName;
            this.lastName = lastName;
            this.birthYear = birthYear;
        }
    }

HashCode

當 Hasher 給了所有 input 資料,他的 hash() method 會產生 HashCode 結果

HashCode 支援 euqality testing 例如 asInt(), asLong, asBytes()

另外有 writeBytesTo(array, offset, maxLength) 可將 hash 的 maxLength bytes 寫入 array

BloomFilter

可偵測物件是否 "definitely" 不在 filter 裡面,或是有可能已經被加入 Bloom filter

對於 hash 來說,就是偵測是否 hash value 有碰撞,不同的 input 得到相同的結果

BloomFilter<Person> friends = BloomFilter.create(personFunnel, 500, 0.01);
for (Person friend : friendsList) {
  friends.put(friend);
}
// much later
if (friends.mightContain(someone)) {
  // the probability that someone reached this place if they aren't a friend is
  // 1% we might, for example, start asynchronously loading things for someone
  // while we do a more expensive exact check
}

Hashing

Hashing (Guava: Google Core Libraries for Java HEAD-jre-SNAPSHOT API)

裡面有 static method,可取得某些演算法的 HashFunction

  • adler32()

  • crc32(), crc32c()

  • farmHashFingerprint64()

  • fingerprint2011()

  • goodFastHash(int minimumBits)

  • hmacMd5

  • hmacSha1

  • hmacSha256

  • hmacSha512

  • md5()

  • murmur3_128(), murmur3_32()

  • sha1(), sha256(), sha384(), sha512()

  • sipHash24

    @Test
    public void hashing() {
        // 測試 Hasing 演算法
        String input = "hello, world";
        // MD5
        System.out.println("md5="+Hashing.md5().hashBytes(input.getBytes()).toString());
        // sha256
        System.out.println("sha256="+Hashing.sha256().hashBytes(input.getBytes()).toString());
        // sha512
        System.out.println("sha512="+Hashing.sha512().hashBytes(input.getBytes()).toString());
        // crc32
        System.out.println("crc32="+Hashing.crc32().hashBytes(input.getBytes()).toString());
        // MD5
        System.out.println("MD5="+Hashing.md5().hashUnencodedChars(input).toString());
    }

References

HashingExplained · google/guava Wiki · GitHub

2024/05/06

Guava String Utilities 2

CharMatcher

CharMatcher 用在字元的 trimming, collapsing, removing, retaining

    @Test
    public void char_matchers() {
        String string = "12 ab 34 CD\r\n 56 ef789GH";
        String noControl = CharMatcher.javaIsoControl().removeFrom(string); // remove control characters
        String theDigits = CharMatcher.digit().retainFrom(string); // only the digits
        String spaced = CharMatcher.whitespace().trimAndCollapseFrom(string, ' ');
        // trim whitespace at ends, and replace/collapse whitespace into single spaces
        String noDigits = CharMatcher.javaDigit().replaceFrom(string, "*"); // star out all digits
        String lowerAndDigit = CharMatcher.javaDigit().or(CharMatcher.javaLowerCase()).retainFrom(string);
        // eliminate all characters that aren't digits or lowercase

        assertEquals("12 ab 34 CD 56 ef789GH", noControl);
        assertEquals("123456789", theDigits);
        assertEquals("12 ab 34 CD 56 ef789GH", spaced);
        assertEquals("** ab ** CD\r\n ** ef***GH", noDigits);
        assertEquals("12ab3456ef789", lowerAndDigit);
    }

除了一些已經封裝的methods 以外,通用的 method 有三個

Method Description example
anyOf(CharSequence) 要符合的字元 CharMatcher.anyOf("aeiou")
is(char) 特定的 char
inRange(char, char) 一連串的字元 CharMatcher.inRange('a', 'z')
    @Test
    public void char_matchers2() {
        String string = "12 ab 34 CD\r\n 56";
        String anyOfResult = CharMatcher.anyOf("cdef\r\n").removeFrom(string);
        String isResult = CharMatcher.is('C').removeFrom(string);
        String inRangeResult = CharMatcher.inRange('A', 'Z').removeFrom(string);

        assertEquals("12 ab 34 CD 56", anyOfResult);
        assertEquals("12 ab 34 D\r\n 56", isResult);
        assertEquals("12 ab 34 \r\n 56", inRangeResult);
    }

使用 CharMatcher 的 methods

Method Description
collapseFrom(CharSequence, char) 將一連串的字元,縮小變成一個  ex: WHITESPACE.collapseFrom(string, ' ') 會減少為只有一個空白字元
matchesAllOf(CharSequence) 符合所有字元
removeFrom(CharSequence) 移除符合字元
retainFrom(CharSequence) 保留符合的字元
trimFrom(CharSequence) 去掉 leading, trailing 符合的字元
replaceFrom(CharSequence, CharSequence) 取代字元

Charsets

注意 "不要" 這樣寫

try {
  bytes = string.getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
  // how can this possibly happen?
  throw new AssertionError(e);
}

要改用 Charsets

bytes = string.getBytes(Charsets.UTF_8);

CaseFormat

Format Example
LOWER_CAMEL lowerCamel
LOWER_HYPHEN lower-hyphen
LOWER_UNDERSCORE lower_underscore
UPPER_CAMEL UpperCamel
UPPER_UNDERSCORE UPPER_UNDERSCORE
CaseFormat.UPPER_UNDERSCORE.to(CaseFormat.LOWER_CAMEL, "CONSTANT_NAME"));
// returns "constantName"

References

StringsExplained · google/guava Wiki · GitHub