HACKER Q&A
📣 qwertygnu

What research questions do you want answered?


Feel free to be as impractical, inane, and/or idealistic as you wish. What have you been curious about and thought, "I wonder what would happen if you ran a study on this..."

Here's a couple examples:

- What numbers are people least/most likely to notice missing in an ordered list of numbers? E.g. If you ask people to quickly scan the list [60,61,62,63,64,65,66,67,68,69], which element when taken out would have the highest/lowest % of people who respond that there are no missing elements? Is it numbers that end in 7? Maybe people always notice when numbers with repeated digits are missed (66).

-


  👤 dalke Accepted Answer ✓
The Jaccard index between two bit-vectors A and B is

    ||A intersection B|| / ||A union B||
    = popcount(A & B) / popcount(A | B)
Let N be the number of bits in the bit-vector.

Consider A_1, A_2, A_3, ... A_i which are distinct bit-vectors in the 2^N-1 possibilities. (Clearly the all-zero bit vector is not a solution since it will have an index of 0.0 with all other bit-vectors.)

For each N, what is the largest i such that all pair-wise Jaccard index values are distinct? And what is an example?

This comes up in testing algorithms in my field of cheminformatics. The bit-vectors are "molecular fingerprints", and my field refers to the Jaccard index as "the Tanimoto score".

If there are ties in the scores then different implementations of the same algorithm break ties arbitrarily.

For cross testing, I want a test set which has no ties. All implementations should give the same results.

This is not urgent. We were able to come up with a random sampling method which found 27 bit vectors for N=64 bits. (I also tried a GA and a Z3 search, but those weren't as effective.)

  010ca5212bb73967        id1
  222257801a012513        id2
  0005683000080038        id3
  efffff7fffffffff        id4
  fbfbdfffedfbdeff        id5
  21602a5034b064a0        id6
  7d01cc7d85503009        id7
  e7ff2efbfa723f4d        id8
  dfe9bcffddffefff        id9
  11eaf6836d1a1b13        id10
  356c55eff7decef9        id11
  b6bfdfda73f7faff        id12
  0200000000002122        id13
  dfdd7fdf7f7dfeff        id14
  12449480c4c7cb1c        id15
  67efdeff75fbbe4a        id16
  d0ae48211cc31910        id17
  77e23fb4dfd848b1        id18
  53c4fffbe5fbf4fb        id19
  b51f9cbffede9f10        id20
  0100200100000281        id21
  f7f60d895cc39d5f        id22
  f997c59d6c75067a        id23
  eeffffffffffdf7f        id24
  0003d010581e0813        id25
  914d85e6e9f9566b        id26
  9d6f3f77b74a779c        id27
Still, I would like to know the maximum value for i.

I brute-forced it in C and found values up to N=8 or so. Starting from N=2: max i = 2, 3, 4, 5, 5, 6, 7. An example solution for N=8 is:

  01 id1
  0e id2
  17 id3
  37 id4
  77 id5
  7f id6
  ff id7