[Cascavel-pm] Comparacao aproximada entre duas strings

Sábado Outubro 15 22:11:22 PDT 2005

Hmm...

Eu começaria "limpando" as strings, eliminando a) caracteres especiais
e b) "stop words", palavras comuns que não carregam muita informação e
poderiam distorcer os resultados.

sub Clean {
  my $str = shift;
  my @stop_words = qw/a an the for and is are to/;

  $str =~ s/[^\w\s]//g;

  foreach (@stop_words) {
    $str =~ s/^$_ //;
    $str =~ s/ $_$//;
    $str =~ s/ $_ / /;
  }

  return $str;
}

Em seguida, eu contaria o número de palavras coincidentes nas duas strings:

sub Similarity {
  my ($str1,$str2) = @_;

  my (%words1,%words2);
  map({ $words1{$_} = 1 } split(/\s+/,$str1));
  map({ $words2{$_} = 1 } split(/\s+/,$str2));

  my ($common,$total);
  foreach ((keys %words1,keys %words2)) {
    $total++;
    $common++ if defined $words1{$_} and defined $words2{$_};
  }

  return ($common/$total);
}

Exemplos:

Similarity(
        "WIM: an Information Mine Model for the World Wide Web",
        "WIM: World Wide Web Information Mine Model",
); # => 100%

Similarity(
        "WIM: an Information Mine Model for the World Wide Web",
        "WIM: an Information Mining Model for the Web"
); # => 66,7%

Similarity(
        "A Practical Minimal Perfect Hashing Method",
        "WIM: an Information Mining Model for the Web"
); # => 0

Espero que isso possa ser útil! : )

[]s

Nelson

--
Nelson Ferraz
GNU BIS - www.gnubis.com.br