Elasticsearch:运用 distance feature 查询来增强相关性

Elastic 专栏收录该内容
494 篇文章 87 订阅

提高文档的相关性得分,使其更接近提供的原始日期或地点。 例如,您可以使用此查询为更接近某个日期或位置的文档赋予更大的权重。

您可以使用 distance_feature 查询查找与某个位置最近的邻居。 您还可以在布尔搜索的 “should” 过滤器中使用查询,以将增强的相关性得分添加到布尔查询的得分中。

下面我们用一个具体的例子来展示这个 API 的使用。

 

准备数据

我们还是拿之前我们的文章 “Elasticsearch: 运用Field collapsing来减少基于单个字段的搜索结果” 中的索引来做例子。在那个例子里,我们把数据导入到 Elasticsearch 中。我们可以查看一下它的mapping:

GET best_games/_mapping
{
  "best_games" : {
    "mappings" : {
      "_meta" : {
        "created_by" : "ml-file-data-visualizer"
      },
      "properties" : {
        "critic_score" : {
          "type" : "long"
        },
        "developer" : {
          "type" : "text"
        },
        "genre" : {
          "type" : "keyword"
        },
        "global_sales" : {
          "type" : "double"
        },
        "id" : {
          "type" : "keyword"
        },
        "image_url" : {
          "type" : "keyword"
        },
        "name" : {
          "type" : "text"
        },
        "platform" : {
          "type" : "keyword"
        },
        "publisher" : {
          "type" : "keyword"
        },
        "user_score" : {
          "type" : "long"
        },
        "year" : {
          "type" : "long"
        }
      }
    }
  }
}

我们从上面可以看出来,上面的 year 显示的是 long 类型的数据。显然这个不是我们所需要的。我们希望它是 date 类型的数据。我们需要重新对我们的数据进行 reindex。对于不很熟悉 reindex 的开发者来说,你可以参照我之前的文章 “Elasticsearch: Reindex接口”。

我们来重新定义一个新的叫做 best_games 的索引:

PUT best_games1
{
  "mappings": {
    "properties": {
      "critic_score": {
        "type": "long"
      },
      "developer": {
        "type": "text"
      },
      "genre": {
        "type": "keyword"
      },
      "global_sales": {
        "type": "double"
      },
      "id": {
        "type": "keyword"
      },
      "image_url": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      },
      "platform": {
        "type": "keyword"
      },
      "publisher": {
        "type": "keyword"
      },
      "user_score": {
        "type": "long"
      },
      "year": {
        "type": "date",
        "format": "strict_year"
      }
    }
  }
}

在上面的 mapping 里,我们重新定义了 year 为 date 类型。那么我们可以通过如下的命令来 reindex 我们的 best_games1 索引:

POST _reindex
{
  "source": {
    "index": "best_games"
  },
  "dest": {
    "index": "best_games1"
  }
}

操作完上面的命令后,我们可以重新查看一下我的 best_games1 索引的文档数量:

GET best_games1/_count

显示结果为:

{
  "count" : 500,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

这样我们在best_games1中已经存在我们想要的索引数据了。

distance_feature 查询

接下来,我们开始做一些查询。比如我们查询一下 critical_score 大于90的所有文档:

GET /best_games1/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "critic_score": {
            "gte": 90
          }
        }
      }
    }
  }
}

显示的结果为:

    "hits" : [
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "hnLfF28BjrINWI3xWOLh",
        "_score" : 0.0,
        "_source" : {
          "id" : "mario-kart-ds-ds-2005",
          "name" : "Mario Kart DS",
          "year" : 2005,
          "platform" : "DS",
          "genre" : "Racing",
          "publisher" : "Nintendo",
          "global_sales" : 23.21,
          "critic_score" : 91,
          "user_score" : 8,
          "developer" : "Nintendo",
          "image_url" : "https://upload.wikimedia.org/wikipedia/en/thumb/a/ad/Mario_Kart_DS_screenshot.png/220px-Mario_Kart_DS_screenshot.png"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "inLfF28BjrINWI3xWOLh",
        "_score" : 0.0,
        "_source" : {
          "id" : "grand-theft-auto-v-ps3-2013",
          "name" : "Grand Theft Auto V",
          "year" : 2013,
          "platform" : "PS3",
          "genre" : "Action",
          "publisher" : "Take-Two Interactive",
          "global_sales" : 21.04,
          "critic_score" : 97,
          "user_score" : 8,
          "developer" : "Rockstar North",
          "image_url" : "https://pmcvariety.files.wordpress.com/2013/09/gta-v-big.jpg?w=1000&h=563&crop=1"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "i3LfF28BjrINWI3xWOLh",
        "_score" : 0.0,
        "_source" : {
          "id" : "grand-theft-auto-san-andreas-ps2-2004",
          "name" : "Grand Theft Auto: San Andreas",
          "year" : 2004,
          "platform" : "PS2",
          "genre" : "Action",
          "publisher" : "Take-Two Interactive",
          "global_sales" : 20.81,
          "critic_score" : 95,
          "user_score" : 9,
          "developer" : "Rockstar North",
          "image_url" : "http://4.bp.blogspot.com/-IITyrVJdS50/Udvw7XLG-oI/AAAAAAAASwY/H1j2GYBjXng/s1600/GTA+SA+0.jpg"
        }
      },
 ...

显然这个结果里有一些文档的年代非常久远,可能并不是我们想要的结果。那么我改如何把那些靠近我们的年代的文档排名到前面呢?答案是使用 distance_feature 查询。

我们使用如下的方法来进行查询:

GET /best_games1/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "critic_score": {
            "gte": 90
          }
        }
      },
      "should": {
        "distance_feature": {
          "field": "year",
          "pivot": "2555d",
          "origin": "now"
        }
      }
    }
  }
}

我们通过 distance_feature 的引入,上面定义了从现在开始到7年之间(2555天)所有文档的分数都将被提高,同时超过7年之上的文档都只享受一半的提高。那么这样的搜索的结果是:

    "hits" : [
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "-XLfF28BjrINWI3xWOLi",
        "_score" : 0.63837445,
        "_source" : {
          "id" : "uncharted-4-a-thiefs-end-ps4-2016",
          "name" : "Uncharted 4: A Thief's End",
          "year" : 2016,
          "platform" : "PS4",
          "genre" : "Shooter",
          "publisher" : "Sony Computer Entertainment",
          "global_sales" : 5.38,
          "critic_score" : 93,
          "user_score" : 7,
          "developer" : "Naughty Dog",
          "image_url" : "https://i.ytimg.com/vi/hh5HV4iic1Y/maxresdefault.jpg"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "T3LfF28BjrINWI3xWOPi",
        "_score" : 0.5850225,
        "_source" : {
          "id" : "the-witcher-3-wild-hunt-ps4-2015",
          "name" : "The Witcher 3: Wild Hunt",
          "year" : 2015,
          "platform" : "PS4",
          "genre" : "Role-Playing",
          "publisher" : "Namco Bandai Games",
          "global_sales" : 3.97,
          "critic_score" : 92,
          "user_score" : 9,
          "developer" : "CD Projekt Red Studio",
          "image_url" : "https://www.godisageek.com/wp-content/uploads/the-witcher-3-monster-1024x576.jpg"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "inLfF28BjrINWI3xWOPi",
        "_score" : 0.5850225,
        "_source" : {
          "id" : "metal-gear-solid-v-the-phantom-pain-ps4-2015",
          "name" : "Metal Gear Solid V: The Phantom Pain",
          "year" : 2015,
          "platform" : "PS4",
          "genre" : "Action",
          "publisher" : "Konami Digital Entertainment",
          "global_sales" : 3.41,
          "critic_score" : 93,
          "user_score" : 8,
          "developer" : "Kojima Productions, Moby Dick Studio",
          "image_url" : "https://i.ytimg.com/vi/gtgNUFSoHv8/maxresdefault.jpg"
        }
      },
  ...

从上面的显示结果可以看出来,靠近最近年份的文档最先出现,而且得分较高。

同样地,我们也可以基于位置对文档进行提高。以下布尔搜索将返回名称为 chocolate 的文档。 该搜索还使用 distance_feature 查询来增加位置值接近 [-71.3,41.15] 的文档的相关性得分。

GET /items/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "name": "chocolate"
        }
      },
      "should": {
        "distance_feature": {
          "field": "location",
          "pivot": "1000m",
          "origin": [-71.3, 41.15]
        }
      }
    }
  }
}

 

参考:

【1】https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-distance-feature-query.html

  • 0
    点赞
  • 0
    评论
  • 1
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值