Elasticsearch:通过 inference pipeline 聚合为你的数据科学增加灵活性

Elastic 7.6 引入了 inference processor,用于对通过 ingest pipeline 提取的文档进行推理。ingest pipeline 功能强大且灵活,但设计用于在 ingest 时工作。如果你想了解更多关于 inference processor,请参阅文章 “使用 Elastic 有监督的机器学习进行二进制分类”。那么,如果你的数据已经被摄取会怎样?

引入了新的 Elasticsearch inference pipeline 聚合,可让你将新的推理模型应用于已建立索引的数据。使用这种新的聚合类型,你可以在聚合中搜索时使用机器学习推断,并即时获取结果-实时获取最新数据。现在,你始终可以期待新的模型,而不必担心在 Elasticsearch 中为数据重新编制索引!

该博客将向你展示如何使用推理管道聚合将新模型应用于现有数据(例如,我们将使用客户服务流失数据),然后向你展示如何使用 Kibana Vega 来显示推理结果。

通过 inference pipeline 聚合为你的数据科学增加灵活性

 

Inference pipeline aggregations

推论是根据机器学习算法训练的模型做出预测的过程。 Elastic Stack 支持两种类型的模型:回归模型和分类模型,这些模型经过训练以查找数据中的关系并预测诸如房价之类的数值,分类模型则预测数据点属于哪个类别或类别。如果你对分类模型的机器学习感兴趣的话,请阅读我之前的文章 “Elastic:使用 Elastic 有监督的机器学习进行二进制分类”。

模型的输入可以是数值,分类(文本/关键字字段),甚至是 IP 地址。 管道聚合是将其他聚合的输出作为其输入的聚合。 这些可能是数字值,关键字甚至 IP 地址。

 

示例:应用新的客户流失模型

我们将使用的数据集首先在 Elastic 网络研讨会的《有监督的机器学习简介》中介绍。在该网络研讨会中,我们使用虚构的电信通话记录数据(来自 OpenMLKaggle)来预测哪些客户会根据他们的历史流失。客户记录的子集已标有目标客户流失字段,使其适合于在监督学习下训练模型。实验表明,诸如总通话费和客户服务呼叫数量之类的功能是客户流失率的特别重要的预测指标。

重要提示:在集群之间共享模型时,请使用 “获取训练的模型” 请求中的 for_export 参数以可以直接上传到其他集群的格式返回模型。

 

创建推理模型

Elasticsearch 及 Kibana

你可以在我们的 Elastic Cloud 上生产免费试用的集群。如果你想使用 Elastic Stack 的本地部署,请参阅我之前的文章:

由于机器学习是需要白金版的一个功能,我们需要在本地的部署中启动白金版。

这样我们就完成了白金版的试用配置。在本文中,我将使用 Elastic 7.9 来展示。

实验数据

我们今天的实验数据将采用 Customer churn demo。这里的实验数据有两个部分:

数据来自各种来源的数据集:openmlkaggleLarose2014。该数据集被分解为订单项,并添加了随机特征。 电话号码和其他数据是捏造的,与真实数据的任何相似之处都是偶然的。

导入数据

你可以参阅我之前的文章 “Kibana: 运用Data Visualizer来分析CSV数据” 来把上面的两个 csv 的数据集导入到  Elasticsearch 中。应该是非常直接的。当我导入完 calls.csv 及 customers.csv 后,我们可以在 Kibana 中查看到如下的两个索引:

这两个索引的内容如下面的格式:

calls index

{
  "column1" : 0,
  "dialled_number" : "415-421-9401",
  "call_charges" : 0.2647226166736342,
  "call_duration" : 2.569473435514605,
  "@timestamp" : "2020-09-08T10:16:15.019+08:00",
  "phone_number" : "415-382-4657",
  "timestamp" : "2020-09-08 10:16:15.019388"
}

注意上面的  column1 字段表示为在 csv 文档中的第一列。是个系列号。calls 索引共有 1002092 个数据。

customers index

{
  "column1" : 0,
  "number_vmail_messages" : 25,
  "account_length" : 128,
  "churn" : 0,
  "customer_service_calls" : 1,
  "voice_mail_plan" : "yes",
  "phone_number" : "415-382-4657",
  "state" : "KS",
  "international_plan" : "no"
}

注意上面的  column1 字段表示为在 csv 文档中的第一列。是个系列号。customers 索引共有 3333 个数据。其中的一个单词 churn 表示客户是否流失过。1 表示流失过,0则表示没有。

 

创建模型

创建 index pattern

我们创建一个叫做 calls,customers 的 index pattern。这种 index pattern 比较少见。它其实是把两个不同数据结构的索引放在一起。

这样就完成了这个 index pattern 的创建。

创建 transforms

选择 calls,customers 这个 index pattern:

我们选择 Group by phone_number,同时我们启动 Edit JSON config,这样跟容易,更快地书写下面的 aggregations:

在上面的 Pivot configuration object 部分,我们可以直接把如下的 JSON 格式的数据粘贴进去:

{
  "group_by": {
    "phone_number": {
      "terms": {
        "field": "phone_number"
      }
    }
  },
  "aggregations": {
    "call_charges": {
      "sum": {
        "field": "call_charges"
      }
    },
    "churn": {
      "max": {
        "field": "churn"
      }
    },
    "call_duration": {
      "sum": {
        "field": "call_duration"
      }
    },
    "call_count": {
      "value_count": {
        "field": "dialled_number"
      }
    },
    "customer_service_calls": {
      "sum": {
        "field": "customer_service_calls"
      }
    },
    "number_vmail_messages": {
      "sum": {
        "field": "number_vmail_messages"
      }
    },
    "account_length": {
      "scripted_metric": {
        "init_script": "state.account_length = 0",
        "map_script": "state.account_length = params._source.account_length",
        "combine_script": "return state.account_length",
        "reduce_script": "for (d in states) if (d != null) return d"
      }
    },   
    "international_plan": {
      "scripted_metric": {
        "init_script": "state.international_plan = ''",
        "map_script": "state.international_plan = params._source.international_plan",
        "combine_script": "return state.international_plan",
        "reduce_script": "for (d in states) if (d != null) return d"
      }
    },
    "voice_mail_plan": {
      "scripted_metric": {
        "init_script": "state.voice_mail_plan = ''",
        "map_script": "state.voice_mail_plan = params._source.voice_mail_plan",
        "combine_script": "return state.voice_mail_plan",
        "reduce_script": "for (d in states) if (d != null) return d"
      }
    },
    "state": {
      "scripted_metric": {
        "init_script": "state.state = ''",
        "map_script": "state.state = params._source.state",
        "combine_script": "return state.state",
        "reduce_script": "for (d in states) if (d != null) return d"
      }
    }
  }
}

我们点击 Apply changes 按钮:

点击 Next 按钮:

点击 Next:

点击 Create and start 按钮:

选择 Discover。这样我们就看到了新生产的 churn 索引:

创建 Data frame analytics

接下来我们依据上面生产的 churn 索引来创建 Data frame analytics。这个将生产推理模型:

选择 churn 索引:

点击 Continue 按钮:

点击上面的 Create 按钮:

上面显示已经创建完毕。点击 Data Frame Analytics 进入管理页面:

点击 View 链接:

我们可以看到有 90% 的准确率。是一个非常好的推理模型。

我们可以在 Dev Tools 中打入如下的命令来查看 model_id:

GET _ml/inference/churn_analysis*?human=true

上面的命令显示:

我们需要记录下来上面的 model_id:churn_analysis-1602730610742。依赖于你自己的系统的不同而不同。在一下的命令中,你需要使用你自己的 model_id 来进行替换。

到此为止,我们已经利用机器学习,创建了自己的推理模型。


使用 inference aggregation

你刚刚提取的客户电话数据已标准化,并分为两个索引- custerms 和 calls。 电话号码对于客户而言是唯一的,并且在两个索引中都存在; phone_number 字段上的composite term aggregation 用于创建客户实体,功能是通过子聚合构建的。 推理聚合是在每个存储桶或实体上运行的父管道聚合。

从控制台内运行以下推断聚合:

GET calls,customers/_search
{
  "size": 0,
  "aggs": {
    "phone_number": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "phone_number": {
              "terms": {
                "field": "phone_number"
              }
            }
          }
        ]
      },
      "aggs": {
        "call_charges": {
          "sum": {
            "field": "call_charges"
          }
        },
        "call_duration": {
          "sum": {
            "field": "call_duration"
          }
        },
        "call_count": {
          "value_count": {
            "field": "dialled_number"
          }
        },
        "customer_service_calls": {
          "sum": {
            "field": "customer_service_calls"
          }
        },
        "number_vmail_messages": {
          "sum": {
            "field": "number_vmail_messages"
          }
        },
        "account_length": {
          "scripted_metric": {
            "init_script": "state.account_length = 0",
            "map_script": "state.account_length = params._source.account_length",
            "combine_script": "return state.account_length",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "international_plan": {
          "scripted_metric": {
            "init_script": "state.international_plan = ''",
            "map_script": "state.international_plan = params._source.international_plan",
            "combine_script": "return state.international_plan",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "voice_mail_plan": {
          "scripted_metric": {
            "init_script": "state.voice_mail_plan = ''",
            "map_script": "state.voice_mail_plan = params._source.voice_mail_plan",
            "combine_script": "return state.voice_mail_plan",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "state": {
          "scripted_metric": {
            "init_script": "state.state = ''",
            "map_script": "state.state = params._source.state",
            "combine_script": "return state.state",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "churn_classification": {
          "inference": {
            "model_id": "churn_analysis-1602730610742",
            "inference_config": {
              "classification": {
                "prediction_field_type": "number"
              }
            },
            "buckets_path": {
              "account_length": "account_length.value",
              "call_charges": "call_charges.value",
              "call_count": "call_count.value",
              "call_duration": "call_duration.value",
              "customer_service_calls": "customer_service_calls.value",
              "international_plan": "international_plan.value",
              "number_vmail_messages": "number_vmail_messages.value",
              "state": "state.value"
            }
          }
        }
      }
    }
  }
}

当然,这看起来很复杂,因为模型期望的每个特征都是通过子聚合计算的。 但是,让我们关注推理配置,它相对简单。

        "churn_classification": {
          "inference": {
            "model_id": "churn_analysis-1602730610742",
            "inference_config": {
              "classification": {
                "prediction_field_type": "number"
              }
            },
            "buckets_path": {
              "account_length": "account_length.value",
              "call_charges": "call_charges.value",
              "call_count": "call_count.value",
              "call_duration": "call_duration.value",
              "customer_service_calls": "customer_service_calls.value",
              "international_plan": "international_plan.value",
              "number_vmail_messages": "number_vmail_messages.value",
              "state": "state.value"
            }
          }
        }

在这里,请注意:你需要用自己的 model_id 来替换上面的 model_id 的值。必填字段是 model_id(在本例中为我以前训练过的模型的 id)和 buckets_path(使用标准pipeline aggs 存储桶路径语法详细描述输入的映射)。 inference_config 节指定输出应该是数字而不是字符串,这将在稍后证明有用。

inference 与其他 aggs 的聚合的输出看起来像这样:

  "aggregations" : {
    "phone_number" : {
      "meta" : { },
      "after_key" : {
        "phone_number" : "408-337-7163"
      },
      "buckets" : [
        {
          "key" : {
            "phone_number" : "408-327-6764"
          },
          "doc_count" : 309,
          "voice_mail_plan" : {
            "value" : "no"
          },
          "number_vmail_messages" : {
            "value" : 0.0
          },
          "call_count" : {
            "value" : 308
          },
          "account_length" : {
            "value" : 105
          },
          "state" : {
            "value" : "NE"
          },
          "international_plan" : {
            "value" : "no"
          },
          "call_charges" : {
            "value" : 55.310000000000024
          },
          "customer_service_calls" : {
            "value" : 4.0
          },
          "call_duration" : {
            "value" : 589.5000000000002
          },
          "churn_classification" : {
            "value" : 1,
            "top_classes" : [
              {
                "class_name" : 1,
                "class_probability" : 0.16952640760696347,
                "class_score" : 0.16952640760696347
              },
              {
                "class_name" : 0,
                "class_probability" : 0.8304735923930365,
                "class_score" : 0.07990906234541963
              }
            ],
            "prediction_probability" : 0.16952640760696347,
            "prediction_score" : 0.16952640760696347
          }
        },
      ...

value:0表示客户不会流失,并且预测概率告诉我们模型对结果有信心。

我们可以看到这些预测,但是我们真正感兴趣的客户是那些模型预测会流失的客户。 为了找到那些善变的用户,我们可以使用bucket_selector 聚合进行过滤:

"will_churn_filter": { 
  "bucket_selector": { 
    "buckets_path": { 
      "will_churn": "churn_classification>value" 
    }, 
    "script": "params.will_churn > 0" 
  } 
}

现在我们只看到要流失的客户。 该脚本要求聚合的输出是一个数值,这就是为什么我们在推理配置中设置 "prediction_field_type": "number" 的原因。

完整的带有 filter 的搜索是这样的:

GET calls,customers/_search
{
  "size": 0,
  "aggs": {
    "phone_number": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "phone_number": {
              "terms": {
                "field": "phone_number"
              }
            }
          }
        ]
      },
      "aggs": {
        "call_charges": {
          "sum": {
            "field": "call_charges"
          }
        },
        "call_duration": {
          "sum": {
            "field": "call_duration"
          }
        },
        "call_count": {
          "value_count": {
            "field": "dialled_number"
          }
        },
        "customer_service_calls": {
          "sum": {
            "field": "customer_service_calls"
          }
        },
        "number_vmail_messages": {
          "sum": {
            "field": "number_vmail_messages"
          }
        },
        "account_length": {
          "scripted_metric": {
            "init_script": "state.account_length = 0",
            "map_script": "state.account_length = params._source.account_length",
            "combine_script": "return state.account_length",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "international_plan": {
          "scripted_metric": {
            "init_script": "state.international_plan = ''",
            "map_script": "state.international_plan = params._source.international_plan",
            "combine_script": "return state.international_plan",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "voice_mail_plan": {
          "scripted_metric": {
            "init_script": "state.voice_mail_plan = ''",
            "map_script": "state.voice_mail_plan = params._source.voice_mail_plan",
            "combine_script": "return state.voice_mail_plan",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "state": {
          "scripted_metric": {
            "init_script": "state.state = ''",
            "map_script": "state.state = params._source.state",
            "combine_script": "return state.state",
            "reduce_script": "for (d in states) if (d != null) return d"
          }
        },
        "churn_classification": {
          "inference": {
            "model_id": "churn_analysis-1602730610742",
            "inference_config": {
              "classification": {
                "prediction_field_type": "number"
              }
            },
            "buckets_path": {
              "account_length": "account_length.value",
              "call_charges": "call_charges.value",
              "call_count": "call_count.value",
              "call_duration": "call_duration.value",
              "customer_service_calls": "customer_service_calls.value",
              "international_plan": "international_plan.value",
              "number_vmail_messages": "number_vmail_messages.value",
              "state": "state.value"
            }
          }
        },
        "will_churn_filter": {
          "bucket_selector": {
            "buckets_path": {
              "will_churn": "churn_classification>value"
            },
            "script": "params.will_churn > 0"
          }
        }
      }
    }
  }
}

 

在 Kibana 中展示结果

很棒,但是如果你不是计算机,则结果格式要解析的很多! 我如何快速概览一下那些会流失的人和那些不会流失的人? 最好是视觉的? 站起来使用多功能的 Kibana Vega 插件,该插件既可以转换数据也可以对其进行可视化。

数据源与上面的查询相同,并且使用 Vega 变换按预测类别对分类结果进行分组,并生成每种类型的计数。 首先,使用查找为数字预测类值赋予有意义的名称。

我们通过如下方式来创建 Vega 可视化:

我们在右边使用如下的代码进行填入:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json"
  "title": "Telco Churn Predictions"

  // Define the data source
  "data": {
    "url": {

      // Do not apply dashboard context filters
      "%context%": false

      // Which index to search
      "index": "calls,customers"

      // The Inference Pipeline agg      
      "body": {
        "size": 0,
        "aggs": {
          "phone_number": {
            "composite": {
              "size": 100,
              "sources": [
                {
                  "phone_number": {
                    "terms": {
                      "field": "phone_number"
                    }
                  }
                }
              ]
            },
            "aggs": {
              "call_charges": {
                "sum": {
                  "field": "call_charges"
                }
              },
              "call_duration": {
                "sum": {
                  "field": "call_duration"
                }
              },
              "call_count": {
                "value_count": {
                  "field": "dialled_number"
                }
              },
              "customer_service_calls": {
                "sum": {
                  "field": "customer_service_calls"
                }
              },
              "number_vmail_messages": {
                "sum": {
                  "field": "number_vmail_messages"
                }
              },
              "account_length": {
                "scripted_metric": {
                  "init_script": "state.account_length = 0",
                  "map_script": "state.account_length = params._source.account_length",
                  "combine_script": "return state.account_length",
                  "reduce_script": "for (d in states) if (d != null) return d"
                }
              },
              "international_plan": {
                "scripted_metric": {
                  "init_script": "state.international_plan = ''",
                  "map_script": "state.international_plan = params._source.international_plan",
                  "combine_script": "return state.international_plan",
                  "reduce_script": "for (d in states) if (d != null) return d"
                }
              },
              "voice_mail_plan": {
                "scripted_metric": {
                  "init_script": "state.voice_mail_plan = ''",
                  "map_script": "state.voice_mail_plan = params._source.voice_mail_plan",
                  "combine_script": "return state.voice_mail_plan",
                  "reduce_script": "for (d in states) if (d != null) return d"
                }
              },
              "state": {
                "scripted_metric": {
                  "init_script": "state.state = ''",
                  "map_script": "state.state = params._source.state",
                  "combine_script": "return state.state",
                  "reduce_script": "for (d in states) if (d != null) return d"
                }
              },
              "churn_classification": {
                "inference": {
                  "model_id": "churn_analysis-1602730610742",
                  "inference_config": {
                    "classification": {
                      "prediction_field_type": "number"
                    }
                  },
                  "buckets_path": {
                    "account_length": "account_length.value",
                    "call_charges": "call_charges.value",
                    "call_count": "call_count.value",
                    "call_duration": "call_duration.value",
                    "customer_service_calls": "customer_service_calls.value",
                    "international_plan": "international_plan.value",
                    "number_vmail_messages": "number_vmail_messages.value",
                    "state": "state.value"
                  }
                }
              }
            }
          }
        }
      }
    }

/*
For our graph, we only need the list of bucket values.  Use the format.property to discard everything else.
*/
    "format": {"property": "aggregations.phone_number.buckets"}
  },
  
  /* 
  The aggregation result tree needs to be transformed into a format we can plot.
  In this case group by the predicted class and count the docs of each class.
  First give the classification field a expressive name
  */
  "transform": [
    {
      "lookup": "churn_classification.value",
      "from": {
        "data": {   
          "values": [
            {"category": 0, "classification_class": "Won't Churn"}, 
            {"category": 1, "classification_class": "Will Churn"}
          ]
        },
        "key": "category",
        "fields": ["classification_class"]
      }
    },
    {
      "aggregate": [
        {
          "op": "count", 
          "as": "class_count"
        }
      ],
      "groupby": ["classification_class"]
    }
  ],

  "mark": "arc",
  "encoding": {
    "theta": {"field": "class_count", "type": "quantitative"},
    "color": {"field": "classification_class", "type": "nominal", "legend": {"title": null }}
  },
 
 "view": {"stroke": null}
}

点击上面的 Update 按钮:

我们可以看到流失和不流失的比例。

 

结论

每个人都希望利用其历史数据来做出更好的业务决策,这是一种非常实用,直接的方法。 通过使用机器学习模型,可以快速轻松地利用推理聚合来寻找即将到来的客户。 实时进行预测,并使用 Kibana Vega 可视化效果显示结果。