用户兴趣标签的ALS矩阵分解¶
交替最小二乘法(Alternating Least Squares, ALS)是一种不需要反向传播的优化方法。这是因为ALS是一种基于线性代数的矩阵分解方法,而不是基于梯度下降的优化方法。在ALS中,优化过程通过交替固定用户或物品矩阵来解决相应的最小二乘问题,这样就可以直接利用解析解来更新用户或物品向量。
与需要使用反向传播的神经网络方法或者FM不同,ALS的优化过程不涉及计算导数链。由于这一特性,ALS在实现上比较简单,并且非常适合并行计算和大规模稀疏数据集的处理。
我们的目标是训练三级标签向量,输入数据是用户兴趣分数,形如(user, tag, weight)
,希望通过ALS矩阵分解,得到用户和标签向量,用于后续任务。
理论和效果:机器学习实践:ALS的矩阵分解
In [1]:
Copied!
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql.functions import col, udf, expr, collect_set, lit
from pyspark.ml.evaluation import RegressionEvaluator, BinaryClassificationEvaluator
import pandas as pd
import numpy as np
import random
import argparse
import faiss
from typing import Tuple
from loguru import logger
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql.functions import col, udf, expr, collect_set, lit
from pyspark.ml.evaluation import RegressionEvaluator, BinaryClassificationEvaluator
import pandas as pd
import numpy as np
import random
import argparse
import faiss
from typing import Tuple
from loguru import logger
In [2]:
Copied!
__doc__ = """
ALS模型训练和评估
"""
__doc__ = """
ALS模型训练和评估
"""
In [3]:
Copied!
# 配置Pandas显示选项
pd.options.display.max_colwidth = 500
# 配置Pandas显示选项
pd.options.display.max_colwidth = 500
In [4]:
Copied!
def in_notebook():
"""
检查当前代码是否在Jupyter Notebook中运行
"""
try:
return get_ipython().__class__.__name__ == 'ZMQInteractiveShell'
except NameError:
return False
def in_notebook():
"""
检查当前代码是否在Jupyter Notebook中运行
"""
try:
return get_ipython().__class__.__name__ == 'ZMQInteractiveShell'
except NameError:
return False
In [5]:
Copied!
def get_spark():
"""
获取Spark会话,如果不存在则创建一个
"""
if 'spark' not in locals():
spark = SparkSession.builder \
.appName("ALS") \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
return spark
def get_spark():
"""
获取Spark会话,如果不存在则创建一个
"""
if 'spark' not in locals():
spark = SparkSession.builder \
.appName("ALS") \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
return spark
In [8]:
Copied!
# 获取Spark会话
spark = get_spark()
spark
# 获取Spark会话
spark = get_spark()
spark
Out[8]:
SparkSession - hive
配置参数¶
- 在ipython notebook中运行时,使用args参数传递全局参数
- 以命令行方式运行时,使用argparse模块传递全局参数
In [7]:
Copied!
# ===============================
# 主函数,执行ALS模型训练和评估"
# ===============================
class Args:
seed = 123
dataset = "bigdata_vf_als_user_tag_tuple"
dataset_dt = "20241015"
dataset_pt = "long_obj"
dataset_substr = "5"
maxIter = 10
regParam = 0.01
rank = 10
implicitPrefs = True
alpha = 200
train_test_split = 0.9
als_blocks = 10
split_by_user = True
add_negative_samples = False
neg_sample_ratio = 1.0
transform = "power4"
topk_items = 50
faiss_index_type = "IP"
args = Args()
if not in_notebook():
example = """
# 示例
python train.py --seed 123 --dataset bigdata_vf_als_user_tag_tuple --dataset_dt 20241015 --dataset_pt long_obj --maxIter 10 --regParam 0.01 --rank 10 --implicitPrefs --alpha 200 --train_test_split 0.9 --split_by_user --add_negative_samples --neg_sample_ratio 1.0 --transform power4 --topk_items 50 --faiss_index_type IP
"""
parser = argparse.ArgumentParser(description=__doc__, epilog=example, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--seed", type=int, default=args.seed, help="随机种子")
# dataset group
group = parser.add_argument_group("dataset")
group.add_argument("--dataset", type=str, default=args.dataset, help="数据集表名")
group.add_argument("--dataset_dt", type=str, default=args.dataset_dt, help="数据集日期字段名")
group.add_argument("--dataset_pt", type=str, default=args.dataset_pt, help="数据集分区字段名")
group.add_argument("--dataset_substr", type=str, default=None, help="数据集分区子串,逗号分隔,例如1,4")
# model group
group = parser.add_argument_group("model")
group.add_argument("--maxIter", type=int, default=args.maxIter, help="最大迭代次数")
group.add_argument("--regParam", type=float, default=args.regParam, help="正则化参数")
group.add_argument("--rank", type=int, default=args.rank, help="潜在因子数")
group.add_argument("--alpha", type=float, default=args.alpha, help="ALS模型参数alpha")
group.add_argument("--implicitPrefs", action="store_true", default=args.implicitPrefs, help="是否使用隐式反馈")
group.add_argument("--no_implicitPrefs", dest="implicitPrefs", action="store_false", help="不使用隐式反馈")
group.add_argument("--transform", choices=["power5", "power4", "power3", "power2", "logit", "none"], default=args.transform, help="评分数据转换")
group.add_argument("--als_blocks", type=int, default=20, help="ALS计算的并行度")
# split group
group = parser.add_argument_group("split")
group.add_argument("--train_test_split", type=float, default=args.train_test_split, help="训练集和测试集的分割比例")
group.add_argument("--neg_sample_ratio", type=float, default=args.neg_sample_ratio, help="负样本比例")
group.add_argument("--split_by_user", action="store_true", default=args.split_by_user, help="是否按用户分割数据集")
group.add_argument("--add_negative_samples", action="store_true", default=args.add_negative_samples, help="是否添加训练集负样本")
group.add_argument("--no_add_negative_samples", dest="add_negative_samples", action="store_false", help="不添加训练集负样本")
# faiss group
group = parser.add_argument_group("faiss")
group.add_argument("--topk_items", type=int, default=args.topk_items, help="TopK相似物品数")
group.add_argument("--faiss_index_type", choices=["IP", "L2"], default="IP", help="faiss索引类型")
args = parser.parse_args()
if not args.dataset_substr:
args.dataset_substr = ",".join(str(i) for i in range(10))
# ===============================
# 主函数,执行ALS模型训练和评估"
# ===============================
class Args:
seed = 123
dataset = "bigdata_vf_als_user_tag_tuple"
dataset_dt = "20241015"
dataset_pt = "long_obj"
dataset_substr = "5"
maxIter = 10
regParam = 0.01
rank = 10
implicitPrefs = True
alpha = 200
train_test_split = 0.9
als_blocks = 10
split_by_user = True
add_negative_samples = False
neg_sample_ratio = 1.0
transform = "power4"
topk_items = 50
faiss_index_type = "IP"
args = Args()
if not in_notebook():
example = """
# 示例
python train.py --seed 123 --dataset bigdata_vf_als_user_tag_tuple --dataset_dt 20241015 --dataset_pt long_obj --maxIter 10 --regParam 0.01 --rank 10 --implicitPrefs --alpha 200 --train_test_split 0.9 --split_by_user --add_negative_samples --neg_sample_ratio 1.0 --transform power4 --topk_items 50 --faiss_index_type IP
"""
parser = argparse.ArgumentParser(description=__doc__, epilog=example, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--seed", type=int, default=args.seed, help="随机种子")
# dataset group
group = parser.add_argument_group("dataset")
group.add_argument("--dataset", type=str, default=args.dataset, help="数据集表名")
group.add_argument("--dataset_dt", type=str, default=args.dataset_dt, help="数据集日期字段名")
group.add_argument("--dataset_pt", type=str, default=args.dataset_pt, help="数据集分区字段名")
group.add_argument("--dataset_substr", type=str, default=None, help="数据集分区子串,逗号分隔,例如1,4")
# model group
group = parser.add_argument_group("model")
group.add_argument("--maxIter", type=int, default=args.maxIter, help="最大迭代次数")
group.add_argument("--regParam", type=float, default=args.regParam, help="正则化参数")
group.add_argument("--rank", type=int, default=args.rank, help="潜在因子数")
group.add_argument("--alpha", type=float, default=args.alpha, help="ALS模型参数alpha")
group.add_argument("--implicitPrefs", action="store_true", default=args.implicitPrefs, help="是否使用隐式反馈")
group.add_argument("--no_implicitPrefs", dest="implicitPrefs", action="store_false", help="不使用隐式反馈")
group.add_argument("--transform", choices=["power5", "power4", "power3", "power2", "logit", "none"], default=args.transform, help="评分数据转换")
group.add_argument("--als_blocks", type=int, default=20, help="ALS计算的并行度")
# split group
group = parser.add_argument_group("split")
group.add_argument("--train_test_split", type=float, default=args.train_test_split, help="训练集和测试集的分割比例")
group.add_argument("--neg_sample_ratio", type=float, default=args.neg_sample_ratio, help="负样本比例")
group.add_argument("--split_by_user", action="store_true", default=args.split_by_user, help="是否按用户分割数据集")
group.add_argument("--add_negative_samples", action="store_true", default=args.add_negative_samples, help="是否添加训练集负样本")
group.add_argument("--no_add_negative_samples", dest="add_negative_samples", action="store_false", help="不添加训练集负样本")
# faiss group
group = parser.add_argument_group("faiss")
group.add_argument("--topk_items", type=int, default=args.topk_items, help="TopK相似物品数")
group.add_argument("--faiss_index_type", choices=["IP", "L2"], default="IP", help="faiss索引类型")
args = parser.parse_args()
if not args.dataset_substr:
args.dataset_substr = ",".join(str(i) for i in range(10))
加载数据¶
- 数据来源为hive表,格式为ALS模型需要的
(user_code, tag_code, weight)
- 测试时只使用10%的user_code,集群提交时候可以使用全部数据
- 划分训练集和测试集,为了得到正确的AUC值,会在测试集中加入负样本
In [6]:
Copied!
def split_dataset(dataset, train_test_split=0.9, seed=123, split_by_user=True, add_negative_samples=True, neg_sample_ratio=1.0):
"""
分割数据集为训练集和测试集,并根据需要添加负样本
"""
def add_random_negative_samples(df, all_tags, ratio=1.0):
"""为每个用户生成随机负样本"""
def add_negative_samples(row):
user_code = row['user_code']
positive_tags = set(row['positive_tags'])
negative_tags = random.sample(list(all_tags.value - positive_tags), int(len(positive_tags) * ratio))
return [(user_code, tag_code, 0.0) for tag_code in negative_tags]
partitions = df.rdd.getNumPartitions()
df_grouped = df.groupby('user_code').agg(collect_set('tag_code').alias('positive_tags')).repartition(partitions)
rdd_neg = df_grouped.rdd.flatMap(add_negative_samples)
return spark.createDataFrame(rdd_neg, ['user_code', 'tag_code', 'weight'])
# 分割数据集
train_data, test_data = dataset.select("user_code", "tag_code", "weight") \
.randomSplit([train_test_split, 1 - train_test_split], seed=seed)
# 按用户分割(可选)
if split_by_user:
dividend = int(1.0 / (1 - train_test_split))
test_data = test_data.where(f"user_code % {dividend} = 1")
# 添加测试集负样本
all_tags = dataset.select('tag_code').distinct().rdd.map(lambda row: row['tag_code']).collect()
all_tags = spark.sparkContext.broadcast(set(all_tags))
logger.info(f"数据中总标签数:{len(all_tags.value)}")
test_data_neg = add_random_negative_samples(test_data, all_tags, ratio=neg_sample_ratio)
test_data = test_data.union(test_data_neg)
# 添加训练集负样本(可选)
if add_negative_samples:
train_data_neg = add_random_negative_samples(train_data, all_tags, ratio=neg_sample_ratio)
train_data = train_data.union(train_data_neg)
return train_data, test_data
def split_dataset(dataset, train_test_split=0.9, seed=123, split_by_user=True, add_negative_samples=True, neg_sample_ratio=1.0):
"""
分割数据集为训练集和测试集,并根据需要添加负样本
"""
def add_random_negative_samples(df, all_tags, ratio=1.0):
"""为每个用户生成随机负样本"""
def add_negative_samples(row):
user_code = row['user_code']
positive_tags = set(row['positive_tags'])
negative_tags = random.sample(list(all_tags.value - positive_tags), int(len(positive_tags) * ratio))
return [(user_code, tag_code, 0.0) for tag_code in negative_tags]
partitions = df.rdd.getNumPartitions()
df_grouped = df.groupby('user_code').agg(collect_set('tag_code').alias('positive_tags')).repartition(partitions)
rdd_neg = df_grouped.rdd.flatMap(add_negative_samples)
return spark.createDataFrame(rdd_neg, ['user_code', 'tag_code', 'weight'])
# 分割数据集
train_data, test_data = dataset.select("user_code", "tag_code", "weight") \
.randomSplit([train_test_split, 1 - train_test_split], seed=seed)
# 按用户分割(可选)
if split_by_user:
dividend = int(1.0 / (1 - train_test_split))
test_data = test_data.where(f"user_code % {dividend} = 1")
# 添加测试集负样本
all_tags = dataset.select('tag_code').distinct().rdd.map(lambda row: row['tag_code']).collect()
all_tags = spark.sparkContext.broadcast(set(all_tags))
logger.info(f"数据中总标签数:{len(all_tags.value)}")
test_data_neg = add_random_negative_samples(test_data, all_tags, ratio=neg_sample_ratio)
test_data = test_data.union(test_data_neg)
# 添加训练集负样本(可选)
if add_negative_samples:
train_data_neg = add_random_negative_samples(train_data, all_tags, ratio=neg_sample_ratio)
train_data = train_data.union(train_data_neg)
return train_data, test_data
In [9]:
Copied!
# 加载数据集
dataset = spark.sql(f"""
select * from {args.dataset}
where dt='{args.dataset_dt}' and pt='{args.dataset_pt}'
and substr(uid, 3, 1) in ({args.dataset_substr})
""")
logger.info("数据集的schema:")
dataset.printSchema()
# 加载数据集
dataset = spark.sql(f"""
select * from {args.dataset}
where dt='{args.dataset_dt}' and pt='{args.dataset_pt}'
and substr(uid, 3, 1) in ({args.dataset_substr})
""")
logger.info("数据集的schema:")
dataset.printSchema()
2024-12-11 11:08:07.718 | INFO | __main__:<module>:8 - 数据集的schema:
root |-- uid: string (nullable = true) |-- user_code: integer (nullable = true) |-- tag_id: string (nullable = true) |-- tag_code: integer (nullable = true) |-- tag_name: string (nullable = true) |-- weight: double (nullable = true) |-- card_u: long (nullable = true) |-- card_t: long (nullable = true) |-- dt: string (nullable = true) |-- pt: string (nullable = true)
In [10]:
Copied!
# 分割数据集
train_data, test_data = split_dataset(
dataset,
train_test_split=args.train_test_split,
split_by_user=args.split_by_user,
add_negative_samples=args.add_negative_samples,
seed=args.seed,
neg_sample_ratio=args.neg_sample_ratio
)
# 分割数据集
train_data, test_data = split_dataset(
dataset,
train_test_split=args.train_test_split,
split_by_user=args.split_by_user,
add_negative_samples=args.add_negative_samples,
seed=args.seed,
neg_sample_ratio=args.neg_sample_ratio
)
2024-12-11 11:11:52.583 | INFO | __main__:split_dataset:32 - 数据中总标签数:34140
In [11]:
Copied!
test_data.show()
test_data.show()
+---------+--------+------------------+ |user_code|tag_code| weight| +---------+--------+------------------+ | 134081| 300| 0.7741| | 134171| 400| 0.8609| | 135441| 300| 0.5002| | 136631| 100|0.7268000000000001| | 138281| 100| 0.6828| | 139201| 300| 0.5934| | 139841| 300| 0.5072| | 140441| 100| 0.4849| | 144021| 200| 0.688| | 145171| 200| 0.622| | 145481| 100| 0.9839| | 146051| 300|0.5356000000000001| | 151771| 400| 0.7461| | 152281| 300| 0.536| | 152471| 200| 0.5232| | 152601| 200| 0.624| | 154951| 200|0.9289000000000001| | 156181| 400|0.9520000000000001| | 157831| 100| 0.7539| | 158221| 200| 0.6084| +---------+--------+------------------+ only showing top 20 rows
In [12]:
Copied!
dataset.select("user_code").distinct().count()
dataset.select("user_code").distinct().count()
Out[12]:
8009382
使用10%的数据状态下,标签个数为3万,用户个数为800w,人均标签个数几十个左右,稀疏度较高。
ALS模型训练¶
In [13]:
Copied!
# 设置ALS模型参数
params = {
"maxIter": args.maxIter,
"regParam": args.regParam,
"userCol": 'user_code',
"itemCol": 'tag_code',
"ratingCol": 'weight',
"rank": args.rank,
"coldStartStrategy": 'drop',
"implicitPrefs": args.implicitPrefs,
"alpha": args.alpha,
"numUserBlocks": args.als_blocks,
"numItemBlocks": args.als_blocks
}
als = ALS(**params)
logger.info("模型参数:{}", params)
# 设置ALS模型参数
params = {
"maxIter": args.maxIter,
"regParam": args.regParam,
"userCol": 'user_code',
"itemCol": 'tag_code',
"ratingCol": 'weight',
"rank": args.rank,
"coldStartStrategy": 'drop',
"implicitPrefs": args.implicitPrefs,
"alpha": args.alpha,
"numUserBlocks": args.als_blocks,
"numItemBlocks": args.als_blocks
}
als = ALS(**params)
logger.info("模型参数:{}", params)
2024-12-11 11:26:25.980 | INFO | __main__:<module>:17 - 模型参数:{'maxIter': 10, 'regParam': 0.01, 'userCol': 'user_code', 'itemCol': 'tag_code', 'ratingCol': 'weight', 'rank': 10, 'coldStartStrategy': 'drop', 'implicitPrefs': True, 'alpha': 200, 'numUserBlocks': 10, 'numItemBlocks': 10}
In [14]:
Copied!
# 分数数据转换
transform = {
"power5": expr("pow(weight, 5)"),
"power4": expr("pow(weight, 4)"),
"power3": expr("pow(weight, 3)"),
"power2": expr("pow(weight, 2)"),
"logit": expr("ln(weight/(1.001 - weight)) + 0.5"),
"none": col("weight")
}
transform_name = args.transform
logger.info(f"使用转换:{transform_name} -> {transform[transform_name]}")
train_data = train_data.withColumn("weight", transform[transform_name])
# 分数数据转换
transform = {
"power5": expr("pow(weight, 5)"),
"power4": expr("pow(weight, 4)"),
"power3": expr("pow(weight, 3)"),
"power2": expr("pow(weight, 2)"),
"logit": expr("ln(weight/(1.001 - weight)) + 0.5"),
"none": col("weight")
}
transform_name = args.transform
logger.info(f"使用转换:{transform_name} -> {transform[transform_name]}")
train_data = train_data.withColumn("weight", transform[transform_name])
2024-12-11 11:26:30.720 | INFO | __main__:<module>:12 - 使用转换:power4 -> Column<'pow(weight, 4)'>
In [15]:
Copied!
# 训练模型
model = als.fit(train_data)
# 训练模型
model = als.fit(train_data)
In [16]:
Copied!
# 模型保存
model_path = f"viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt={args.dataset_dt}/pt={args.dataset_pt}/train.d_{params['rank']}.imp_{params['implicitPrefs']}.reg_{params['regParam']}.a_{params['alpha']}.it_{params['maxIter']}.tf_{transform_name}.sub_{args.dataset_substr.replace(',', '%')}"
model.write().overwrite().save(model_path)
logger.info(f"Model saved to {model_path}")
# 模型保存
model_path = f"viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt={args.dataset_dt}/pt={args.dataset_pt}/train.d_{params['rank']}.imp_{params['implicitPrefs']}.reg_{params['regParam']}.a_{params['alpha']}.it_{params['maxIter']}.tf_{transform_name}.sub_{args.dataset_substr.replace(',', '%')}"
model.write().overwrite().save(model_path)
logger.info(f"Model saved to {model_path}")
2024-12-11 11:54:05.846 | INFO | __main__:<module>:5 - Model saved to viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt=20241015/pt=long_obj/train.d_10.imp_True.reg_0.01.a_200.it_10.tf_power4.sub_5
In [ ]:
Copied!
## 加载模型
# model_path = r"viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt=20241015/pt=long_obj/train.d_10.imp_True.reg_0.01.a_200.0.it_10.tf_power4.sub_0%5"
# model = ALSModel.load(model_path)
# logger.info(f"Model loaded from {model_path}")
## 加载模型
# model_path = r"viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt=20241015/pt=long_obj/train.d_10.imp_True.reg_0.01.a_200.0.it_10.tf_power4.sub_0%5"
# model = ALSModel.load(model_path)
# logger.info(f"Model loaded from {model_path}")
数值评估¶
- RMSE:均方根误差,越小越好
- AUC:计算正确预估偏序对的比例,越大越好,0.5为随机猜测的结果;其实严谨点需要用GAUC,但是这里简化处理
In [17]:
Copied!
def evaluate(model, data, threshold=0.75):
""" 评估模型性能,计算RMSE和AUC """
pred = model.transform(data).cache()
pred = pred.withColumn("prediction", col("prediction").cast("double"))
pred = pred.withColumn("label", expr(f"if(weight > {threshold}, 1.0, 0.0)"))
metrics = {}
evaluator = RegressionEvaluator(metricName="rmse", labelCol="weight", predictionCol="prediction")
metrics["rmse"] = evaluator.evaluate(pred)
logger.info("Root-mean-square error = " + str(metrics["rmse"]))
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label")
metrics["AUC"] = evaluator.evaluate(pred)
logger.info("AUC = " + str(metrics["AUC"]))
pred.unpersist()
return metrics
def evaluate(model, data, threshold=0.75):
""" 评估模型性能,计算RMSE和AUC """
pred = model.transform(data).cache()
pred = pred.withColumn("prediction", col("prediction").cast("double"))
pred = pred.withColumn("label", expr(f"if(weight > {threshold}, 1.0, 0.0)"))
metrics = {}
evaluator = RegressionEvaluator(metricName="rmse", labelCol="weight", predictionCol="prediction")
metrics["rmse"] = evaluator.evaluate(pred)
logger.info("Root-mean-square error = " + str(metrics["rmse"]))
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label")
metrics["AUC"] = evaluator.evaluate(pred)
logger.info("AUC = " + str(metrics["AUC"]))
pred.unpersist()
return metrics
In [18]:
Copied!
# 评估模型
metrics = evaluate(model, test_data, 0.75)
# 评估模型
metrics = evaluate(model, test_data, 0.75)
2024-12-11 12:02:13.229 | INFO | __main__:evaluate:10 - Root-mean-square error = 0.2089015628973701 2024-12-11 12:03:13.382 | INFO | __main__:evaluate:14 - AUC = 0.868035915721492
In [19]:
Copied!
model
model
Out[19]:
ALSModel: uid=ALS_79934957ad25, rank=10
In [20]:
Copied!
# 保存评估结果
report = spark.createDataFrame([(
model_path,
args.dataset_dt,
args.dataset_pt,
metrics["rmse"],
metrics["AUC"],
*list(params.values()),
transform_name,
args.split_by_user,
args.add_negative_samples,
args.neg_sample_ratio,
args.train_test_split,
)], [
"model_path",
"dataset_dt",
"dataset_pt",
"rmse", "AUC"] + \
list(params.keys()) + [
"transform",
"split_by_user",
"add_negative_samples",
"neg_sample_ratio",
"train_test_split"])
report.toPandas()
# 保存评估结果
report = spark.createDataFrame([(
model_path,
args.dataset_dt,
args.dataset_pt,
metrics["rmse"],
metrics["AUC"],
*list(params.values()),
transform_name,
args.split_by_user,
args.add_negative_samples,
args.neg_sample_ratio,
args.train_test_split,
)], [
"model_path",
"dataset_dt",
"dataset_pt",
"rmse", "AUC"] + \
list(params.keys()) + [
"transform",
"split_by_user",
"add_negative_samples",
"neg_sample_ratio",
"train_test_split"])
report.toPandas()
Out[20]:
model_path | dataset_dt | dataset_pt | rmse | AUC | maxIter | regParam | userCol | itemCol | ratingCol | ... | coldStartStrategy | implicitPrefs | alpha | numUserBlocks | numItemBlocks | transform | split_by_user | add_negative_samples | neg_sample_ratio | train_test_split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt=20241015/pt=long_obj/train.d_10.imp_True.reg_0.01.a_200.it_10.tf_power4.sub_5 | 20241015 | long_obj | 0.208902 | 0.868036 | 10 | 0.01 | user_code | tag_code | weight | ... | drop | True | 200 | 10 | 10 | power4 | True | True | 1.0 | 0.9 |
1 rows × 21 columns
In [21]:
Copied!
# 保存结果追加到文件
report_path = f"viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt={args.dataset_dt}/pt={args.dataset_pt}/report.json"
report.write.mode("append").json(report_path)
logger.info(f"Report saved to {report_path}")
# 保存结果追加到文件
report_path = f"viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt={args.dataset_dt}/pt={args.dataset_pt}/report.json"
report.write.mode("append").json(report_path)
logger.info(f"Report saved to {report_path}")
2024-12-11 12:04:36.519 | INFO | __main__:<module>:5 - Report saved to viewfs:///user_ext/weibo_bigdata_vf/yandi/als/checkpoints/dt=20241015/pt=long_obj/report.json
业务评估¶
- 计算每个物品cosine相似度最高的前N个物品,直观上看是否合理
In [22]:
Copied!
def prepare_item_features(model: ALSModel) -> pd.DataFrame:
""" 获取模型的物品特征矩阵,并与标签信息关联 """
model.itemFactors.createOrReplaceTempView("item_factors")
item_feature = spark.sql("""
SELECT
a.tag_id AS tag_id,
a.code AS tag_code,
a.tag_name AS tag_name,
b.features
FROM yandi_bigdata_vf_tag_dim a
JOIN item_factors b ON a.code = b.id
WHERE dt='working' AND pt='obj'
""")
return item_feature.toPandas()
def prepare_item_features(model: ALSModel) -> pd.DataFrame:
""" 获取模型的物品特征矩阵,并与标签信息关联 """
model.itemFactors.createOrReplaceTempView("item_factors")
item_feature = spark.sql("""
SELECT
a.tag_id AS tag_id,
a.code AS tag_code,
a.tag_name AS tag_name,
b.features
FROM yandi_bigdata_vf_tag_dim a
JOIN item_factors b ON a.code = b.id
WHERE dt='working' AND pt='obj'
""")
return item_feature.toPandas()
In [23]:
Copied!
def normalize_features(item_feature_df: pd.DataFrame) -> Tuple[pd.DataFrame, np.ndarray]:
""" 物品特征矩阵归一化 """
feature_matrix = np.stack(item_feature_df["features"])
logger.info("Feature matrix shape: {}", feature_matrix.shape)
normalized_matrix = feature_matrix / np.linalg.norm(feature_matrix, axis=1, keepdims=True)
item_feature_df["features_norm"] = [
",".join(f"{value:.5f}" for value in row) for row in normalized_matrix
]
return item_feature_df, normalized_matrix
def normalize_features(item_feature_df: pd.DataFrame) -> Tuple[pd.DataFrame, np.ndarray]:
""" 物品特征矩阵归一化 """
feature_matrix = np.stack(item_feature_df["features"])
logger.info("Feature matrix shape: {}", feature_matrix.shape)
normalized_matrix = feature_matrix / np.linalg.norm(feature_matrix, axis=1, keepdims=True)
item_feature_df["features_norm"] = [
",".join(f"{value:.5f}" for value in row) for row in normalized_matrix
]
return item_feature_df, normalized_matrix
In [24]:
Copied!
# 物品特征向量标准化
logger.info("准备物品特征...")
item_feature_df = prepare_item_features(model)
item_feature_df, feature_matrix = normalize_features(item_feature_df)
# 物品特征向量标准化
logger.info("准备物品特征...")
item_feature_df = prepare_item_features(model)
item_feature_df, feature_matrix = normalize_features(item_feature_df)
2024-12-11 12:04:57.699 | INFO | __main__:<module>:3 - 准备物品特征... /data0/spark/spark-3.2.0-bin/python/pyspark/sql/context.py:127: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. FutureWarning 2024-12-11 12:05:00.568 | INFO | __main__:normalize_features:5 - Feature matrix shape: (34140, 10)
In [25]:
Copied!
def create_faiss_index(index_type: str, rank: int) -> faiss.Index:
index_map = {
"IP": faiss.IndexFlatIP,
"L2": faiss.IndexFlatL2
}
index_class = index_map.get(index_type)
if not index_class:
raise ValueError(f"Invalid faiss_index_type: {index_type}")
return index_class(rank)
def create_faiss_index(index_type: str, rank: int) -> faiss.Index:
index_map = {
"IP": faiss.IndexFlatIP,
"L2": faiss.IndexFlatL2
}
index_class = index_map.get(index_type)
if not index_class:
raise ValueError(f"Invalid faiss_index_type: {index_type}")
return index_class(rank)
In [26]:
Copied!
def compute_similarity(index: faiss.Index, feature_matrix: np.ndarray, topk_items: int) -> Tuple[np.ndarray, np.ndarray]:
index.add(feature_matrix)
D, I = index.search(feature_matrix, topk_items)
return D, I
def compute_similarity(index: faiss.Index, feature_matrix: np.ndarray, topk_items: int) -> Tuple[np.ndarray, np.ndarray]:
index.add(feature_matrix)
D, I = index.search(feature_matrix, topk_items)
return D, I
In [27]:
Copied!
def add_topk_related(item_feature_df: pd.DataFrame, I: np.ndarray, D: np.ndarray) -> pd.DataFrame:
""" 计算每个物品的TopK相似物品 """
tag_name_dict = item_feature_df["tag_name"].to_dict()
def compute_topk_related(id_: int):
try:
return "|".join([
f"{tag_name_dict[I[id_][k]]}:{D[id_][k]:.1f}" for k in range(I.shape[1])
])
except Exception as e:
logger.error(e)
return ""
item_feature_df["topk_related"] = [compute_topk_related(id_) for id_ in range(len(item_feature_df))]
return item_feature_df
def add_topk_related(item_feature_df: pd.DataFrame, I: np.ndarray, D: np.ndarray) -> pd.DataFrame:
""" 计算每个物品的TopK相似物品 """
tag_name_dict = item_feature_df["tag_name"].to_dict()
def compute_topk_related(id_: int):
try:
return "|".join([
f"{tag_name_dict[I[id_][k]]}:{D[id_][k]:.1f}" for k in range(I.shape[1])
])
except Exception as e:
logger.error(e)
return ""
item_feature_df["topk_related"] = [compute_topk_related(id_) for id_ in range(len(item_feature_df))]
return item_feature_df
In [28]:
Copied!
# 检索每个物品的TopK相似物品
logger.info("计算相似物品...")
index = create_faiss_index(args.faiss_index_type, model.rank)
D, I = compute_similarity(index, feature_matrix, args.topk_items)
item_feature_df = add_topk_related(item_feature_df, I, D)
# 检索每个物品的TopK相似物品
logger.info("计算相似物品...")
index = create_faiss_index(args.faiss_index_type, model.rank)
D, I = compute_similarity(index, feature_matrix, args.topk_items)
item_feature_df = add_topk_related(item_feature_df, I, D)
2024-12-11 12:05:16.475 | INFO | __main__:<module>:3 - 计算相似物品...
In [29]:
Copied!
def log_matches(tags: list, item_feature_df: pd.DataFrame) -> None:
""" 打印标签的相似物品 """
for tag_name in tags:
matched = item_feature_df.loc[item_feature_df.tag_name == tag_name]
for row in matched.itertuples():
logger.info(f"{tag_name} -> {row.tag_id} -> {row.topk_related}")
def log_matches(tags: list, item_feature_df: pd.DataFrame) -> None:
""" 打印标签的相似物品 """
for tag_name in tags:
matched = item_feature_df.loc[item_feature_df.tag_name == tag_name]
for row in matched.itertuples():
logger.info(f"{tag_name} -> {row.tag_id} -> {row.topk_related}")
In [30]:
Copied!
# 测试一些标签的相似物品
tags = ["李鸿章", "复习", "瑞幸咖啡", "单板滑雪", "李世石", "德约科维奇", "赵露思", "灌篮高手",
"定妆粉", "延禧攻略", "黄圣依", "徐熙娣", "谏山创", "章子怡", "阴阳师手游", "model y"]
log_matches(tags, item_feature_df)
# 测试一些标签的相似物品
tags = ["李鸿章", "复习", "瑞幸咖啡", "单板滑雪", "李世石", "德约科维奇", "赵露思", "灌篮高手",
"定妆粉", "延禧攻略", "黄圣依", "徐熙娣", "谏山创", "章子怡", "阴阳师手游", "model y"]
log_matches(tags, item_feature_df)
2024-12-11 12:05:37.194 | INFO | __main__:log_matches:7 - 李鸿章 -> 1042015:historyConcept_8994373d3a93c2fe6ebb625a71c7d195 -> 李鸿章:1.0|成吉思汗:1.0|冯玉祥:1.0|李自成:1.0|黄金荣:1.0|戴笠:1.0|林彪:1.0|崇祯:1.0|洪秀全:1.0|十大元帅:1.0|袁世凯:1.0|蒋经国:1.0|军统:1.0|阎锡山:1.0|粟裕:1.0|朱元璋:1.0|元朝:1.0|抗日:1.0|二战:1.0|周恩来:1.0|解放战争:1.0|许世友:1.0|蒋介石:1.0|八国联军:1.0|汪精卫:1.0|白崇禧:1.0|第二次世界大战:1.0|原子弹:1.0|老视频:1.0|李宗仁:1.0|黄埔军校:1.0|蒋纬国:1.0|希特勒:1.0|明成祖:1.0|明朝:1.0|启功:1.0|战争史:1.0|安禄山:1.0|朝鲜战争:1.0|西安事变:1.0|杜月笙:1.0|汉朝:1.0|北宋:1.0|唐朝:1.0|孙殿英:1.0|努尔哈赤:1.0|孙中山:1.0|秦始皇:1.0|冷战:1.0|岳飞:1.0 2024-12-11 12:05:37.201 | INFO | __main__:log_matches:7 - 复习 -> 1042015:extendtag_143d13cd3598ca3ef00513a952d62628 -> 复习:1.0|知识点:1.0|考证:1.0|刷题:1.0|押题:1.0|证书:1.0|行政处罚法:1.0|备考:1.0|考点:1.0|公司法:1.0|复习计划:1.0|逢考必过:1.0|选择题:1.0|真题:1.0|判断题:1.0|成绩:1.0|考试报名:1.0|模拟题:1.0|国际法:1.0|学校教育:1.0|商经:1.0|行政管理:1.0|学习计划:1.0|鄢梦萱:1.0|民诉法:1.0|商法:1.0|戴鹏:1.0|信用卡诈骗罪:1.0|李晗:1.0|每日一题:1.0|理论法:1.0|抢劫罪:1.0|三国法:1.0|民诉:1.0|考研报名:1.0|五院四系:1.0|法学:1.0|考试大纲:1.0|文学:1.0|考博:1.0|杜洪波:1.0|柏浪涛:1.0|行政法:1.0|行政诉讼法:1.0|非法经营罪:1.0|统考:1.0|英语阅读:1.0|应届毕业生:1.0|考试时间:1.0|合伙企业法:1.0 2024-12-11 12:05:37.201 | INFO | __main__:log_matches:7 - 复习 -> 1042015:extendtag_143d13cd3598ca3ef00513a952d62628 -> 复习:1.0|知识点:1.0|考证:1.0|刷题:1.0|押题:1.0|证书:1.0|行政处罚法:1.0|备考:1.0|考点:1.0|公司法:1.0|复习计划:1.0|逢考必过:1.0|选择题:1.0|真题:1.0|判断题:1.0|成绩:1.0|考试报名:1.0|模拟题:1.0|国际法:1.0|学校教育:1.0|商经:1.0|行政管理:1.0|学习计划:1.0|鄢梦萱:1.0|民诉法:1.0|商法:1.0|戴鹏:1.0|信用卡诈骗罪:1.0|李晗:1.0|每日一题:1.0|理论法:1.0|抢劫罪:1.0|三国法:1.0|民诉:1.0|考研报名:1.0|五院四系:1.0|法学:1.0|考试大纲:1.0|文学:1.0|考博:1.0|杜洪波:1.0|柏浪涛:1.0|行政法:1.0|行政诉讼法:1.0|非法经营罪:1.0|统考:1.0|英语阅读:1.0|应届毕业生:1.0|考试时间:1.0|合伙企业法:1.0 2024-12-11 12:05:37.208 | INFO | __main__:log_matches:7 - 瑞幸咖啡 -> 1042015:foodMenu_5fc6544e33238feaed99287363e1a82a -> 瑞幸咖啡:1.0|欧诗漫:1.0|春季护肤:1.0|薇诺娜精华液:1.0|爆皮:1.0|舒肤佳:1.0|尿酸:1.0|高颅顶:1.0|库迪咖啡:0.9|Nars超方瓶粉底液:0.9|蜜丝婷(Mistine):0.9|六桂福:0.9|彩妆:0.9|霏丝佳:0.9|app:0.9|urban revivo(UR):0.9|手机支架:0.9|修护:0.9|胶原:0.9|塔斯汀:0.9|修丽可(Skin Ceuticals):0.9|祖玛珑:0.9|德克士:0.9|护发:0.9|面颊:0.9|三酸:0.9|亮眼:0.9|生发:0.9|雅萌:0.9|醉象:0.9|徕芬:0.9|打印机:0.9|欧树:0.9|MOKINGRAN(梦金园):0.9|博柏利(Burberry):0.9|Sabon:0.9|倩碧紫胖子卸妆膏:0.9|杯子:0.9|丝芙兰:0.9|头皮护理:0.9|双汇:0.9|面膜贴:0.9|茶百道:0.9|飞利浦:0.9|Hotwind热风:0.9|黛优佳:0.9|冲锋衣:0.9|奥伦纳素:0.9|国货彩妆:0.9|淘宝直播:0.9 2024-12-11 12:05:37.209 | INFO | __main__:log_matches:7 - 瑞幸咖啡 -> 1042015:stock_LKNCY -> 瑞幸咖啡:1.0|JC:0.9|韩雨彤:0.9|费加罗男士:0.9|卡地亚猎豹手表:0.9|王译磊:0.9|睿士ELLE MEN:0.9|王千硕:0.9|马乐婕:0.9|白妍:0.9|金秋:0.9|杨欢:0.9|偏爱人间烟火:0.9|王宇威:0.9|苏易水:0.9|游山恋:0.9|CEO:0.9|李柯以:0.9|PradaGalleria手袋:0.9|云起时:0.9|等不到的等待:0.9|叶十七:0.9|梁思伟:0.9|曹君豪:0.9|郑适:0.9|霓虹甜心:0.9|工作室:0.9|李亦非:0.9|莫欺少年穷:0.9|赵佳:0.9|澄芓:0.9|着迷:0.9|伍怡桥:0.9|尚美:0.9|贾翼瑄:0.9|永不失联的爱:0.9|爱情有烟火:0.9|白昕怡:0.9|王凯沐:0.9|千喆:0.9|何聪睿:0.9|马千欢:0.9|仙台有树:0.9|一念无明:0.9|唐奇:0.9|滤镜:0.9|段美洋:0.9|费加罗FIGARO:0.9|冯祥琨:0.9|翟一莹:0.9 2024-12-11 12:05:37.215 | INFO | __main__:log_matches:7 - 单板滑雪 -> 1042015:sportIceConcept_68e4a62e9330d02b632adb1dc3c477c1 -> 单板滑雪:1.0|悉尼奥运会:1.0|冯雨:1.0|杨家玉:1.0|陈艾森:1.0|冰雪项目:1.0|水球:1.0|伦敦奥运会:1.0|发球:1.0|巴塞罗那奥运会:1.0|冰球:1.0|鲍春来:1.0|雅典奥运会:1.0|陈虹伊:1.0|濑户大也:1.0|跆拳道:1.0|里约热内卢奥运会:1.0|于昕沂:1.0|举重:1.0|皮划艇:1.0|中国举重队:1.0|陈清晨:1.0|Long Live:1.0|何诗蓓:1.0|水上运动:1.0|王仪涵:1.0|汤慕涵:1.0|高志丹:1.0|八一队:1.0|刘天艺:1.0|李子君:1.0|陈雨菲:1.0|王芊懿:1.0|巩立姣:1.0|杨易溪:1.0|全运会:1.0|中国代表团:1.0|罗雪娟:1.0|牛广盛:1.0|万乐天:1.0|广州亚运会:1.0|邓舜阳:1.0|冰壶:1.0|葛楚彤:1.0|吴卿风:1.0|程玉洁:1.0|邢珈宁:1.0|全国冠军:1.0|李冰洁:1.0|罗切特:1.0 2024-12-11 12:05:37.221 | INFO | __main__:log_matches:7 - 李世石 -> 1042015:sportweiqiPlayer_e0463285bdcdc52656e0d4a1f278eeef -> 李世石:1.0|金明训:1.0|古力:1.0|春兰杯:1.0|李钦诚:1.0|芈昱廷:1.0|廖元赫:1.0|时越:1.0|李维清:1.0|许皓鋐:1.0|李昌镐:1.0|谢尔豪:1.0|申旻埈:1.0|檀啸:1.0|许嘉阳:1.0|柁嘉熹:1.0|三星杯:1.0|杨鼎新:1.0|范廷钰:1.0|姜东润:1.0|党毅飞:1.0|卞相壹:1.0|丁浩:1.0|周泓余:1.0|赵晨宇:1.0|崔精:1.0|屠晓宇:1.0|连笑:1.0|王星昊:1.0|芝野虎丸:1.0|朴廷桓:1.0|辜梓豪:1.0|元晟溱:1.0|聂卫平:1.0|於之莹:1.0|常昊:1.0|井山裕太:1.0|李轩豪:1.0|吴依铭:1.0|申真谞:1.0|围棋:1.0|谢科:0.9|一力辽:0.9|柯洁:0.9|棋牌:0.9|赛季:0.9|卡卡:0.9|小罗:0.9|唐韦星:0.9|巴黎圣日耳曼:0.9 2024-12-11 12:05:37.227 | INFO | __main__:log_matches:7 - 德约科维奇 -> 1042015:sportTennisPlayer_103999953 -> 德约科维奇:1.0|纳达尔:1.0|瓦林卡:1.0|多米尼克蒂姆:1.0|加斯奎特:1.0|费德勒:1.0|德尔波特罗:1.0|亚历山大-兹维列夫:1.0|ATP:1.0|史蒂夫-诺瓦克:1.0|马德里大师赛:1.0|迈阿密大师赛:1.0|罗杰斯杯:1.0|上海大师赛:1.0|罗兰加洛斯:1.0|丹尼尔·梅德韦杰夫:1.0|小威廉姆斯:1.0|安迪穆雷:1.0|阿尔卡拉兹:1.0|尼克克耶高斯:1.0|泰勒·弗里茨:1.0|锦织圭:1.0|科维托娃:1.0|阿加西:1.0|巴黎大师赛:1.0|麦肯罗:1.0|库兹涅佐娃:1.0|斯托瑟:1.0|贝尔滕斯:1.0|伊万尼塞维奇:1.0|卡洛斯·阿尔卡拉斯:1.0|斯蒂芬斯:1.0|罗德·拉沃尔:1.0|辛辛那提大师赛:1.0|西西帕斯:1.0|孔塔:1.0|阿利亚西姆:1.0|拉沃尔杯:1.0|卫冕冠军:1.0|沃兹尼亚奇:1.0|塞伦多罗:1.0|麦迪逊-凯斯:1.0|塞巴斯蒂安-科达:1.0|大威廉姆斯:1.0|蒙特卡洛大师赛:1.0|中国国家田径队:1.0|法网:1.0|扬尼克-辛纳:1.0|霍尔格·鲁内:1.0|水球:1.0 2024-12-11 12:05:37.233 | INFO | __main__:log_matches:7 - 赵露思 -> 1042015:moviePerson_c2459842283015b387530776c04c3985 -> 赵露思:1.0|迪丽热巴:1.0|鞠婧祎:1.0|虞书欣:1.0|刘亦菲:0.9|Angelababy:0.9|白鹿:0.9|杨幂:0.9|赵今麦:0.9|黄晓明:0.9|王鹤棣:0.9|关晓彤:0.9|鹿晗:0.9|刘浩存:0.9|赵丽颖:0.9|古力娜扎:0.9|田曦薇:0.9|白敬亭:0.9|范丞丞:0.9|贾玲:0.9|黄子韬:0.9|吴磊:0.9|Lisa:0.9|壁纸插画:0.9|可爱萌娃:0.8|杨超越:0.8|张雨绮:0.8|杨洋:0.8|王俊凯:0.8|明星盘点:0.8|易烊千玺:0.8|奔跑吧:0.8|萌娃日常:0.8|蔡徐坤:0.8|黄景瑜:0.8|张若昀:0.8|金晨:0.8|林更新:0.8|日常记录:0.8|宋轶:0.8|秦霄贤:0.8|吴谨言:0.8|少年感帅哥:0.8|情感语录:0.8|周冬雨:0.8|杨紫:0.8|减肥:0.8|王一博:0.8|欧阳娜娜:0.8|张婧仪:0.8 2024-12-11 12:05:37.240 | INFO | __main__:log_matches:7 - 灌篮高手 -> 1042015:movie_37910 -> 灌篮高手:1.0|樱木花道:1.0|井上雄彦:1.0|幽游白书:1.0|仙道彰:1.0|流川枫:1.0|三井寿:1.0|水户洋平:1.0|好兆头第二季:1.0|深津一成:1.0|仙流:1.0|宫城良田:1.0|夜翼:1.0|卡嘉莉:1.0|泽北荣治:1.0|久保带人:1.0|基拉:1.0|大卫·田纳特:1.0|中村悠一:1.0|富坚义博:1.0|荒木飞吕彦:1.0|浪客剑心:1.0|乱马1/2:1.0|全职猎人:1.0|银魂:1.0|好兆头:1.0|死神:1.0|异兽魔都:1.0|猎人:1.0|关于地球的运动:1.0|假面骑士OOO:1.0|呪术廻戦:1.0|丹尼尔·雷德克里夫:1.0|神探夏洛克:1.0|黑崎一护:1.0|超蝙:1.0|藤本树:1.0|极乐迪斯科:1.0|伊藤润二:1.0|黄金神威:1.0|心理测量者:1.0|Aniplex:1.0|jojo的奇妙冒险 黄金之风:1.0|谏山创:1.0|妮可·罗宾:1.0|朽木露琪亚:1.0|周刊少年Jump:1.0|最游记:1.0|鲁邦三世:1.0|内田雄马:1.0 2024-12-11 12:05:37.241 | INFO | __main__:log_matches:7 - 灌篮高手 -> 1042015:gameName_10200387 -> 灌篮高手:1.0|行尸走肉:0.9|骑马与砍杀2:0.9|复仇者联盟4:0.9|火影忍者:0.9|龙珠超:0.9|超级龙珠英雄:0.9|蜘蛛侠:英雄远征:0.9|火影忍者OL:0.9|海贼王:0.9|X战警:黑凤凰:0.9|海贼王剧场版:0.9|索隆:0.9|火影忍者之疾风传:0.9|大侦探皮卡丘:0.9|罗宾:0.9|四川美术学院:0.9|全面战争:0.9|七龙珠:0.9|广州美术学院:0.9|率土之滨:0.9|复仇者联盟3:无限战争:0.9|复联3:0.9|玩具总动员4:0.8|灰烬战线:0.8|日本漫画:0.8|乔巴:0.8|海贼王:0.8|NBA吐槽大会:0.8|远征:0.8|莫斯科震中杯:0.8|卢西奥:0.8|魔神英雄传:0.8|狂鼠:0.8|杰-克劳德:0.8|鸟山明:0.8|我爱罗:0.8|拳皇:0.8|布鲁克:0.8|龙珠z:0.8|弗兰奇:0.8|X战警:天启:0.8|暗黑破坏神2:0.8|九尾:0.8|美:0.8|彩虹六号:0.8|死神:0.8|综合球类:0.8|速度与激情9:0.8|卢本伟:0.8 2024-12-11 12:05:37.242 | INFO | __main__:log_matches:7 - 灌篮高手 -> 1042015:cartoonItem_2644 -> 灌篮高手:1.0|井上雄彦:1.0|死神:千年血战篇:1.0|死神:1.0|剑风传奇:1.0|山治:1.0|幽游白书:1.0|蝙蝠侠:1.0|久保带人:1.0|乌索普:1.0|灌篮高手:1.0|周刊少年Jump:1.0|妮可·罗宾:1.0|沙·克洛克达尔:0.9|波特卡斯·D·艾斯:0.9|巴基:0.9|尾田荣一郎:0.9|蜘蛛侠:纵横宇宙:0.9|罗罗诺亚·索隆:0.9|异兽魔都:0.9|佩罗娜:0.9|乌塔:0.9|火影忍者:0.9|进击的巨人:0.9|特拉法尔加·罗:0.9|浦饭幽助:0.9|青山渚:0.9|星际牛仔:0.9|假面骑士Build:0.9|萨博:0.9|超级战队:0.9|jojo的奇妙冒险:0.9|超级战队:0.9|大和:0.9|娜美:0.9|卡卡西:0.9|阿姆罗:0.9|樱木花道:0.9|金刚狼:0.9|欧尔麦特:0.9|新条茜:0.9|街霸:0.9|夜翼:0.9|蜘蛛侠:平行宇宙:0.9|电锯人X电钻人:0.9|古见同学有交流障碍症:0.9|高野麻里佳:0.9|假面骑士剑:0.9|基拉:0.9|赛博朋克:边缘行者:0.9 2024-12-11 12:05:37.248 | INFO | __main__:log_matches:7 - 定妆粉 -> 1042015:beautyProductForm_dd6b307cd28c33c688f820ec7bd9be1c -> 定妆粉:1.0|粉底霜:1.0|医美面膜:1.0|保湿面膜:1.0|混合皮:1.0|染发剂:1.0|眼线液:1.0|睫毛打底:1.0|精华水:1.0|美宝莲:1.0|眉刷:1.0|红血丝:1.0|清洁面膜:1.0|菲鹿儿focallure:1.0|护理液:1.0|遮瑕膏:1.0|艾杜纱:1.0|悦诗风吟水乳:1.0|悦诗风吟:1.0|化妆棉:1.0|唇膏笔:1.0|小棕瓶:1.0|面膜纸:1.0|眼影膏:1.0|卸妆水:1.0|镜面唇釉:1.0|Urban Decay:1.0|眼唇卸妆液:1.0|粉底刷:1.0|卸妆乳:1.0|水光肌护理:1.0|肌底液:1.0|妆前:1.0|colorkey小黑镜唇釉:1.0|化妆镜:1.0|液体眼影:1.0|磨砂膏:1.0|苏菲娜:1.0|潘达Panda.W:1.0|NYX:1.0|诗佩妮Spenny:1.0|卸妆巾:1.0|隔离霜:1.0|防晒喷雾:1.0|bb霜:1.0|混油:1.0|眼影刷:1.0|美容液:1.0|稚优泉CHIOTURE:1.0|SUQQU:1.0 2024-12-11 12:05:37.253 | INFO | __main__:log_matches:7 - 延禧攻略 -> 1042015:tv_90e66f0b5cecf54ca12c131b679fb92d -> 延禧攻略:1.0|还珠格格:1.0|蒋欣:1.0|欢乐颂:1.0|三十而已:1.0|知否知否应是绿肥红瘦:1.0|步步惊心:0.9|如懿传:0.9|回家的诱惑:0.9|情深深雨濛濛:0.9|我的前半生:0.9|美人心计:0.9|陈建斌:0.9|恶作剧之吻:0.9|清平乐:0.9|蓝盈莹:0.9|父母爱情:0.9|霍建华:0.9|放羊的星星:0.9|何以笙箫默:0.9|千山暮雪:0.9|梅婷:0.9|又见一帘幽梦:0.9|周迅:0.9|蜗居:0.9|刘荷娜:0.9|金粉世家:0.9|芈月传:0.9|匪我思存:0.8|阳光之下:0.8|命中注定我爱你:0.8|刁蛮公主:0.8|上错花轿嫁对郎:0.8|倾世皇妃:0.8|司藤:0.8|新还珠格格:0.8|郭珍霓:0.8|三生三世十里桃花:0.8|金太郎的幸福生活:0.8|她们的名字:0.8|欢乐颂2:0.8|樊胜美:0.8|门第:0.8|请回答1988:0.8|郑元畅:0.8|都挺好:0.8|奋斗:0.8|黄维德:0.8|宫锁珠帘:0.8|左耳:0.8 2024-12-11 12:05:37.260 | INFO | __main__:log_matches:7 - 黄圣依 -> 1042015:moviePerson_5107 -> 黄圣依:1.0|杜江:1.0|孙莉:1.0|霍思燕:1.0|窦骁:1.0|蔡少芬:1.0|应采儿:1.0|麦迪娜:1.0|贾静雯:1.0|袁咏仪:1.0|陈岚(向太):1.0|刘芸:1.0|王子文:0.9|幸福三重奏:0.9|张歆艺:0.9|吴尊:0.9|伊能静:0.9|袁弘:0.9|焦俊艳:0.9|朱丹:0.9|林心如:0.9|钟丽缇:0.9|张嘉倪:0.9|姜潮:0.9|黄奕:0.9|李承铉:0.9|金莎:0.9|陆毅:0.9|田亮:0.9|颖儿:0.9|安以轩:0.9|甜馨:0.9|叶一茜:0.9|陈小春:0.9|鲍蕾:0.9|杜淳:0.9|郭碧婷:0.9|李国毅:0.9|李湘:0.9|林志颖:0.9|爸爸去哪儿:0.9|妻子的浪漫旅行:0.9|陈超:0.9|霍建华:0.9|立威廉:0.9|姚晨:0.9|春日迟迟再出发:0.9|何超莲:0.9|包文婧:0.9|邓莎:0.9 2024-12-11 12:05:37.270 | INFO | __main__:log_matches:7 - 谏山创 -> 1042015:cartoonPeople_488 -> 谏山创:1.0|巨人:1.0|朽木露琪亚:1.0|少年JUMP:1.0|不死不幸:1.0|JOJO的奇妙冒险:石之海:1.0|Aniplex:1.0|全职猎人:1.0|银魂:1.0|呪术廻戦:1.0|集英社:1.0|富坚义博:1.0|乱马1/2:1.0|MAPPA:1.0|荒木飞吕彦:1.0|关于地球的运动:1.0|利威尔:1.0|猎人:1.0|jojo的奇妙冒险 黄金之风:1.0|堀越耕平:1.0|武内崇:1.0|电锯人2:1.0|藤本树:1.0|月刊少女野崎君:1.0|钢之炼金术师:1.0|出胜:1.0|波奇塔:1.0|五条老师:1.0|艾伦:1.0|炼狱杏寿郎:1.0|反叛的鲁路修:1.0|东方仗助:1.0|我推的孩子:1.0|RWBY:1.0|五悠:1.0|Fate Zero:1.0|地狱乐:1.0|平山宽菜:1.0|爆豪胜己:1.0|偶像大师闪耀色彩:1.0|黑崎一护:1.0|阿尔托莉雅:1.0|怪兽8号:1.0|女神异闻录4:1.0|赤坂明:1.0|青之驱魔师:1.0|鬼太郎:1.0|空条徐伦:1.0|银土:1.0|九井谅子:1.0 2024-12-11 12:05:37.276 | INFO | __main__:log_matches:7 - 章子怡 -> 1042015:moviePerson_40998 -> 章子怡:1.0|张柏芝:0.9|林志颖:0.9|高圆圆:0.9|郭晶晶:0.9|向佐:0.9|林心如:0.9|佟丽娅:0.9|郭碧婷:0.9|大S:0.9|李湘:0.9|孙俪:0.9|王力宏:0.9|马伊琍:0.9|汪小菲:0.9|汪峰:0.9|姚晨:0.9|苗苗:0.9|陈岚(向太):0.9|刘涛:0.9|窦骁:0.9|贾静雯:0.9|周冬雨:0.9|伊能静:0.9|霍思燕:0.9|章泽天:0.9|钟丽缇:0.9|刘恺威:0.9|王子文:0.9|具俊晔:0.9|甜馨:0.9|张馨予:0.9|陈妍希:0.9|陈乔恩:0.9|冯绍峰:0.9|小s:0.9|林志玲:0.9|张嘉倪:0.9|何超莲:0.9|贾乃亮:0.9|杜江:0.9|谢霆锋:0.9|范冰冰:0.9|杜淳:0.9|张钧甯:0.9|昆凌:0.8|唐嫣:0.8|张兰:0.8|霍建华:0.8|倪妮:0.8 2024-12-11 12:05:37.283 | INFO | __main__:log_matches:7 - 阴阳师手游 -> 1042015:gameName_9f43a6f58e2801c14726c9cc51cabd22 -> 阴阳师手游:1.0|阴阳师:1.0|网易游戏:1.0|克制:1.0|阴阳师:超鬼王:1.0|破势:1.0|地震鲶:1.0|面灵气:1.0|荒骷髅:1.0|海坊主:1.0|超鬼王:1.0|吸血姬:1.0|幻书启世录:1.0|蝎女:1.0|技巧:1.0|永生之海:1.0|猫之城:1.0|土蜘蛛:1.0|御魂:1.0|LF:1.0|火灵:0.9|涂佛:0.9|薙魂:0.9|NS:0.9|TK:0.9|逢魔之时:0.9|狙击:0.9|神都夜行录:0.9|伤魂鸟:0.9|花合战:0.9|哈利波特:魔法觉醒:0.9|道馆:0.9|石距:0.9|共潜:0.9|铃彦姬:0.9|小松丸:0.9|OPL常规赛:0.9|大闹天宫:0.9|任天堂Switch:0.9|斗技:0.9|ZGDX:0.9|海忍:0.9|镰鼬:0.9|ODG:0.9|渡劫:0.9|式神:0.9|空相面灵气:0.9|沙石镇时光:0.9|万年竹:0.9|阴阳师:百闻牌:0.9 2024-12-11 12:05:37.289 | INFO | __main__:log_matches:7 - model y -> 1042015:carSubBrand_21897151790127c8475f046482069cf5 -> model y:1.0|理想mpv:1.0|model 3:1.0|腾势:1.0|理想L9:1.0|李斌:1.0|何小鹏:1.0|zeekr 001:1.0|aito:1.0|吉利汽车:1.0|蔚来汽车:1.0|零跑汽车:1.0|理想l6:1.0|问界M7:1.0|智己汽车:1.0|哪吒汽车:1.0|高合汽车:1.0|小鹏汽车:1.0|威马汽车:1.0|大众:1.0|李想:1.0|折叠手机:1.0|蒂姆·库克:1.0|仰望:1.0|王自如:1.0|折叠屏:1.0|哪吒S:1.0|问界m5:1.0|鸿蒙智行:1.0|ff 91:1.0|智己ls7:1.0|理想l7:1.0|阿维塔:1.0|小鹏p7:1.0|faraday future:1.0|领克:1.0|零跑t03:1.0|cybertruck:0.9|广汽集团:0.9|比亚迪汉:0.9|zeekr 009:0.9|蔚来es6:0.9|model x:0.9|余承东:0.9|零跑c11:0.9|仰望u8:0.9|比亚迪宋pro dm:0.9|aion v:0.9|小鹏g9:0.9|上汽集团:0.9