此文由 Mix Space 同步更新至 xLog
为获得最佳浏览体验,建议访问原始链接
https://do1e.cn/posts/code/algolia-search
Algolia 搜索配置方法#
mx-space 的文档中有比较详细的配置教程,其他博客框架可能大同小异。
索引大小限制#
很不幸,在根据文档配置完后,log 中报错了:
16:40:40 ERROR [AlgoliaSearch] algolia 推送错误
16:40:40 ERROR [Event] Record at the position 10 objectID=xxxxxxxx is too big size=12097/10000 bytes. Please have a look at
https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-limitations/#record-size-limits
出错原因也很明确,有一篇博客太长了,而免费的 Algolia 每条数据仅有 10KB。对于我这种想白嫖的人怎么能忍,马上想办法解决。
解决方案#
思路#
对于 mx-space 来说,可以配置 API Token 后从/api/v2/search/algolia/import-json获取到手动提交到 Algolia 索引的 json 文件。
其中是一个包含了posts, pages 和notes的列表,示例数据如下:
{
"title": "南京大学IPv4地址范围",
"text": "# 动机\n\n<details>\n<summary>动机来自于搭建的网页。由于校内和公网都有搭建....",
"slug": "nju-ipv4",
"categoryId": "abcdefg",
"category": {
"_id": "abcdefg",
"name": "其他",
"slug": "others",
"id": "abcdefg"
},
"id": "1234567",
"objectID": "1234567",
"type": "post"
},
其中objectID比较关键,提交给 Algolia 的必须唯一。
这里我能想到的思路便是分页,将有过长text的文章切分,同时修改objectID不就可以了?!(显然,此时并没有想到问题的严重性)
另外我的一些页面里会写<style>和<script>,这部分也可以直接使用正则匹配删掉。
于是有了如下 Python 代码,编辑从上述接口下载的 json 并提交给 Algolia。
from algoliasearch.search.client import SearchClientSync
import requests
import json
import math
import os
from copy import deepcopy
import re
MAXSIZE = 9990
APPID = "..."
APPKey = "..."
MXSPACETOKEN = "..."
url = "https://www.do1e.cn/api/v2/search/algolia/import-json"
headers = {
"Authorization": MXSPACETOKEN,
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0",
}
ret = requests.get(url, headers=headers)
ret = ret.json()
with open("data.json", "w", encoding="utf-8") as f:
json.dump(ret, f, ensure_ascii=False, indent=2)
to_push = []
def json_length(item):
content = json.dumps(item, ensure_ascii=False).encode("utf-8")
return len(content)
def right_text(text):
try:
text.decode("utf-8")
return True
except:
return False
def cut_json(item):
length = json_length(item)
text_length = len(item["text"].encode("utf-8"))
# 计算切分份数
n = math.ceil(text_length / (MAXSIZE - length + text_length))
start = 0
text_content = item["text"].encode("utf-8")
for i in range(n):
new_item = deepcopy(item)
new_item["objectID"] = f"{item['objectID']}_{i}"
end = start + text_length // n
# 切分时要注意确保能被正确解码(中文占2个字节)
while not right_text(text_content[start:end]):
end -= 1
new_item["text"] = text_content[start:end].decode("utf-8")
start = end
to_push.append(new_item)
for item in ret:
# 删除style和script标签
item["text"] = re.sub(r"<style.*?>.*?</style>", "", item["text"], flags=re.DOTALL)
item["text"] = re.sub(r"<script.*?>.*?</script>", "", item["text"], flags=re.DOTALL)
if json_length(item) > MAXSIZE: # 超过限制,切分
print(f"{item['title']} is too large, cut it")
cut_json(item)
else: # 没超限制也修改objectID以保持一致性
item["objectID"] = f"{item['objectID']}_0"
to_push.append(item)
with open("topush.json", "w", encoding="utf-8") as f:
json.dump(to_push, f, ensure_ascii=False, indent=2)
client = SearchClientSync(APPID, APPKey)
resp = client.replace_all_objects("mx-space", to_push)
print(resp)
如果你用的是其他博客框架,看到这里就够了,希望能给你提供点思路。
很好,用 Python 修改搜索索引后重新提交到 Algolia 并在 mx-space 后台启用搜索功能,来试一试搜索超出限制的JPEG 编码细节吧。
怎么没有结果?怎么后台又报错了?
17:03:46 ERROR [Catch] Cast to ObjectId failed for value "1234567_0" (type string) at path "_id" for model "posts"
at SchemaObjectId.cast (entrypoints.js:1073:883)
at SchemaType.applySetters (entrypoints.js:1187:226)
at SchemaType.castForQuery (entrypoints.js:1199:338)
at cast (entrypoints.js:159:5360)
at Query.cast (entrypoints.js:799:583)
at Query._castConditions (entrypoints.js:765:9879)
at Hr.Query._findOne (entrypoints.js:768:4304)
at Hr.Query.exec (entrypoints.js:784:5145)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async Promise.all (index 0)
来编辑 mx-space 代码吧#
从上述 log 很容易看出,mx-space 使用ObjectId作为了索引,而不是id,定位到代码中的这里:
import type { SearchResponse } from '@algolia/client-search'
import {
BadRequestException,
forwardRef,
Inject,
Injectable,
Logger,
} from '@nestjs/common'
import { OnEvent } from '@nestjs/event-emitter'
import { CronExpression } from '@nestjs/schedule'
import { CronDescription } from '~/common/decorators/cron-description.decorator'
import { CronOnce } from '~/common/decorators/cron-once.decorator'
import { BusinessEvents } from '~/constants/business-event.constant'
import { EventBusEvents } from '~/constants/event-bus.constant'
import type { SearchDto } from '~/modules/search/search.dto'
import { DatabaseService } from '~/processors/database/database.service'
import type { Pagination } from '~/shared/interface/paginator.interface'
import { transformDataToPaginate } from '~/transformers/paginate.transformer'
import algoliasearch from 'algoliasearch'
import removeMdCodeblock from 'remove-md-codeblock'
import { ConfigsService } from '../configs/configs.service'
import { NoteService } from '../note/note.service'
import { PageService } from '../page/page.service'
import { PostService } from '../post/post.service'
@Injectable()
export class SearchService {
private readonly logger = new Logger(SearchService.name)
constructor(
@Inject(forwardRef(() => NoteService))
private readonly noteService: NoteService,
@Inject(forwardRef(() => PostService))
private readonly postService: PostService,
@Inject(forwardRef(() => PageService))
private readonly pageService: PageService,
private readonly configs: ConfigsService,
private readonly databaseService: DatabaseService,
) {}
async searchNote(searchOption: SearchDto, showHidden: boolean) {
const { keyword, page, size } = searchOption
const select = '_id title created modified nid'
const keywordArr = keyword
.split(/\s+/)
.map((item) => new RegExp(String(item), 'gi'))
return transformDataToPaginate(
await this.noteService.model.paginate(
{
$or: [{ title: { $in: keywordArr } }, { text: { $in: keywordArr } }],
$and: [
{ password: { $not: null } },
{ isPublished: { $in: showHidden ? [false, true] : [true] } },
{
$or: [
{ publicAt: { $not: null } },
{ publicAt: { $lte: new Date() } },
],
},
],
},
{
limit: size,
page,
select,
},
),
)
}
async searchPost(searchOption: SearchDto) {
const { keyword, page, size } = searchOption
const select = '_id title created modified categoryId slug'
const keywordArr = keyword
.split(/\s+/)
.map((item) => new RegExp(String(item), 'gi'))
return await this.postService.model.paginate(
{
$or: [{ title: { $in: keywordArr } }, { text: { $in: keywordArr } }],
},
{
limit: size,
page,
select,
},
)
}
public async getAlgoliaSearchIndex() {
const { algoliaSearchOptions } = await this.configs.waitForConfigReady()
if (!algoliaSearchOptions.enable) {
throw new BadRequestException('algolia not enable.')
}
if (
!algoliaSearchOptions.appId ||
!algoliaSearchOptions.apiKey ||
!algoliaSearchOptions.indexName
) {
throw new BadRequestException('algolia not config.')
}
const client = algoliasearch(
algoliaSearchOptions.appId,
algoliaSearchOptions.apiKey,
)
const index = client.initIndex(algoliaSearchOptions.indexName)
return index
}
async searchAlgolia(searchOption: SearchDto): Promise<
| SearchResponse<{
id: string
text: string
title: string
type: 'post' | 'note' | 'page'
}>
| (Pagination<any> & {
raw: SearchResponse<{
id: string
text: string
title: string
type: 'post' | 'note' | 'page'
}>
})
> {
const { keyword, size, page } = searchOption
const index = await this.getAlgoliaSearchIndex()
const search = await index.search<{
id: string
text: string
title: string
type: 'post' | 'note' | 'page'
}>(keyword, {
// start with 0
page: page - 1,
hitsPerPage: size,
attributesToRetrieve: ['*'],
snippetEllipsisText: '...',
responseFields: ['*'],
facets: ['*'],
})
if (searchOption.rawAlgolia) {
return search
}
const data: any[] = []
const tasks = search.hits.map((hit) => {
const { type, objectID } = hit
const model = this.databaseService.getModelByRefType(type as 'post')
if (!model) {
return Promise.resolve()
}
return model
.findById(objectID.split('_')[0])
.select('_id title created modified categoryId slug nid')
.lean({
getters: true,
autopopulate: true,
})
.then((doc) => {
if (doc) {
Reflect.set(doc, 'type', type)
data.push(doc)
}
})
})
await Promise.all(tasks)
return {
data,
raw: search,
pagination: {
currentPage: page,
total: search.nbHits,
hasNextPage: search.nbPages > search.page,
hasPrevPage: search.page > 1,
size: search.hitsPerPage,
totalPage: search.nbPages,
},
}
}
/**
* @description 每天凌晨推送一遍 Algolia Search
*/
@CronOnce(CronExpression.EVERY_DAY_AT_MIDNIGHT, {
name: 'pushToAlgoliaSearch',
})
@CronDescription('推送到 Algolia Search')
@OnEvent(EventBusEvents.PushSearch)
@OnEvent(BusinessEvents.POST_CREATE)
@OnEvent(BusinessEvents.POST_UPDATE)
@OnEvent(BusinessEvents.POST_DELETE)
@OnEvent(BusinessEvents.NOTE_CREATE)
@OnEvent(BusinessEvents.NOTE_UPDATE)
@OnEvent(BusinessEvents.NOTE_DELETE)
async pushAllToAlgoliaSearch() {
const configs = await this.configs.waitForConfigReady()
if (!configs.algoliaSearchOptions.enable || isDev) {
return
}
const index = await this.getAlgoliaSearchIndex()
this.logger.log('--> 开始推送到 Algolia')
const documents = await this.buildAlgoliaIndexData()
try {
await Promise.all([
index.replaceAllObjects(documents, {
autoGenerateObjectIDIfNotExist: false,
}),
index.setSettings({
attributesToHighlight: ['text', 'title'],
}),
])
this.logger.log('--> 推送到 algoliasearch 成功')
} catch (error) {
Logger.error('algolia 推送错误', 'AlgoliaSearch')
throw error
}
}
private canBeDecoded(textEncoded: Uint8Array): boolean {
try {
new TextDecoder('utf-8', { fatal: true }).decode(textEncoded)
return true
} catch {
return false
}
}
async buildAlgoliaIndexData() {
const combineDocuments = await Promise.all([
this.postService.model
.find()
.select('title text categoryId category slug')
.populate('category', 'name slug')
.lean()
.then((list) => {
return list.map((data) => {
Reflect.set(data, 'objectID', data._id)
Reflect.deleteProperty(data, '_id')
return {
...data,
text: removeMdCodeblock(data.text),
type: 'post',
}
})
}),
this.pageService.model
.find({}, 'title text slug subtitle')
.lean()
.then((list) => {
return list.map((data) => {
Reflect.set(data, 'objectID', data._id)
Reflect.deleteProperty(data, '_id')
return {
...data,
type: 'page',
}
})
}),
this.noteService.model
.find(
{
isPublished: true,
$or: [
{ password: undefined },
{ password: null },
{ password: { $exists: false } },
],
},
'title text nid',
)
.lean()
.then((list) => {
return list.map((data) => {
const id = data.nid.toString()
Reflect.set(data, 'objectID', data._id)
Reflect.deleteProperty(data, '_id')
Reflect.deleteProperty(data, 'nid')
return {
...data,
type: 'note',
id,
}
})
}),
])
const { algoliaSearchOptions } = await this.configs.waitForConfigReady()
const combineDocumentsSplited: any[] = []
combineDocuments.flat().forEach((item) => {
const objectToAdjust = JSON.parse(JSON.stringify(item))
objectToAdjust.text = objectToAdjust.text.replaceAll(
/<style[^>]*>[\s\S]*?<\/style>/gi,
'',
)
objectToAdjust.text = objectToAdjust.text.replaceAll(
/<script[^>]*>[\s\S]*?<\/script>/gi,
'',
)
const encodedSize = new TextEncoder().encode(
JSON.stringify(objectToAdjust),
).length
if (encodedSize <= algoliaSearchOptions.maxTruncateSize) {
objectToAdjust.objectID = `${objectToAdjust.objectID}_0`
combineDocumentsSplited.push(objectToAdjust)
} else {
const textEncoded = new TextEncoder().encode(objectToAdjust.text)
const textSize = textEncoded.length
const n = Math.ceil(
textSize /
(algoliaSearchOptions.maxTruncateSize - encodedSize + textSize),
)
let start = 0
for (let i = 0; i < n; i++) {
const newObject = JSON.parse(JSON.stringify(objectToAdjust))
let end = start + Math.floor(textSize / n)
while (!this.canBeDecoded(textEncoded.slice(start, end))) {
end--
}
newObject.text = new TextDecoder('utf-8').decode(
textEncoded.slice(start, end),
)
newObject.objectID = `${newObject.objectID}_${i}`
combineDocumentsSplited.push(newObject)
start = end
}
}
})
return combineDocumentsSplited
}
}
小插曲#
在编辑代码的时候,我发现原来代码中已经定义了超出限制长度后截断。不过定义的是 100KB,看来开发者是付费玩家。个人觉得把这个设置为环境变量会更好,而不是写死在代码里。
https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L370
2024/12/21: 作者更新了可配置的截断,不过我还是更喜欢分页提交,毕竟可以全文搜索嘛。
https://github.com/mx-space/core/commit/6da1c13799174e746708844d0b149b4607e8f276