This article is synchronized and updated to xLog by Mix Space
For the best browsing experience, it is recommended to visit the original link
https://www.do1e.cn/posts/code/algolia-search
Algolia Search Configuration Method#
The mx-space documentation contains a detailed configuration tutorial, which may be similar for other blog frameworks.
Index Size Limit#
Unfortunately, after configuring according to the documentation, an error was reported in the log:
16:40:40 ERROR [AlgoliaSearch] Algolia push error
16:40:40 ERROR [Event] Record at the position 10 objectID=xxxxxxxx is too big size=12097/10000 bytes. Please have a look at
https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-limitations/#record-size-limits
The reason for the error is clear: one of the blog posts is too long, and the free Algolia allows only 10KB per record. For someone like me who wants to take advantage of free services, this is unacceptable, so I immediately thought of a solution.
Solution#
Idea#
For mx-space, after configuring the API Token, you can obtain a JSON file to manually submit to the Algolia index from /api/v2/search/algolia/import-json
.
This contains a list of posts, pages, and notes, with sample data as follows:
{
"title": "Nanjing University IPv4 Address Range",
"text": "# Motivation\n\n<details>\n<summary>The motivation comes from the website I built. Due to both internal and public networks being set up....",
"slug": "nju-ipv4",
"categoryId": "abcdefg",
"category": {
"_id": "abcdefg",
"name": "Others",
"slug": "others",
"id": "abcdefg"
},
"id": "1234567",
"objectID": "1234567",
"type": "post"
},
The objectID
is crucial; it must be unique when submitted to Algolia.
The idea I came up with is to paginate the articles with long text
, while modifying the objectID
, right?! (Clearly, at this point, I did not realize the severity of the problem)
Additionally, some of my pages contain <style>
and <script>
, which can also be directly removed using regex.
Thus, I created the following Python code to edit the downloaded JSON from the above interface and submit it to Algolia.
from algoliasearch.search.client import SearchClientSync
import requests
import json
import math
import os
from copy import deepcopy
import re
MAXSIZE = 9990
APPID = "..."
APPKey = "..."
MXSPACETOKEN = "..."
url = "https://www.do1e.cn/api/v2/search/algolia/import-json"
headers = {
"Authorization": MXSPACETOKEN,
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0",
}
ret = requests.get(url, headers=headers)
ret = ret.json()
with open("data.json", "w", encoding="utf-8") as f:
json.dump(ret, f, ensure_ascii=False, indent=2)
to_push = []
def json_length(item):
content = json.dumps(item, ensure_ascii=False).encode("utf-8")
return len(content)
def right_text(text):
try:
text.decode("utf-8")
return True
except:
return False
def cut_json(item):
length = json_length(item)
text_length = len(item["text"].encode("utf-8"))
# Calculate the number of splits
n = math.ceil(text_length / (MAXSIZE - length + text_length))
start = 0
text_content = item["text"].encode("utf-8")
for i in range(n):
new_item = deepcopy(item)
new_item["objectID"] = f"{item['objectID']}_{i}"
end = start + text_length // n
# Ensure correct decoding when splitting (Chinese characters occupy 2 bytes)
while not right_text(text_content[start:end]):
end -= 1
new_item["text"] = text_content[start:end].decode("utf-8")
start = end
to_push.append(new_item)
for item in ret:
# Remove style and script tags
item["text"] = re.sub(r"<style.*?>.*?</style>", "", item["text"], flags=re.DOTALL)
item["text"] = re.sub(r"<script.*?>.*?</script>", "", item["text"], flags=re.DOTALL)
if json_length(item) > MAXSIZE: # Exceeds limit, split
print(f"{item['title']} is too large, cut it")
cut_json(item)
else: # Not exceeding limit, also modify objectID for consistency
item["objectID"] = f"{item['objectID']}_0"
to_push.append(item)
with open("topush.json", "w", encoding="utf-8") as f:
json.dump(to_push, f, ensure_ascii=False, indent=2)
client = SearchClientSync(APPID, APPKey)
resp = client.replace_all_objects("mx-space", to_push)
print(resp)
If you are using another blog framework, this should be enough; I hope it provides you with some ideas.
Great, after modifying the search index with Python and re-submitting it to Algolia, try enabling the search function in mx-space and searching for the out-of-limit JPEG Encoding Details.
Why are there no results? Why is there an error in the backend again?
17:03:46 ERROR [Catch] Cast to ObjectId failed for value "1234567_0" (type string) at path "_id" for model "posts"
at SchemaObjectId.cast (entrypoints.js:1073:883)
at SchemaType.applySetters (entrypoints.js:1187:226)
at SchemaType.castForQuery (entrypoints.js:1199:338)
at cast (entrypoints.js:159:5360)
at Query.cast (entrypoints.js:799:583)
at Query._castConditions (entrypoints.js:765:9879)
at Hr.Query._findOne (entrypoints.js:768:4304)
at Hr.Query.exec (entrypoints.js:784:5145)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async Promise.all (index 0)
Let's Edit the mx-space Code#
From the above log, it is easy to see that mx-space uses ObjectId
as the index instead of id
. Locate this in the code:
Simply modify it as follows.
Further Improvements#
However, this is still not very elegant; a Python script needs to be run regularly to push data to Algolia. Since I have already started modifying the mx-space code, why not integrate pagination directly? Fortunately, various AI tools can help me quickly get started with programming languages I wasn't very familiar with.
After buildAlgoliaIndexData()
in /apps/core/src/modules/search/search.service.ts, add the following code, which has the same logic as the above Python:
Rebuild the Docker image, and then switch back to the official image, and it will be fine!
However, the original version also defined three types of events (add, delete, modify) to trigger the push of a single element, but I was too lazy to change that, so I just moved the decorator (what is it called in TypeScript? I only know it as a Python term) to pushAllToAlgoliaSearch
.
A Side Note#
While editing the code, I discovered that the code already defined truncation for exceeding the limit. However, it was set to 100KB, indicating that the developer is a paid user. I personally think it would be better to set this as an environment variable rather than hardcoding it in the code.
https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L370
2024/12/21: The author updated the configurable truncation, but I still prefer pagination submission, as it allows for full-text search.
https://github.com/mx-space/core/commit/6da1c13799174e746708844d0b149b4607e8f276