HKU NLP

HKU NLP / HKU NLP / instructor-base

AI ModelMilvus Integrated

TaskEmbedding

Modality文本

Similarity MetricCosine

LicenseApache 2.0

Dimensions768

Max Input Tokens512

Price免费

Instructor 模型家族简介

由 NKU NLP 推出的 Instructor 模型是一个经过指令微调的文本 Embedding 模型。它可以通过仅提供任务指令来创建特定任务的 Embedding 向量用于分类、检索、聚类、文本评估等,涵盖各种领域(如科学和金融),无需额外的微调。

AImodelHKU.PNG AImodelHKU.PNG

图1: Instructor 模型工作原理 (图片来源:NKU NLP)

Instructor 模型提供 3 种变体:instructor-base、instructor-xl 和 instructor-large。每种的性能和可扩展性都有所不同,可满足多样化的需求。

instructor-base 简介

Instructor-base 是 Instructor 模型家族中最小的一种变体,旨在平衡效率和性能,适用于追求高质量文本 Embedding 但需要控制计算开销的任务。 instructor-base、instructor-xl 和 instructor-large 对比

screenshot-20241009-105254.png screenshot-20241009-105254.png

如何使用 instructor-base 生成 Embedding 向量

生成 Embedding 向量的主要有两种方式:

  • PyMilvus:Milvus 的 Python SDK,无缝集成了 instructor-base 模型
  • Instructor 库:Python 的 InstructorEmbedding 库

生成 Embedding 向量后,就可以将向量存储在 Zilliz Cloud(全托管的 Milvus 向量数据库服务)中,用于语义相似性搜索。以下是四个关键步骤:

  • 免费注册 Zilliz Cloud 账号
  • 创建 Serverless 集群并获取公共 Endpoint 和 API 密钥
  • 创建 Collection 并插入向量数据
  • 对存储的 Embedding 向量进行语义搜索

通过 PyMilvus 生成 Embedding 向量

from pymilvus.model.dense import InstructorEmbeddingFunction
from pymilvus import MilvusClient

ef = InstructorEmbeddingFunction(
    "hkunlp/instructor-base", 
    query_instruction="Represent the Wikipedia question for retrieving supporting documents:", 
    doc_instruction="Represent the Wikipedia document for retrieval:")

docs = [
   "Artificial intelligence was founded as an academic discipline in 1956.",
   "Alan Turing was the first person to conduct substantial research in AI.",
   "Born in Maida Vale, London, Turing was raised in southern England."
]
# Generate embeddings for documents
docs_embeddings = ef.encode_documents(docs)

queries = ["When was artificial intelligence founded",
          "Where was Alan Turing born?"]

# Generate embeddings for queries
query_embeddings = ef.encode_queries(queries)

client = MilvusClient(
    uri=ZILLIZ_PUBLIC_ENDPOINT,
    token=ZILLIZ_API_KEY)
    
COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
    client.drop_collection(collection_name=COLLECTION)
client.create_collection(
    collection_name=COLLECTION,
    dimension=ef.dim,
    auto_id=True)

for doc, embedding in zip(docs, docs_embeddings):
    client.insert(COLLECTION, {"text": doc, "vector": embedding})

results = client.search(
    collection_name=COLLECTION,
    data=query_embeddings,
    consistency_level="Strong",
    output_fields=["text"])

通过 InstructorEmbedding 库生成 Embedding 向量

from InstructorEmbedding import INSTRUCTOR
from pymilvus import MilvusClient

model = INSTRUCTOR('hkunlp/instructor-base')

docs = [["Represent the Wikipedia document for retrieval: ", "Artificial intelligence was founded as an academic discipline in 1956."],
        ["Represent the Wikipedia document for retrieval: ", "Alan Turing was the first person to conduct substantial research in AI."],
        ["Represent the Wikipedia document for retrieval: ", "Born in Maida Vale, London, Turing was raised in southern England."]]
# Generate embeddings for documents
docs_embeddings = model.encode(docs, normalize_embeddings=True)


queries = [["Represent the Wikipedia question for retrieving supporting documents: ", "When was artificial intelligence founded"],
           ["Represent the Wikipedia question for retrieving supporting documents: ", "Where was Alan Turing born?"]]
# Generate embeddings for queries
query_embeddings = model.encode(queries, normalize_embeddings=True)

# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
    uri=ZILLIZ_PUBLIC_ENDPOINT,
    token=ZILLIZ_API_KEY)

COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
    client.drop_collection(collection_name=COLLECTION)
client.create_collection(
    collection_name=COLLECTION,
    dimension=768,
    auto_id=True)

for doc, embedding in zip(docs, docs_embeddings):
    client.insert(COLLECTION, {"text": doc, "vector": embedding})
    
results = client.search(
    collection_name=COLLECTION,
    data=query_embeddings,
    consistency_level="Strong",
    output_fields=["text"])

    准备好开始了吗?

    立刻创建 Zilliz Cloud 集群,存储和检索您的向量。

    免费试用 Zilliz Cloud