셜록 홈즈와 대화하는 챗봇을 만들어보자!
1. 데이터 준비
a. 원본 데이터 다운로드
!curl https://sherlock-holm.es/stories/plain-text/cano.txt -o ../dataset/holmes/canon.txt
위 명령어는 curl을 사용하여 URL에서 텍스트 파일을 다운로드하고, 이를 ../dataset/holmes/canon.txt 경로에 저장하는 것이다. 이 파일에는 셜록 홈즈 소설 전체가 담겨 있다.
b. API 키 설정
여기서는 OpenAI API 키를 설정하는 과정이다. os.environ을 사용하여 환경 변수 OPENAI_API_KEY에 API 키를 저장한다. 이후 이 키를 사용하여 OpenAI의 서비스를 호출할 수 있게 된다.
import os
api_key = "sk-xxx"
os.environ["OPENAI_API_KEY"] = api_key
os.environ.get("OPENAI_API_KEY")
2. 문서 로딩
a. 디렉토리에서 문서 로딩
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('../dataset/holmes', glob="*", show_progress=True)
docs = loader.load()
이 코드는 지정된 디렉터리에서 모든 파일을 로드하는 과정이다.
DirectoryLoader는 주어진 디렉토리(../dataset/holmes)에서 파일을 읽어 docs라는 변수에 문서 리스트를 저장한다.
glob="*"는 모든 파일을 읽어오겠다는 의미다.
show_progress=True는 파일 로딩 과정을 시각적으로 보여준다.
3. 문서 분할
a. 문서 분할 설정
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=2048,
chunk_overlap=256,
)
documents = text_splitter.split_documents(docs)
documents[0].page_content
CharacterTextSplitter는 긴 텍스트를 작은 덩어리로 나누는 역할을 한다. 이 코드에서는:
- separator="\n\n": 두 개의 줄 바꿈 문자를 기준으로 텍스트를 나눈다.
- chunk_size=2048: 각 덩어리의 최대 크기는 2048자로 설정된다.
- chunk_overlap=256: 각 덩어리 사이에 256자의 중첩이 있다.
이렇게 하면 텍스트가 잘게 나뉘어 documents라는 변수에 저장된다.
CharacterTextSplitter를 선택한 이유는 token으로 잘라버리면 문맥이 파악되지 않기 때문에 문맥을 유지하기 위해서이다.
'THE COMPLETE SHERLOCK HOLMES\n\nArthur Conan Doyle\n\nTable of contents\n\nA Study In Scarlet\n\nThe Sign of the Four\n\nThe Adventures of Sherlock Holmes A Scandal in Bohemia The Red-Headed League A Case of Identity The Boscombe Valley Mystery The Five Orange Pips The Man with the Twisted Lip The Adventure of the Blue Carbuncle The Adventure of the Speckled Band The Adventure of the Engineer\'s Thumb The Adventure of the Noble Bachelor The Adventure of the Beryl Coronet The Adventure of the Copper Beeches\n\nThe Memoirs of Sherlock Holmes Silver Blaze The Yellow Face The Stock-Broker\'s Clerk The "Gloria Scott" The Musgrave Ritual The Reigate Squires The Crooked Man The Resident Patient The Greek Interpreter The Naval Treaty The Final Problem\n\nThe Return of Sherlock Holmes The Adventure of the Empty House The Adventure of the Norwood Builder The Adventure of the Dancing Men The Adventure of the Solitary Cyclist The Adventure of the Priory School The Adventure of Black Peter The Adventure of Charles Augustus Milverton The Adventure of the Six Napoleons The Adventure of the Three Students The Adventure of the Golden Pince-Nez The Adventure of the Missing Three-Quarter The Adventure of the Abbey Grange The Adventure of the Second Stain\n\nThe Hound of the Baskervilles\n\nThe Valley Of Fear\n\nHis Last Bow Preface The Adventure of Wisteria Lodge The Adventure of the Cardboard Box The Adventure of the Red Circle The Adventure of the Bruce-Partington Plans The Adventure of the Dying Detective The Disappearance of Lady Frances Carfax The Adventure of the Devil\'s Foot His Last Bow\n\nThe Case-Book of Sherlock Holmes Preface The Illustrious Client The Blanched Soldier The Adventure Of The Mazarin Stone The Adventure of the Three Gables The Adventure of the Sussex Vampire The Adventure of the Three Garridebs The Problem of Thor Bridge The Adventure of the Creeping Man The Adventure of the Lion\'s Mane The Adventure of the Veiled Lodger The Adventure of Shoscombe Old Place The Adventure of the Retired Colourman\n\nA STUDY IN SCARLET'
생성된 다큐먼트의 갯수를 확인해 보자.
len(documents)
2126
4. 문서 필터링 및 샘플 출력
documents = [d for d in documents if d.page_content.find('"') > -1]
이 코드는 따옴표(")가 포함된 문서만 필터링하여 documents 리스트에 다시 저장하는 과정이다.
보통 대화가 따옴표로 감싸져 있기 때문에, 대화가 포함된 문서만을 필터링하는 것으로 최종적으로 필터링된 문서의 개수를 출력한다.
len(documents)
2021
한번 데이터를 확인해 보자.
print(documents[1].page_content)
On the very day that I had come to this conclusion, I was standing at the Criterion Bar, when some one tapped me on the shoulder, and turning round I recognized young Stamford, who had been a dresser under me at Bart's. The sight of a friendly face in the great wilderness of London is a pleasant thing indeed to a lonely man. In old days Stamford had never been a particular crony of mine, but now I hailed him with enthusiasm, and he, in his turn, appeared to be delighted to see me. In the exuberance of my joy, I asked him to lunch with me at the Holborn, and we started off together in a hansom. "Whatever have you been doing with yourself, Watson?" he asked in undisguised wonder, as we rattled through the crowded London streets. "You are as thin as a lath and as brown as a nut." I gave him a short sketch of my adventures, and had hardly concluded it by the time that we reached our destination. "Poor devil!" he said, commiseratingly, after he had listened to my misfortunes. "What are you up to now?" "Looking for lodgings," I answered. "Trying to solve the problem as to whether it is possible to get comfortable rooms at a reasonable price." "That's a strange thing," remarked my companion; "you are the second man to-day that has used that expression to me." "And who was the first?" I asked. "A fellow who is working at the chemical laboratory up at the hospital. He was bemoaning himself this morning because he could not get someone to go halves with him in some nice rooms which he had found, and which were too much for his purse." "By Jove!" I cried, "if he really wants someone to share the rooms and the expense, I am the very man for him. I should prefer having a partner to being alone." Young Stamford looked rather strangely at me over his wine-glass. "You don't know Sherlock Holmes yet," he said; "perhaps you would not care for him as a constant companion." "Why, what is there against him?"
5. 대화 추출 준비
a. LLM 설정
ChatOpenAI는 OpenAI의 대화 모델을 설정하는 클래스다. 여기서 gpt-3.5-turbo 모델을 사용하고 있으며, temperature=0으로 설정하여 모델의 출력을 결정적으로 만들고 있다. 즉, 동일한 입력에 대해 항상 동일한 출력을 생성하도록 설정한 것이다.
from langchain_community.chat_models import ChatOpenAI
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0
)
b. 추출 체인 설정
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text
example_text = """
"Which is it today?" I asked,-
"morphine or cocaine?"
He raised his eyes languidly from the old black-letter volume which he had opened. "It is cocaine," he said,--"a seven-per-cent solution. Would you care to try it?"
"No, indeed," I answered, brusquely. "My constitution has not got over the Afghan campaign yet. I cannot afford to throw any extra strain upon it."
He smiled at my vehemence. "Perhaps you are right, Watson," he said. "I suppose that its influence is physically a bad one. I find it, however, so transcendently stimulating and clarifying to the mind that its secondary action is a matter of small moment."
"""
result = [
{"role": "Watson", "dialogue": "Which is it today? morphine or cocaine?"},
{"role": "Holmes", "dialogue": "It is cocaine, a seven-per-cent solution. Would you care to try it?"},
{"role": "Watson", "dialogue": "No, indeed, My constitution has not got over the Afghan campaign yet. I cannot afford to throw any extra strain upon it."},
{"role": "Holmes", "dialogue": "Perhaps you are right, Watson, I suppose that its influence is physically a bad one. I find it, however, so transcendently stimulating and clarifying to the mind that its secondary action is a matter of small moment."},
]
schema = Object(
id="script",
description="Extract dialogue from given piece of the novel 'Sherlock holmes', ignore the non-dialogue parts. When analyzing the document, make the most of your knowledge about the Sherlock Holmes novels you know. When the speaker is not clear, infer from the character's personality, occupation, and way of speaking.",
attributes=[
Text(
id="role",
description="The character who is speaking, use context to predict the role",
),
Text(
id="dialogue",
description="The dialogue spoken by the characters in the context",
)
],
examples=[
(example_text, result)
],
many=True,
)
여기서는 대화를 추출하기 위한 스키마를 정의하고 있다. 이 스키마는 텍스트에서 대화만을 추출하기 위해 사용된다.
- Object는 전체 데이터 구조를 정의하고, Text는 각각의 대화 요소를 정의한다.
- 예시 텍스트와 그에 대응하는 결과를 통해 모델이 학습할 수 있도록 돕는다.
import pickle
with open("../dataset/kor_schema_holmes.json", "wb") as file:
pickle.dump(schema, file)
스키마는 대화 추출 작업의 중요한 정의(규칙)이다. 이 스키마를 pickle을 통해 파일로 저장하면, 나중에 다시 불러와서 동일한 규칙으로 데이터를 처리할 수 있다. 이는 동일한 스키마를 여러 번 재사용하거나, 추후 작업을 이어서 진행할 때 유용하다.
c. 추출 체인 생성
kor_chain = create_extraction_chain(llm, schema)
이 코드는 대화 추출 작업을 수행할 수 있는 체인을 생성하는 과정이다. 여기서 create_extraction_chain 함수는 다음 요소들을 결합하여 추출 체인을 만든다:
• LLM: 대형 언어 모델로, 여기서는 ChatOpenAI가 사용되었다. 이 모델은 텍스트에서 정보를 추출하는 데 필요한 자연어 처리 능력을 제공한다.
• Schema: 추출할 데이터의 구조를 정의하는 스키마다. 어떤 데이터를 추출할지, 그리고 그 데이터가 어떤 형식을 가져야 하는지를 규정한다.
- 체인의 역할: LLM과 스키마를 결합하여, 사용자가 제공한 텍스트에서 특정 정보를 추출할 수 있는 자동화된 파이프라인을 구성한다.
- 추출 작업 수행: kor_chain.invoke(text) 메서드를 통해 체인을 실행할 수 있으며, 이 때 체인은 LLM과 스키마를 기반으로 텍스트에서 필요한 데이터를 추출한다.
print(kor_chain.prompt.format_prompt(text="[user_input]").to_string())
위 코드는 kor_chain이 대화 추출을 수행하기 위해 사용할 프롬프트를 미리 확인하는 코드다.
여기서 format_prompt는 주어진 입력(text="[user_input]")에 대해 LLM(대형 언어 모델)에게 어떤 형태의 명령어를 전달할지를 보여준다.
- 목적: 대화 추출이 어떻게 이루어질지를 미리 확인하고, LLM이 어떤 프롬프트를 사용할지 확인하기 위해 사용된다.
- 결과: LLM에게 전달될 프롬프트의 형태가 출력된다.
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below. ```TypeScript script: Array<{ [//](https:) Extract dialogue from given piece of the novel 'Sherlock holmes', ignore the non-dialogue parts. When analyzing the document, make the most of your knowledge about the Sherlock Holmes novels you know. When the speaker is not clear, infer from the character's personality, occupation, and way of speaking. role: string [//](https:) The character who is speaking, use context to predict the role dialogue: string [//](https:) The dialogue spoken by the characters in the context }> ``` Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema. Input: "Which is it today?" I asked,- "morphine or cocaine?" He raised his eyes languidly from the old black-letter volume which he had opened. "It is cocaine," he said,--"a seven-per-cent solution. Would you care to try it?" "No, indeed," I answered, brusquely. "My constitution has not got over the Afghan campaign yet. I cannot afford to throw any extra strain upon it."
...
Holmes|Perhaps you are right, Watson, I suppose that its influence is physically a bad one. I find it, however, so transcendently stimulating and clarifying to the mind that its secondary action is a matter of small moment. Input: [user_input] Output:
6. 대화 추출 테스트
text = documents[1].page_content
print(text)
이 코드는 필터링된 문서 중 하나를 선택하여 그 내용을 출력한다.
On the very day that I had come to this conclusion, I was standing at the Criterion Bar, when some one tapped me on the shoulder, and turning round I recognized young Stamford, who had been a dresser under me at Bart's. The sight of a friendly face in the great wilderness of London is a pleasant thing indeed to a lonely man. In old days Stamford had never been a particular crony of mine, but now I hailed him with enthusiasm, and he, in his turn, appeared to be delighted to see me. In the exuberance of my joy, I asked him to lunch with me at the Holborn, and we started off together in a hansom. "Whatever have you been doing with yourself, Watson?" he asked in undisguised wonder, as we rattled through the crowded London streets. "You are as thin as a lath and as brown as a nut." I gave him a short sketch of my adventures, and had hardly concluded it by the time that we reached our destination. "Poor devil!" he said, commiseratingly, after he had listened to my misfortunes. "What are you up to now?" "Looking for lodgings," I answered. "Trying to solve the problem as to whether it is possible to get comfortable rooms at a reasonable price." "That's a strange thing," remarked my companion; "you are the second man to-day that has used that expression to me." "And who was the first?" I asked. "A fellow who is working at the chemical laboratory up at the hospital. He was bemoaning himself this morning because he could not get someone to go halves with him in some nice rooms which he had found, and which were too much for his purse." "By Jove!" I cried, "if he really wants someone to share the rooms and the expense, I am the very man for him. I should prefer having a partner to being alone." Young Stamford looked rather strangely at me over his wine-glass. "You don't know Sherlock Holmes yet," he said; "perhaps you would not care for him as a constant companion." "Why, what is there against him?"
이제 실제로 주어진 text에서 대화를 추출하는 작업을 수행해 보자.
result = kor_chain.invoke(text)
result
kor_chain은 이전에 정의된 대화 추출 체인을 의미하며, invoke 메서드는 이 체인을 실행하여 주어진 텍스트에서 대화만을 추출해 낸다. 그 결과는 result 변수에 저장된다.
- 목적: 실제 텍스트에서 대화만을 추출하기 위해 체인을 실행한다.
- 결과: 텍스트에서 추출된 대화가 포함된 데이터가 반환된다.
{'text': {'data': {'script': [{'role': 'Watson', 'dialogue': 'Whatever have you been doing with yourself, Watson? You are as thin as a lath and as brown as a nut.'}, {'role': 'Stamford', 'dialogue': 'Looking for lodgings. Trying to solve the problem as to whether it is possible to get comfortable rooms at a reasonable price.'}, {'role': 'Stamford', 'dialogue': "That's a strange thing, you are the second man to-day that has used that expression to me."}, {'role': 'Watson', 'dialogue': 'And who was the first?'}, {'role': 'Stamford', 'dialogue': 'A fellow who is working at the chemical laboratory up at the hospital. He was bemoaning himself this morning because he could not get someone to go halves with him in some nice rooms which he had found, and which were too much for his purse.'}, {'role': 'Watson', 'dialogue': 'By Jove! if he really wants someone to share the rooms and the expense, I am the very man for him. I should prefer having a partner to being alone.'}, {'role': 'Stamford', 'dialogue': "You don't know Sherlock Holmes yet, perhaps you would not care for him as a constant companion."}, {'role': 'Watson', 'dialogue': 'Why, what is there against him?'}]}, 'raw': "role|dialogue\nWatson|Whatever have you been doing with yourself, Watson? You are as thin as a lath and as brown as a nut.\nStamford|Looking for lodgings. Trying to solve the problem as to whether it is possible to get comfortable rooms at a reasonable price.\nStamford|That's a strange thing, you are the second man to-day that has used that expression to me.\nWatson|And who was the first?\nStamford|A fellow who is working at the chemical laboratory up at the hospital. He was bemoaning himself this morning because he could not get someone to go halves with him in some nice rooms which he had found, and which were too much for his purse.\nWatson|By Jove! if he really wants someone to share the rooms and the expense, I am the very man for him. I should prefer having a partner to being alone.\nStamford|You don't know Sherlock Holmes yet, perhaps you would not care for him as a constant companion.\nWatson|Why, what is there against him?", 'errors': [], 'validated_data': {}}}
- 프롬프트 확인 코드: 대화 추출을 수행하기 전에, LLM에게 전달될 명령어(프롬프트)가 어떻게 구성되는지 미리 확인한다.
- kor_chain.invoke(text): 실제로 대화 추출을 수행하여 결과를 얻는다.
7. 추출된 대화 파싱
def parse_kor_result(data):
script = data['text']['data']['script']
results = [f"{scr['role']}: {scr['dialogue']}\n" for scr in script if 'role' in scr]
holmes_inc = any(scr['role'] == 'Holmes' for scr in script if 'role' in scr)
return ''.join(results), holmes_inc
parse_kor_result(result)
parse_kor_result 함수는 추출된 대화를 파싱 하여 역할과 대사를 출력할 수 있게 한다. 또한, 셜록 홈즈가 포함된 대화인지 여부를 확인한다.
("Watson: Whatever have you been doing with yourself, Watson? You are as thin as a lath and as brown as a nut.\nStamford: Looking for lodgings. Trying to solve the problem as to whether it is possible to get comfortable rooms at a reasonable price.\nStamford: That's a strange thing, you are the second man to-day that has used that expression to me.\nWatson: And who was the first?\nStamford: A fellow who is working at the chemical laboratory up at the hospital. He was bemoaning himself this morning because he could not get someone to go halves with him in some nice rooms which he had found, and which were too much for his purse.\nWatson: By Jove! if he really wants someone to share the rooms and the expense, I am the very man for him. I should prefer having a partner to being alone.\nStamford: You don't know Sherlock Holmes yet, perhaps you would not care for him as a constant companion.\nWatson: Why, what is there against him?\n", False)
여기서 holmes_inc가 False로 나오는 이유는 Holmes가 포함되지 않아서 False로 나온 것이다.
8. 문서에서 대화 추출 반복
지금까지는 notebook 환경에서 진행했는데 이 부분은 조금 오래 걸리기 때문에 terminal 환경에서 실행할 코드다.
from langchain.docstore.document import Document
from tqdm import tqdm
doc_script = []
pbar = tqdm(total = len(documents))
idx = 0
while idx < len(documents):
try:
doc = documents[idx]
script = kor_chain.invoke(doc.page_content)
script_parsed, holmes_inc = parse_kor_result(script)
if holmes_inc:
doc_script.append(script_parsed)
idx += 1
pbar.update(1)
except Exception as e:
print(e)
time.sleep(60)
위 코드 모든 문서에 대해 대화 추출을 반복적으로 수행하는 것이다. tqdm은 진행 상황을 시각적으로 보여준다.
- kor_chain.invoke(doc.page_content): 각 문서에 대해 대화를 추출한다.
- parse_kor_result(script): 추출된 대화를 파싱 하여 저장한다.
- 셜록 홈즈가 포함된 대화만 doc_script 리스트에 저장한다.
doc_script = []
pbar = tqdm(total = len(documents))
idx = 0
while idx < len(documents):
try:
doc = documents[idx]
script = kor_chain.invoke(doc.page_content)
script_parsed, holmes_inc = parse_kor_result(script)
if holmes_inc:
doc_script.append(script_parsed)
idx += 1
pbar.update(1)
except openai.RateLimitError as e:
print(f"OpenAI RATE LIMIT error {e.status_code}: (e.response)")
time.sleep(60)
9. 결과 저장 및 검색 엔진 설정
a. 결과 저장
with open("../dataset/holmes_script.txt", "r") as f:
lines = "\n".join(f.readlines()).split("###\n")
이 코드는 추출된 대화 스크립트를 파일로 저장한다. 나중에 검색 엔진에 사용할 수 있다.
from langchain.docstore.document import Document
doc_script = [Document(page_content=script_parsed,metadata={"source": "Sherlock Holmes"}) for script_parsed in lines]
이 코드는 저장된 스크립트를 불러와서 Document 객체로 변환하는 과정이다.
10. Retriever 생성
a. 벡터 검색 엔진 생성
from langchain_openai.embeddings import OpenAIEmbeddings
먼저, OpenAIEmbeddings라는 임베딩 모델을 사용하기 위해 라이브러리를 불러온다.
임베딩(embedding)은 텍스트를 고차원 벡터로 변환하는 과정을 의미하며, 이 벡터는 텍스트의 의미를 숫자 형태로 표현한다. 이를 통해 텍스트 간의 유사도를 계산하거나, 검색 엔진에서 유사한 문서를 찾는 데 사용할 수 있다.
embed_model = OpenAIEmbeddings(api_key=api_key,
model='text-embedding-3-small')
OpenAIEmbeddings 클래스의 인스턴스를 생성하는 코드다. 여기서:
- api_key: OpenAI API를 사용하기 위한 키값을 전달한다. 이 키는 OpenAI에서 제공하는 API에 접근하기 위해 필요하다.
- model: 'text-embedding-3-small'이라는 모델을 사용한다. 이 모델은 텍스트를 임베딩 벡터로 변환하는 역할을 한다.
이렇게 설정한 embed_model은 이후 텍스트를 벡터로 변환하는 데 사용된다.
from langchain_community.vectorstores import FAISS
FAISS(Facebook AI Similarity Search)는 벡터화된 데이터에 대해 빠른 검색과 유사성 탐색을 가능하게 하는 라이브러리다. 여기서는 FAISS를 이용해 벡터 저장소를 설정한다.
vector_index = FAISS.from_documents(doc_script, embed_model)
이 부분은 주어진 문서(doc_script)를 임베딩 벡터로 변환하고, 이를 FAISS 벡터 저장소(vector_index)에 저장하는 과정이다.
- doc_script: 임베딩할 문서들의 리스트다. 각 문서는 임베딩 모델을 통해 벡터로 변환된다.
- embed_model: 앞서 생성한 임베딩 모델로, 문서를 벡터로 변환하는 데 사용된다.
결과적으로 vector_index는 문서들이 벡터 형태로 저장된 데이터베이스가 된다.
retriever = vector_index.as_retriever(search_type="mmr", search_kwargs={"k": 3})
이 코드는 vector_index를 사용하여 검색 리트리버(retriever)를 설정하는 부분이다. 검색 리트리버는 주어진 쿼리(query)에 대해 가장 유사한 문서들을 찾는 역할을 한다.
- search_type="mmr": mmr(Maximal Marginal Relevance)이라는 검색 방법을 사용한다. MMR은 검색 결과가 다양하면서도 관련성이 높은 결과를 반환하도록 하는 알고리즘이다. 이는 검색 결과가 유사한 문서들로만 채워지는 것을 방지한다.
- search_kwargs={"k": 3}: 검색 결과로 반환할 문서의 개수를 설정한다. 여기서는 상위 3개의 문서를 반환하도록 설정되어 있다.
결과적으로, retriever는 텍스트 쿼리를 벡터화하여 vector_index에서 유사한 문서들을 검색하는 역할을 한다.
b. save vector index
vector_index.save_local("../models/holmes_faiss.json")
위 코드는 벡터 인덱스를 로컬 파일로 저장하는 작업을 수행한다. 벡터 인덱스를 로컬 파일로 저장하는 이유는, 이미 생성한 벡터 인덱스를 나중에 다시 사용할 수 있도록 하기 위함이다.
- vector_index: 생성된 벡터 인덱스를 나타내며, 이 인덱스에는 모든 문서의 임베딩 벡터가 포함되어 있다.
- save_local(): 이 메서드는 벡터 인덱스를 지정된 파일 경로에 저장한다. 예를 들어, 위 코드에서는 "../models/holmes_faiss.json" 경로에 저장한다.
a. 시간 절약
벡터 인덱스를 생성하는 과정은 시간이 많이 소요될 수 있다. 문서가 많거나 임베딩 벡터를 생성하는 데 시간이 오래 걸릴 수 있기 때문이다. 이미 생성된 벡터 인덱스를 저장해 두면, 나중에 동일한 데이터를 다시 분석할 때 인덱스를 새로 생성할 필요 없이 바로 불러와 사용할 수 있다.
b. 재사용성
프로젝트를 중단하고 나중에 다시 시작할 때, 이전에 작업한 벡터 인덱스를 그대로 사용할 수 있다. 이렇게 하면 중복 작업을 피할 수 있고, 동일한 벡터 인덱스를 기반으로 다른 작업이나 실험을 이어서 진행할 수 있다.
c. 일관성 유지
저장된 벡터 인덱스를 사용하면, 동일한 데이터셋에 대해 일관된 검색 결과를 얻을 수 있다. 벡터 인덱스를 새로 생성할 때마다 미세한 차이가 발생할 수 있는데, 저장된 인덱스를 사용하면 이러한 문제를 방지할 수 있다.
나중에 이 벡터 인덱스를 다시 사용할 때는 load_local() 메서드를 사용하여 인덱스를 불러올 수 있다.
from langchain_community.vectorstores import FAISS
vector_index = FAISS.load_local("../models/holmes_faiss.json", embed_model)
retriever = vector_index.as_retriever(search_type="mmr", search_kwargs={"k": 3})
c. 검색 엔진 테스트
result = retriever.get_relevant_documents("What is solar system?")
for d in result:
print(d.page_content)
print("===")
Watson: You appear to be astonished, Now that I do know it I shall do my best to forget it. Holmes: To forget it! Watson: You see, I consider that a man's brain originally is like a little empty attic, and you have to stock it with such furniture as you choose. A fool takes in all the lumber of every sort that he comes across, so that the knowledge which might be useful to him gets crowded out, or at best is jumbled up with a lot of other things so that he has a difficulty in laying his hands upon it. Now the skilful workman is very careful indeed as to what he takes into his brain-attic. He will have nothing but the tools which may help him in doing his work, but of these he has a large assortment, and all in the most perfect order. It is a mistake to think that that little room has elastic walls and can distend to any extent. Depend upon it there comes a time when for every addition of knowledge you forget something that you knew before. It is of the highest importance, therefore, not to have useless facts elbowing out the useful ones. Watson: But the Solar System! Holmes: What the deuce is it to me? you say that we go round the sun. If we went round the moon it would not make a pennyworth of difference to me or to my work. Watson: I was on the point of asking him what that work might be, but something in his manner showed me that the question would be an unwelcome one. I pondered over our short conversation, however, and endeavoured to draw my deductions from it. He said that he would acquire no knowledge which did not bear upon his object. Therefore all the knowledge which he possessed was such as would be useful to him. I enumerated in my own mind all the various points upon which he had shown me that he was exceptionally well-informed. I even took a pencil and jotted them down. I could not help smiling at the document when I had completed it. It ran in this way-- Sherlock Holmes- his limits. === Unknown: Is Mr. Sherlock Holmes here? Holmes: Mr. Sandeford, of Reading, I suppose? Sandeford: Yes, sir, I fear that I am a little late; but the trains were awkward. You wrote to me about a bust that is in my possession. Holmes: Exactly. Sandeford: I have your letter here. You said, 'I desire to possess a copy of Devine's Napoleon, and am prepared to pay you ten pounds for the one which is in your possession.' Is that right?
...
Holmes: Do you hear me? Who are you? What are you doing here? ===
result = retriever.get_relevant_documents("Who is your brother?")
for d in result:
print(d.page_content)
print("===")
Holmes: He is coming. Holmes: This way! Holmes: You can write me down an ass this time, Watson. This was not the bird that I was looking for. Mycroft: Who is he? Holmes: The younger brother of the late Sir James Walter, the head of the Submarine Department. Yes, yes; I see the fall of the cards. He is coming to. I think that you had best leave his examination to me. Prisoner: What is this? I came here to visit Mr. Oberstein. === Watson: In your own case, from all that you have told me, it seems obvious that your faculty of observation and your peculiar facility for deduction are due to your own systematic training. Holmes: To some extent. My ancestors were country squires, who appear to have led much the same life as is natural to their class. But, none the less, my turn that way is in my veins, and may have come with my grandmother, who was the sister of Vernet, the French artist. Art in the blood is liable to take the strangest forms. Watson: But how do you know that it is hereditary? Holmes: Because my brother Mycroft possesses it in a larger degree than I do.
...
Holmes: Pray be precise as to details. ===
🔥 결론
이렇게 해서 셜록 홈즈 소설에서 대화를 추출하고, 이를 바탕으로 사용자 질문에 적합한 대화나 문맥을 검색해주는 검색 엔진을 구현하는 과정을 완료했다. 검색 엔진을 통해 셜록 홈즈와 관련된 다양한 질문에 대해 적절한 문서를 찾을 수 있게 되었지만, 아직 사용자의 질문에 직접 응답을 생성하는 부분은 남아 있다.
다음 글에서는 이 검색 엔진을 기반으로 프롬프트를 작성하고, 실제로 셜록 홈즈 챗봇이 사용자와 대화할 수 있도록 응답을 생성하는 기능을 구현해 보자!
'RAG' 카테고리의 다른 글
[RAG] RAG와 Fine-Tuning 차이점과 Small Language Models (SLM) (2) | 2024.08.30 |
---|---|
[RAG] RAG 파이프라인 (1) | 2024.08.30 |
Bert와 GPT 차이점 (1) | 2024.08.30 |
[RAG] RAG의 기본 개념 및 트랜스포머 어텐션 설명 (1) | 2024.08.19 |
페르소나를 이용한 챗봇 (2) - chat memory 추가 (0) | 2024.08.13 |