In this lab, you will use unstructured data files like contract documents, leases, user manuals etc. to be analyzed using OCR, Azure Cognitive Search Semantic Search feature and the capabilities of Azure Open AI Large Language Models to summerize key information after converting them into more structured index files. For this lab, you will use the dataset provided at Lab3 Sample Data.
- How to leverage the Large Language Model(LLM) from GPT-3 to extract concise summary from a subset of huge document repository using OpenAI and Azure Semantic Search
- The accelerator is deployed and ready in the resource group
- You have access to sample data to test OpenAI
At this stage, we select the model we want to use and the feature we want to leverage. In this case we will be using the Davinci model and the Summerize feature. The playground brings in a sample on the editor. Select the content of the 'Conversation' section and replace with ${document} to ensure the dynamic content is used on runtime. After that click on 'View Code' on top right.
On the pop up, there will be drop down menu where by default 'Python' will be selected. Please change that to 'json' and Copy the code snippet.
Go back to the BPA tab and replace the default text on the Generic OpenAI component opened earlier with the copied text.
That completes the pipeline
There are 2 options for ingesting the data for the pipeline:
- Use single file upload for smaller files [smaller than 4 pages]
- Use the split document option to split the larger documents and upload the individual split files to the pipeline.
-
The get to Search Service. To view the results, go to portal.azure.com (Azure Portal) again in your browser and get to the resource group like we did earlier in Step 1. There, in the resource group, click on the resource that is of type Search Service.
-
Provide a name for datasource and click on Choose an existing connection for Connection String. Here the Azure CosmosDB resource created as a part of BPA accelerator already setup will be one of the sources you can choose from.
-
Keep the default for Managed identity Authentication, which is None. For Databases and Collection use the dropdown to select the same name as the Cosmos DB you selected at step 15.
-
Under Query, use the following Query. The pipeline should match the pipeline name you used in step 3
SELECT * FROM c WHERE c.pipeline = 'YOUR-PIPELINE-NAME' AND c._ts > @HighWaterMark
-
Click Next: Add cognitive skills (Optional). This validates and creates the index schema.
-
In the next Screen(Add cognitive skills (Optional)), Click Skip to: Customize Target Index,
-
Provide a name for the Index and click on Next: Create an indexer
-
Provide a name for the indexer and click Submit
-
You will get a notification that the import is successfully configured
-
Go back to your search index and configure the Semantic Configuration
-
Select the Semantic Configuration and click on Create new.
On the pop up do the following:
- Give a name to the Semantic Search Config
- Select the Title field and select 'filename'
- Select the 'content' field and any other relevant fields for Content Fields
- Select Save