February 2024

New Embedding Models

  • We now support embedding generation using OpenAI’s text-embedding-3-small and text-embedding-3-large models.

  • To define the embedding model, utilize the embedding_model parameter in the POST body for the /embeddings and other API endpoints. By default, if no specific model is provided, the system will use OPENAI (the original Ada-2).

  • Find more details on the models available here.

Return HTML for Webpages

  • presigned_url field under user_files_v2 now returns a pre-signed URL to the raw HTML content for each web page.

  • parsed_text_url field still returns a pre-signed URL for the corresponding plain text.

  • Find more details here.

Return Websiate Tags in File Metadata

  • file_metadata field under user_files_v2 now returns og:image and og:description for each web page.

  • Find more details here.

Omit Content by CSS Selector 

  • You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.

  • The web_scrape request objects supports a new fields:

  •  css_selectors_to_skip: Optional[list[str]] = []

  • Find more details here.

JSON File Support

  • We’ve added support for JSON files via local upload and 3rd party connectors.

  • How It Works:

    • The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.

    • max_items_per_chunk is a parameter that determines how many JSON objects to include in a single chunk.

    • A new chunk is created if either the max_items_per_chunk and chunk_size limit is reached. For example:

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 JSON objects.

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 JSON object.

  • Learn more details here.