June 2024

New Endpoint: /list_users

  • A new endpoint has been added to list all users under your organization.

    • Filters: include filtering using list of customer_id .

    • Pagination: Request body needs pagination limit and offset.

    • Sorting: Sort by created_at and updated_at and ascending/descending.

    • Find more details here.

More Chunk Metadata

  • We’ve added chunk metadata for the following data sources to the /embeddings and /list_chunks_and_embeddings endpoints:

    • Websites:

"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "title": "RGB to HEX", "description": "Convert RGB color codes to HEX HTML format for use in web design and CSS. Also converts RGBA to HEX." } } }

  • Email:

"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "cc": "Swapnil Banga <swapnilbanga@outlook.com>, swapnil.banga@squareboat.com", "sender": "Swapnil Banga <swapnil.galaxy@gmail.com>", "timestamp": "2024-06-21T14:52:24Z" } } }

New Endpoint: /list_chunks_and_embeddings

  • A new endpoint has been added /list_chunks_and_embeddings. This endpoint is similar to the existing /text_chunks endpoint but with some key differences:

    • Retrieve chunks for multiple files that match the filter criteria instead of just a single file.

    • Filters: The filters for this endpoint are the same as those found via user_files_v2, allowing for more granular filtering of chunks based on file-level data.

    • Ordering: The order_by values enables sorting based on file-specific attributes.

    • Pagination Behavior: The order_by, limit, and offset parameters now correspond to the initial query that filters files, before chunks and embeddings are fetched for the filtered files.

    • The count parameter still refers to the total count of all embeddings for all filtered files, not the total count of filtered files.

Introducing /fetch_webpage Endpoint

  • We’re excited to announce a new and improved endpoint for fetching webpage URLs: /fetch_webpage. This endpoint offers an asynchronous way to retrieve webpage data and URLs.

    • Fetch URLs: The /fetch_webpage endpoint accepts a POST request with the url parameter as input.

    • Webhook Notifications: Upon completion of a webpage request, one of the following webhooks will be sent:

      • WEBPAGE_ERROR (object type: WEBPAGE): Indicates that the request failed. The webhook payload includes the corresponding webpage ID.

      • WEBPAGE_READY (object type: WEBPAGE): Indicates that the request succeeded. The webhook payload includes the corresponding webpage ID.

  • User Webpage History: Users can access their webpage request history by querying the /user_webpages endpoint. This endpoint returns the results of all the user’s webpage requests.

split_rows for Third-Party Data Sources and Carbon Connect

  • split_rows has been added for Excel and CSV files uploaded from third-party data sources and Carbon Connect.

    • For Carbon Connect, it is available as a parameter (splitRows) on the integration and file extension level for LOCAL_FILES and part of the fileSyncConfig field for third-party integrations.

Notion Updates

  • Notion now supports the sync_files_on_connection parameter.

    • When set to true (default), selected files via the Notion file picker will be synced immediately.

    • When set to false, permissions to access the selected files will still be granted, but users need to use Carbon’s file picker or the /integrations/items/list and /integrations/files/sync endpoints to sync the files.

  • The root_external_id field under /integrations/items/list is now returned for Notion files as well.

Slack Connector Launch

  • We’ve officially launched our Slack Connector.

  • How It Works:

    • OAuth and Token Refresh

      • Slack integration will use OAuth for authentication.

    • Conversation Listing

      • Users can list their Slack conversations using the /integrations/slack/conversations endpoint.

      • The endpoint supports filtering by conversation type:

        • Public channels

        • Private channels

        • Private messages

        • Group conversations

  • Conversation Sync

    • There will be no automatic or global file sync for Slack. Instead, a dedicated sync endpoint /integrations/slack/sync is available, which accepts conversation_id (required) and after (date) filter parameters.

    • Messages are synced in 15-minute blocks. For example, all messages between 2:15 and 2:30 will be synced together. Replies to messages outside the block will still be synced in the same block.

    • Currently, only message content is synced. Attachments and reactions are not included.

Chunk Metadata for CSV and Excel Files

  • We added row metadata to each chunk result. Moving forward, the beginning and ending row number in the table corresponding to a specific chunk will be returned under content_metadata.

  • Similar to the page numbers and x/y coordinates we return for PDF chunks, the beginning and ending row number allows you to directly reference where in a file the chunk was found.

FILE_DELETED Webhook Update

  • We pushed the update to fire the FILE_DELETED webhook if you manually re-sync a file via the  /integrations/oauth_url, /integrations/connect, resync_file and /integrations/files/sync endpoints.

  • This means if you pass in the external id of a folder, and we find that an item in the folder is deleted then we would delete the file and fire the webhook.

New CSV Parameter: split_rows

  • We have introduced a new optional query parameter called split_rows that accepts a boolean value. This parameter provides more flexibility when handling CSV rows that exceed max token limits.

  • Here’s how it works:

    • If split_rows is set to true:

      • CSV rows will be automatically split if they exceed either the specified chunk size or the maximum token limit of the embedding model.

      • This allows for processing of larger CSV rows without encountering errors.

    • If split_rows is set to false (default value):

      • The behavior remains the same as before.

      • If a CSV row exceeds the limits, an error will be thrown.

  • The default value of split_rows is set to false to ensure backwards compatibility.

  • This param is currently available for local file uploads via API and will be rolled out for external data sources and Carbon Connect shortly.

Dropbox Business Support

  • We’ve launched support for Dropbox Business. Users with Dropbox Business accounts can now access and sync files from team folders shared with them.

  • Users will need to reconnect their existing Dropbox connections if they want to start syncing files from their teams.

Inclusion and Exclusions for Sitemaps and Web Scrapes

  • We’ve added a new feature that allows you to filter web and sitemap scrapes by specific URL paths. This enhancement gives you greater control over the data you collect, enabling you to focus on the most relevant content for your needs.

    • For sitemaps, you can now include or exclude URLs based on their paths using the following parameters:

      • url_paths_to_include

        • Description: Filters sitemap URLs that contain any of the specified paths.

        • Value: A list of up to 10 strings representing the URL paths to include.

        • Example: url_paths_to_include: ["/products", "/collections"]

      • url_paths_to_exclude

        • Description: Filters out sitemap URLs that contain any of the specified paths.

        • Value: A list of up to 10 strings representing the URL paths to exclude.

        • Example: url_paths_to_exclude: ["/products", "/collections"]

    • For web scrapes, you can now specify the starting paths based on the URL paths:

      • url_paths_to_include

        • Description: The scrape will start at the specified paths, and if a recursion depth is set, it will only include links that also contain these paths.

        • Value: A list of up to 10 strings representing the URL paths to include.

        • Example: url_paths_to_include: ["/products", "/collections"]


Data Connectors for LLMs