June 2024
New Endpoint: /list_users
A new endpoint has been added to list all users under your organization.
Filters: include filtering using list of
customer_id
.Pagination: Request body needs pagination limit and offset.
Sorting: Sort by
created_at
andupdated_at
and ascending/descending.Find more details here.
More Chunk Metadata
We’ve added chunk metadata for the following data sources to the
/embeddings
and/list_chunks_and_embeddings
endpoints:Websites:
"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "title": "RGB to HEX", "description": "Convert RGB color codes to HEX HTML format for use in web design and CSS. Also converts RGBA to HEX." } } }
Email:
"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "cc": "Swapnil Banga <swapnilbanga@outlook.com>, swapnil.banga@squareboat.com", "sender": "Swapnil Banga <swapnil.galaxy@gmail.com>", "timestamp": "2024-06-21T14:52:24Z" } } }
New Endpoint: /list_chunks_and_embeddings
A new endpoint has been added
/list_chunks_and_embeddings
. This endpoint is similar to the existing/text_chunks
endpoint but with some key differences:Retrieve chunks for multiple files that match the filter criteria instead of just a single file.
Filters: The filters for this endpoint are the same as those found via
user_files_v2
, allowing for more granular filtering of chunks based on file-level data.Ordering: The
order_by
values enables sorting based on file-specific attributes.Pagination Behavior: The
order_by
,limit
, andoffset
parameters now correspond to the initial query that filters files, before chunks and embeddings are fetched for the filtered files.The
count
parameter still refers to the total count of all embeddings for all filtered files, not the total count of filtered files.
Introducing /fetch_webpage
Endpoint
We’re excited to announce a new and improved endpoint for fetching webpage URLs:
/fetch_webpage
. This endpoint offers an asynchronous way to retrieve webpage data and URLs.Fetch URLs: The
/fetch_webpage
endpoint accepts a POST request with theurl
parameter as input.Webhook Notifications: Upon completion of a webpage request, one of the following webhooks will be sent:
WEBPAGE_ERROR
(object type:WEBPAGE
): Indicates that the request failed. The webhook payload includes the corresponding webpage ID.WEBPAGE_READY
(object type:WEBPAGE
): Indicates that the request succeeded. The webhook payload includes the corresponding webpage ID.
User Webpage History: Users can access their webpage request history by querying the
/user_webpages
endpoint. This endpoint returns the results of all the user’s webpage requests.
split_rows
for Third-Party Data Sources and Carbon Connect
split_rows
has been added for Excel and CSV files uploaded from third-party data sources and Carbon Connect.For Carbon Connect, it is available as a parameter (
splitRows
) on the integration and file extension level forLOCAL_FILES
and part of thefileSyncConfig
field for third-party integrations.
Notion Updates
Notion now supports the
sync_files_on_connection
parameter.When set to
true
(default), selected files via the Notion file picker will be synced immediately.When set to
false
, permissions to access the selected files will still be granted, but users need to use Carbon’s file picker or the/integrations/items/list
and/integrations/files/sync
endpoints to sync the files.
The
root_external_id
field under/integrations/items/list
is now returned for Notion files as well.
Slack Connector Launch
We’ve officially launched our Slack Connector.
How It Works:
OAuth and Token Refresh
Slack integration will use OAuth for authentication.
Conversation Listing
Users can list their Slack conversations using the
/integrations/slack/conversations
endpoint.The endpoint supports filtering by conversation type:
Public channels
Private channels
Private messages
Group conversations
Conversation Sync
There will be no automatic or global file sync for Slack. Instead, a dedicated sync endpoint
/integrations/slack/sync
is available, which acceptsconversation_id
(required) andafter
(date) filter parameters.Messages are synced in 15-minute blocks. For example, all messages between 2:15 and 2:30 will be synced together. Replies to messages outside the block will still be synced in the same block.
Currently, only message content is synced. Attachments and reactions are not included.
Chunk Metadata for CSV and Excel Files
We added row metadata to each chunk result. Moving forward, the beginning and ending row number in the table corresponding to a specific chunk will be returned under
content_metadata
.Similar to the page numbers and x/y coordinates we return for PDF chunks, the beginning and ending row number allows you to directly reference where in a file the chunk was found.
FILE_DELETED
Webhook Update
We pushed the update to fire the
FILE_DELETED
webhook if you manually re-sync a file via the/integrations/oauth_url
,/integrations/connect
,resync_file
and/integrations/files/sync
endpoints.This means if you pass in the external id of a folder, and we find that an item in the folder is deleted then we would delete the file and fire the webhook.
New CSV Parameter: split_rows
We have introduced a new optional query parameter called
split_rows
that accepts a boolean value. This parameter provides more flexibility when handling CSV rows that exceed max token limits.Here’s how it works:
If
split_rows
is set totrue
:CSV rows will be automatically split if they exceed either the specified chunk size or the maximum token limit of the embedding model.
This allows for processing of larger CSV rows without encountering errors.
If
split_rows
is set tofalse
(default value):The behavior remains the same as before.
If a CSV row exceeds the limits, an error will be thrown.
The default value of
split_rows
is set tofalse
to ensure backwards compatibility.This param is currently available for local file uploads via API and will be rolled out for external data sources and Carbon Connect shortly.
Dropbox Business Support
We’ve launched support for Dropbox Business. Users with Dropbox Business accounts can now access and sync files from team folders shared with them.
Users will need to reconnect their existing Dropbox connections if they want to start syncing files from their teams.
Inclusion and Exclusions for Sitemaps and Web Scrapes
We’ve added a new feature that allows you to filter web and sitemap scrapes by specific URL paths. This enhancement gives you greater control over the data you collect, enabling you to focus on the most relevant content for your needs.
For sitemaps, you can now include or exclude URLs based on their paths using the following parameters:
url_paths_to_include
Description: Filters sitemap URLs that contain any of the specified paths.
Value: A list of up to 10 strings representing the URL paths to include.
Example:
url_paths_to_include: ["/products", "/collections"]
url_paths_to_exclude
Description: Filters out sitemap URLs that contain any of the specified paths.
Value: A list of up to 10 strings representing the URL paths to exclude.
Example:
url_paths_to_exclude: ["/products", "/collections"]
For web scrapes, you can now specify the starting paths based on the URL paths:
url_paths_to_include
Description: The scrape will start at the specified paths, and if a recursion depth is set, it will only include links that also contain these paths.
Value: A list of up to 10 strings representing the URL paths to include.
Example:
url_paths_to_include: ["/products", "/collections"]
CARBON
Data Connectors for LLMs
COPYRIGHT @ 2024 JCDT DBA CARBON