Guru Connector
The Guru connector allows users to sync collections, folders, and cards from their Guru account.
CCv3 support for Guru will be coming soon and the
enabledIntegration
value isGURU
.See more details here.
Sync Filter for Email Attachments
Customers can specifically select to sync only emails that contain attachments.
You will still need to specify
sync_attachments
totrue
and also set the following filter:
{ "key": "has", "value": "attachment" }
Auto-Refresh Synced Files List in CCv3
We now automatically refresh the synced file list whenever users select additional files using our in-house or third-party file picker view. This eliminates the need for users to manually refresh the view.
Updated Children Prop
The
children
prop of the CCv3 component now accepts any valid React node as the children of the modal, from a simple<div>
to an entire component.Here’s an example of how the children prop can be used:
children={ <button onClick={() => setOpen((prev) => !prev)}> Toggle Connect </button> }
Custom Styling for Carbon Connect
Users can now control styling of CCv3 by targeting the specific class names we’ve provided. This allows for complete customization to match the desired look and feel of the application.
For example, classes names include:
cc-modal
: Applies to the entire modal componentcc-modal-header
: Targets the header section of the modalcc-modal-footer
: Targets the footer section of the modalcc-modal-close
: Applies to the close button of the modalcc-modal-overlay
: Targets the overlay background of the modal
By utilizing these class names, users can easily override the default styles and apply their own CSS rules to achieve the desired appearance.
OCR Support for JPG and PNG
We now support
jpg
,jpeg
andpng
file formats for OCR.In addition to the normal steps for enabling OCR, please set
media_type
toTEXT
(via file upload and/integrations/oauth_url
) so Carbon knows to process the image via OCR (versus generating image embeddings via our image embedding model).
HTML for Confluence Articles
We now return the raw HTML output for each Confluence article via the
file_metadata.saved_filename
object underuser_files_v2
.
Cancel Source Items Sync
We added an endpoint
/integrations/items/sync/cancel
to cancel data source syncs that are initiated via/integrations/items/sync
.This allows customers to manually stop syncing for user data sources where
sync_status
=SYNCING
.
New Gmail Filter
We added a new Gmail filter to sync all emails sent from a given account. Example:
{ "filters": { "key": "in", "value": "sent" } }
Return Raw Notion Blocks
We now return the raw output (blocks) for each Notion page via
saved_filename
underuser_files_v2
wheninclude_raw_file: true
.
Shared Google Drive Source Items
We now return shared Google Drive files and folders via
integration/items/list
.
Clearer Error Message for SYNC_ERROR
Status
When a file goes into
SYNC_ERROR
from re-syncing via/resync_file
because it has been deleted in source,sync_error_message
will now sayFile not found in data source
The webhook sent for that error will also contain
sync_error_message
inadditional_information
.
Slack UI in Carbon Connect v3 (3.0.0-beta32
)
Select Conversations to Sync
After authenticating, users have full control over which conversations they want to sync via CCv3, including:
Public channels
Private channels
Direct messages (DMs)
Group DMs
Manage Synced Conversations
Users can manage their list of synced conversations at any time via CCv3.
Easily add or remove channels and DMs to adjust what gets synced between Slack and Carbon.
Carbon Connect Enhancements
Synced URLs for Web Scrapes (CCv3
beta30
)We now display synced URLs in a dedicated list view under the
WEB_SCRAPE
integration.The default columns displayed in the list view are
name
,status
,created_at
.Parent URLs will be displayed as “folders” and children URLs will be displayed as “files” within the folder.
When
showFilesTab
is set tofalse
we surface aSelect files
button in the account drop-down for users to sync new files.Data Source Polling Interval
Added a new configuration property at the component level called
dataSourcePollingInterval
.This property controls how frequently data sources are polled for any updates and events.
The value is specified in milliseconds (ms) and the minimum allowed value for this property is 3000 ms. The default is 8000 ms.
Speaker Diarization
Added
includeSpeakerLabels
forLOCAL_FILES
integration and file extensions.Added
include_speaker_labels
to fileSyncConfig for third-party connectors.
openFilesTabTo
ParamThe
openFilesTabTo
prop is set on the component level and determines which tab (FILE_PICKER
orFILES_LIST
) the user is taken to by default when they select an integration.The prop takes a string value of either
"FILE_PICKER"
|"FILES_LIST"
.This prop only applies when the customer has enabled Carbon’s in-house file picker.
We now display a banner when data source items are being synced. The user will still be able to select previously synced items for upload in the meantime.
Guru support in CCv3 has been added. The
enabledIntegration
isGURU
.We improved the file list view to be better optimized for mobile devices and ensured that the column headers and values align properly.
Pongo Reranking Modal
We’ve added Pongo as a supported reranker model alongside Jina and Cohere.
Similar to Cohere and Jina reranking, users can now use
PONGO_RERANKER
in the following manner on theembeddings
endpoint: { "query": "how is anime made?", "k": 5, "rerank": {"model": "PONGO_RERANKER"} }
Third-Party File Picker Behavior
We added a new parameter
automatically_open_file_picker
to the external file sync urls:/integrations/oauth_url
and/integrations/connect
. Whentrue
, the file picker for Google Drive, Box, OneDrive, Sharepoint, Dropbox will automatically open when the user lands on the successful connection page.It’s important to note that some users’ browsers may have popup blockers that could prevent this parameter from functioning. In such cases, the user may receive a prompt from their browser asking for permission to allow popups from the platform. If the user grants permission, the feature will work as intended for future syncs.
It’s worth mentioning that OneDrive and SharePoint behave differently due to Microsoft treating the file picker as a separate app. Instead of directly opening the file picker, it will trigger another OAuth prompt. If the user consents to the file picker OAuth, the file picker will then automatically open afterwards.
Speaker Diarization
Speaker diarization has been added for audio transcription models. This allows us to format chunks so that the text is organized by utterances and each utterance will be labeled with the speaker. It’ll take this format:
[Speaker A] speaker A's utterance
[Speaker B] speaker B's utterance
For local file uploads, there is a new parameter
include_speaker_labels
. And for external file uploads, the parameterfile_sync_config
object can take a new propertyinclude_speaker_labels
. When either is set totrue
, speaker diarization will be enabled for the audio transcription servicesMinor note: Speaker label may appear differently depending on the transcription service. Deepgram uses numbers to label speakers while AssemblyAI uses letters.
request_id
on Additional Webhooks
request_id
is now included in following webhook events under theadditional_information
object for external files: UPDATE, FILES_CREATED, FILE_READY, FILE_ERROR, FILES_SKIPPED, FILE_SYNC_LIMIT_REACHED
Cold Storage for Files (Beta)
Overview
Carbon supports moving file embeddings between hot and cold storage. This feature allows you to optimize storage costs and improve performance by keeping embeddngs for frequently accessed files in hot storage (vector storage) while moving less frequently used files to cold storage (object storage).
Enabling Cold Storage
By default, the cold storage feature is not enabled. Once enabled, files will automatically be moved to cold storage after a set period of inactivity. To enable cold storage, you must set a flag at file upload time. Currently cold storage is only available for local file uploads via
/uploadfile
,/upload_text
and/upload_file_from_url
.Moving Files from Hot to Cold Storage
Once enabled, files will be automatically moved from hot to cold storage after a specified period of inactivity. This period is determined by the
time_to_move_to_cold_storage
parameter, which represents the number of seconds a file must be inactive before it’s moved to cold storage. There is no manual way to move files to cold storage.You can make an API request to the
/modify_cold_storage_parameters
endpoint which allows customers to update existing files to use cold storage.
Moving Files from Cold to Hot Storage
To move files from cold to hot storage, you must make an API request to
/move_to_hot_storage
. The request will take filters similar to/user_files_v2
, and all files matching the provided filters will be moved to hot storage.To avoid a single request hogging resources, there is a limit of 200 files that can be moved in one request. If the number of files matching the filters exceed 200, the files will be processed in batches of 200 over a longer period of time
/embeddings
Endpoint BehaviorIf a request is made to
/embeddings
that involves files in cold storage, an error will be returned that includes a lfile_ids
for the affected files. This a lows the client to know which files need to be moved to hot storage before the request can be processed.However,
exclude_cold_storage_embeddings
is set totrue
, any files in cold storage will be ignored, and no error ill be thro n for requests involving files in cold storage. Then the search will naturally exclude those files.In the future, we may enable a way to allow
/embeddings
to work with files that are in both cold and hot storage.
File Object Information
Activity is defined as when a file was last used, which currently includes file re-syncs, queries involving that file, and updates to file tags.
The following fields under the file object (under
user_files_v2
) are related to cold storage:last_use
: A timestamp indicating when a file was last used (i.e., when it last had activity).supports_cold_storage
: A flag indicating whether or not a file can be moved to cold storage.time_to_move_to_cold_storage
: An integer representing the number of seconds a file must be inactive before it’s moved to cold storage.embedding_storage_status
: The storage status of the embeddings for a file, indicating whether they are in cold or hot storage.
New Cold Storage Webhooks
MOVED_TO_COLD_STORAGE
- This event is fired when a file is moved to cold storage.MOVED_TO_HOT_STORAGE
- This event is fired when a file is moved to hot storage.
You can find our documentation on cold storage here.
Warnings
Object to API Responses
In the next two weeks, we plan to add a
warnings
object to our API responses to display warning messages.Here’s an example of how it looks:
{ "documents": [], "warnings": [ { "warning_type": "FILES_IN_COLD_STORAGE", "object_type": "FILE_LIST", "object_id": [ 47058 ], "message": "These files won't be queried because they are not in hot storage." } ] }
Carbon Connect 3.0 (CCv3) Enhancements
We’ve added 3 new props to CCv3:
The
showFilesTab
(boolean) prop has been reintroduced to CCv3 with a default value of true. As a quick reminder, this prop allows customers to hide the file selector and file list view from the CCv3 component. It can be enabled or disabled at both the component and integration levels. If specified for a specific integration, it will override the component-level configuration.The
filesTabColumns
(array) prop has been added on both the component and integration levels. This prop controls which columns are displayed and hidden in the file list view and accepts an array of strings with values “name”, “status”, “created_at”, and “external_url”.The
transcription_service
(enum) prop has been added underfileSyncConfig
andtranscriptionService
forLOCAL_FILES
integration to specify which speech-to-text model to use for transcriptions. You can specify the enum asASSEMBLYAI
orDEEPGRAM
but the prop defaults toDEEPGRAM
.
Google Cloud Storage Connector
We launched our GCS connector that enables syncing files from buckets.
The Carbon Connect
enabledIntegrations
value for GCS isGCS
.See more specifics about our GCS connector here.
DigitalOcean Storage Connector
We launched our DigitalOcena Storage connector that enables syncing files from buckets.
The Carbon Connect
enabledIntegrations
value for Digital Ocean Spaces isS3
(CC support will be launched tomorrow).The Spaces API is interoperable with the AWS S3, so Digital Ocean Spaces makes use of the existing S3 endpoints.
This means that the source of Digital Ocean files is S3. To differentiate between data sources and files from Spaces Object Storage, additional metadata has been added:
Data Source Metadata
data_source_metadata
: Indicates the type of data source. Possible values include:S3
: Represents an Amazon S3 data source.DigitalOcean Space
: Represents a DigitalOcean Spaces data source.
File Metadata
file_metadata
: Specifies the type of file. Possible values include:S3 File
: Represents a file stored in Amazon S3.DigitalOcean Space File
: Represents a file stored in DigitalOcean Spaces.S3 Bucket
: Represents a file representation for a S3 Bucket.DigitalOcean Space Bucket
: Represents a file representation for a DigitalOcean Space Bucket.
See more specifics about our DigitalOcean Spaces connector here.
New file_types_at_source
Filter for /user_files_v2
and /embeddings
Introduced a new optional field
file_types_at_source
for/user_files_v2
and/embeddings
.The
file_types_at_source
field is an array type that currently accepts the following values:TICKET
ARTICLE
This new field allows users to specify whether we return tickets, articles or both when retrieving content (files and embeddings) from Zendesk, Intercom and Freshdesk.
If
file_types_at_source
containsTICKET
, ticket content from Zendesk, Intercom and Freshdesk are returned.If
file_types_at_source
containsARTICLE
, article content from Zendesk, Intercom and Freshdesk are returned.
AssemblyAI Integration for Audio Transcriptions
We are excited to announce that Carbon now supports multiple audio transcription services. In addition to our existing integration with Deepgram, we have added support for AssemblyAI, providing our users with more options and flexibility when transcribing audio files.
To accommodate the new transcription service, we have updated the following endpoints to accept the new parameters
transcription_service
that allow you to specify which service to use. Valid values aredeepgram
andassemblyai
. If no value is specified, Deepgram will be used as the default transcription service.For local files, the endpoints are:
/uploadfile
/upload_file_from_url
For external files,
transcription_service
is set within thefile_sync_config
parameter, under:/integrations/oauth_url
/integrations/connect
/integrations/files/sync
Similar to files transcribed by Deepgram, files transcribed by AssemblyAI also have an additional saved file containing the full JSON response from the AssemblyAI service. To access the transcription response, query the files using the
user_files_v2
endpoint with theinclude_additional_files
parameter set totrue
.
Carbon Webhook Libraries
We have released our official webhook libraries for handling the verification of webhook signatures. You can find our updated documentation here, and access our libraries on GitHub here.
Zendesk Auto-Sync Update
We are thrilled to announce that the Zendesk connector now supports auto-sync.
Carbon can now sync any new articles with auto-sync enabled.
Help Center Categories are now synced into Carbon as files, and Help Center Categories and articles form a parent-child relationship.
Reconnecting Existing Zendesk Connections:I
If you have existing Zendesk connections in Carbon, please note that you will need to reconnect them to enable the updates above.
Organization Connector Settings
The
/organization
endpoint now includesconnector_settings
in the response, providing additional information about the organization’s connector configurations, starting with permitted file formats.The
/organization/update
endpoint has been updated to accept thedata_source_config
parameter, allowing customers to configure permitted file formats for organization users. Thedata_source_config
parameter should be provided in the following format:
{ "data_source_configs": { "GOOGLE_DRIVE": { "allowed_file_formats": ["PDF", "DOCX"] }, "DROPBOX": { "allowed_file_formats": ["XLSX", "CSV"] }, "DEFAULT": { "allowed_file_formats": ["PDF", "DOCX", "XLSX", "NOTION"] } } }
DEFAULT
is applied to all data sources that do not have configs defined.If the
data_source_config
parameter includes file formats that are not supported by Carbon, those formats will be ignored, and only the supported formats from each data source will synced.
Carbon Self-Hosting on AWS
Starting today, customers have the option to host a Carbon instance on their own cloud, with full access to all features of our managed solution, including data connectors, hybrid search, and more.
We’re launching on Microsoft Azure and Google Cloud later next month!
Book a demo if you’re interested to learn more:https://cal.com/carbon-ai/30min
Confluence Enhancements
We’ve made improvements to the Confluence Connector related to the following:
Auto-Sync Improvements
Auto-syncs process will now index new pages that are added to a previously synced parent page. If a user syncs their entire Confluence account, then the space will be the top-most file.
If pages are deleted from a synced parent page in Confluence, the scheduled sync will remove them from the synced content.
File Metadata Enhancements
The
file_metadata
property now includes additional information about the type of Confluence item each file represents (spaces and pages).The
file_metadata
property will also record theexternal_id
of the file’s parent and root, providing better context and hierarchy information.
To take advantage of these updates, users will need to reconnect their Confluence account and re-sync their Confluence files.
Reranker Models for Search
We are excited to introduce native support for reranker models. With this release, customers now have the option to rerank search result chunks to provide more relevant and accurate results.
How it works:
When making a search query via the
embeddings
endpoint, customers can control the reranking behavior by setting thererank
parameter in the payload.If
rerank
is set to"JINA_MULTILINGUAL_BASE_V2"
the search result chunks will be reranked using the Jina reranking algorithm.If
rerank
is set to"COHERE_RERANK_MULTILINGUAL_V3"
, the search result chunks will be reranked using the Cohere reranking algorithm.If the
rerank
parameter is not specified or set to any other value, the default ranking will be used.
The response format from the
embeddings
endpoint remains consistent regardless of whetherrerank
is enabled or not.
We’ll be adding support for more reranker models in the weeks to come!New Webhook: WEBSCRAPE_URLS_READY
We’ve added a new webhook named WEBSCRAPE_URLS_READY
that triggers each time a specific web page from a web scrape request is finished processing.
Introducing Carbon Connect 3.0
We’re thrilled to announce the beta
release of Carbon Connect 3.0, packed with exciting updates and improvements, based on customer feedback.Key Features and Improvements
1. Seamless File and Folder Uploads
Carbon Connect 3.0 now supports both file and folder uploads by default, eliminating the need for the filePickerMode
property. Uploading entire folder directories is now a breeze with our new drag-and-drop functionality.
2. Carbon’s In-House File Picker
We’re excited to introduce Carbon’s in-house file picker is now available for all connectors, except for Slack, Gmail, and Outlook (currently in development). To use Carbon’s file picker instead of the source’s file picker, simply set the new useCarbonFilePicker
property to true
.
3. Enhanced In-Modal Notifications
We’ve completely replaced toast notifications with in-modal notifications, providing a more cohesive and user-friendly experience. As a result, the enableToasts
property has been removed.
4. Customizable Theme Options
Personalize your Carbon Connect experience with our new theme options. Use the theme
property to set the application’s theme to light
, dark
, or auto
(default). When set to auto
, Carbon Connect will automatically adapt to your system’s theme.
5. Simplified File Limit Control
Limiting the number of files is now easier than ever. Simply set the maxFilesCount
property to 1
to restrict uploads to a single file. The allowMultipleFiles
property has been removed for a more straightforward approach.
Upcoming Enhancements
We’re continuously working to improve Carbon Connect and have exciting plans for the near future:
1. Enhanced Customization Options
We’re working on bringing back customization options from Carbon Connect 2.0, including loadingIconColor
, primaryBackgroundColor
, primaryTextColor
, secondaryBackgroundColor
, and secondaryTextColor
.
2. Expanded In-House File Pickers
In the coming weeks, we’ll be launching Carbon’s in-house file pickers for Outlook, Slack, and Gmail, providing a consistent and seamless experience across all connectors.
Installation
You can install the new component for testing via the command npm install carbon-connect@beta
. We plan to bring 3.0 out of beta
by the end of the month!
Here’s a Loom video providing a quick walkthrough of the new modal: https://www.loom.com/share/b7b241fa5e5e4d0a92fb5e748d3d6ec3
External URLs Filter
A new external_urls
filter has been added to the user_files_v2
endpoint.This filter allows you to refine the results returned by the endpoint based on a list of external_urls
passed.
File Deletion Enhancements
When a customer deletes a file from Carbon (via
delete_files_v2
), they have the flexibility to control whether the file row in the database is preserved or marked as deleted when deleting a file.This behavior is managed by the
preserve_file_record
flag. Ifpreserve_file_record
is set totrue
, then we delete the files stored in our S3/GCS while keeping the file record and metadata to allow for re-syncs and auto-syncs.We also added a
file_contents_deleted
field to theuser_files_v2
endpoint. If the field is returned astrue
, then the file record still exists, but the stored file content is deleted.
Find more details here.
High Accuracy Mode
We’ve introduced a new optional boolean parameter to the
/embeddings
endpoint calledhigh_accuracy
. If set totrue
, then vector search may give more accurate results at a slight performance penalty. By default, it’sfalse
.Find more details here.
To
And From
Filters for Outlook and Gmail
We added 2 more filters for syncing emails from Outlook and Gmail:
to
: Supports an email (email@address.com
) as a string to which the email was sent.from
: Supports an email (email@address.com
) as a string from which the email was sent.
Note: Outlook only supports
from
filters.
Intercom Auto-Sync Update
We are thrilled to announce 2 updates to our Intercom connector:
Carbon can now sync multiple Intercom Help Centers:
Help Centers are now synced into Carbon as files, and Help Center and articles form a parent-child relationship.
Just as only published articles are synced, only activated Help Centers will be synced.
Carbon can now sync any new published articles with auto-sync is enabled.
Reconnecting Existing Intercom Connections:
If you have existing Intercom connections in Carbon, please note that you will need to reconnect them to enable the updates above.
New Endpoint: /list_users
A new endpoint has been added to list all users under your organization.
Filters: include filtering using list of
customer_id
.Pagination: Request body needs pagination limit and offset.
Sorting: Sort by
created_at
andupdated_at
and ascending/descending.Find more details here.
More Chunk Metadata
We’ve added chunk metadata for the following data sources to the
/embeddings
and/list_chunks_and_embeddings
endpoints:Websites:
"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "title": "RGB to HEX", "description": "Convert RGB color codes to HEX HTML format for use in web design and CSS. Also converts RGBA to HEX." } } }
Email:
"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "cc": "Swapnil Banga <swapnilbanga@outlook.com>, swapnil.banga@squareboat.com", "sender": "Swapnil Banga <swapnil.galaxy@gmail.com>", "timestamp": "2024-06-21T14:52:24Z" } } }
New Endpoint: /list_chunks_and_embeddings
A new endpoint has been added
/list_chunks_and_embeddings
. This endpoint is similar to the existing/text_chunks
endpoint but with some key differences:Retrieve chunks for multiple files that match the filter criteria instead of just a single file.
Filters: The filters for this endpoint are the same as those found via
user_files_v2
, allowing for more granular filtering of chunks based on file-level data.Ordering: The
order_by
values enables sorting based on file-specific attributes.Pagination Behavior: The
order_by
,limit
, andoffset
parameters now correspond to the initial query that filters files, before chunks and embeddings are fetched for the filtered files.The
count
parameter still refers to the total count of all embeddings for all filtered files, not the total count of filtered files.
Introducing /fetch_webpage
Endpoint
We’re excited to announce a new and improved endpoint for fetching webpage URLs:
/fetch_webpage
. This endpoint offers an asynchronous way to retrieve webpage data and URLs.Fetch URLs: The
/fetch_webpage
endpoint accepts a POST request with theurl
parameter as input.Webhook Notifications: Upon completion of a webpage request, one of the following webhooks will be sent:
WEBPAGE_ERROR
(object type:WEBPAGE
): Indicates that the request failed. The webhook payload includes the corresponding webpage ID.WEBPAGE_READY
(object type:WEBPAGE
): Indicates that the request succeeded. The webhook payload includes the corresponding webpage ID.
User Webpage History: Users can access their webpage request history by querying the
/user_webpages
endpoint. This endpoint returns the results of all the user’s webpage requests.
split_rows
for Third-Party Data Sources and Carbon Connect
split_rows
has been added for Excel and CSV files uploaded from third-party data sources and Carbon Connect.For Carbon Connect, it is available as a parameter (
splitRows
) on the integration and file extension level forLOCAL_FILES
and part of thefileSyncConfig
field for third-party integrations.
Notion Updates
Notion now supports the
sync_files_on_connection
parameter.When set to
true
(default), selected files via the Notion file picker will be synced immediately.When set to
false
, permissions to access the selected files will still be granted, but users need to use Carbon’s file picker or the/integrations/items/list
and/integrations/files/sync
endpoints to sync the files.
The
root_external_id
field under/integrations/items/list
is now returned for Notion files as well.
Slack Connector Launch
We’ve officially launched our Slack Connector.
How It Works:
OAuth and Token Refresh
Slack integration will use OAuth for authentication.
Conversation Listing
Users can list their Slack conversations using the
/integrations/slack/conversations
endpoint.The endpoint supports filtering by conversation type:
Public channels
Private channels
Private messages
Group conversations
Conversation Sync
There will be no automatic or global file sync for Slack. Instead, a dedicated sync endpoint
/integrations/slack/sync
is available, which acceptsconversation_id
(required) andafter
(date) filter parameters.Messages are synced in 15-minute blocks. For example, all messages between 2:15 and 2:30 will be synced together. Replies to messages outside the block will still be synced in the same block.
Currently, only message content is synced. Attachments and reactions are not included.
Chunk Metadata for CSV and Excel Files
We added row metadata to each chunk result. Moving forward, the beginning and ending row number in the table corresponding to a specific chunk will be returned under
content_metadata
.Similar to the page numbers and x/y coordinates we return for PDF chunks, the beginning and ending row number allows you to directly reference where in a file the chunk was found.
FILE_DELETED
Webhook Update
We pushed the update to fire the
FILE_DELETED
webhook if you manually re-sync a file via the/integrations/oauth_url
,/integrations/connect
,resync_file
and/integrations/files/sync
endpoints.This means if you pass in the external id of a folder, and we find that an item in the folder is deleted then we would delete the file and fire the webhook.
New CSV Parameter: split_rows
We have introduced a new optional query parameter called
split_rows
that accepts a boolean value. This parameter provides more flexibility when handling CSV rows that exceed max token limits.Here’s how it works:
If
split_rows
is set totrue
:CSV rows will be automatically split if they exceed either the specified chunk size or the maximum token limit of the embedding model.
This allows for processing of larger CSV rows without encountering errors.
If
split_rows
is set tofalse
(default value):The behavior remains the same as before.
If a CSV row exceeds the limits, an error will be thrown.
The default value of
split_rows
is set tofalse
to ensure backwards compatibility.This param is currently available for local file uploads via API and will be rolled out for external data sources and Carbon Connect shortly.
Dropbox Business Support
We’ve launched support for Dropbox Business. Users with Dropbox Business accounts can now access and sync files from team folders shared with them.
Users will need to reconnect their existing Dropbox connections if they want to start syncing files from their teams.
Inclusion and Exclusions for Sitemaps and Web Scrapes
We’ve added a new feature that allows you to filter web and sitemap scrapes by specific URL paths. This enhancement gives you greater control over the data you collect, enabling you to focus on the most relevant content for your needs.
For sitemaps, you can now include or exclude URLs based on their paths using the following parameters:
url_paths_to_include
Description: Filters sitemap URLs that contain any of the specified paths.
Value: A list of up to 10 strings representing the URL paths to include.
Example:
url_paths_to_include: ["/products", "/collections"]
url_paths_to_exclude
Description: Filters out sitemap URLs that contain any of the specified paths.
Value: A list of up to 10 strings representing the URL paths to exclude.
Example:
url_paths_to_exclude: ["/products", "/collections"]
For web scrapes, you can now specify the starting paths based on the URL paths:
url_paths_to_include
Description: The scrape will start at the specified paths, and if a recursion depth is set, it will only include links that also contain these paths.
Value: A list of up to 10 strings representing the URL paths to include.
Example:
url_paths_to_include: ["/products", "/collections"]
Webhook Health Monitoring
We added a more robust health check logic for webhook URLs.
If a URL is flagged as unhealthy (and marked as status
FLAGGED
), the system will automatically poll the URL every 10 seconds to check its status and fire a new webhook event calledCHECKUP
per poll request.For
CHECKUP
events, there is no requirement to verify the signature, although you still have the option to do so if desired.When receiving a
CHECKUP
event, it is safe to simply return a200
response without any additional processing.
If a successful response is received during the health check, the URL will be re-activated.
Notifications via Email
We are excited to announce the launch of email notifications to keep our customers informed about important events and actions occurring on our platform. In this initial release, we have implemented the following email notifications:
Webhook Events Paused
Trigger: This notification is sent when a webhook has been temporarily paused due to failing to return a response 20 times within a 60-second window.
Purpose: To alert customers about any interruptions in webhook functionality and provide them with timely information to investigate and resolve the issue.
Webhook Events Unpaused
Trigger: This notification is sent when a previously paused webhook has been unpaused after our system’s polling mechanism (which runs every 10 seconds) determines that the webhook is healthy and responsive again.
Purpose: To inform customers that the webhook has resumed normal operation and that data flow has been restored.
Video Embeddings Support
We now support embedding generation for videos, allowing you to run semantic search on the video content based on the similarity of a video snippet to the search query or the text within the video frames, similar to OCR.
/uploadfile
now takes a new optional parameter calledmedia_type
, whose value comes from theFileContentTypes
enum. By default all video file formats will default to audio processing ifmedia_type
isn’t provided.Currently videos are supported via the
uploadfile
andupload_file_from_url
endpoints but we’ll be adding support for third-party connectors and in Carbon Connect soon.
We support the following video file formats:
AVI
FLV
MKV
MOV
MP4
MPEG
MPG
WEBM
WMV
The maximum file size is 1 GB, but it can be increased upon request.
See more details here.
Please note that video embedding generation takes much longer than text and image embeddings. For example, it took 60-90s to embed a 3-minute video.
Intercom Tickets Integration
We’re thrilled to announce that our Intercom connector now has support for tickets.
The
/integrations/oauth_url
andintegrations/connect
endpoints sync articles by default. To customize the sync behavior, use thefile_sync_config
parameter.You can now also view and sync tickets via the global endpoints
/integrations/items/list
and/integrations/files/sync
.To start syncing ticket content, the Intercom scope should include:
To sync user articles only, add these scopes:
Read one admin
Read and List Articles
To sync both user articles and tickets, also add:
Read and list users and companies
Read tickets
The following ticket information is available as tags for filtering:
{ "ticket_type": "Support Request", "ticket_status": "resolved", "ticket_category": "Customer", "ticket_submitter": "example.user@projectmap.com", "ticket_assigned_team": "Technical", "ticket_assigned_admin": "swapnil@carbon.ai" }
Text chunks will include the conversation history (comments on the ticket).
You can find more details here.
New Webhook Statuses
Each created webhook will now have a status of either
ACTIVE
orFLAGGED
that is returned underwebhooks
endpoint response.ACTIVE
: The webhook is operating normally and successfully receiving events.FLAGGED
: The webhook URL failed to return a response more than 20 times within a 60 second window. This indicates a potential issue with your webhook URL that you should check. If a webhook is moved to theFLAGGED
status, please contact us to update.
Incremental Syncs for Gmail and Outlook
We have introduced incremental syncs for the following endpoints for Gmail and Outlook:
/integrations/items/sync
/integrations/connect
/integrations/oauth_url
How It Works
By setting
incremental_sync
totrue
, only new or updated files since the last sync will be re-synced. This means that if a file has already been synced and hasn’t been modified, it will be skipped during the next sync.If the embedding properties or tags of a file change between sync requests, those specific files will be re-synced.
Carbon sends a
FILE_SKIPPED
webhook event for files skipped during the incremental sync. The body of the webhook will contain a list offile_ids
for files and a reason inadditional_information
.
This update addresses a common problem where files would be re-synced if a user went through the 3rd-party file selector to select files that had already been synced. With incremental syncs, this issue is resolved, ensuring that only truly new or updated files are synchronized.
Note: Incremental syncs is already enabled for Box, Dropbox, OneDrive and Google Drive.
Aggregated Usage Metrics Update
We’re excited to announce several improvements to how we aggregate and expose file statistics across the API.
The following metrics will now be returned via the
/organization
and/user
endpoints:aggregate_file_size
aggregate_num_characters
aggregate_num_tokens
aggregate_num_embeddings
aggregate_num_files_by_source
aggregate_num_files_by_file_format
To fetch the most updated metrics via the
organization
endpoint moving forward, you need to take following steps:The endpoint
/organization/statistics
takes no parameters and submits a request to asynchronously re-aggregate organization file statistics.When the re-aggregation is complete, a webhook of the event type
FILE_STATISTICS_AGGREGATED
will be sent.After receiving that event, making a request to
/organization
will return the updated file statistics in the response body.Additionally, a timestamp of when the file statistics were last updated can be found in
file_statistics_aggregated_at
.
fileSyncConfig
Property for Carbon Connect
We have added a new
fileSyncConfig
prop for Carbon Connect that is set at the component or integration level and accepts the following properties:
auto_synced_source_types
(AutoSyncedSourceTypes
array): An array specifying the types of sources to automatically sync files from.sync_attachments
(boolean): Set totrue
to enable synchronization of attachments, orfalse
to disable attachment syncing. Applies to helpdesk tickets currently.detect_audio_language
(boolean): Set totrue
to enable automatic detection of audio language during file upload, orfalse
to disable audio language detection.
Deepgram Audio Langauge Detection
This feature easily enables automatic language detection for audio file uploads.
Added a new optional query parameter
detect_audio_language
When set to
true
, Deepgram will automatically detect the language of the uploaded audio fileDefaults to
false
if not specifiedApplies to the
upload_files_from_url
anduploadfile
endpoints.
Updated Webhook Event: FILE_SYNC_LIMIT_REACHED
We have improved the functionality of the
FILE_SYNC_LIMIT_REACHED
webhook event to provide more granular information when users exceed file upload limits. This event will now be triggered in the following scenarios:When a user attempts to upload files that would cause them to exceed the maximum number of allowed files (
max_files
).When a user tries to upload more files than the maximum allowed per upload (
max_files_per_upload
).When a user exceeds the daily 2.5GB file sync limit (existing functionality).
To differentiate between the three different limit scenarios, we have introduced a new
reason
property in the event’s additional information. Thereason
property will have one of the following values:Max files per upload limit exceeded.
Max files limit exceeded.
Organization daily limit for file sync has been reached.
HTML File Support
We now support for uploading
.html
files from local and third-party data sources.Similar to other file formats, we provide the original
.html
file as well as a plain text version of the file as pre-signed URLs via theuser_files_v2
endpoint.
Freshdesk Tickets Integration
We’re thrilled to announce that our Freshdesk connector now has support for tickets.
The
/integrations/freshdesk
andintegrations/connect
endpoints sync articles by default. To customize the sync behavior, use thefile_sync_config
parameter.You can now also view and sync tickets via the global endpoints
/integrations/items/list
and/integrations/files/sync
.To start syncing ticket content, the Freshdesk API key should belong to a user with access to
agents
andtickets
permissions.The following ticket information is available as tags for filtering:
{ "ticket_type": "incident", "ticket_status": "open", "ticket_assignee": "swapnil+zen1@carbon.ai", "ticket_priority": "normal", "ticket_requester": "customer@example.com", }
Text chunks will include the conversation history (comments on the ticket).
You can find more details here.
New Webhook Type: SPARSE_VECTOR_GENERATION
We have introduced a new webhook event type
SPARSE_VECTOR_GENERATION
that is triggered when the queued status of sparse vector generation for a file changes. It is calledSPARSE_VECTOR_QUEUE_STATUS
and has object typeCHUNK_LIST
.This new webhook includes an object in the
additional_information
with the key-namesparse_vector_queue_status
. The object has two fields:sparse_vector_queue_status
, which can be eitherqueued
,aborted
, orfailed
sparse_vector_queue_error
, which isnull
unlesssparse_vector_queue_status
isaborted
orfailed
See more details here.
parent_file_id
for Embeddings
The
embeddings
response now includes aparent_file_id
field for each chunk returned.This field can contain an integer value representing the ID of the parent file, or
null
if there is no parent file associated with the embedding.
SharePoint and OneDrive Folder Selection and Syncing
You can now select an entire folder for upload, and Carbon will automatically include all nested subfolders and files. This brings our SharePoint and OneDrive functionality in line with popular services like Google Drive, Dropbox and Notion.
We have also introduced auto-sync for SharePoint and OneDrive folders. Any new folders and files added to your selected parent folder will be automatically detected and synced by Carbon. To enable auto-sync on folders, the user will need to re-upload the folders again through the 3rd-party file picker.
Dropbox Folder Selection and Syncing
You can now select an entire folder for upload, and Carbon will automatically include all nested subfolders and files.
We have also introduced auto-sync for Dropbox folders. Any new folders and files added to your selected parent folder will be automatically detected and synced by Carbon, which brings our Dropbox functionality in line with popular services like Google Drive and Notion.
Webhook for Files Skipped
To improve visibility into your file processing pipeline, we’ve added a new webhook event:
FILES_SKIPPED
.This event is triggered whenever Carbon skips processing for one or more files, such as when a file exceeds the size limits imposed by a third-party integration. The webhook payload will include a list of
external_file_ids
for the affected files, as well as anadditional_information
field with details on why processing was skipped. This allows you to easily identify and handle files that couldn’t be processed.
Zendesk Tickets Integration
We’re thrilled to announce that our Zendesk connector now has support for tickets.
The
integrations/oauth_url
andintegrations/connect
endpoints now sync articles by default. To sync only tickets or both articles and tickets, use thefile_sync_config
parameter. Thefile_sync_config
parameter can also enable syncing attachments from ticket comments.You can now also view and sync tickets via the global endpoints
/integrations/items/list
and/integrations/files/sync
.To start syncing ticket content, users must disconnect and reconnect their accounts with the new scopes. Don’t worry, disconnecting won’t affect your files.
The following ticket information is available as tags for filtering:
{ "ticket_type": "incident", "ticket_status": "open", "ticket_assignee": "swapnil+zen1@carbon.ai", "ticket_priority": "normal", "ticket_requester": "customer@example.com", "ticket_submitter": "swapnil+zen1@carbon.ai" }
Text chunks will include the conversation history (comments on the ticket).
You can find more details here.
Carbon Connect 2.0 Exits Beta
Carbon Connect 2.0 has officially exited beta as version
2.0.0
.
Incremental Syncs for Data Sources
We have introduced incremental syncs for the following endpoints:
/integrations/items/sync
/integrations/connect
/integrations/oauth_url
How It Works
By setting
incremental_sync
totrue
, only new or updated files since the last sync will be re-synced. This means that if a file has already been synced and hasn’t been modified, it will be skipped during the next sync.If the embedding properties or tags of a file change between sync requests, those specific files will be re-synced.
Carbon sends a
FILE_SKIPPED
webhook event for files skipped during the incremental sync. The body of the webhook will contain a list offile_ids
for files and a reason inadditional_information
.
This update addresses a common problem where files would be re-synced if a user went through the 3rd-party file selector to select files that had already been synced. With incremental syncs, this issue is resolved, ensuring that only truly new or updated files are synchronized.
Note: Incremental syncs are only enabled on certain sources to start, including Box, Dropbox, OneDrive and Google Drive.
Re-Sync Child Files Via Resync_File
Endpoint
When a
file-id
that belongs to a parent file (i.e., a folder) is submitted for re-sync via theresync_file
endpoint, the associated child files will now also be re-synced.This enhancement ensures that all related files within a folder hierarchy are properly synced when the parent file is re-synced.
Post Messages for Third-Party File Pickers
External data sources that utilize third-party file pickers will now post messages containing data of the selected file to the parent window when they are used in an iframe.
The message will be structured in the following format:
{ "event": "SELECTED", "data": list[{ "external_id": str, "parent_external_id": str | null, "name": str, "url": str | null, "is_folder": bool, "file_format": str | null, }], }
Note: Not all of the properties in the data list are available for every data source. For example, GDrive will have
parent_external_id
, butparent_external_id
will always benull
for Microsoft because its file picker does not return that data.
New Parameter include_containers
A new optional boolean parameter
filters.include_containers
has been added to theuser_files_v2
API endpoint. This parameter allows you to control whether containers (folders) should be included in the API response.When
include_containers
is set tofalse
, the API will exclude folders from the response. This means that only files with actual content will be returned.In addition to folders, the following types of files will also be excluded when
include_containers
isfalse
:RSS feed URLs
Email queries
GitBook spaces
GitHub directories
These excluded files typically group other files together but do not have any content themselves.
The default behavior of
user_files_v2
remains unchanged. If theinclude_containers
parameter is not provided or is set totrue
, folders will be included in the API response as before.
File Statistics Now Include MIME Type
file_statistics
under theuser_files_v2
endpoint now return the MIME type of the file, providing more detailed information about each file.
Organization-Level User Settings
Introduced the ability to configure user settings at the organization level.
Use the
/organization/update
endpoint with theglobal_user_config
parameter to set the following organization-wide user settings:auto_sync_enabled_sources
max_file
max_files_per_upload
Find more details here.
Customizable Sync Page Copy
Organizations now have the ability to customize the copy on the sync page after a user has connected to an external source.
Customizable elements include:
Header text
Subheader text
Button text
To update the sync page copy, DM us to make the requested changes. This is a white label specific feature.
Please note that success and error messages are not customizable at this time.
File List for Local File Uploads
Added a new screen in Carbon Connect 2.0 (
2.0.0-beta25
) that displays a list of files uploaded locally by the userUse the
showFilesTab
configuration option to control whether this view is visible
Limit File Uploads by Type
Organizations can now restrict the types of files that can be uploaded to Carbon.
File extension restrictions can be set per data source or globally for a given organization.
Users can still select disallowed file formats from the file picker, but these files will be ignored during the upload process.
To enable this feature, provide Carbon with a list of allowed file extensions, which must be a subset of Carbon’s supported file formats. A dedicated API endpoint will be coming soon!
New GitHub Endpoints
We’ve added two new endpoints to enhance the usability of the GitHub connector:
/integrations/github/repos
: This endpoint allows users to retrieve a list of their GitHub repositories./integrations/github/sync_repos
: This endpoint accepts a list of GitHub repository IDs, enabling users to list items from the specified repositories.
These new endpoints provide a more streamlined and efficient way to interact with GitHub repositories within Carbon.
GitHub Repository Selection Screen
We’ve introduced a dedicated screen in Carbon Connect 2.0 (
2.0.0-beta24
) for selecting GitHub repositories.This new feature allows users to easily choose the repositories they want to sync and list items from. The repository selection screen is automatically displayed whenever a user connects their GitHub account.
This enhancement simplifies the process of managing GitHub repositories within Carbon Connect, providing a more intuitive and user-friendly experience.
Enhancements to Item Listing
We’ve added a new parameter called
sync_source_items
(orsyncSourceItems
in Carbon Connect) to give users more control over item syncing. By setting this parameter tofalse
, users can prevent listing items from the corresponding connector.By default,
sync_source_items
is set totrue
for all connectors, except for GitHub, where it is set tofalse
. This default behavior for GitHub helps prevent rate limit-related sync issues with GitHub.This enhancement provides users with greater flexibility in managing item syncing across different connectors.
Sorting Options for Source Items
We’ve introduced new sorting parameters,
order_by
andorder_dir
, for source items (/integrations/items/list
). Users can now choose to sort items by the following criteria:id
: Sort items by their unique identifier.name
: Sort items alphabetically by their name.directories_first
: Sort folders first, followed by the remaining items. Both folders and files are sorted by name.
By default, items are sorted by name in ascending order (
asc
), maintaining the existing behavior. Please note that whendirectories_first
is selected, theorder_dir
parameter is ignored.
External URLs in Salesforce
We now return the external URL for Salesforce Knowledge articles for Lightning users.
File List for Local File Uploads
Added a new screen in Carbon Connect 2.0 (
2.0.0-beta25
) that displays a list of files uploaded locally by the userUse the
showFilesTab
configuration option to control whether this view is visible
Limit File Uploads by Type
Organizations can now restrict the types of files that can be uploaded to Carbon.
File extension restrictions can be set per data source or globally for a given organization.
Users can still select disallowed file formats from the file picker, but these files will be ignored during the upload process.
To enable this feature, provide Carbon with a list of allowed file extensions, which must be a subset of Carbon’s supported file formats. A dedicated API endpoint will be coming soon!
File Statistics Now Include MIME Type
file_statistics
under theuser_files_v2
endpoint now return the MIME type of the file, providing more detailed information about each file.
Organization-Level User Settings
Introduced the ability to configure user settings at the organization level.
Use the
/organization/update
endpoint with theglobal_user_config
parameter to set the following organization-wide user settings:auto_sync_enabled_sources
max_file
max_files_per_upload
Find more details here.
Customizable Sync Page Copy
Organizations now have the ability to customize the copy on the sync page after a user has connected to an external source.
Customizable elements include:
Header text
Subheader text
Button text
To update the sync page copy, DM us to make the requested changes. This is a white label-specific feature.
Please note that success and error messages are not customizable at this time.
Support for Solar Embeddings
Exciting news! We’ve integrated Upstage’s Solar Embeddings into our platform, offering you a powerful new embedding model on Carbon.
To utilize this embedding model, specify the slug
SOLAR
forembedding_model
You can find more details here.
FILE_CREATED
for Web Scrape
We have expanded the
FILE_CREATED
webhook events to fire when files are generated from web scraping requests.
IS_RESYNC
for FILE_READY
Webhook
We’ve added a new boolean property
additional_information.is_resync
to theFILE_READY
webhook event.When it is
false
, the file was synced for the first time.When it is
true
, the file was already synced previously so the current sync is a re-sync.
Carbon Connect 2.0 Is Exiting Beta
Carbon Connect 2.0 is exiting
beta
by this Friday!This means if you run
npm install carbon-connect
moving forward and do not specify a version, we’ll install 2.0 by default.If you need help or have any questions moving over to Carbon Connect 2.0, DM me.
Loading Screen for Carbon Connect 2.0 (carbon-connect@2.0.0-beta22
)
We added a new component level prop
loadingIconColor
which defines the color of the loader icon. This can be specified using standard CSS color names, or directly as either a Hexadecimal (Hex) code or RGB color values.
Support for Google Drive Shortcuts
Users can now seamlessly sync Google Drive shortcuts to reference the files and folders they point to.
How It Works:
For shortcuts within folders, a file object will be generated. When this shortcut file is synced, it will also synchronize its targeted file separately, though not as a child. Please note, there is no hierarchical relationship between a shortcut and its target.
If the shortcut is directly selected from Google’s file picker, a shortcut file object will not be created. Instead, the target will be synced directly.
Importantly, the shortcut file itself will not contain any parsed text of chunks. Instead, it acts as a pointer, with the
file_metadata.target_external_file_id
attribute identifying the file the shortcut targets.
New Webhook Events
We’ve introduced 2 additional webhook events to help track file sync statuses:
FILE_CREATED
: This event is fired when a user queues up a file to be synced for the first time. The body of the webhook will contain a list offile_ids
for files that were created in the same upload, and multiple events could fire for the same upload if a lot of files were queued.ALL_UPLOADED_FILES_QUEUED
: This event is fired when every single item in an upload has been queued for sync, including all children of folders in an upload. The body will contain the upload’srequest_id
.
Couple notes:
Both
file_ids
andrequest_ids
can be used to filter for the files in/user_files_v2
.A
request_id
is now always generated for an upload to support theALL_UPLOADED_FILES_QUEUED
webhook. Previously, it was only generated by the user (unless you’re using Carbon Connect) and passed to us as a parameter. You may still do that and we’ll use your generatedrequest_id
, but if they don’t then we’ll generate anrequest_id
for you on behalf of the user’s upload.These two webhooks currently are supported for 3rd party data sources only. Support for web scrapes and local file uploads will be coming soon.
You can find more details here.
GitHub Connector
We launched our Github integration today that syncs pages from both public and public repositories.
The Carbon Connect
enabledIntegration
slug for Github isGITHUB
. You’ll need to update to2.0.0-beta19
to access the new screen.Users should first submit their GitHub username and access token to our integration endpoint at
/integrations/github
. Then you can then use our global endpoints for listing and syncing specific files in different repositories:List files from repositories with the global endpoints
/integrations/items/list
Sync files from repositories with the global endpoint
/integrations/files/sync
See more specifics about our Github integration here.
Set Max Files Per Upload
A new user-level parameter,
max_files_per_upload
, has been introduced that can be modified via the/update_users
endpoint. It determines the maximum number of files a user can upload in a single request.Files that exceed the maximum number of files will be moved into the
SYNC_ERROR
status with webhooks being fired to alert you.
You can check the
file_single_upload_limit
set for a particular user via theuser
endpoint.Find more details here.
Important Update: The parameter
max_files
now serves to establish the overall file upload limit for a user across all uploads.
Add include_all_children
to Embeddings Endpoint
Added param
include_all_children
to theembeddings
endpoint. When this param is set totrue
, the search is run over all filtered files as well as their children.Filters applied to the endpoint extend to the returned child files.
In-House File Picker for Confluence and Salesforce
We’re excited to introduce our in-house file picker, starting with Confluence and Salesforce. Our in-house file picker is still in beta, but you can test it out by manually running
npm install carbon-connect@2.0.0-beta13
With this update, end users gain the ability to directly select and upload specific files from Confluence and Salesforce. Previously, this functionality was unavailable as neither platform offered their own dedicated file pickers.
When
syncFilesOnConnection
is set tofalse
then our file picker will be enabled.
Hiding 3rd-Party File Picker
The endpoints
/integrations/oauth_url
and/integrations/connect
now support a new boolean parameter namedenable_file_picker
.When
enable_file_picker
is set totrue
(default behavior), a button will be displayed on the success page. Clicking this button will open the file picker associated with the respective source. This is the standard behavior.Conversely, setting
enable_file_picker
tofalse
will hide the file picker button on the success page. In such cases, end users will be directed to use custom or in-house file pickers for file selection.
Sync Outlook and Gmail Attachments
We’ve introduced a new property called
sync_attachments
, which can be specified when syncing via/integrations/gmail/sync
and/integrations/outlook/sync
endpoints. By default, this property is set tofalse
.Setting
sync_attachments
totrue
enables Carbon to automatically sync file attachments from corresponding emails. This includes not only traditional file attachments but also files (such as images) that are added in-line within emails.Each file attachment will be assigned a unique
file_id
, with theparent_id
corresponding to the email the file was attached to.Please note that the same rules that apply to our file uploads also apply to attachments in terms of file size and supported extensions.
Set User File Limits
You have the flexibility to set the maximum number of files that a unique customer ID can upload using the
file_upload_limit
field on theupdate_users
endpoint.This value can be adjusted as needed, allowing you to tailor it according to your own plan limits.
Then you can check the upload limit set for a specific user via the
custom_limits
object on theuser
endpoint.See details here.
Flags for OCR
Added
ocr_job_started_at
to theuser_files_v2
response to denote whether OCR was enabled for a particular file.Added additional OCR properties to be returned via
ocr_properties
, including whether table parsing was enabled.See details here.
Role Management in Customer Portal
You now have the ability to manage who in your organization can create, delete, and view API keys.
Here’s a breakdown of the current roles available:
Admin: This role is empowered to both create and delete API keys.
User: Users with this role can view API keys.
Moving forward, these roles will determine user permissions and access across different sections of the Carbon Customer Portal.
You can access the customer portal via portal.carbon.ai
Expanded OCR Support in Carbon Connect
The prop
useOCR
can now be enabled on the integration level for the following connectors (in addition to local files):OneDrive
Dropbox
Box
Google Drive
Zotero
SharePoint
The prop
parsePdfTablesWithOcr
can now be enabled on the integration level to parse tables with OCR whenuseOCR
is set totrue
.Please note OCR support is only applicable for PDFs at the moment.
You can find more details here.
Return chunk_index
on the /embeddings
Endpoint
We now return the
chunk_index
for specific chunks returned via the/embeddings
endpoint.You can find more details here.
Migrations between Embedding Models
You can now request migrations between embedding models with minimal downtime.
Email me if you’re interested. The cost per migration (not including embedding token costs) starts at $850 one-time.
New request_id
Field
Carbon now accommodates the inclusion of a
request_id
within OAuth URLs, global sync endpoints, and custom sync endpoints (such as Gmail, Outlook, etc.), allowing users to define it as needed. Non-OAuth URL endpoints that auto-sync upon connection (e.g., Freshdesk, Gitbook) also supports this value. Therequest_id
serves as a filter for files throughuser_files_v2
.With Carbon Connect, enabling the
useRequestIds
parameter totrue
will trigger automatic assignment of therequest_id
. Thisrequest_id
will be returned inINITIATE
andADD
/UPDATE
callbacks.It’s essential to note that this configuration adjustment is applicable at the component level rather than the integration level.
This enhancement is part of version
2.0.0-beta8
.Find more details here.
syncFilesOnConnection
For More Data Sources
We’ve added the
sync_files_on_connection
parameter to theoauth_url
endpoint for the following data sources: Intercom, Salesforce, Zendesk, Confluence, Freshdesk, and Gitbook.This parameter is also accessible for each
enabledIntegration
in Carbon Connect. You can find more information about this here.By default, this parameter is set to
true
. When enabled, all files will be synchronized automatically after a user connects their account. This is particularly useful when a user connects a data source that doesn’t have a built-in file picker.
Delete Child Files Based on Parent ID
Added a flag named
delete_child_files
to thedelete_files
endpoint. When set totrue
, it will delete all files that have the sameparent_file_ids
as the file submitted for deletion. This flag defaults tofalse
.Find more details here.
upload_chunks_and_embeddings
Updates
You can now upload only chunks to Carbon via the
upload_chunks_and_embeddings
and we can generate the embeddings for you. This is useful for migrations where you want to migrate between embedding models and vector databases.In the API request, you can exclude embeddings and set
chunks_only
totrue
. Then, include your embedding model API key (OpenAI or Cohere) undercustom_credentials
.
{ "api_key": "lkdsjflds" }
Make sure to include some delay between requests. There are also stricter limits on how many embeddings/chunks can be uploaded per request if
chunks_only
istrue
. Each request can only include 100 chunks.
Data Source Connections with Pre-Existing Auth
If you’re using our white labeling add-on, we added a new POST endpoint
/integrations/connect
so customers can bypass the authentication flow on Carbon by directly passing in an access token.The request takes an authentication object that contains all the necessary pieces of data to connect to user’s account. The object will vary by data source and a list specifying the required keys can be found in our docs. If the connection is successful, the upserted data source will be returned.
This endpoint also returns a sync url for some data source types that will initiate the sync process.
Improvements to CSV, TSV, XLSX, GSheet Parsing
You have the option to now chunk CSV, TSV, XLSX, and Google Sheets by tokens via
chunk_size
and/or rows viamax_items_per_chunk
parameters. When a file is processed, we will add rows to a chunk until adding the next row would exceedchunk_size
ormax_items_per_chunk
.If a single row exceeds
chunk_size
or the embedding model’s limit for number of tokens, then the file’ssync_error_message
will point out which row has too many tokens.For example:
If each CSV row is 250 tokens,
chunk_size
of 800 tokens and nomax_items_per_chunk
set, then each chunk will contain 3 CSV rows.If each CSV row is 250 tokens,
chunk_size
of 800 tokens andmax_items_per_chunk
set to 1, then each chunk will contain 1 CSV rows.Consequently, it is essential to ensure that the number of tokens in a CSV row does not surpass the token limits established by the embedding models. Token counting is currently only supported for OpenAI models currently.
You can find more details here.
Improvements to OCR
Table parsing in PDFs has been improved significantly with this most recent OCR update.
In order to use the enhanced table parsing features, you need to set
parse_pdf_tables_with_ocr
to true when uploading PDFs (use_ocr
must also be true).Any tables parsed when
parse_pdf_tables_with_ocr
istrue
have their own chunk(s) assigned to them. These chunks can be identified by the presence of the stringTABLE
inembedding_metadata.block_types
.The format of these tabular chunks will be the same format as CSV-derived chunks.
Using this table-parsing feature in conjunction with hybrid search should provide much better search results than before (assuming the PDF has tables that need to be searched over).
If you’re using OCR we now also return metadata such as coordinates and page numbers even if
set_page_as_boundary
is set tofalse
.Specifically, we will return the bounding box coordinates as well as the start and end page number of the chunk.
In the event that
pg_start
<pg_end
, then you should interpret bounding box coordinates slightly differently.x1
andx2
will correspond to the minimumx1
and maximumx2
over all pages for the chunk.y1
will correspond to the upper-most coordinate of the part of the chunk onpg_start
, andy2
will correspond to the bottom-most coordinate of the part of the chunk onpg-end
.
Carbon Connect 2.0 (Beta)
We are thrilled to announce the beta launch of Carbon Connect 2.0, with the following improvements:
Support multiple active accounts per data source.
Improved data source list.
Built-in user interface for users to view and re-sync files per account.
Ability for users to directly disconnect active accounts.
To install Carbon Connect 2.0 please npm install
carbon-connect@2.0.0-beta5
. It is not treated as the latest version of Carbon Connect so you won’t get this version automatically.Few other important updates for Carbon Connect 2.0:
We’ve made a change to remove file details from the payload of
UPDATE
callbacks. If you used to get files in this way, you’ll now need to switch to using our SDK or API to get the updated files when a data source updates.When you’re specifying embedding models, just make sure to use the format like this: embeddingModel=
{EmbeddingGenerators.OPENAI_ADA_LARGE_1024}
, instead of just writing out a string.You can hide our built-in UI for viewing and re-syncing files using the
showFilesTab
param on either the global component orenabledIntegration
level.
Scheduled Syncs Per User and Data Source
Control user and data source syncing using the
/update_users
endpoint, allowing organizations to specify enabled syncing for particular users and data source types. The endpoint accepts a list of user IDs and data sources, with an option to enable syncing for all sources using the string'ALL'
.Each request supports up to 100 customer IDs.
In the following example, future Gmail accounts for specified users will automatically have syncing enabled according to the provided settings.
{ "customer_ids": ["swapnil@carbon.ai", "swapnil.galaxy@gmail.com"], "auto_sync_enabled_sources": ["GMAIL"] }
Find more details in our documentation here.
Note: This update is meant to replace our file-level sync logic and any existing auto-syncs have been migrated over to use this updated logic.
Delete Files Based on Filters
We added the
/delete_files_v2
endpoint which allows customers to delete files via the same filters as/user_files_v2
We plan to deprecate the
/delete_files
endpoint in a month.Find more details in our documentation here.
Filtering for Child Files
We added the ability to include all descendent (child) files on both
/delete_files_v2
and/user_files_v2
when filtering.Filters applied to the endpoint extend to the returned child files.
We plan to deprecate the
parent_file_ids
filter on the/user_files_v2
endpoint in a month.
Customer Portal v1
We’ve officially launched v1 of our Customer Portal - portal.carbon.ai
You can currently manage your API keys directly via the Portal, and we plan to release the following functionality next quarter:
User management
Usage monitoring
Billing management
For current customers, you can reset your password with the email provided to Carbon to gain access. If you don’t know the email you have on file, DM me!
integration/items/list
Improvements
We are implementing four distinct filters:
external_ids
,ids
,root_files_only
, andname
, each meant to filter data based on their respective fields.The
root_files_only
filter will exclusively return top-level files. However, if aparent_id
is specified, thenroot_files_only
can’t be specified and vice versa.
The
external_url
has been added to the response body of theintegrations/items/list
endpoint.See more details here.
Multiple Active Accounts Per Data Source
Carbon now support multiple active accounts per data connection!
We’ve introduced two new parameters across various API endpoints to support this functionality across all our connectors. While these parameters are optional for users with a single data source of each type, they become mandatory when managing multiple accounts.
/integrations/oauth_url
data_source_id
: Specifies the data source from which synchronization should occur when dealing with multiple data sources of the same type.connecting_new_account
: This parameter is utilized to consistently generate an OAuth URL as opposed to a sync URL. A sync URL is the destination where users are redirected after a successful OAuth authentication to synchronize their files. While this parameter can be skipped when adding the first data source of that type, it should be explicitly specified for subsequent additions.
/integrations/s3/files
,/integrations/outlook/sync
,/integrations/gmail/sync
data_source_id
: Used to specify the data source for synchronization when managing multiple data sources of the same type.
/integrations/outlook/user_folders
,/integrations/outlook/user_categories
,/integrations/gmail/user_labels
data_source_id
: Specifies the data source to be utilized when there are multiple data sources of the same type.
Note that the following endpoints already have a mandatory requirement to pass in a
data_source_id
:/integrations/items/sync
,/integrations/items/list
,/integrations/files/sync/
,integrations/gitbook/spaces
,/integrations/gitbook/sync
New Embedding Models
We now support embedding generation using OpenAI’s
text-embedding-3-small
andtext-embedding-3-large
models.To define the embedding model, utilize the
embedding_model
parameter in the POST body for the/embeddings
and other API endpoints. By default, if no specific model is provided, the system will useOPENAI
(the original Ada-2).Find more details on the models available here.
Return HTML for Webpages
presigned_url
field underuser_files_v2
now returns a pre-signed URL to the raw HTML content for each web page.parsed_text_url
field still returns a pre-signed URL for the corresponding plain text.Find more details here.
Return Website Tags in File Metadata
file_metadata
field underuser_files_v2
now returnsog:image
andog:description
for each web page.Find more details here.
Omit Content by CSS Selector
You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.
The
web_scrape
request objects supports a new fields:css_selectors_to_skip: Optional[list[str]] = []
Find more details here.
JSON File Support
We’ve added support for JSON files via local upload and 3rd party connectors.
How It Works:
The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.
max_items_per_chunk
is a parameter that determines how many JSON objects to include in a single chunk.A new chunk is created if either the
max_items_per_chunk
andchunk_size
limit is reached. For example:If each JSON object is 250 tokens,
chunk_size
of 800 tokens and nomax_items_per_chunk
set, then each chunk will contain 3 JSON objects.If each JSON object is 250 tokens,
chunk_size
of 800 tokens andmax_items_per_chunk
set to 1, then each chunk will contain 1 JSON object.
Learn more details here.
Gitbook Connector
We launched our Gitbook integration today that syncs pages from any public and shared spaces.
The Carbon Connect
enabledIntegrations
value for Gitbook isGITBOOK
.Gitbook does not come with a pre-built file selector so we added 2 endpoints for listing and syncing Gitbook spaces:
List all Gitbook spaces with
/integrations/gitbook/spaces
(API Reference)Sync multiple spaces at once with
integrations/gitbook/sync
(API Reference)
You can also use our global endpoints for listing and syncing specific pages in Gitbook spaces:
List pages in spaces with the global endpoints
/integrations/items/list
Sync pages in spaces with the global endpoint
/integrations/files/sync
Note: Spaces are treated like folders via the Carbon API.
See more specifics about our Gitbook integration here.
Note: our Gitbook page parser is still in
beta
so feedback is much appreciated!
Delete Endpoint Update
We’re transitioning file deletion from sync to async processing.
This means that the
FILE_DELETED
webhook event will not fire immediately and instead fire when the file is actually deleted.We are also limiting 50 files to be deleted per
/delete_files
request to limit the load on our servers. We advise spacing out delete requests every 24 hours.
Pinecone Integration
We’ve launched our Pinecone destination connector! We offer support for both pod-based and serverless offerings.
Carbon seamlessly updates your Pinecone instance with the latest embeddings upon processing user files. Users gain full access to Carbon’s API endpoints, including hybrid search for supported sparse vector storage.
Find more details here.
New Carbon SDKs
Moving forward, we will be able to provide support for a greater number of SDKs and promptly release SDK support for API updates. If there is a language for which you want us to add SDK support, we should be able to turn that around in less than a week.
We’re adding support for the following languages today:
The current Javascript SDK will continue to be supported for the next month, and it will be available longer term. However, new features that are introduced will only be supported in the new Typescript SDK moving forward.
Delete Users Endpoint
Added an endpoint
/delete_users
that takes an array of customer IDs and deletes all those users.Deleting a user revokes all of the user’s oauth connections and deletes all their files, embeddings and chunks.
The request format is:
{ "customer_ids": ["USER_1", "USER_2", "USER_3"] }
Find more details here.
Salesforce Connector is Live
All articles from an end user’s Salesforce Knowledge can be listed and synced via the global API endpoint
/integrations/items/list
and/integrations/files/sync
.The Carbon Connect integration (launching tomorrow) will sync all articles by default.
The
enabledIntegrations
value isSALESFORCE
.You can find more info here.
Outlook Folders
After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook.
This includes both system folders like
inbox
and user-created folders.Find more details here.
Gmail Labels
After connecting a Gmail account, you can use the
/integrations/gmail/user_labels
endpoint to list all of your labels.User created labels will have the type
user
and Gmail’s default labels will have the typesystem
.Find more details here.
Delete Child Files Based on Parent ID
Added a flag named
delete_child_files
to thedelete_files
endpoint. When set totrue
, it will delete all files that have the sameparent_file_ids
as the file submitted for deletion. This flag defaults tofalse
.Find more details here.
Carbon Connect Updates
Added support for JSON file formats and
maxItemsPerChunk
param to specify the number of items to include in a specific chunk.Added
cssSelectorsToSkip
toWEB_SCRAPE
to define CSS Selectors to exclude when converting HTML to plaintext.Added
SALESFORCE
as anenabledIntegration
on Carbon Connect.For Salesforce, we added a param
syncFilesOnConnection
that defaults totrue
and will automatically sync all pages from a user’s Salesforce account.We’ll be adding this param to other connectors too, meaning you can automatically sync all files from connectors that don’t have built-in file selectors (Gitbook, Confluence, etc).
This parameter is also added to the
/integrations/oauth_url
endpoint assync_files_on_connection
and also defaults totrue
.
Freshdesk Connector is Live
All
Published
articles from an end user’s Freshdesk knowledge base are synced when connected to Carbon.The Carbon Connect
enabledIntegrations
value isFRESHDESK
.You can find more info here.
Speed Improvements to Hybrid Search
We improved the speed of hybrid search by a factor of 10x by creating sparse vector indexes on file upload vs. query time.
Steps to Enable:
Pass the following body to the
/modify_user_configuration
endpoint: { "configuration_key_name": "sparse_vectors", "value": { "enabled": true } }
Set the parameter
generate_sparse_vectors
totrue
via the/uploadfile
endpoint.
We’ll be rolling out faster hybrid search support across 3rd party connectors in the upcoming weeks.
Deleting Files based on Sync Status
You can now delete file(s) based on
sync_status
via thedelete_files
endpoint.We added 2 parameters:
sync_statuses
- parameter to pass a list of sync statuses for file deletion.For example,
{ "sync_statuses": ["SYNC_ERROR", "QUEUED_FOR_SYNC"] }
. When this parameter value is passed we will delete all files in theSYNC_ERROR
andQUEUED_FOR_SYNC
status that belong to the end user identified bycustomer-id
in headers that made the request.
delete_non_synced_only
- boolean parameter that limits deletion to files that have not been re-synced before.For example, a previously synced Google Drive file enters the
QUEUED_FOR_SYNC
status again during a scheduled re-sync. Settingdelete_non_synced_only
totrue
would prevent this file from being deleted as well.
Files are deletable in all statuses except
SYNCING
,EVALUATING_RESYNC
andQUEUED_FOR_OCR
states. IncludingSYNCING
,EVALUATING_RESYNC
,QUEUED_FOR_OCR
in the list will result in an error response - files in these statuses must wait until they transition out of the status to be deleted.Find more details here.
Carbon Connect Updates
Added support for the following functionalities in Carbon Connect (React component + JavaScript SDK):
Additional embedding models (
OPENAI
,AZURE_OPENAI
,COHERE_MULTILINGUAL_V3
for text and audio files, andVERTEX_MULTIMODAL
for image files).Enable audio and image file support. Reference documentation on file formats available.
OCR support for PDFs from local file uploads via Carbon Connect.
Hybrid search supported.
You can find details to enable any of these functionalities in our documentation:
Remove Customer-Id
on Select Endpoints
We’re removing
customer-id
as a required header for the following endpoints where it is not required:/auth/v1/white_labeling
/user
/webhooks
/add_webhook
/delete_webhook/{webhook_id}
/organization
Vector Database Integration
We are starting to build out direct integrations with vector database providers!
What this means:
After authenticating a vector database provider via API key, Carbon automatically synchronizes between user data sources and the embeddings within your vector database. Whenever a user file is processed, we handle the seamless update of your vector database with the latest embeddings.
You’ll have full functionality to all our Carbon’s API endpoints, including hybrid search if sparse vector storage is supported by your vector database.
Migrations between vector databases is made simple since Carbon provides a unified API to interface with all providers.
The first vector database integration we’re announcing is with Turbopuffer. Many more to come!
S3 Connector
We launched our S3 connector today that enables syncing objects from buckets.
The Carbon Connect
enabledIntegrations
value for S3 isS3
.See more specifics about our S3 connector here.
File + Account Management Component (BETA)
We’ve launched a new component that enables the following:
Users to add and revoke access to accounts under each connection.
Users to view and select specific folders and files for sync.
The aim is to offer a pre-built file selector for integrations without their own.
The component is currently offered in React but we’ll add support for other frameworks soon.
You can find the npm package here. Please note it’s still in BETA so your feedback is much appreciated!
Expanding sort for user_files_v2
You can sort by
name
,file_size
andlast_sync
onorder_by
field in theuser_files_v2
body.See more details here.
Support for audio file uploads via connectors
We’ve enabled support for audio files via the following connectors: S3, Google Drive, Onedrive, SharePoint, Box, DropBox, Zotero.
See list of supported audio files here.
Google Verification
Carbon’s Google Connector is officially Google-verified. This means users will no longer see the warning screen when authenticating with Carbon’s Google connector.
OCR Public Preview
We’ve been rolling out support for OCR, starting with PDFs uploaded locally (images and data connectors to follow).
Exposing Sync Error Reasons
We are now exposing error messages under the
sync_error_reason
field for files enteringSYNC_ERROR
status.You can find a list of common errors here and we’ll be updating this on an ongoing basis.
List and Sync Items from Data Sources
We’re introducing new functionalities that allow customers to synchronize and retrieve a comprehensive list of items such as files, folders, collections, articles, and more from a user’s data source. This enhancement empowers you to create an in-house file selection flow, while enabling Carbon to also provide a user-friendly file selector UI and convenient helper methods within our SDK.
You can find more details here.
Upload Chunks and Embeddings
Added
/upload_chunks_and_embeddings
endpoint to enable uploading of chunks and vectors to Carbon directly.See more specific details here.
CARBON
Data Connectors for LLMs
COPYRIGHT @ 2024 JCDT DBA CARBON