XLSM File Support
Added support for
XLSM
file uploads from third-party and local file uploads.Similar to
XLSX
files, each row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Macros, images, and charts aren’t supported yet.
Webscrape Improvements
We now immediately abort the web scrape run when the file is deleted in Carbon, freeing up resources to submit another web scrape request.
Web scrapes auto-sync can now be managed on the user and organization level via
auto_sync_enabled_sources
asWEB_SCRAPE
. The default auto-sync schedule is 2 weeks (as opposed to daily for other data sources).
Store Files Without Parsing
The sync endpoints now take a new parameter
store_file_only
(file_sync_config.store_file_only
for external files) to allow users to skip parsing during the sync. This means the file will have apresigned_url
but not aparsed_text_url
.Because we are skipping parsing, we won’t be able to count the number of characters in the file. That means the only metrics we’ll report to Stripe are bytes uploaded (and URLs scraped if it’s a web scrape file).
Sync Additional GitHub Data
In addition to syncing files from repositories, you can fetch data directly from GitHub via the following endpoints:
/integrations/data/github/pull_requests
: Lists all pull requests for a repository/integrations/data/github/pull_requests/{pull_number}
: Retrieves a specific pull request/integrations/data/github/pull_requests/comments
: Fetches comments on a pull request/integrations/data/github/pull_requests/files
: Retrieves files that were changed/integrations/data/github/pull_requests/commits
: Retrieves a list of commits on a pull request/integrations/data/github/issues
: Lists repository issues/integrations/data/github/issues/{issue_number}
: Retrieves a specific issue
By default, we return responses with mappings applied, but there is an option to include the entire GitHub response on every endpoint (
include_remote_data
).Find more details in our documentation here.
/user_files_v2
: New upload_id
Property
User files now contain a new property called
upload_id
which is generated internally by Carbon. This property groups together files that were part of the same upload. Each upload from a third-party file picker will have its own unique upload_id, even if the files were uploaded in the same session. Sessions are still identified by therequest_id
. If the same file is uploaded multiple times, only the most recentupload_id
is saved.Webhooks that send the
request_id
will now also send theupload_id
.The
/user_files_v2
endpoint now accepts a new filter calledupload_ids
.
New ALL_FILES_PROCESSED
Webhook
The new webhook
ALL_FILES_PROCESSED
will be sent when all files in an upload have moved into the “READY,” “SYNC_ERROR,” “READY_TO_SYNC,” or “RATE_LIMITED” status. It includes therequest_id
as the sent object and theupload_id
as additional information.
API Update
Starting next Tuesday (10/15), the
hot_storage_time_to_live
field under file upload endpoints will no longer take values in seconds. Instead it will need to be a discrete number of days from the list:[1, 3, 7, 14, 30]
. Anything else will raise an exception.
Self-Hosting Updates
You can now bring your own S3-compatible object storage instead of using S3 (AWS) or Google Blob Storage (GCP).
Added a flag
DISABLE_RATE_LIMITS
to disable all of Carbon’s rate limits listed here.
Premium Proxies for Web Scraping
We have introduced a new feature called
use_premium_proxies
for web scraping and sitemap scraping that can be enabled upon request. This feature aims to enhance the success rate when scraping websites that utilize captchas or bot blockers.Please note that enabling this feature upon request may result in longer web scraping durations.
Limit File Syncs by Character Count
Initial file syncs now includes the option to limit based on the number of characters. There are three levels of character limits:
max_characters_per_file
: A single file from the user cannot exceed this character limit.max_characters_per_upload
: Custom character limit for the user across a single upload request.max_characters
: Custom character limit for the user across all of the user’s files. Please note that in this case, the value can slightly exceed the limit.
These limits can be configured using the user (
/update_user
) or organization (/organization/update
) update endpoints. If these limits are exceeded, the file that surpasses the threshold will be moved toSYNC_ERROR
, and the corresponding webhook (FILE_ERROR
) will be sent. Please be aware that files that have already synced from the same upload request will not be rolled back.
Email Notifications for Usage Limits
You can now enable the following emails (currently upon request) to be sent to admin and users under your portal.carbon.ai account:
Daily Limit Reached
: Your organization has reached the 2.5GB (or custom) upload limit across all users and data sources. We’ll return theorganizationName
,uploadLimit
, andresetTime
.User Exceeded Their Upload Limit
: A certain user has exceeded the upload limits you set viamax_files
ormax_files_per_upload
. We’ll return thecustomerId
,limitType
, anddateTime
.User Exceeded Their Upload Limit
: A certain user has exceeded the upload limits you set viamax_characters_per_file
,max_characters_per_upload
, ormax_characters
. We’ll return thecustomerId
,limitType
, anddateTime
.
Self-Hosting Updates
We have added two new environment variables:
HASH_BEARER_TOKEN
: when set totrue
, we store only hashed bearer tokens in the database. This is optional and adds an additional layer of security if your database is already encrypted at rest.DATA_SOURCE_ENCRYPTION_KEY
: enables encryption of client secrets and access tokens when set. This key should be a URL-safe, base64-encoded 32-byte key. Refresh tokens are not encrypted because they are not useful without the client secret. Encrypted values can be decrypted and rolled back using a migration that does this for all tokens.
You can now use your own SQS-compatible queue instead of using SQS (AWS) or PubSub (GCP). Currently we’ve implemented elasticmq as the open-source SQS alternative.
Carbon Connect Enhancements
Users can now opt to have the
updated_at
column displayed infilesTabColumns
instead ofcreated_at
, allowing for sorting by this column.Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either
Ready
orSync Error
.Users can now opt to have the
updated_at
column displayed infilesTabColumns
instead ofcreated_at
, allowing for sorting by this column.Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either
Ready
orSync Error
.
API Endpoint for White Labeling
If white-labeling is enabled for your organization, you can directly manage your oauth credentials for white-labeling via the following endpoints:
/white_label/create
: Add credentials to white label data sources./white_label/delete
: Delete credentials for white-labeled data sources./white_label/update
: Update credentials for a white-labeled data source./white_label/list
: List credentials for white-labeled data sources.
Below is a list of data sources that can be white-labeling:
NOTION GOOGLE_DRIVE BOX ONEDRIVE SHAREPOINT INTERCOM SLACK ZENDESK OUTLOOK GMAIL SERVICENOW SALESFORCE ZOTERO CONFLUENCE DROPBOX GOOGLE_CLOUD_STORAGE GONG
For all these data source types,
client_id
andredirect_uri
are required credentials.client_secret
is optional for those who want to create data source with access tokens obtained outside of Carbon. For data source specific credentials:Google Drive takes an
api_key
optionally for those who want to use Google’s file pickerOneDrive and Sharepoint take a
file_picker_client_id
andfile_picker_redirect_uri
for those who want to use Microsoft’s file picker.
Note: Carbon will encrypt client secrets in our database, but return them unencrypted in the API responses.
Disabling File Formats in CCv3 File Picker (3.0.21
)
You can now disable the selection of unsupported or disabled file formats in the CCv3 in-house file picker for the following integrations:
GOOGLE_DRIVE ONEDRIVE SHAREPOINT BOX DROPBOX S3 (includes Digital Ocean Spaces) ZOTERO AZURE_BLOB_STORAGE GOOGLE_CLOUD_STORAGE
By default, all file formats supported by Carbon are enabled. Users can set the
allowed_file_formats
under connector settings at the user (update_users
) or organization level (organization/update
) to control which file formats are enabled.
Self-Hosting Updates (1.3.18
)
We now allow environment variables for the Carbon application to be passed as a
yaml
file. Theconfig.yaml
file is a configuration file that stores all the environment variables for the application. It can have multiple environments such as dev, prod, etc. Each environment should be placed under a key with the name of the environment, and the key must be at the top level of the file. The environment that is used is determined by theglobal_env
key-value pair (e.g.global_env: dev
). It’s important to note that the variables in this file are converted into environment variables. Essentially, every key-value pair at the leaf level is extracted. The key becomes the key of the environment variable in all caps, and the value remains the same.For instance, the
microsoft_file_picker_client_id
variable underprod.data_connects.onedrive
would be converted to the env variable:MICROSOFT_CLIENT_FILE_PICKER_CLIENT_ID=test_id_here
.
Here is an example of the
.yaml
file for reference.
Custom Metadata for Data Sources
We added the functionality to add custom tags to data sources, similar to those currently supported for files.
You can add tags to any data source via the following endpoints:
data_sources/tags/add
: Add tags to a data source.data_sources/tags/remove
: Remove tags from a data source.
Any endpoints for connecting data sources (ie:
integrations/connect
and/integrations/oauth_url
) all take adata_source_tags
param for adding tags.The tags must be added as a key-value pair (same as file tags). Example:
{{"userId": "swapnil@carbon.ai"}}
We have also introduced two parameters in Carbon Connect (
3.0.23
), allowing customers to add and filter displayed data sources for users:dataSourceTags
: These are key-value pairs that will be added to all data sources connected through CCv3.dataSourceTagsFilterQuery
: This parameter filters for tags when querying data sources. It functions similarly to our documented file filters. If not provided, all data sources will be returned. Example:{{"key": "userId", "value": "swapnil@carbon.ai"}}
Sharepoint Team Site Support
We now support Sharepoint team sites. To connect a Sharepoint team site, leave
sharepoint_site_name
undefined when calling/integrations/oauth_url
Cursor-Based Pagination
We have begun to implement a more efficient pagination system for our API endpoints, starting with the
/user_files_v2
endpoint.We introduced a new parameter called
starting_id
in thepagination
block. It is recommended to use a combination of limit andstarting_id
instead of limit and offset. This not only reduces the load on our backend but also leads to significantly faster response times for you. Thelimit-starting_id
approach is essentially cursor-based pagination, with the cursor being thestarting_id
in this case.To use it, if you are unsure about which ID to use for
starting_id
, you should initially make a query with just a limit, order direction, and field to order by. For example:
{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10 } }
You will receive a list of results (files in the case of
/user_files_v2
), ordered byid
in descending order. From here, you can use the last ID in the list as the starting ID for the next API call. For instance:
{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10, "starting_id": 25032 } }
This assumes that the last ID of the first API call was
25032
. By following this method, you can retrieve the next 10 files. You can continue this process as needed.We aim to eventually phase out offset-based pagination in favor of this cursor-based pagination, as limit-based pagination performs significantly worse at a database level.
Self-Hosting Updates
Azure Blob Storage has been added as an object storage backend, alongside S3 (AWS), Google Blob Storage (GCP), and S3-compatible open source alternatives.
Customer Portal v2
We have completely redesigned our customer portal UI (portal.carbon.ai) and have a roadmap to significantly enhance the functionality.
You can now manage the following through the portal:
Webhooks
API keys
Admin and User Permissions
Subscription Plans
Drives Listed As Top-Level Items
Personal and Shared Drives are now listed as the top-level sources items via both the API and and within the in-house Carbon file picker.
Drives themselves cannot be selected for syncing but you can click in to select folders and files with the Drives.
Self-Hosting Updates
We added the following environment variables for self-hosted deployments:
default_request_timeout
: This is the default timeout for all requests made to external APIs and URLs. Defaults to 7 seconds.web_scrape_request_timeout
: This timeout is specifically for requests made during web scraping. Defaults to 60 seconds.<data_source_name>_request_timeout
: This allows you to customize the request timeout for specific data sources. Replace<data_source_name>
with the actual name, such asnotion_request_timeout
orgoogle_drive_request_timeout
. Defaults to 7 seconds.
Custom Scopes for Connectors
You can now directly pass in custom scopes to request from the OAuth provider via
/integrations/oauth_url
. The scopes will be used as it is, not combined with the default scopes that Carbon requests.The scopes must be passed in as an array, example:
Support for custom scopes has also been added to Carbon Connect
3.0.26
. The prop is calledscopes
and is an array that can be set only on the integration level.
Presigned URL Expiry Time
We added a new, optional field on the
/user_files_v2
endpoint calledpresigned_url_expiry_time_seconds
that can be used to set the expiry time for generated presigned URLs. The default is 3600 seconds.
List Sharepoint Sites
You can now list all the SharePoint sites associated with a user’s SharePoint account.
After connecting to a SharePoint account, you can use the endpoint
/integrations/sharepoint/sites/list
to retrieve a list of all sites in the account.This endpoint has two optional parameters:
data_source_id
: This must be provided if there are multiple SharePoint connections under the same customer ID.cursor
: This is used for pagination.
Each site will return three properties:
site_url
site_display_name
site_name
: This value is used for thesharepoint_site_name
when connecting sites withintegrations/oauth_url
.
Please note that this endpoint requires an additional scope,
Sites.Read.All
, which Carbon does not request by default. In order to list sites, connect sites, and sync files from connected sites, you must includeSites.Read.All
in the/integrations/oauth_url
through thescopes
parameter, along with the required scopes:openid
,offline_access
,User.Read
, andFiles.Read.All
.
New Filters for Source Items
We added two new optional filters for
/integrations/items/list
:file_formats
: Filter based on all file formats supported by Carbon. This is a new feature that won’t be backfilled, so it will only apply to data sources that are synced or re-synced moving forward.item_types
: Filter on different item types at the source; for example, help centers will haveTICKET
andARTICLE
, while Google Drive will haveFILE
andFOLDER
.
Return external_url
for Freshdesk
We now return the
external_url
value for Freshdesk articles.
Sync Outlook Emails Across All Folders
We have introduced support for syncing Outlook emails across all user folders. Users can specify the folder as
null
to achieve this, with the default being the inbox if this input is excluded.
Support for .eml
and .msg
Files
We’ve added support for
.eml
and.msg
files for both local and third-party file uploads.
Return Document Chunks without Embeddings
We added a new flag
generate_chunks_only
underfiles_sync_config
for third-party connectors (asgenerate_chunks_only
) and at the top-level for web scrapes, file uploads, and raw text (asgenerateChunksOnly
).When this flag is set to
true
, documents will be chunked without generating embeddings, and the/list_chunks_and_embeddings
will list chunks only.If
generate_chunks_only
is set totrue
then it overridesskip_embedding_generation
. Oncegenerate_chunks_only
is set totrue
embeddings will not be generated irrespective of the value passed forskip_embedding_generation
.
ServiceNow Connector
The ServiceNow connector allows customers to synchronize incidents and attachments from their accounts, and support for knowledge articles and catalogs will be added soon!
Carbon Connect support is coming tomorrow. The
enabledIntegration
will beSERVICENOW
.You can find more details here.
Carbon Connect Enhancements
If a synced file in the “Synced File” list view is in
ERROR
status, an error message will be displayed when hovering over theError
status label.If a file is re-synced via the “Synced File” list view, a success or error message will be provided based on the outcome.
The
ServiceNow
connector has been added to CCv3. The slug for theenabledIntegration
isSERVICENOW
.
Gong Connector
Just launched our Gong connector for syncing Gong calls and retrieving the call transcripts.
CCv3 support for the Gong Connector will be added later this week with the
enabledIntegration
slug beingGONG
.
By default, the Gong connector will sync all of your workspaces and calls. However, you can customize this behavior:
To turn off automatic syncing of all workspaces and calls, set the
sync_files_on_connection
parameter tofalse
when configuring the connector.To manually sync specific workspaces or calls, use the global endpoints (
/integrations/items/list
and/integrations/files/sync
).
To include speaker names and emails (when available), set the
include_speaker_labels
flag underfile_sync_config
totrue
.New calls are auto-synced from existing workspaces but any new workspaces created later will require syncing manually.
Find more details here.
External URL for Gmail and Outlook
The
external_url
field is now returned for both Gmail and Outlook email files underuser_files_v2
Return Raw Slack Messages
We now return the individual Slack messages under the
additional_presigned_urls->messages_json
field when you set theinclude_additional_files
parameter totrue
foruser_files_v2
.The pre-signed file will contain the raw Slack response for all the messages in that file. The JSON will have one entry per conversation, with the conversation timestamp as the key.
Improved Search for Carbon Connect (3.0.12
)
The search functionality in CCv3 has been enhanced to enable searching through all items in the directory or selected folder, rather than just what is displayed in the front-end.
Improved Notion Parsing
We’ve improved our Notion parser to support parsing for the following blocks:
Toggle lists
In-line tables, text, code blocks, and lists
Numbered and bullet lists
Synced blocks
Multi-column blocks
Text with links
Syncing Intercom Conversations
In addition to articles and tickets, Carbon now syncs Intercom Conversations.
You can specific
CONVERSATION
underfile_sync_config
to enable syncing conversations:
"file_sync_config": { "auto_synced_source_types": ["CONVERSATION"], "sync_attachments": true }
The following conversation information is available as tags for filtering:
{ "conversation_status": "open", "conversation_priority": "not_priority", "conversation_submitter": "example.user@projectmap.com", "conversation_assigned_team": "Support", "conversation_assigned_admin": "swapnil+int2@carbon.ai" }
If you are white-labeling the Intercom connector you’ll need to add the
Read conversations
scope as well.See more details here: https://docs.carbon.ai/connectors/intercom
Notion Database Properties
Notion database properties are now returned per page within the database.
All Notion database properties are supported except for
relation
.Properties are parsed per page in a database. They are parsed in a key-value format (
property_name: property_value
) and are added to the beginning of the parsed page (parsed_text_url
) as a newline separated list.The file returned by
presigned_url
also now contains the JSON representation of the Notion page. The page’s properties and child blocks can be found in the object.
Sync Files Without Processing
We now allow new file records to be created in Carbon (and displayed via
/user_files_v2
) without processing and saving the actual file. the remote file content will not be downloaded, and no chunks or embeddings will be generated. Only some metadata such as name, external id, and external URL (depending on the source being synced from) will be stored.This feature can be enabled by setting the flag
skip_file_processing
totrue
underfile_sync_config
for a given data source, and thesync_status
of files in this state will beREADY_TO_SYNC
.It’s important to note that this flag overrides both the
skip_embedding_generation
andgenerate_chunks_only
flags.
apiURL
prop for CCv3 (3.0.14
)
For customers that self-host Carbon, we added the prop
apiURL
to CCv3 which defaults to https://api.carbon.ai but can be set to another URL value. This URL value then acts as the base path for all of the requests made through Carbon Connect.
Qdrant Destination Connector
You can now “bring your own” Qdrant index to use with Carbon.
Carbon can automatically synchronizes embeddings generated from customer data sources with any Qdrant index.
To enable, we’ll require your Qdrant API key, an URL, and a mapping of embedding generators (ie:
OPENAI
) to collection names:
{ "api_key": "API_KEY", "url": "URL", "collection_names": { "EMBEDDING_GENERATOR_1": "COLLECTION_NAME_1", "EMBEDDING_GENERATOR_2": "COLLECTION_NAME_2" } }
Azure Blob Storage Connector
We launched our Azure Blob Storage connector that enables syncing files and folders from blobs.The Carbon Connect
enabledIntegrations
value for Azure Blob Storage isAZURE_BLOB_STORAGE
, and CCv3 support will launch tomorrow.Find more details on our Azure Blob Storage connector here.
Business OneDrive Support for Microsoft File Picker
The file picker button will now appear on the successful connection page for Business OneDrive accounts.
In order to open the file picker, the tenant name of the business account is required. Carbon will try to find it through Microsoft’s API by default. If it can’t be found, the file picker button won’t appear, and the successful connection page will instruct the user to close the tab.
Carbon Self-Hosting on Google Cloud Platform
Starting today, customers have the option to host a Carbon instance within their own GCP instance, with full access to all features of our managed solution, including data connectors, hybrid search, and more.
As a reminder, we’re already live on AWS and launching on Azure next month!
Book a demo if you’re interested to learn more: https://cal.com/carbon-ai/30min
Unified API for CRMs
We are introducing a unified API to access standardized data directly from CRM systems, starting with Salesforce.
To start, you can now sync data from the following CRM objects:
Accounts
Leads
Contacts
Opportunities
You can find more details in our documentation here.
Google Sheets Update
The file returned in
presigned_url
for Google Sheets has been changed fromtxt
toxlsx
. Thetxt
file is still available inparsed_text_url
.
Sync Filter for Email Attachments
Customers can specifically select to sync only emails that contain attachments.
You will still need to specify
sync_attachments
totrue
and also set the following filter:
{ "key": "has", "value": "attachment" }
Auto-Refresh Synced Files List in CCv3
We now automatically refresh the synced file list whenever users select additional files using our in-house or third-party file picker view. This eliminates the need for users to manually refresh the view.
Updated Children Prop
The
children
prop of the CCv3 component now accepts any valid React node as the children of the modal, from a simple<div>
to an entire component.Here’s an example of how the children prop can be used:
children={ <button onClick={() => setOpen((prev) => !prev)}> Toggle Connect </button> }
Custom Styling for Carbon Connect
Users can now control styling of CCv3 by targeting the specific class names we’ve provided. This allows for complete customization to match the desired look and feel of the application.
For example, classes names include:
cc-modal
: Applies to the entire modal componentcc-modal-header
: Targets the header section of the modalcc-modal-footer
: Targets the footer section of the modalcc-modal-close
: Applies to the close button of the modalcc-modal-overlay
: Targets the overlay background of the modal
By utilizing these class names, users can easily override the default styles and apply their own CSS rules to achieve the desired appearance.
OCR Support for JPG and PNG
We now support
jpg
,jpeg
andpng
file formats for OCR.In addition to the normal steps for enabling OCR, please set
media_type
toTEXT
(via file upload and/integrations/oauth_url
) so Carbon knows to process the image via OCR (versus generating image embeddings via our image embedding model).
HTML for Confluence Articles
We now return the raw HTML output for each Confluence article via the
file_metadata.saved_filename
object underuser_files_v2
.
Cancel Source Items Sync
We added an endpoint
/integrations/items/sync/cancel
to cancel data source syncs that are initiated via/integrations/items/sync
.This allows customers to manually stop syncing for user data sources where
sync_status
=SYNCING
.
New Gmail Filter
We added a new Gmail filter to sync all emails sent from a given account. Example:
{ "filters": { "key": "in", "value": "sent" } }
Return Raw Notion Blocks
We now return the raw output (blocks) for each Notion page via
saved_filename
underuser_files_v2
wheninclude_raw_file: true
.
Shared Google Drive Source Items
We now return shared Google Drive files and folders via
integration/items/list
.
Clearer Error Message for SYNC_ERROR
Status
When a file goes into
SYNC_ERROR
from re-syncing via/resync_file
because it has been deleted in source,sync_error_message
will now sayFile not found in data source
The webhook sent for that error will also contain
sync_error_message
inadditional_information
.
Slack UI in Carbon Connect v3 (3.0.0-beta32
)
Select Conversations to Sync
After authenticating, users have full control over which conversations they want to sync via CCv3, including:
Public channels
Private channels
Direct messages (DMs)
Group DMs
Manage Synced Conversations
Users can manage their list of synced conversations at any time via CCv3.
Easily add or remove channels and DMs to adjust what gets synced between Slack and Carbon.
Carbon Connect Enhancements
Synced URLs for Web Scrapes (CCv3
beta30
)We now display synced URLs in a dedicated list view under the
WEB_SCRAPE
integration.The default columns displayed in the list view are
name
,status
,created_at
.Parent URLs will be displayed as “folders” and children URLs will be displayed as “files” within the folder.
When
showFilesTab
is set tofalse
we surface aSelect files
button in the account drop-down for users to sync new files.Data Source Polling Interval
Added a new configuration property at the component level called
dataSourcePollingInterval
.This property controls how frequently data sources are polled for any updates and events.
The value is specified in milliseconds (ms) and the minimum allowed value for this property is 3000 ms. The default is 8000 ms.
Speaker Diarization
Added
includeSpeakerLabels
forLOCAL_FILES
integration and file extensions.Added
include_speaker_labels
to fileSyncConfig for third-party connectors.
openFilesTabTo
ParamThe
openFilesTabTo
prop is set on the component level and determines which tab (FILE_PICKER
orFILES_LIST
) the user is taken to by default when they select an integration.The prop takes a string value of either
"FILE_PICKER"
|"FILES_LIST"
.This prop only applies when the customer has enabled Carbon’s in-house file picker.
We now display a banner when data source items are being synced. The user will still be able to select previously synced items for upload in the meantime.
Guru support in CCv3 has been added. The
enabledIntegration
isGURU
.We improved the file list view to be better optimized for mobile devices and ensured that the column headers and values align properly.
Pongo Reranking Modal
We’ve added Pongo as a supported reranker model alongside Jina and Cohere.
Similar to Cohere and Jina reranking, users can now use
PONGO_RERANKER
in the following manner on theembeddings
endpoint: { "query": "how is anime made?", "k": 5, "rerank": {"model": "PONGO_RERANKER"} }
Third-Party File Picker Behavior
We added a new parameter
automatically_open_file_picker
to the external file sync urls:/integrations/oauth_url
and/integrations/connect
. Whentrue
, the file picker for Google Drive, Box, OneDrive, Sharepoint, Dropbox will automatically open when the user lands on the successful connection page.It’s important to note that some users’ browsers may have popup blockers that could prevent this parameter from functioning. In such cases, the user may receive a prompt from their browser asking for permission to allow popups from the platform. If the user grants permission, the feature will work as intended for future syncs.
It’s worth mentioning that OneDrive and SharePoint behave differently due to Microsoft treating the file picker as a separate app. Instead of directly opening the file picker, it will trigger another OAuth prompt. If the user consents to the file picker OAuth, the file picker will then automatically open afterwards.
Speaker Diarization
Speaker diarization has been added for audio transcription models. This allows us to format chunks so that the text is organized by utterances and each utterance will be labeled with the speaker. It’ll take this format:
[Speaker A] speaker A's utterance
[Speaker B] speaker B's utterance
For local file uploads, there is a new parameter
include_speaker_labels
. And for external file uploads, the parameterfile_sync_config
object can take a new propertyinclude_speaker_labels
. When either is set totrue
, speaker diarization will be enabled for the audio transcription servicesMinor note: Speaker label may appear differently depending on the transcription service. Deepgram uses numbers to label speakers while AssemblyAI uses letters.
request_id
on Additional Webhooks
request_id
is now included in following webhook events under theadditional_information
object for external files: UPDATE, FILES_CREATED, FILE_READY, FILE_ERROR, FILES_SKIPPED, FILE_SYNC_LIMIT_REACHED
Cold Storage for Files (Beta)
Overview
Carbon supports moving file embeddings between hot and cold storage. This feature allows you to optimize storage costs and improve performance by keeping embeddngs for frequently accessed files in hot storage (vector storage) while moving less frequently used files to cold storage (object storage).
Enabling Cold Storage
By default, the cold storage feature is not enabled. Once enabled, files will automatically be moved to cold storage after a set period of inactivity. To enable cold storage, you must set a flag at file upload time. Currently cold storage is only available for local file uploads via
/uploadfile
,/upload_text
and/upload_file_from_url
.Moving Files from Hot to Cold Storage
Once enabled, files will be automatically moved from hot to cold storage after a specified period of inactivity. This period is determined by the
time_to_move_to_cold_storage
parameter, which represents the number of seconds a file must be inactive before it’s moved to cold storage. There is no manual way to move files to cold storage.You can make an API request to the
/modify_cold_storage_parameters
endpoint which allows customers to update existing files to use cold storage.
Moving Files from Cold to Hot Storage
To move files from cold to hot storage, you must make an API request to
/move_to_hot_storage
. The request will take filters similar to/user_files_v2
, and all files matching the provided filters will be moved to hot storage.To avoid a single request hogging resources, there is a limit of 200 files that can be moved in one request. If the number of files matching the filters exceed 200, the files will be processed in batches of 200 over a longer period of time
/embeddings
Endpoint BehaviorIf a request is made to
/embeddings
that involves files in cold storage, an error will be returned that includes a lfile_ids
for the affected files. This a lows the client to know which files need to be moved to hot storage before the request can be processed.However,
exclude_cold_storage_embeddings
is set totrue
, any files in cold storage will be ignored, and no error ill be thro n for requests involving files in cold storage. Then the search will naturally exclude those files.In the future, we may enable a way to allow
/embeddings
to work with files that are in both cold and hot storage.
File Object Information
Activity is defined as when a file was last used, which currently includes file re-syncs, queries involving that file, and updates to file tags.
The following fields under the file object (under
user_files_v2
) are related to cold storage:last_use
: A timestamp indicating when a file was last used (i.e., when it last had activity).supports_cold_storage
: A flag indicating whether or not a file can be moved to cold storage.time_to_move_to_cold_storage
: An integer representing the number of seconds a file must be inactive before it’s moved to cold storage.embedding_storage_status
: The storage status of the embeddings for a file, indicating whether they are in cold or hot storage.
New Cold Storage Webhooks
MOVED_TO_COLD_STORAGE
- This event is fired when a file is moved to cold storage.MOVED_TO_HOT_STORAGE
- This event is fired when a file is moved to hot storage.
You can find our documentation on cold storage here.
Warnings
Object to API Responses
In the next two weeks, we plan to add a
warnings
object to our API responses to display warning messages.Here’s an example of how it looks:
{ "documents": [], "warnings": [ { "warning_type": "FILES_IN_COLD_STORAGE", "object_type": "FILE_LIST", "object_id": [ 47058 ], "message": "These files won't be queried because they are not in hot storage." } ] }
Carbon Connect 3.0 (CCv3) Enhancements
We’ve added 3 new props to CCv3:
The
showFilesTab
(boolean) prop has been reintroduced to CCv3 with a default value of true. As a quick reminder, this prop allows customers to hide the file selector and file list view from the CCv3 component. It can be enabled or disabled at both the component and integration levels. If specified for a specific integration, it will override the component-level configuration.The
filesTabColumns
(array) prop has been added on both the component and integration levels. This prop controls which columns are displayed and hidden in the file list view and accepts an array of strings with values “name”, “status”, “created_at”, and “external_url”.The
transcription_service
(enum) prop has been added underfileSyncConfig
andtranscriptionService
forLOCAL_FILES
integration to specify which speech-to-text model to use for transcriptions. You can specify the enum asASSEMBLYAI
orDEEPGRAM
but the prop defaults toDEEPGRAM
.
Google Cloud Storage Connector
We launched our GCS connector that enables syncing files from buckets.
The Carbon Connect
enabledIntegrations
value for GCS isGCS
.See more specifics about our GCS connector here.
DigitalOcean Storage Connector
We launched our DigitalOcena Storage connector that enables syncing files from buckets.
The Carbon Connect
enabledIntegrations
value for Digital Ocean Spaces isS3
(CC support will be launched tomorrow).The Spaces API is interoperable with the AWS S3, so Digital Ocean Spaces makes use of the existing S3 endpoints.
This means that the source of Digital Ocean files is S3. To differentiate between data sources and files from Spaces Object Storage, additional metadata has been added:
Data Source Metadata
data_source_metadata
: Indicates the type of data source. Possible values include:S3
: Represents an Amazon S3 data source.DigitalOcean Space
: Represents a DigitalOcean Spaces data source.
File Metadata
file_metadata
: Specifies the type of file. Possible values include:S3 File
: Represents a file stored in Amazon S3.DigitalOcean Space File
: Represents a file stored in DigitalOcean Spaces.S3 Bucket
: Represents a file representation for a S3 Bucket.DigitalOcean Space Bucket
: Represents a file representation for a DigitalOcean Space Bucket.
See more specifics about our DigitalOcean Spaces connector here.
New file_types_at_source
Filter for /user_files_v2
and /embeddings
Introduced a new optional field
file_types_at_source
for/user_files_v2
and/embeddings
.The
file_types_at_source
field is an array type that currently accepts the following values:TICKET
ARTICLE
This new field allows users to specify whether we return tickets, articles or both when retrieving content (files and embeddings) from Zendesk, Intercom and Freshdesk.
If
file_types_at_source
containsTICKET
, ticket content from Zendesk, Intercom and Freshdesk are returned.If
file_types_at_source
containsARTICLE
, article content from Zendesk, Intercom and Freshdesk are returned.
AssemblyAI Integration for Audio Transcriptions
We are excited to announce that Carbon now supports multiple audio transcription services. In addition to our existing integration with Deepgram, we have added support for AssemblyAI, providing our users with more options and flexibility when transcribing audio files.
To accommodate the new transcription service, we have updated the following endpoints to accept the new parameters
transcription_service
that allow you to specify which service to use. Valid values aredeepgram
andassemblyai
. If no value is specified, Deepgram will be used as the default transcription service.For local files, the endpoints are:
/uploadfile
/upload_file_from_url
For external files,
transcription_service
is set within thefile_sync_config
parameter, under:/integrations/oauth_url
/integrations/connect
/integrations/files/sync
Similar to files transcribed by Deepgram, files transcribed by AssemblyAI also have an additional saved file containing the full JSON response from the AssemblyAI service. To access the transcription response, query the files using the
user_files_v2
endpoint with theinclude_additional_files
parameter set totrue
.
Carbon Webhook Libraries
We have released our official webhook libraries for handling the verification of webhook signatures. You can find our updated documentation here, and access our libraries on GitHub here.
Zendesk Auto-Sync Update
We are thrilled to announce that the Zendesk connector now supports auto-sync.
Carbon can now sync any new articles with auto-sync enabled.
Help Center Categories are now synced into Carbon as files, and Help Center Categories and articles form a parent-child relationship.
Reconnecting Existing Zendesk Connections:I
If you have existing Zendesk connections in Carbon, please note that you will need to reconnect them to enable the updates above.
Organization Connector Settings
The
/organization
endpoint now includesconnector_settings
in the response, providing additional information about the organization’s connector configurations, starting with permitted file formats.The
/organization/update
endpoint has been updated to accept thedata_source_config
parameter, allowing customers to configure permitted file formats for organization users. Thedata_source_config
parameter should be provided in the following format:
{ "data_source_configs": { "GOOGLE_DRIVE": { "allowed_file_formats": ["PDF", "DOCX"] }, "DROPBOX": { "allowed_file_formats": ["XLSX", "CSV"] }, "DEFAULT": { "allowed_file_formats": ["PDF", "DOCX", "XLSX", "NOTION"] } } }
DEFAULT
is applied to all data sources that do not have configs defined.If the
data_source_config
parameter includes file formats that are not supported by Carbon, those formats will be ignored, and only the supported formats from each data source will synced.
Carbon Self-Hosting on AWS
Starting today, customers have the option to host a Carbon instance on their own cloud, with full access to all features of our managed solution, including data connectors, hybrid search, and more.
We’re launching on Microsoft Azure and Google Cloud later next month!
Book a demo if you’re interested to learn more:https://cal.com/carbon-ai/30min
Confluence Enhancements
We’ve made improvements to the Confluence Connector related to the following:
Auto-Sync Improvements
Auto-syncs process will now index new pages that are added to a previously synced parent page. If a user syncs their entire Confluence account, then the space will be the top-most file.
If pages are deleted from a synced parent page in Confluence, the scheduled sync will remove them from the synced content.
File Metadata Enhancements
The
file_metadata
property now includes additional information about the type of Confluence item each file represents (spaces and pages).The
file_metadata
property will also record theexternal_id
of the file’s parent and root, providing better context and hierarchy information.
To take advantage of these updates, users will need to reconnect their Confluence account and re-sync their Confluence files.
Reranker Models for Search
We are excited to introduce native support for reranker models. With this release, customers now have the option to rerank search result chunks to provide more relevant and accurate results.
How it works:
When making a search query via the
embeddings
endpoint, customers can control the reranking behavior by setting thererank
parameter in the payload.If
rerank
is set to"JINA_MULTILINGUAL_BASE_V2"
the search result chunks will be reranked using the Jina reranking algorithm.If
rerank
is set to"COHERE_RERANK_MULTILINGUAL_V3"
, the search result chunks will be reranked using the Cohere reranking algorithm.If the
rerank
parameter is not specified or set to any other value, the default ranking will be used.
The response format from the
embeddings
endpoint remains consistent regardless of whetherrerank
is enabled or not.
We’ll be adding support for more reranker models in the weeks to come!New Webhook: WEBSCRAPE_URLS_READY
We’ve added a new webhook named WEBSCRAPE_URLS_READY
that triggers each time a specific web page from a web scrape request is finished processing.
Introducing Carbon Connect 3.0
We’re thrilled to announce the beta
release of Carbon Connect 3.0, packed with exciting updates and improvements, based on customer feedback.Key Features and Improvements
1. Seamless File and Folder Uploads
Carbon Connect 3.0 now supports both file and folder uploads by default, eliminating the need for the filePickerMode
property. Uploading entire folder directories is now a breeze with our new drag-and-drop functionality.
2. Carbon’s In-House File Picker
We’re excited to introduce Carbon’s in-house file picker is now available for all connectors, except for Slack, Gmail, and Outlook (currently in development). To use Carbon’s file picker instead of the source’s file picker, simply set the new useCarbonFilePicker
property to true
.
3. Enhanced In-Modal Notifications
We’ve completely replaced toast notifications with in-modal notifications, providing a more cohesive and user-friendly experience. As a result, the enableToasts
property has been removed.
4. Customizable Theme Options
Personalize your Carbon Connect experience with our new theme options. Use the theme
property to set the application’s theme to light
, dark
, or auto
(default). When set to auto
, Carbon Connect will automatically adapt to your system’s theme.
5. Simplified File Limit Control
Limiting the number of files is now easier than ever. Simply set the maxFilesCount
property to 1
to restrict uploads to a single file. The allowMultipleFiles
property has been removed for a more straightforward approach.
Upcoming Enhancements
We’re continuously working to improve Carbon Connect and have exciting plans for the near future:
1. Enhanced Customization Options
We’re working on bringing back customization options from Carbon Connect 2.0, including loadingIconColor
, primaryBackgroundColor
, primaryTextColor
, secondaryBackgroundColor
, and secondaryTextColor
.
2. Expanded In-House File Pickers
In the coming weeks, we’ll be launching Carbon’s in-house file pickers for Outlook, Slack, and Gmail, providing a consistent and seamless experience across all connectors.
Installation
You can install the new component for testing via the command npm install carbon-connect@beta
. We plan to bring 3.0 out of beta
by the end of the month!
Here’s a Loom video providing a quick walkthrough of the new modal: https://www.loom.com/share/b7b241fa5e5e4d0a92fb5e748d3d6ec3
External URLs Filter
A new external_urls
filter has been added to the user_files_v2
endpoint.This filter allows you to refine the results returned by the endpoint based on a list of external_urls
passed.
File Deletion Enhancements
When a customer deletes a file from Carbon (via
delete_files_v2
), they have the flexibility to control whether the file row in the database is preserved or marked as deleted when deleting a file.This behavior is managed by the
preserve_file_record
flag. Ifpreserve_file_record
is set totrue
, then we delete the files stored in our S3/GCS while keeping the file record and metadata to allow for re-syncs and auto-syncs.We also added a
file_contents_deleted
field to theuser_files_v2
endpoint. If the field is returned astrue
, then the file record still exists, but the stored file content is deleted.
Find more details here.
High Accuracy Mode
We’ve introduced a new optional boolean parameter to the
/embeddings
endpoint calledhigh_accuracy
. If set totrue
, then vector search may give more accurate results at a slight performance penalty. By default, it’sfalse
.Find more details here.
To
And From
Filters for Outlook and Gmail
We added 2 more filters for syncing emails from Outlook and Gmail:
to
: Supports an email (email@address.com
) as a string to which the email was sent.from
: Supports an email (email@address.com
) as a string from which the email was sent.
Note: Outlook only supports
from
filters.
Intercom Auto-Sync Update
We are thrilled to announce 2 updates to our Intercom connector:
Carbon can now sync multiple Intercom Help Centers:
Help Centers are now synced into Carbon as files, and Help Center and articles form a parent-child relationship.
Just as only published articles are synced, only activated Help Centers will be synced.
Carbon can now sync any new published articles with auto-sync is enabled.
Reconnecting Existing Intercom Connections:
If you have existing Intercom connections in Carbon, please note that you will need to reconnect them to enable the updates above.
Load More
CARBON
Data Connectors for LLMs
COPYRIGHT @ 2024 JCDT DBA CARBON