New Webhook Events
We’ve introduced 2 additional webhook events to help track file sync statuses:
FILE_CREATED
: This event is fired when a user queues up a file to be synced for the first time. The body of the webhook will contain a list offile_ids
for files that were created in the same upload, and multiple events could fire for the same upload if a lot of files were queued.ALL_UPLOADED_FILES_QUEUED
: This event is fired when every single item in an upload has been queued for sync, including all children of folders in an upload. The body will contain the upload’srequest_id
.
Couple notes:
Both
file_ids
andrequest_ids
can be used to filter for the files in/user_files_v2
.A
request_id
is now always generated for an upload to support theALL_UPLOADED_FILES_QUEUED
webhook. Previously, it was only generated by the user (unless you’re using Carbon Connect) and passed to us as a parameter. You may still do that and we’ll use your generatedrequest_id
, but if they don’t then we’ll generate anrequest_id
for you on behalf of the user’s upload.These two webhooks currently are supported for 3rd party data sources only. Support for web scrapes and local file uploads will be coming soon.
You can find more details here.
GitHub Connector
We launched our Github integration today that syncs pages from both public and public repositories.
The Carbon Connect
enabledIntegration
slug for Github isGITHUB
. You’ll need to update to2.0.0-beta19
to access the new screen.Users should first submit their GitHub username and access token to our integration endpoint at
/integrations/github
. Then you can then use our global endpoints for listing and syncing specific files in different repositories:List files from repositories with the global endpoints
/integrations/items/list
Sync files from repositories with the global endpoint
/integrations/files/sync
See more specifics about our Github integration here.
Set Max Files Per Upload
A new user-level parameter,
max_files_per_upload
, has been introduced that can be modified via the/update_users
endpoint. It determines the maximum number of files a user can upload in a single request.Files that exceed the maximum number of files will be moved into the
SYNC_ERROR
status with webhooks being fired to alert you.
You can check the
file_single_upload_limit
set for a particular user via theuser
endpoint.Find more details here.
Important Update: The parameter
max_files
now serves to establish the overall file upload limit for a user across all uploads.
Add include_all_children
to Embeddings Endpoint
Added param
include_all_children
to theembeddings
endpoint. When this param is set totrue
, the search is run over all filtered files as well as their children.Filters applied to the endpoint extend to the returned child files.
In-House File Picker for Confluence and Salesforce
We’re excited to introduce our in-house file picker, starting with Confluence and Salesforce. Our in-house file picker is still in beta, but you can test it out by manually running
npm install carbon-connect@2.0.0-beta13
With this update, end users gain the ability to directly select and upload specific files from Confluence and Salesforce. Previously, this functionality was unavailable as neither platform offered their own dedicated file pickers.
When
syncFilesOnConnection
is set tofalse
then our file picker will be enabled.
Hiding 3rd-Party File Picker
The endpoints
/integrations/oauth_url
and/integrations/connect
now support a new boolean parameter namedenable_file_picker
.When
enable_file_picker
is set totrue
(default behavior), a button will be displayed on the success page. Clicking this button will open the file picker associated with the respective source. This is the standard behavior.Conversely, setting
enable_file_picker
tofalse
will hide the file picker button on the success page. In such cases, end users will be directed to use custom or in-house file pickers for file selection.
Sync Outlook and Gmail Attachments
We’ve introduced a new property called
sync_attachments
, which can be specified when syncing via/integrations/gmail/sync
and/integrations/outlook/sync
endpoints. By default, this property is set tofalse
.Setting
sync_attachments
totrue
enables Carbon to automatically sync file attachments from corresponding emails. This includes not only traditional file attachments but also files (such as images) that are added in-line within emails.Each file attachment will be assigned a unique
file_id
, with theparent_id
corresponding to the email the file was attached to.Please note that the same rules that apply to our file uploads also apply to attachments in terms of file size and supported extensions.
Set User File Limits
You have the flexibility to set the maximum number of files that a unique customer ID can upload using the
file_upload_limit
field on theupdate_users
endpoint.This value can be adjusted as needed, allowing you to tailor it according to your own plan limits.
Then you can check the upload limit set for a specific user via the
custom_limits
object on theuser
endpoint.See details here.
Flags for OCR
Added
ocr_job_started_at
to theuser_files_v2
response to denote whether OCR was enabled for a particular file.Added additional OCR properties to be returned via
ocr_properties
, including whether table parsing was enabled.See details here.
Role Management in Customer Portal
You now have the ability to manage who in your organization can create, delete, and view API keys.
Here’s a breakdown of the current roles available:
Admin: This role is empowered to both create and delete API keys.
User: Users with this role can view API keys.
Moving forward, these roles will determine user permissions and access across different sections of the Carbon Customer Portal.
You can access the customer portal via portal.carbon.ai
Expanded OCR Support in Carbon Connect
The prop
useOCR
can now be enabled on the integration level for the following connectors (in addition to local files):OneDrive
Dropbox
Box
Google Drive
Zotero
SharePoint
The prop
parsePdfTablesWithOcr
can now be enabled on the integration level to parse tables with OCR whenuseOCR
is set totrue
.Please note OCR support is only applicable for PDFs at the moment.
You can find more details here.
Return chunk_index
on the /embeddings
Endpoint
We now return the
chunk_index
for specific chunks returned via the/embeddings
endpoint.You can find more details here.
Migrations between Embedding Models
You can now request migrations between embedding models with minimal downtime.
Email me if you’re interested. The cost per migration (not including embedding token costs) starts at $850 one-time.
New request_id
Field
Carbon now accommodates the inclusion of a
request_id
within OAuth URLs, global sync endpoints, and custom sync endpoints (such as Gmail, Outlook, etc.), allowing users to define it as needed. Non-OAuth URL endpoints that auto-sync upon connection (e.g., Freshdesk, Gitbook) also supports this value. Therequest_id
serves as a filter for files throughuser_files_v2
.With Carbon Connect, enabling the
useRequestIds
parameter totrue
will trigger automatic assignment of therequest_id
. Thisrequest_id
will be returned inINITIATE
andADD
/UPDATE
callbacks.It’s essential to note that this configuration adjustment is applicable at the component level rather than the integration level.
This enhancement is part of version
2.0.0-beta8
.Find more details here.
syncFilesOnConnection
For More Data Sources
We’ve added the
sync_files_on_connection
parameter to theoauth_url
endpoint for the following data sources: Intercom, Salesforce, Zendesk, Confluence, Freshdesk, and Gitbook.This parameter is also accessible for each
enabledIntegration
in Carbon Connect. You can find more information about this here.By default, this parameter is set to
true
. When enabled, all files will be synchronized automatically after a user connects their account. This is particularly useful when a user connects a data source that doesn’t have a built-in file picker.
Delete Child Files Based on Parent ID
Added a flag named
delete_child_files
to thedelete_files
endpoint. When set totrue
, it will delete all files that have the sameparent_file_ids
as the file submitted for deletion. This flag defaults tofalse
.Find more details here.
upload_chunks_and_embeddings
Updates
You can now upload only chunks to Carbon via the
upload_chunks_and_embeddings
and we can generate the embeddings for you. This is useful for migrations where you want to migrate between embedding models and vector databases.In the API request, you can exclude embeddings and set
chunks_only
totrue
. Then, include your embedding model API key (OpenAI or Cohere) undercustom_credentials
.
{ "api_key": "lkdsjflds" }
Make sure to include some delay between requests. There are also stricter limits on how many embeddings/chunks can be uploaded per request if
chunks_only
istrue
. Each request can only include 100 chunks.
Data Source Connections with Pre-Existing Auth
If you’re using our white labeling add-on, we added a new POST endpoint
/integrations/connect
so customers can bypass the authentication flow on Carbon by directly passing in an access token.The request takes an authentication object that contains all the necessary pieces of data to connect to user’s account. The object will vary by data source and a list specifying the required keys can be found in our docs. If the connection is successful, the upserted data source will be returned.
This endpoint also returns a sync url for some data source types that will initiate the sync process.
Improvements to CSV, TSV, XLSX, GSheet Parsing
You have the option to now chunk CSV, TSV, XLSX, and Google Sheets by tokens via
chunk_size
and/or rows viamax_items_per_chunk
parameters. When a file is processed, we will add rows to a chunk until adding the next row would exceedchunk_size
ormax_items_per_chunk
.If a single row exceeds
chunk_size
or the embedding model’s limit for number of tokens, then the file’ssync_error_message
will point out which row has too many tokens.For example:
If each CSV row is 250 tokens,
chunk_size
of 800 tokens and nomax_items_per_chunk
set, then each chunk will contain 3 CSV rows.If each CSV row is 250 tokens,
chunk_size
of 800 tokens andmax_items_per_chunk
set to 1, then each chunk will contain 1 CSV rows.Consequently, it is essential to ensure that the number of tokens in a CSV row does not surpass the token limits established by the embedding models. Token counting is currently only supported for OpenAI models currently.
You can find more details here.
Improvements to OCR
Table parsing in PDFs has been improved significantly with this most recent OCR update.
In order to use the enhanced table parsing features, you need to set
parse_pdf_tables_with_ocr
to true when uploading PDFs (use_ocr
must also be true).Any tables parsed when
parse_pdf_tables_with_ocr
istrue
have their own chunk(s) assigned to them. These chunks can be identified by the presence of the stringTABLE
inembedding_metadata.block_types
.The format of these tabular chunks will be the same format as CSV-derived chunks.
Using this table-parsing feature in conjunction with hybrid search should provide much better search results than before (assuming the PDF has tables that need to be searched over).
If you’re using OCR we now also return metadata such as coordinates and page numbers even if
set_page_as_boundary
is set tofalse
.Specifically, we will return the bounding box coordinates as well as the start and end page number of the chunk.
In the event that
pg_start
<pg_end
, then you should interpret bounding box coordinates slightly differently.x1
andx2
will correspond to the minimumx1
and maximumx2
over all pages for the chunk.y1
will correspond to the upper-most coordinate of the part of the chunk onpg_start
, andy2
will correspond to the bottom-most coordinate of the part of the chunk onpg-end
.
Carbon Connect 2.0 (Beta)
We are thrilled to announce the beta launch of Carbon Connect 2.0, with the following improvements:
Support multiple active accounts per data source.
Improved data source list.
Built-in user interface for users to view and re-sync files per account.
Ability for users to directly disconnect active accounts.
To install Carbon Connect 2.0 please npm install
carbon-connect@2.0.0-beta5
. It is not treated as the latest version of Carbon Connect so you won’t get this version automatically.Few other important updates for Carbon Connect 2.0:
We’ve made a change to remove file details from the payload of
UPDATE
callbacks. If you used to get files in this way, you’ll now need to switch to using our SDK or API to get the updated files when a data source updates.When you’re specifying embedding models, just make sure to use the format like this: embeddingModel=
{EmbeddingGenerators.OPENAI_ADA_LARGE_1024}
, instead of just writing out a string.You can hide our built-in UI for viewing and re-syncing files using the
showFilesTab
param on either the global component orenabledIntegration
level.
Scheduled Syncs Per User and Data Source
Control user and data source syncing using the
/update_users
endpoint, allowing organizations to specify enabled syncing for particular users and data source types. The endpoint accepts a list of user IDs and data sources, with an option to enable syncing for all sources using the string'ALL'
.Each request supports up to 100 customer IDs.
In the following example, future Gmail accounts for specified users will automatically have syncing enabled according to the provided settings.
{ "customer_ids": ["swapnil@carbon.ai", "swapnil.galaxy@gmail.com"], "auto_sync_enabled_sources": ["GMAIL"] }
Find more details in our documentation here.
Note: This update is meant to replace our file-level sync logic and any existing auto-syncs have been migrated over to use this updated logic.
Delete Files Based on Filters
We added the
/delete_files_v2
endpoint which allows customers to delete files via the same filters as/user_files_v2
We plan to deprecate the
/delete_files
endpoint in a month.Find more details in our documentation here.
Filtering for Child Files
We added the ability to include all descendent (child) files on both
/delete_files_v2
and/user_files_v2
when filtering.Filters applied to the endpoint extend to the returned child files.
We plan to deprecate the
parent_file_ids
filter on the/user_files_v2
endpoint in a month.
Customer Portal v1
We’ve officially launched v1 of our Customer Portal - portal.carbon.ai
You can currently manage your API keys directly via the Portal, and we plan to release the following functionality next quarter:
User management
Usage monitoring
Billing management
For current customers, you can reset your password with the email provided to Carbon to gain access. If you don’t know the email you have on file, DM me!
integration/items/list
Improvements
We are implementing four distinct filters:
external_ids
,ids
,root_files_only
, andname
, each meant to filter data based on their respective fields.The
root_files_only
filter will exclusively return top-level files. However, if aparent_id
is specified, thenroot_files_only
can’t be specified and vice versa.
The
external_url
has been added to the response body of theintegrations/items/list
endpoint.See more details here.
Multiple Active Accounts Per Data Source
Carbon now support multiple active accounts per data connection!
We’ve introduced two new parameters across various API endpoints to support this functionality across all our connectors. While these parameters are optional for users with a single data source of each type, they become mandatory when managing multiple accounts.
/integrations/oauth_url
data_source_id
: Specifies the data source from which synchronization should occur when dealing with multiple data sources of the same type.connecting_new_account
: This parameter is utilized to consistently generate an OAuth URL as opposed to a sync URL. A sync URL is the destination where users are redirected after a successful OAuth authentication to synchronize their files. While this parameter can be skipped when adding the first data source of that type, it should be explicitly specified for subsequent additions.
/integrations/s3/files
,/integrations/outlook/sync
,/integrations/gmail/sync
data_source_id
: Used to specify the data source for synchronization when managing multiple data sources of the same type.
/integrations/outlook/user_folders
,/integrations/outlook/user_categories
,/integrations/gmail/user_labels
data_source_id
: Specifies the data source to be utilized when there are multiple data sources of the same type.
Note that the following endpoints already have a mandatory requirement to pass in a
data_source_id
:/integrations/items/sync
,/integrations/items/list
,/integrations/files/sync/
,integrations/gitbook/spaces
,/integrations/gitbook/sync
New Embedding Models
We now support embedding generation using OpenAI’s
text-embedding-3-small
andtext-embedding-3-large
models.To define the embedding model, utilize the
embedding_model
parameter in the POST body for the/embeddings
and other API endpoints. By default, if no specific model is provided, the system will useOPENAI
(the original Ada-2).Find more details on the models available here.
Return HTML for Webpages
presigned_url
field underuser_files_v2
now returns a pre-signed URL to the raw HTML content for each web page.parsed_text_url
field still returns a pre-signed URL for the corresponding plain text.Find more details here.
Return Website Tags in File Metadata
file_metadata
field underuser_files_v2
now returnsog:image
andog:description
for each web page.Find more details here.
Omit Content by CSS Selector
You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.
The
web_scrape
request objects supports a new fields:css_selectors_to_skip: Optional[list[str]] = []
Find more details here.
JSON File Support
We’ve added support for JSON files via local upload and 3rd party connectors.
How It Works:
The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.
max_items_per_chunk
is a parameter that determines how many JSON objects to include in a single chunk.A new chunk is created if either the
max_items_per_chunk
andchunk_size
limit is reached. For example:If each JSON object is 250 tokens,
chunk_size
of 800 tokens and nomax_items_per_chunk
set, then each chunk will contain 3 JSON objects.If each JSON object is 250 tokens,
chunk_size
of 800 tokens andmax_items_per_chunk
set to 1, then each chunk will contain 1 JSON object.
Learn more details here.
Gitbook Connector
We launched our Gitbook integration today that syncs pages from any public and shared spaces.
The Carbon Connect
enabledIntegrations
value for Gitbook isGITBOOK
.Gitbook does not come with a pre-built file selector so we added 2 endpoints for listing and syncing Gitbook spaces:
List all Gitbook spaces with
/integrations/gitbook/spaces
(API Reference)Sync multiple spaces at once with
integrations/gitbook/sync
(API Reference)
You can also use our global endpoints for listing and syncing specific pages in Gitbook spaces:
List pages in spaces with the global endpoints
/integrations/items/list
Sync pages in spaces with the global endpoint
/integrations/files/sync
Note: Spaces are treated like folders via the Carbon API.
See more specifics about our Gitbook integration here.
Note: our Gitbook page parser is still in
beta
so feedback is much appreciated!
Delete Endpoint Update
We’re transitioning file deletion from sync to async processing.
This means that the
FILE_DELETED
webhook event will not fire immediately and instead fire when the file is actually deleted.We are also limiting 50 files to be deleted per
/delete_files
request to limit the load on our servers. We advise spacing out delete requests every 24 hours.
Pinecone Integration
We’ve launched our Pinecone destination connector! We offer support for both pod-based and serverless offerings.
Carbon seamlessly updates your Pinecone instance with the latest embeddings upon processing user files. Users gain full access to Carbon’s API endpoints, including hybrid search for supported sparse vector storage.
Find more details here.
New Carbon SDKs
Moving forward, we will be able to provide support for a greater number of SDKs and promptly release SDK support for API updates. If there is a language for which you want us to add SDK support, we should be able to turn that around in less than a week.
We’re adding support for the following languages today:
The current Javascript SDK will continue to be supported for the next month, and it will be available longer term. However, new features that are introduced will only be supported in the new Typescript SDK moving forward.
Delete Users Endpoint
Added an endpoint
/delete_users
that takes an array of customer IDs and deletes all those users.Deleting a user revokes all of the user’s oauth connections and deletes all their files, embeddings and chunks.
The request format is:
{ "customer_ids": ["USER_1", "USER_2", "USER_3"] }
Find more details here.
Salesforce Connector is Live
All articles from an end user’s Salesforce Knowledge can be listed and synced via the global API endpoint
/integrations/items/list
and/integrations/files/sync
.The Carbon Connect integration (launching tomorrow) will sync all articles by default.
The
enabledIntegrations
value isSALESFORCE
.You can find more info here.
Outlook Folders
After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook.
This includes both system folders like
inbox
and user-created folders.Find more details here.
Gmail Labels
After connecting a Gmail account, you can use the
/integrations/gmail/user_labels
endpoint to list all of your labels.User created labels will have the type
user
and Gmail’s default labels will have the typesystem
.Find more details here.
Delete Child Files Based on Parent ID
Added a flag named
delete_child_files
to thedelete_files
endpoint. When set totrue
, it will delete all files that have the sameparent_file_ids
as the file submitted for deletion. This flag defaults tofalse
.Find more details here.
Carbon Connect Updates
Added support for JSON file formats and
maxItemsPerChunk
param to specify the number of items to include in a specific chunk.Added
cssSelectorsToSkip
toWEB_SCRAPE
to define CSS Selectors to exclude when converting HTML to plaintext.Added
SALESFORCE
as anenabledIntegration
on Carbon Connect.For Salesforce, we added a param
syncFilesOnConnection
that defaults totrue
and will automatically sync all pages from a user’s Salesforce account.We’ll be adding this param to other connectors too, meaning you can automatically sync all files from connectors that don’t have built-in file selectors (Gitbook, Confluence, etc).
This parameter is also added to the
/integrations/oauth_url
endpoint assync_files_on_connection
and also defaults totrue
.
Freshdesk Connector is Live
All
Published
articles from an end user’s Freshdesk knowledge base are synced when connected to Carbon.The Carbon Connect
enabledIntegrations
value isFRESHDESK
.You can find more info here.
Speed Improvements to Hybrid Search
We improved the speed of hybrid search by a factor of 10x by creating sparse vector indexes on file upload vs. query time.
Steps to Enable:
Pass the following body to the
/modify_user_configuration
endpoint: { "configuration_key_name": "sparse_vectors", "value": { "enabled": true } }
Set the parameter
generate_sparse_vectors
totrue
via the/uploadfile
endpoint.
We’ll be rolling out faster hybrid search support across 3rd party connectors in the upcoming weeks.
Deleting Files based on Sync Status
You can now delete file(s) based on
sync_status
via thedelete_files
endpoint.We added 2 parameters:
sync_statuses
- parameter to pass a list of sync statuses for file deletion.For example,
{ "sync_statuses": ["SYNC_ERROR", "QUEUED_FOR_SYNC"] }
. When this parameter value is passed we will delete all files in theSYNC_ERROR
andQUEUED_FOR_SYNC
status that belong to the end user identified bycustomer-id
in headers that made the request.
delete_non_synced_only
- boolean parameter that limits deletion to files that have not been re-synced before.For example, a previously synced Google Drive file enters the
QUEUED_FOR_SYNC
status again during a scheduled re-sync. Settingdelete_non_synced_only
totrue
would prevent this file from being deleted as well.
Files are deletable in all statuses except
SYNCING
,EVALUATING_RESYNC
andQUEUED_FOR_OCR
states. IncludingSYNCING
,EVALUATING_RESYNC
,QUEUED_FOR_OCR
in the list will result in an error response - files in these statuses must wait until they transition out of the status to be deleted.Find more details here.
Carbon Connect Updates
Added support for the following functionalities in Carbon Connect (React component + JavaScript SDK):
Additional embedding models (
OPENAI
,AZURE_OPENAI
,COHERE_MULTILINGUAL_V3
for text and audio files, andVERTEX_MULTIMODAL
for image files).Enable audio and image file support. Reference documentation on file formats available.
OCR support for PDFs from local file uploads via Carbon Connect.
Hybrid search supported.
You can find details to enable any of these functionalities in our documentation:
Remove Customer-Id
on Select Endpoints
We’re removing
customer-id
as a required header for the following endpoints where it is not required:/auth/v1/white_labeling
/user
/webhooks
/add_webhook
/delete_webhook/{webhook_id}
/organization
Vector Database Integration
We are starting to build out direct integrations with vector database providers!
What this means:
After authenticating a vector database provider via API key, Carbon automatically synchronizes between user data sources and the embeddings within your vector database. Whenever a user file is processed, we handle the seamless update of your vector database with the latest embeddings.
You’ll have full functionality to all our Carbon’s API endpoints, including hybrid search if sparse vector storage is supported by your vector database.
Migrations between vector databases is made simple since Carbon provides a unified API to interface with all providers.
The first vector database integration we’re announcing is with Turbopuffer. Many more to come!
S3 Connector
We launched our S3 connector today that enables syncing objects from buckets.
The Carbon Connect
enabledIntegrations
value for S3 isS3
.See more specifics about our S3 connector here.
File + Account Management Component (BETA)
We’ve launched a new component that enables the following:
Users to add and revoke access to accounts under each connection.
Users to view and select specific folders and files for sync.
The aim is to offer a pre-built file selector for integrations without their own.
The component is currently offered in React but we’ll add support for other frameworks soon.
You can find the npm package here. Please note it’s still in BETA so your feedback is much appreciated!
Expanding sort for user_files_v2
You can sort by
name
,file_size
andlast_sync
onorder_by
field in theuser_files_v2
body.See more details here.
Support for audio file uploads via connectors
We’ve enabled support for audio files via the following connectors: S3, Google Drive, Onedrive, SharePoint, Box, DropBox, Zotero.
See list of supported audio files here.
Google Verification
Carbon’s Google Connector is officially Google-verified. This means users will no longer see the warning screen when authenticating with Carbon’s Google connector.
OCR Public Preview
We’ve been rolling out support for OCR, starting with PDFs uploaded locally (images and data connectors to follow).
Exposing Sync Error Reasons
We are now exposing error messages under the
sync_error_reason
field for files enteringSYNC_ERROR
status.You can find a list of common errors here and we’ll be updating this on an ongoing basis.
List and Sync Items from Data Sources
We’re introducing new functionalities that allow customers to synchronize and retrieve a comprehensive list of items such as files, folders, collections, articles, and more from a user’s data source. This enhancement empowers you to create an in-house file selection flow, while enabling Carbon to also provide a user-friendly file selector UI and convenient helper methods within our SDK.
You can find more details here.
Upload Chunks and Embeddings
Added
/upload_chunks_and_embeddings
endpoint to enable uploading of chunks and vectors to Carbon directly.See more specific details here.
CARBON
Data Connectors for LLMs
COPYRIGHT @ 2024 JCDT DBA CARBON