February 2024
New Embedding Models
We now support embedding generation using OpenAI’s
text-embedding-3-small
andtext-embedding-3-large
models.To define the embedding model, utilize the
embedding_model
parameter in the POST body for the/embeddings
and other API endpoints. By default, if no specific model is provided, the system will useOPENAI
(the original Ada-2).Find more details on the models available here.
Return HTML for Webpages
presigned_url
field underuser_files_v2
now returns a pre-signed URL to the raw HTML content for each web page.parsed_text_url
field still returns a pre-signed URL for the corresponding plain text.Find more details here.
Return Website Tags in File Metadata
file_metadata
field underuser_files_v2
now returnsog:image
andog:description
for each web page.Find more details here.
Omit Content by CSS Selector
You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.
The
web_scrape
request objects supports a new fields:css_selectors_to_skip: Optional[list[str]] = []
Find more details here.
JSON File Support
We’ve added support for JSON files via local upload and 3rd party connectors.
How It Works:
The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.
max_items_per_chunk
is a parameter that determines how many JSON objects to include in a single chunk.A new chunk is created if either the
max_items_per_chunk
andchunk_size
limit is reached. For example:If each JSON object is 250 tokens,
chunk_size
of 800 tokens and nomax_items_per_chunk
set, then each chunk will contain 3 JSON objects.If each JSON object is 250 tokens,
chunk_size
of 800 tokens andmax_items_per_chunk
set to 1, then each chunk will contain 1 JSON object.
Learn more details here.
Gitbook Connector
We launched our Gitbook integration today that syncs pages from any public and shared spaces.
The Carbon Connect
enabledIntegrations
value for Gitbook isGITBOOK
.Gitbook does not come with a pre-built file selector so we added 2 endpoints for listing and syncing Gitbook spaces:
List all Gitbook spaces with
/integrations/gitbook/spaces
(API Reference)Sync multiple spaces at once with
integrations/gitbook/sync
(API Reference)
You can also use our global endpoints for listing and syncing specific pages in Gitbook spaces:
List pages in spaces with the global endpoints
/integrations/items/list
Sync pages in spaces with the global endpoint
/integrations/files/sync
Note: Spaces are treated like folders via the Carbon API.
See more specifics about our Gitbook integration here.
Note: our Gitbook page parser is still in
beta
so feedback is much appreciated!
Delete Endpoint Update
We’re transitioning file deletion from sync to async processing.
This means that the
FILE_DELETED
webhook event will not fire immediately and instead fire when the file is actually deleted.We are also limiting 50 files to be deleted per
/delete_files
request to limit the load on our servers. We advise spacing out delete requests every 24 hours.
Pinecone Integration
We’ve launched our Pinecone destination connector! We offer support for both pod-based and serverless offerings.
Carbon seamlessly updates your Pinecone instance with the latest embeddings upon processing user files. Users gain full access to Carbon’s API endpoints, including hybrid search for supported sparse vector storage.
Find more details here.
New Carbon SDKs
Moving forward, we will be able to provide support for a greater number of SDKs and promptly release SDK support for API updates. If there is a language for which you want us to add SDK support, we should be able to turn that around in less than a week.
We’re adding support for the following languages today:
The current Javascript SDK will continue to be supported for the next month, and it will be available longer term. However, new features that are introduced will only be supported in the new Typescript SDK moving forward.
Delete Users Endpoint
Added an endpoint
/delete_users
that takes an array of customer IDs and deletes all those users.Deleting a user revokes all of the user’s oauth connections and deletes all their files, embeddings and chunks.
The request format is:
{ "customer_ids": ["USER_1", "USER_2", "USER_3"] }
Find more details here.
Salesforce Connector is Live
All articles from an end user’s Salesforce Knowledge can be listed and synced via the global API endpoint
/integrations/items/list
and/integrations/files/sync
.The Carbon Connect integration (launching tomorrow) will sync all articles by default.
The
enabledIntegrations
value isSALESFORCE
.You can find more info here.
Outlook Folders
After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook.
This includes both system folders like
inbox
and user-created folders.Find more details here.
Gmail Labels
After connecting a Gmail account, you can use the
/integrations/gmail/user_labels
endpoint to list all of your labels.User created labels will have the type
user
and Gmail’s default labels will have the typesystem
.Find more details here.
Delete Child Files Based on Parent ID
Added a flag named
delete_child_files
to thedelete_files
endpoint. When set totrue
, it will delete all files that have the sameparent_file_ids
as the file submitted for deletion. This flag defaults tofalse
.Find more details here.
Carbon Connect Updates
Added support for JSON file formats and
maxItemsPerChunk
param to specify the number of items to include in a specific chunk.Added
cssSelectorsToSkip
toWEB_SCRAPE
to define CSS Selectors to exclude when converting HTML to plaintext.Added
SALESFORCE
as anenabledIntegration
on Carbon Connect.For Salesforce, we added a param
syncFilesOnConnection
that defaults totrue
and will automatically sync all pages from a user’s Salesforce account.We’ll be adding this param to other connectors too, meaning you can automatically sync all files from connectors that don’t have built-in file selectors (Gitbook, Confluence, etc).
This parameter is also added to the
/integrations/oauth_url
endpoint assync_files_on_connection
and also defaults totrue
.
CARBON
Data Connectors for LLMs
COPYRIGHT @ 2024 JCDT DBA CARBON