Exploring the concepts of elasticsearch

Table of Contents

Understanding keyword and text datatype in elasticsearch

In Elasticsearch, data is stored in documents, which are then organized into indexes. When defining an index, you have the option to specify a mapping, which includes defining data types for your fields. Among the various data types available, two commonly used but often confusing types are text and keyword.

  • Text: This data type is analyzed before being stored in the inverted index. Analyzed means the text is broken into individual terms or tokens, which allows for advanced search capabilities like full-text search and partial matching.
  • Keyword: This data type is not analyzed. It is stored as it is, making it ideal for exact match searches, filtering, or aggregations. This is commonly used for IDs, tags, or categories

Suppose you have defined the name field as a keyword field and indexed the following document:

1POST products/_doc
2{
3  "name": "washing machine"
4}
If you execute a search query like this:
1POST products/_search
2{
3  "query": {
4    "match": {
5      "name": "washing"
6    }
7  }
8}
It will not return any matches. This happens because the keyword type is not analyzed, meaning it only supports exact matches. To retrieve the document, you must search for the full value, such as "washing machine". Texttype on the other hand is analyzed and you can search using tokens from the field value. For example, if you execute the following query:
1POST products/_search
2{
3  "query": {
4    "match": {
5      "name": "washing"
6    }
7  }
8}
This will return the matching documents because the text type supports partial matching using tokens derived from the field value.

QueryContext vs filterContext

Elasticsearch sorts matching search results by relevance score. based on a relevance score. This score, returned in the _score metadata field, indicates how well each document matches a query. The higher the _score, the more relevant the document. The calculation of this score depends on whether the query clause operates in a query context or a filter context.

In the query context, the query clause answers the question: How well does this document match this query clause? Besides determining whether a document matches, it also calculates a relevance score stored in the _score metadata field. The query context is applied whenever a query clause is passed to a query parameter and is typically used for full-text searches.

In contrast, the filter context operates on a simple yes/no basis to determine if a document matches the query clause. It does not contribute to relevance scores. Filters are efficient for operations such as keyword searches, exact matches, range queries, and numerical data comparisons.

1POST trade/_search
2{
3  "query": {
4    "term": {"symbol": "META"}
5    }
6}

The term query runs in the query context by default and hence, it will calculate the score. Even if the score will be identical for all documents returned, additional computing power will be involved to score the documents. So how do we speed up the query and optimise it? We can use a term query with filter[constant_score filter]

1POST trade/_search
2{
3    "query": {
4        "constant_score" : {
5            "filter" : {
6                "term" : {"symbol" : "META"}
7            }
8        }
9    }
10}
11

Match queries vs term queries

Previously, we discussed the differences between text fields and keyword fields. Term queries are designed to return documents that contain an exact term in a specified field. These queries are particularly useful for retrieving documents based on precise values, such as a price, product ID, or username.

However, term queries are not suitable for text fields. For text fields, match queries are the preferred option. Match queries return documents that match a provided text, number, date, or boolean value. Before matching, the provided text is analyzed. Match queries are ideal for performing full-text searches and also support features like fuzzy matching for more flexible results.

1POST articles/_search
2{
3    "query": {
4        "match" : {"content" : "boyfriend loves me"}
5    }
6}

Imagine you have an index called articles that contains scraped Reddit posts. To search for articles containing the phrase boyfriend loves me, you can perform a full-text search using a match query. This query analyzes your search term and retrieves articles related to it.

In this case, we are using the query context instead of the filter context. This means that the _score metadata is used to indicate how relevant each article is to the search term, allowing for results ranked by relevance.

Boolean queries

Boolean queries are used to combine multiple queries together. They are mainly comprised of 4 types: must, should,filter and must_not.

must: The query must appear in matching documents and contributes to the score. It acts as a logical "AND", ensuring all specified queries match.

should: The query should appear in the matching document. Each query defined under a should acts as a logical "OR", returning documents that match any of the specified queries.

filter: The query must appear in matching documents. However unlike must, the score of the query will be ignored. Filter clauses are executed in filter context and clauses are considered for caching. Each query defined under a filter acts as a logical "AND", returning only documents that match all the specified queries.

must_not: The query must not appear in the matching documents. Clauses are executed in filter context meaning that scoring is ignored and clauses are considered for caching. As such, score of 0 is returned for all documents. Each query defined under a must_not acts as a logical "NOT", returning only documents that do not match any of the specified queries.

1 {
2  "query": {
3    "bool": {
4      "must": [
5        {
6          "term": {
7            "symbol": "META"
8          }
9        },
10        {
11          "range": {
12            "date": {
13              "gte": "2025-01-10T00:00",
14              "lte": "2025-01-10T12:00"
15            }
16          }
17        }
18      ]
19    }
20  }
21}
22

Based on the json above, it will search for all trade orders that has symbol "META" and is executed between "2025-01-10T00:00" and "2025-01-10T11:59"

Aggregations

Aggregations in Elasticsearch allow you to analyze your data by summarizing it into metrics, statistics, or other meaningful insights. They provide a powerful way to understand large datasets efficiently. There are 3 types of aggregations:

Metric Aggregations: Compute numerical metrics like sum, average, max, or min directly from field values.

Bucket aggregation: group documents into buckets, based on field values, ranges, or other criteria.

Pipeline Aggregations: Perform operations on the output of other aggregations, such as calculating derivatives or cumulative sums.

One common example of a bucket aggregation is the terms aggregation. It retrieves the most frequent terms within a specified field. By default, it returns the top ten terms with the highest document counts. To retrieve more terms, you can adjust the size parameter.

Note: The terms aggregation can only be used on fields of type keyword or numeric. It is not compatible with text data types, as they are analyzed and tokenized.

1POST articles/_search
2{
3  "aggs": {
4    "genres": {
5      "terms": { "field": "genre" }
6    }
7  }
8}

Combining aggregations

Elasticsearch allows you to include multiple aggregations in a single request. This capability enables you to extract diverse insights in one query. For example, the following request retrieves: the most popular stock based on document counts, the average trading volume for all stocks and the total trading volume across all stocks.

1POST trade/_search
2{
3  "size":0,
4  "aggs": {
5    "most_popular_stock": {
6      "terms": {
7        "field": "symbol"
8      }
9    },
10    "avg_volume": {
11      "avg": {
12        "field": "volume"
13      }
14    },
15    "total_volume": {
16      "sum": {
17        "field": "volume"
18      }
19    }
20  }
21}

Subaggregations

Subaggregations in Elasticsearch allow you to nest one aggregation within another, enabling more detailed analysis of your data. For example, the following request calculates the trade volume every 30 minutes for a specific stock symbol (META).

1{
2  "size": 0,
3  "aggs": {
4    "filter_by_symbol": {
5      "filter": {
6        "term": {
7          "symbol": "META"
8        }
9      },
10      "aggs": {
11        "trade_over_time": {
12          "date_histogram": {
13            "field": "date",
14            "fixed_interval": "30m"
15          },
16          "aggs": {
17            "total_volume_transacted": {
18              "sum": {
19                "field": "volume"
20              }
21            }
22          }
23        }
24      }
25    }
26  }
27}
28