|
1642 | 1642 | <quote>Matrix input</quote> below for alternative usage.
|
1643 | 1643 | </para>
|
1644 | 1644 | <para>
|
1645 |
| - In the most minimal usage, <argname>x</argname> is set to |
| 1645 | + In the simplest usage <argname>x</argname> is set to |
1646 | 1646 | <lit>null</lit>, <argname>byvar</argname> is a single series
|
1647 |
| - and the third argument is omitted, or set to |
1648 |
| - <lit>null</lit>. In this case, the return value is a matrix |
1649 |
| - with two columns holding, respectively, the distinct values |
1650 |
| - of <argname>byvar</argname>, sorted in ascending order, and |
1651 |
| - the count of observations at which <argname>byvar</argname> |
| 1647 | + and the third argument is omitted or set to |
| 1648 | + <lit>null</lit>. The return value is then a matrix with two |
| 1649 | + columns holding, respectively, the distinct values of |
| 1650 | + <argname>byvar</argname> sorted in ascending order, and the |
| 1651 | + count of observations at which <argname>byvar</argname> |
1652 | 1652 | takes on each of these values. For example,
|
1653 | 1653 | </para>
|
1654 | 1654 | <code>
|
|
1661 | 1661 | </para>
|
1662 | 1662 | <para>
|
1663 | 1663 | More generally, if <argname>byvar</argname> is a list with
|
1664 |
| - <math>n</math> members, then the left-hand <math>n</math> |
1665 |
| - columns hold the combinations of the distinct values of each |
1666 |
| - of the <math>n</math> series and the count column holds the |
1667 |
| - number of observations at which each combination is |
1668 |
| - realized. Note that the count column can always be found at |
1669 |
| - the position <lit>nelem(byvar) + 1</lit>. |
| 1664 | + <math>n</math> members then the first <math>n</math> columns |
| 1665 | + of the returned matrix hold the combinations of the distinct |
| 1666 | + values of each of the <math>n</math> series, and the count |
| 1667 | + column holds the number of observations at which each |
| 1668 | + combination is realized. (The count column can always be |
| 1669 | + found at the position <lit>nelem(byvar)+1</lit>). |
1670 | 1670 | </para>
|
1671 | 1671 | <subhead>Specifying an aggregation function</subhead>
|
1672 | 1672 | <para>
|
1673 |
| - If the third argument is given, then <argname>x</argname> |
| 1673 | + If the third argument is given then <argname>x</argname> |
1674 | 1674 | must not be <lit>null</lit>, and the rightmost
|
1675 | 1675 | <math>m</math> columns hold the values of the statistic
|
1676 | 1676 | specified by <argname>funcname</argname> for each of the
|
1677 |
| - variables in <argname>x</argname>. (Thus, <math>m</math> is |
| 1677 | + variables in <argname>x</argname>. (So <math>m</math> is |
1678 | 1678 | equal to 1 if <argname>x</argname> is a single series and
|
1679 | 1679 | equal to <lit>nelem(x)</lit> if <argname>x</argname> is a
|
1680 |
| - list.) The given statistic is calculated on the respective |
| 1680 | + list.) The specified statistic is calculated on the |
1681 | 1681 | sub-samples defined by the combinations in
|
1682 | 1682 | <argname>byvar</argname> (in ascending order); these
|
1683 | 1683 | combinations are shown in the first <math>n</math> column(s)
|
1684 | 1684 | of the returned matrix.
|
1685 | 1685 | </para>
|
1686 | 1686 | <para>
|
1687 |
| - So, in the special case where <argname>x</argname> and |
1688 |
| - <argname>byvar</argname> are both individual series, the |
1689 |
| - return value is a matrix with three columns holding, |
1690 |
| - respectively, the distinct values of |
1691 |
| - <argname>byvar</argname>, sorted in ascending order; the |
1692 |
| - count of observations at which <argname>byvar</argname> |
1693 |
| - takes on each of these values; and the values of the |
1694 |
| - statistic specified by <argname>funcname</argname> |
1695 |
| - calculated on series <argname>x</argname>, using only those |
1696 |
| - observations at which <argname>byvar</argname> takes on the |
1697 |
| - value given in the first column. |
| 1687 | + So, if both <argname>x</argname> and |
| 1688 | + <argname>byvar</argname> are individual series, the return |
| 1689 | + value is a matrix with three columns holding the distinct |
| 1690 | + values of <argname>byvar</argname> sorted in ascending |
| 1691 | + order; the count of observations at which |
| 1692 | + <argname>byvar</argname> takes on each of these values; and |
| 1693 | + the values of the statistic specified by |
| 1694 | + <argname>funcname</argname> calculated on series |
| 1695 | + <argname>x</argname>, using just those observations at which |
| 1696 | + <argname>byvar</argname> takes on the value given in the |
| 1697 | + first column. |
1698 | 1698 | </para>
|
1699 | 1699 | <para>
|
1700 | 1700 | The following values of <argname>funcname</argname> are
|
|
1710 | 1710 | be said to <quote>aggregate</quote> the series in some way.
|
1711 | 1711 | If none of these built-in functions does what you need, you
|
1712 | 1712 | can give the name of a user-defined function as the
|
1713 |
| - aggregator; like the built-ins, such a function must take a |
| 1713 | + aggregator. Like the built-ins, such a function must take a |
1714 | 1714 | single series argument and return a scalar value.
|
1715 | 1715 | </para>
|
1716 | 1716 | <para>
|
|
1720 | 1720 | (non-missing) observations on <argname>x</argname> at
|
1721 | 1721 | each <argname>byvar</argname> combination.
|
1722 | 1722 | </para>
|
| 1723 | + <subhead>Some examples</subhead> |
1723 | 1724 | <para>
|
1724 |
| - For a simple example, suppose that <lit>region</lit> |
1725 |
| - represents a coding of geographical region using integer |
1726 |
| - values 1 to <math>n</math>, and <lit>income</lit> represents |
1727 |
| - household income. Then the following would produce an <by |
1728 |
| - r="n" c="3"/> matrix holding the region codes, the count of |
| 1725 | + First, suppose that <lit>region</lit> represents a coding of |
| 1726 | + geographical region using integer values 1 to |
| 1727 | + <math>n</math>, and <lit>income</lit> represents household |
| 1728 | + income. Then the following would produce an <by r="n" |
| 1729 | + c="3"/> matrix holding the region codes, the count of |
1729 | 1730 | observations in each region, and mean household income for
|
1730 | 1731 | each of the regions:
|
1731 | 1732 | </para>
|
|
1752 | 1753 | of <lit>income</lit> and <lit>age</lit>.
|
1753 | 1754 | </para>
|
1754 | 1755 | <para>
|
1755 |
| - Note that if <argname>byvar</argname> is a list, some |
1756 |
| - combinations of the <argname>byvar</argname> values may not |
1757 |
| - be present in the data (giving a count of zero). In that |
1758 |
| - case the value of the statistics for <argname>x</argname> |
1759 |
| - are recorded as <lit>NaN</lit> (not a number). If you want |
1760 |
| - to ignore such cases you can use the <fncref targ="selifr"/> |
1761 |
| - function to select only those rows that have a non-zero |
1762 |
| - count. The column to test is one place to the right of the |
1763 |
| - number of <argname>byvar</argname> variables, so we can do: |
| 1756 | + If <argname>byvar</argname> is a list, some combinations of |
| 1757 | + the <argname>byvar</argname> values may not be present in |
| 1758 | + the data (giving a count of zero). In that case the value of |
| 1759 | + the statistics for <argname>x</argname> are recorded as |
| 1760 | + <lit>NaN</lit> (not a number). To cut out such cases you |
| 1761 | + can use the <fncref targ="selifr"/> function to select only |
| 1762 | + those rows that have a non-zero count. The column to test is |
| 1763 | + one place to the right of the number of |
| 1764 | + <argname>byvar</argname> variables, so we can do: |
1764 | 1765 | </para>
|
1765 | 1766 | <code>
|
1766 | 1767 | matrix m = aggregate(X, BY, sd)
|
|
1774 | 1775 | form. However, if both arguments are provided they must
|
1775 | 1776 | match in type (you cannot give a series or list for one
|
1776 | 1777 | argument and a matrix for the other) and two matrix
|
1777 |
| - arguments must have the same number of rows. Also note that |
1778 |
| - in this context matrix columns are treated as if they were |
1779 |
| - series, so the aggregation function must follow the pattern |
| 1778 | + arguments must have the same number of rows. In this |
| 1779 | + context matrix columns are treated as if they were series, |
| 1780 | + so the aggregation function must follow the pattern |
1780 | 1781 | described above, taking a series argument and returning a
|
1781 | 1782 | scalar.
|
1782 | 1783 | </para>
|
|
0 commit comments