第 5 章字符串处理 - 5.5. 词汇分割器库 Boost.Tokenizer - 《Boost C++ 库》

5.5. 词汇分割器库 Boost.Tokenizer

5.5. 词汇分割器库 Boost.Tokenizer

Boost.Tokenizer 库可以在指定某个字符为分隔符后，遍历字符串的部分表达式。

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 
 
int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  tokenizer tok(s); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
}

下载源代码

Boost.Tokenizer 库在 boost/tokenizer.hpp 文件中定义了模板类 boost::tokenizer ，其模板参数为支持相关表达式的类。上面的例子中就使用了 boost::char_separator 类作为模板参数，它将空格和标点符号视为分隔符。

词汇分割器必须由类型为 std::string 的字符串初始化。通过使用 begin() 和 end() 方法，词汇分割器可以像容器一样访问。通过使用迭代器，可以得到前述字符串的部分表达式。模板参数的类型决定了如何达到部分表达式。

因为 boost::char_separator 类默认将空格和标点符号视为分隔符，所以本例显示的结果为 Boost 、 C 、 + 、 + 和 libraries 。为了识别这些分隔符， boost::char_separator 函数调用了 std::isspace() 函数和 std::ispunct 函数。 ()Boost.Tokenizer 库会区分要隐藏的分隔符和要显示的分隔符。在默认的情况下，空格会隐藏而标点符号会显示出来，所以这个例子里显示了两个加号。

如果不需要将标点符号作为分隔符，可以在传递给词汇分割器之前相应地初始化 boost::char_separator 对象。以下例子正式这样做的。

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 
 
int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  boost::char_separator<char> sep(" "); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
}

下载源代码

类 boost::char_separator 的构造函数可以接受三个参数，只有第一个是必须的，它描述了需要隐藏的分隔符。在本例中，空格仍然被视为分隔符。

第二个参数指定了需要显示的分隔符。在不提供此参数的情况下，将不显示任何分隔符。执行程序，会显示 Boost 、 C++ 和 libraries 。

如果将加号作为第二个参数，此例的结果将和上一个例子相同。

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 
 
int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  boost::char_separator<char> sep(" ", "+"); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
}

下载源代码

第三个参数决定了是否显示空的部分表达式。如果连续找到两个分隔符，他们之间的部分表达式将为空。在默认情况下，这些空表达式是不会显示的。第三个参数可以改变默认的行为。

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 
 
int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  boost::char_separator<char> sep(" ", "+", boost::keep_empty_tokens); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
}

下载源代码

执行以上程序，会显示另外两个的空表达式。其中第一个是在两个加号中间的而第二个是加号和之后的空格之间的。

词汇分割器也可用于不同的字符串类型。

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 
 
int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<wchar_t>, std::wstring::const_iterator, std::wstring> tokenizer; 
  std::wstring s = L"Boost C++ libraries"; 
  boost::char_separator<wchar_t> sep(L" "); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::wcout << *it << std::endl; 
}

下载源代码

这个例子遍历了一个类型为 std::wstring 的字符串。为了使用这个类型的字符串，必须使用另外的模板参数初始化词汇分割器，对 boost::char_separator 类也是如此，他们都需要参数 wchar_t 初始化。

除了 boost::char_separator 类之外， Boost.Tokenizer 还提供了另外两个类以识别部分表达式。

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 
 
int main() 
{ 
  typedef boost::tokenizer<boost::escaped_list_separator<char> > tokenizer; 
  std::string s = "Boost,\"C++ libraries\""; 
  tokenizer tok(s); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
}

下载源代码

boost::escaped_list_separator 类用于读取由逗号分隔的多个值，这个格式的文件通常称为 CSV （comma separated values，逗号分隔文件），它甚至还可以处理双引号以及转义序列。所以本例的输出为 Boost 和 C++ libraries 。

另一个是 boost::offset_separator 类，必须用实例说明。这个类的对象必须作为第二个参数传递给 boost::tokenizer 类的构造函数。

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 
 
int main() 
{ 
  typedef boost::tokenizer<boost::offset_separator> tokenizer; 
  std::string s = "Boost C++ libraries"; 
  int offsets[] = { 5, 5, 9 }; 
  boost::offset_separator sep(offsets, offsets + 3); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
}

下载源代码

boost::offset_separator 指定了部分表达式应当在字符串中的哪个位置结束。以上程序制定第一个部分表达式在 5 个字符后结束，第二个字符串在另 5 个字符后结束，第三个也就是最后一个字符串应当在之后的 9 个字符后结束。输出的结果为 Boost 、 C++ 和 libraries 。